Processing Landsat satellite images with GPUs

Jobs that use GPUs incur charges as specified in the Dataflow pricing page .
To use GPUs, your Dataflow job must use Dataflow Runner v2 .

This tutorial shows you how to use GPUs on Dataflow to process Landsat 8 satellite images and render them as JPEG files. The tutorial is based on the example Processing Landsat satellite images with GPUs .

Objectives

Build a Docker image for Dataflow that has TensorFlow with GPU support.
Run a Dataflow job with GPUs.

Costs

This tutorial uses billable components of Google Cloud, including:

Cloud Storage
Dataflow
Artifact Registry

Use the pricing calculator to generate a cost estimate based on your projected usage.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Install the Google Cloud CLI.

If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity .

To initialize the gcloud CLI, run the following command:

gcloud  
init

Create or select a Google Cloud project .

Roles required to select or create a project

Select a project : Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project : To create a project, you need the Project Creator role ( roles/resourcemanager.projectCreator ), which contains the resourcemanager.projects.create permission. Learn how to grant roles .

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID 
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID 
```
Replace PROJECT_ID with your Google Cloud project name.

Verify that billing is enabled for your Google Cloud project .

Enable the Dataflow, Cloud Build, and Artifact Registry APIs:

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role ( roles/serviceusage.serviceUsageAdmin ), which contains the serviceusage.services.enable permission. Learn how to grant roles .

gcloud  
services  
 enable 
  
dataflow  
 cloudbuild.googleapis.com  
 artifactregistry.googleapis.com

If you're using a local shell, then create local authentication credentials for your user account:

gcloud  
auth  
application-default  
login

You don't need to do this if you're using Cloud Shell.

If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity .

Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/iam.serviceAccountUser

gcloud  
projects  
add-iam-policy-binding  
 PROJECT_ID 
  
--member = 
 "user: USER_IDENTIFIER 
" 
  
--role = 
 ROLE

Replace the following:

PROJECT_ID : Your project ID.
USER_IDENTIFIER : The identifier for your user account. For example, myemail@example.com .
ROLE : The IAM role that you grant to your user account.

Install the Google Cloud CLI.

If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity .

To initialize the gcloud CLI, run the following command:

gcloud  
init

Create or select a Google Cloud project .

Roles required to select or create a project

Select a project : Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project : To create a project, you need the Project Creator role ( roles/resourcemanager.projectCreator ), which contains the resourcemanager.projects.create permission. Learn how to grant roles .

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID 
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID 
```
Replace PROJECT_ID with your Google Cloud project name.

Verify that billing is enabled for your Google Cloud project .

Enable the Dataflow, Cloud Build, and Artifact Registry APIs:

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role ( roles/serviceusage.serviceUsageAdmin ), which contains the serviceusage.services.enable permission. Learn how to grant roles .

gcloud  
services  
 enable 
  
dataflow  
 cloudbuild.googleapis.com  
 artifactregistry.googleapis.com

If you're using a local shell, then create local authentication credentials for your user account:

gcloud  
auth  
application-default  
login

You don't need to do this if you're using Cloud Shell.

If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity .

Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/iam.serviceAccountUser

gcloud  
projects  
add-iam-policy-binding  
 PROJECT_ID 
  
--member = 
 "user: USER_IDENTIFIER 
" 
  
--role = 
 ROLE

Replace the following:

PROJECT_ID : Your project ID.
USER_IDENTIFIER : The identifier for your user account. For example, myemail@example.com .
ROLE : The IAM role that you grant to your user account.

Grant roles to your Compute Engine default service account. Run the following command once for each of the following IAM roles: roles/dataflow.admin , roles/dataflow.worker , roles/bigquery.dataEditor , roles/pubsub.editor , roles/storage.objectAdmin , and roles/artifactregistry.reader .
```
gcloud  
projects  
add-iam-policy-binding  
 PROJECT_ID 
  
--member = 
 "serviceAccount: PROJECT_NUMBER 
-compute@developer.gserviceaccount.com" 
  
--role = 
 SERVICE_ACCOUNT_ROLE 
```
- Replace PROJECT_ID with your project ID.
- Replace PROJECT_NUMBER with your project number. To find your project number, see Identify projects .
- Replace SERVICE_ACCOUNT_ROLE with each individual role.
To store the output JPEG image files from this tutorial, create a Cloud Storage bucket:
1. In the Google Cloud console, go to the Cloud Storage Buckets page.
  Go to Buckets
2. Click Create .
3. On the Create a bucket page, enter your bucket information. To go to the next step, click Continue .
  1. For Name your bucket , enter a unique bucket name. Don't include sensitive information in the bucket name, because the bucket namespace is global and publicly visible.
  2. In the Choose where to store your data section, do the following:
    1. Select a Location type .
    2. Choose a location where your bucket's data is permanently stored from the Location type drop-down menu.
      - If you select the dual-region location type, you can also choose to enable turbo replication by using the relevant checkbox.
    3. To set up cross-bucket replication , select Add cross-bucket replication via Storage Transfer Service and follow these steps:
      Set up cross-bucket replication
      
      In the Bucket menu, select a bucket.
      
      In the Replication settings section, click Configure to configure settings for the replication job.
      
      The Configure cross-bucket replication pane appears.
      
      To filter objects to replicate by object name prefix, enter a prefix that you want to include or exclude objects from, then click Add a prefix .
      
      To set a storage class for the replicated objects, select a storage class from the Storage class menu. If you skip this step, the replicated objects will use the destination bucket's storage class by default.
      
      Click Done .
  3. In the Choose how to store your data section, do the following:
    1. In the Set a default class section, select the following: Standard .
    2. To enable hierarchical namespace , in the Optimize storage for data-intensive workloads section, select Enable hierarchical namespace on this bucket .
      Note: You cannot enable hierarchical namespace in existing buckets.
  4. In the Choose how to control access to objects section, select whether or not your bucket enforces public access prevention , and select an access control method for your bucket's objects.
    Note: You cannot change the Prevent public access setting if this setting is enforced at an organization policy .
  5. In the Choose how to protect object data section, do the following:
    - Select any of the options under Data protection that you want to set for your bucket.
      - To enable soft delete , click the Soft delete policy (For data recovery) checkbox, and specify the number of days you want to retain objects after deletion.
      - To set Object Versioning , click the Object versioning (For version control) checkbox, and specify the maximum number of versions per object and the number of days after which the noncurrent versions expire.
      - To enable the retention policy on objects and buckets, click the Retention (For compliance) checkbox, and then do the following:
        
        To enable Object Retention Lock , click the Enable object retention checkbox.
        
        To enable Bucket Lock , click the Set bucket retention policy checkbox, and choose a unit of time and a length of time for your retention period.
    - To choose how your object data will be encrypted, expand the Data encryption section ( ), and select a Data encryption method .
4. Click Create .

Prepare your working environment

Download the starter files, and then create your Artifact Registry repository.

Download the starter files

Download the starter files and then change directories.

Clone the python-docs-samples repository.

 git  
clone  
https://github.com/GoogleCloudPlatform/python-docs-samples.git

Navigate to the sample code directory .

  cd 
  
python-docs-samples/dataflow/gpu-examples/tensorflow-landsat

Configure Artifact Registry

Create an Artifact Registry repository so that you can upload artifacts. Each repository can contain artifacts for a single supported format.

All repository content is encrypted using either Google-owned and Google-managed encryption keys or customer-managed encryption keys. Artifact Registry uses Google-owned and Google-managed encryption keys by default and no configuration is required for this option.

You must have at least Artifact Registry Writer access to the repository.

Run the following command to create a new repository. The command uses the --async flag and returns immediately, without waiting for the operation in progress to complete.

 gcloud  
artifacts  
repositories  
create  
 REPOSITORY 
  
 \ 
  
--repository-format = 
docker  
 \ 
  
--location = 
 LOCATION 
  
 \ 
  
--async

Replace REPOSITORY with a name for your repository. For each repository location in a project, repository names must be unique.

Before you can push or pull images, configure Docker to authenticate requests for Artifact Registry. To set up authentication to Docker repositories, run the following command:

 gcloud  
auth  
configure-docker  
 LOCATION 
-docker.pkg.dev

The command updates your Docker configuration. You can now connect with Artifact Registry in your Google Cloud project to push images.

Build the Docker image

Cloud Build allows you to build a Docker image using a Dockerfile and save it into Artifact Registry, where the image is accessible to other Google Cloud products.

Build the container image by using the build.yaml config file.

 gcloud  
builds  
submit  
--config  
build.yaml

Run the Dataflow job with GPUs

The following code block demonstrates how to launch this Dataflow pipeline with GPUs.

We run the Dataflow pipeline using the run.yaml config file.

  export 
  
 PROJECT 
 = 
 PROJECT_NAME 
 export 
  
 BUCKET 
 = 
 BUCKET_NAME 
 export 
  
 JOB_NAME 
 = 
 "satellite-images- 
 $( 
date  
+%Y%m%d-%H%M%S ) 
 " 
 export 
  
 OUTPUT_PATH 
 = 
 "gs:// 
 $BUCKET 
 /samples/dataflow/landsat/output-images/" 
 export 
  
 REGION 
 = 
 "us-central1" 
 export 
  
 GPU_TYPE 
 = 
 "nvidia-tesla-t4" 
gcloud  
builds  
submit  
 \ 
  
--config  
run.yaml  
 \ 
  
--substitutions  
 _JOB_NAME 
 = 
 $JOB_NAME 
,_OUTPUT_PATH = 
 $OUTPUT_PATH 
,_REGION = 
 $REGION 
,_GPU_TYPE = 
 $GPU_TYPE 
  
 \ 
  
--no-source

Replace the following:

PROJECT_NAME : the Google Cloud project name
BUCKET_NAME : the Cloud Storage bucket name (without the gs:// prefix)

After you run this pipeline, wait for the command to finish. If you exit your shell, you might lose the environment variables that you've set.

To avoid sharing the GPU between multiple worker processes, this sample uses a machine type with 1 vCPU. The memory requirements of the pipeline are addressed by using 13 GB of extended memory. For more information, read GPUs and worker parallelism .

View your results

The pipeline in tensorflow-landsat/main.py processes Landsat 8 satellite images and renders them as JPEG files. Use the following steps to view these files.

List the output JPEG files with details by using the Google Cloud CLI .

 gcloud  
storage  
ls  
 "gs:// 
 $BUCKET 
 /samples/dataflow/landsat/" 
  
--long  
--readable-sizes

Copy the files into your local directory.

 mkdir  
outputs
gcloud  
storage  
cp  
 "gs:// 
 $BUCKET 
 /samples/dataflow/landsat/*" 
  
outputs/

Open these image files with the image viewer of your choice.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete .
In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Look at a minimal GPU-enabled TensorFlow example
Look at a minimal GPU-enabled PyTorch example
Learn more about GPU support on Dataflow .
Look through tasks for Using GPUs .
Explore reference architectures, diagrams, and best practices about Google Cloud. Take a look at our Cloud Architecture Center .

Processing Landsat satellite images with GPUs Stay organized with collections Save and categorize content based on your preferences.

Objectives

Costs

Before you begin

Set up cross-bucket replication

Prepare your working environment

Download the starter files

Configure Artifact Registry

Build the Docker image

Run the Dataflow job with GPUs

View your results

Clean up

Delete the project

What's next

Processing Landsat satellite images with GPUs