Build and run a Flex Template


Dataflow Flex Templates allow you to package a Dataflow pipeline for deployment. This tutorial shows you how to build a Dataflow Flex Template and then run a Dataflow job using that template.

Objectives

  • Build a Dataflow Flex Template.
  • Use the template to run a Dataflow job.

Costs

In this document, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator .

New Google Cloud users might be eligible for a free trial .

When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up .

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. Install the Google Cloud CLI.

  3. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity .

  4. To initialize the gcloud CLI, run the following command:

    gcloud  
    init
  5. Create or select a Google Cloud project .

    • Create a Google Cloud project:

      gcloud projects create PROJECT_ID 
      

      Replace PROJECT_ID with a name for the Google Cloud project you are creating.

    • Select the Google Cloud project that you created:

      gcloud config set project PROJECT_ID 
      

      Replace PROJECT_ID with your Google Cloud project name.

  6. Verify that billing is enabled for your Google Cloud project .

  7. Enable the Dataflow, Compute Engine, Logging, Cloud Storage, Cloud Storage JSON, Resource Manager, Artifact Registry, and Cloud Build API:

    gcloud  
    services  
     enable 
      
    dataflow compute_component logging storage_component storage_api cloudresourcemanager.googleapis.com artifactregistry.googleapis.com cloudbuild.googleapis.com
  8. If you're using a local shell, then create local authentication credentials for your user account:

    gcloud  
    auth  
    application-default  
    login

    You don't need to do this if you're using Cloud Shell.

    If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity .

  9. Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/iam.serviceAccountUser

    gcloud  
    projects  
    add-iam-policy-binding  
     PROJECT_ID 
      
    --member = 
     "user: USER_IDENTIFIER 
    " 
      
    --role = 
     ROLE 
    

    Replace the following:

    • PROJECT_ID : your project ID.
    • USER_IDENTIFIER : the identifier for your user account—for example, myemail@example.com .
    • ROLE : the IAM role that you grant to your user account.
  10. Install the Google Cloud CLI.

  11. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity .

  12. To initialize the gcloud CLI, run the following command:

    gcloud  
    init
  13. Create or select a Google Cloud project .

    • Create a Google Cloud project:

      gcloud projects create PROJECT_ID 
      

      Replace PROJECT_ID with a name for the Google Cloud project you are creating.

    • Select the Google Cloud project that you created:

      gcloud config set project PROJECT_ID 
      

      Replace PROJECT_ID with your Google Cloud project name.

  14. Verify that billing is enabled for your Google Cloud project .

  15. Enable the Dataflow, Compute Engine, Logging, Cloud Storage, Cloud Storage JSON, Resource Manager, Artifact Registry, and Cloud Build API:

    gcloud  
    services  
     enable 
      
    dataflow compute_component logging storage_component storage_api cloudresourcemanager.googleapis.com artifactregistry.googleapis.com cloudbuild.googleapis.com
  16. If you're using a local shell, then create local authentication credentials for your user account:

    gcloud  
    auth  
    application-default  
    login

    You don't need to do this if you're using Cloud Shell.

    If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity .

  17. Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/iam.serviceAccountUser

    gcloud  
    projects  
    add-iam-policy-binding  
     PROJECT_ID 
      
    --member = 
     "user: USER_IDENTIFIER 
    " 
      
    --role = 
     ROLE 
    

    Replace the following:

    • PROJECT_ID : your project ID.
    • USER_IDENTIFIER : the identifier for your user account—for example, myemail@example.com .
    • ROLE : the IAM role that you grant to your user account.
  18. Grant roles to your Compute Engine default service account. Run the following command once for each of the following IAM roles:

    • roles/dataflow.admin
    • roles/dataflow.worker
    • roles/storage.objectAdmin
    • roles/artifactregistry.writer
    gcloud  
    projects  
    add-iam-policy-binding  
     PROJECT_ID 
      
    --member = 
     "serviceAccount: PROJECT_NUMBER 
    -compute@developer.gserviceaccount.com" 
      
    --role = 
     SERVICE_ACCOUNT_ROLE 
    

    Replace the following:

    • PROJECT_ID : your project ID
    • PROJECT_NUMBER your project number
    • SERVICE_ACCOUNT_ROLE : each individual role

Prepare the environment

Install the SDK and any requirements for your development environment.

Java

  1. Download and install the Java Development Kit (JDK) version 17. Verify that the JAVA_HOME environment variable is set and points to your JDK installation.

  2. Download and install Apache Maven by following Maven's installation guide for your specific operating system.

Python

Install the Apache Beam SDK for Python .

Go

Use Go's Download and install guide to download and install Go for your specific operating system. To learn which Go runtime environments are supported by Apache Beam, see Apache Beam runtime support .

Download the code sample.

Java

  1. Clone the java-docs-samples repository .

    git  
    clone  
    https://github.com/GoogleCloudPlatform/java-docs-samples.git
  2. Navigate to the code sample for this tutorial.

     cd 
      
    java-docs-samples/dataflow/flex-templates/getting_started
  3. Build the Java project into an Uber JAR file.

    mvn  
    clean  
    package

    This Uber JAR file has all the dependencies embedded in it. You can run this file as a standalone application with no external dependencies on other libraries.

Python

  1. Clone the python-docs-samples repository .

    git  
    clone  
    https://github.com/GoogleCloudPlatform/python-docs-samples.git
  2. Navigate to the code sample for this tutorial.

     cd 
      
    python-docs-samples/dataflow/flex-templates/getting_started

Go

  1. Clone the golang-samples repository .

    git  
    clone  
    https://github.com/GoogleCloudPlatform/golang-samples.git
  2. Navigate to the code sample for this tutorial.

     cd 
      
    golang-samples/dataflow/flex-templates/wordcount
  3. Compile the Go binary.

     CGO_ENABLED 
     = 
     0 
      
     GOOS 
     = 
    linux  
     GOARCH 
     = 
    amd64  
    go  
    build  
    -o  
    wordcount  
    .

Create a Cloud Storage bucket

Use the gcloud storage buckets create command to create a Cloud Storage bucket:

 gcloud  
storage  
buckets  
create  
gs:// BUCKET_NAME 
 

Replace BUCKET_NAME with a name for your Cloud Storage bucket. Cloud Storage bucket names must be globally unique and meet the bucket naming requirements .

Create an Artifact Registry repository

Create an Artifact Registry repository where you will push the Docker container image for the template.

  1. Use the gcloud artifacts repositories create command to create a new Artifact Registry repository.

    gcloud  
    artifacts  
    repositories  
    create  
     REPOSITORY 
      
     \ 
      
    --repository-format = 
    docker  
     \ 
      
    --location = 
     LOCATION 
    

    Replace the following:

    • REPOSITORY : a name for your repository. Repository names must be unique for each repository location in a project.
    • LOCATION : the regional or multi-regional location for the repository.
  2. Use the gcloud auth configure-docker command to configure Docker to authenticate requests for Artifact Registry. This command updates your Docker configuration, so that you can connect with Artifact Registry to push images.

    gcloud  
    auth  
    configure-docker  
     LOCATION 
    -docker.pkg.dev

Flex Templates can also use images stored in private registries. For more information, see Use an image from a private registry .

Build the Flex Template

In this step, you use the gcloud dataflow flex-template build command to build the Flex Template.

A Flex Template consists of the following components:

  • A Docker container image that packages your pipeline code. For Java and Python Flex Templates, the Docker image is built and pushed to your Artifact Registry repository when you run the gcloud dataflow flex-template build command.
  • A template specification file. This file is a JSON document that contains the location of the container image plus metadata about the template, such as pipeline parameters.

The sample repository in GitHub contains the metadata.json file.

To extend your template with additional metadata, you can create your own metadata.json file .

Java

gcloud  
dataflow  
flex-template  
build  
gs:// BUCKET_NAME 
/getting_started-java.json  
 \ 
  
--image-gcr-path  
 " LOCATION 
-docker.pkg.dev/ PROJECT_ID 
/ REPOSITORY 
/getting-started-java:latest" 
  
 \ 
  
--sdk-language  
 "JAVA" 
  
 \ 
  
--flex-template-base-image  
JAVA17  
 \ 
  
--metadata-file  
 "metadata.json" 
  
 \ 
  
--jar  
 "target/flex-template-getting-started-1.0.jar" 
  
 \ 
  
--env  
 FLEX_TEMPLATE_JAVA_MAIN_CLASS 
 = 
 "com.example.dataflow.FlexTemplateGettingStarted" 

Replace the following:

  • BUCKET_NAME : the name of the Cloud Storage bucket that you created earlier
  • LOCATION : the location
  • PROJECT_ID : the Google Cloud project ID
  • REPOSITORY : the name of the Artifact Registry repository that you created earlier

Python

gcloud  
dataflow  
flex-template  
build  
gs:// BUCKET_NAME 
/getting_started-py.json  
 \ 
  
--image-gcr-path  
 " LOCATION 
-docker.pkg.dev/ PROJECT_ID 
/ REPOSITORY 
/getting-started-python:latest" 
  
 \ 
  
--sdk-language  
 "PYTHON" 
  
 \ 
  
--flex-template-base-image  
 "PYTHON3" 
  
 \ 
  
--metadata-file  
 "metadata.json" 
  
 \ 
  
--py-path  
 "." 
  
 \ 
  
--env  
 "FLEX_TEMPLATE_PYTHON_PY_FILE=getting_started.py" 
  
 \ 
  
--env  
 "FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE=requirements.txt" 

Replace the following:

  • BUCKET_NAME : the name of the Cloud Storage bucket that you created earlier
  • LOCATION : the location
  • PROJECT_ID : the Google Cloud project ID
  • REPOSITORY : the name of the Artifact Registry repository that you created earlier

Go

  1. Use the gcloud builds submit command to build the Docker image using a Dockerfile with Cloud Build. This command builds the file and pushes it to your Artifact Registry repository.

    gcloud  
    builds  
    submit  
    --tag  
     LOCATION 
    -docker.pkg.dev/ PROJECT_ID 
    / REPOSITORY 
    /dataflow/wordcount-go:latest  
    .

    Replace the following:

    • LOCATION : the location
    • PROJECT_ID : the Google Cloud project ID
    • REPOSITORY : the name of the Artifact Registry repository that you created earlier
  2. Use the gcloud dataflow flex-template build command to create a Flex Template named wordcount-go.json in your Cloud Storage bucket.

    gcloud  
    dataflow  
    flex-template  
    build  
    gs:// BUCKET_NAME 
    /samples/dataflow/templates/wordcount-go.json  
     \ 
      
    --image  
     " LOCATION 
    -docker.pkg.dev/ PROJECT_ID 
    / REPOSITORY 
    /dataflow/wordcount-go:latest" 
      
     \ 
      
    --sdk-language  
     "GO" 
      
     \ 
      
    --metadata-file  
     "metadata.json" 
    

    Replace BUCKET_NAME with the name of the Cloud Storage bucket that you created earlier.

Run the Flex Template

In this step, you use the template to run a Dataflow job.

Java

  1. Use the gcloud dataflow flex-template run command to run a Dataflow job that uses the Flex Template.

    gcloud  
    dataflow  
    flex-template  
    run  
     "getting-started-`date +%Y%m%d-%H%M%S`" 
      
     \ 
      
    --template-file-gcs-location  
     "gs:// BUCKET_NAME 
    /getting_started-java.json" 
      
     \ 
      
    --parameters  
     output 
     = 
     "gs:// BUCKET_NAME 
    /output-" 
      
     \ 
      
    --additional-user-labels  
     " LABELS 
    " 
      
     \ 
      
    --region  
     " REGION 
    " 
    

    Replace the following:

    • BUCKET_NAME : the name of the Cloud Storage bucket that you created earlier
    • REGION : the region
    • LABELS : Optional. Labels attached to your job, using the format <key1>=<val1>,<key2>=<val2>,...
  2. To view the status of the Dataflow job in the Google Cloud console, go to the Dataflow Jobs page.

    Go to Jobs

If the job runs successfully, it writes the output to a file named gs:// BUCKET_NAME /output--00000-of-00001.txt in your Cloud Storage bucket.

Python

  1. Use the gcloud dataflow flex-template run command to run a Dataflow job that uses the Flex Template.

    gcloud  
    dataflow  
    flex-template  
    run  
     "getting-started-`date +%Y%m%d-%H%M%S`" 
      
     \ 
      
    --template-file-gcs-location  
     "gs:// BUCKET_NAME 
    /getting_started-py.json" 
      
     \ 
      
    --parameters  
     output 
     = 
     "gs:// BUCKET_NAME 
    /output-" 
      
     \ 
      
    --additional-user-labels  
     " LABELS 
    " 
      
     \ 
      
    --region  
     " REGION 
    " 
    

    Replace the following:

    • BUCKET_NAME : the name of the Cloud Storage bucket that you created earlier
    • REGION : the region
    • LABELS : Optional. Labels attached to your job, using the format <key1>=<val1>,<key2>=<val2>,...
  2. To view the status of the Dataflow job in the Google Cloud console, go to the Dataflow Jobs page.

    Go to Jobs

If the job runs successfully, it writes the output to a file named gs:// BUCKET_NAME /output--00000-of-00001.txt in your Cloud Storage bucket.

Go

  1. Use the gcloud dataflow flex-template run command to run a Dataflow job that uses the Flex Template.

    gcloud  
    dataflow  
    flex-template  
    run  
     "wordcount-go-`date +%Y%m%d-%H%M%S`" 
      
     \ 
      
    --template-file-gcs-location  
     "gs:// BUCKET_NAME 
    /samples/dataflow/templates/wordcount-go.json" 
      
     \ 
      
    --parameters  
     output 
     = 
     "gs:// BUCKET_NAME 
    /samples/dataflow/templates/counts.txt" 
      
     \ 
      
    --additional-user-labels  
     " LABELS 
    " 
      
     \ 
      
    --region  
     " REGION 
    " 
    

    Replace the following:

    • BUCKET_NAME : the name of the Cloud Storage bucket that you created earlier
    • REGION : the region
    • LABELS : Optional. Labels attached to your job, using the format <key1>=<val1>,<key2>=<val2>,...
  2. To view the status of the Dataflow job in the Google Cloud console, go to the Dataflow Jobs page.

    Go to Jobs

If the job runs successfully, it writes the output to a file named gs:// BUCKET_NAME /samples/dataflow/templates/count.txt in your Cloud Storage bucket.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the project

    Delete a Google Cloud project:

    gcloud projects delete PROJECT_ID 
    

Delete individual resources

  1. Delete the Cloud Storage bucket and all the objects in the bucket.
    gcloud  
    storage  
    rm  
    gs:// BUCKET_NAME 
      
    --recursive
  2. Delete the Artifact Registry repository.
    gcloud  
    artifacts  
    repositories  
    delete  
     REPOSITORY 
      
     \ 
      
    --location = 
     LOCATION 
    
  3. Revoke the roles that you granted to the Compute Engine default service account. Run the following command once for each of the following IAM roles:
    • roles/dataflow.admin
    • roles/dataflow.worker
    • roles/storage.objectAdmin
    • roles/artifactregistry.writer
    gcloud  
    projects  
    remove-iam-policy-binding  
     PROJECT_ID 
      
     \ 
      
    --member = 
    serviceAccount: PROJECT_NUMBER 
    -compute@developer.gserviceaccount.com  
     \ 
      
    --role = 
     SERVICE_ACCOUNT_ROLE 
    
  4. Optional: Revoke the authentication credentials that you created, and delete the local credential file.

    gcloud  
    auth  
    application-default  
    revoke
  5. Optional: Revoke credentials from the gcloud CLI.

    gcloud  
    auth  
    revoke

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: