Run multiple BigQuery jobs in parallel


BigQuery hosts a number of public datasets that are available to the general public to query. In this tutorial, you create a workflow that runs multiple BigQuery query jobs in parallel, demonstrating an improvement in performance when compared to running the jobs serially, one after the other.

Objectives

In this tutorial you will:
  1. Run a query against a Wikipedia public dataset to determine the most viewed titles in a specific month.
  2. Deploy and execute a workflow that runs multiple BigQuery query jobs serially, one after the other.
  3. Deploy and execute a workflow that runs the BigQuery jobs using parallel iteration , and where ordinary for loops are executed in parallel.

You can run the following commands in the Google Cloud console, or by using the Google Cloud CLI in either your terminal or Cloud Shell.

Costs

In this document, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator .

New Google Cloud users might be eligible for a free trial .

Before you begin

Security constraints defined by your organization might prevent you from completing the following steps. For troubleshooting information, see Develop applications in a constrained Google Cloud environment .

Console

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project .

  4. Enable the Workflows API.

    Enable the API

  5. Create a service account:

    1. In the Google Cloud console, go to the Create service account page.

      Go to Create service account
    2. Select your project.
    3. In the Service account name field, enter a name. The Google Cloud console fills in the Service account ID field based on this name.

      In the Service account description field, enter a description. For example, Service account for quickstart .

    4. Click Create and continue .
    5. Grant the following roles to the service account: BigQuery > BigQuery Job User, Logging > Logs Writer .

      To grant a role, find the Select a role list, then select the role.

      To grant additional roles, click Add another role and add each additional role.

    6. Click Continue .
    7. Click Done to finish creating the service account.

  6. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  7. Verify that billing is enabled for your Google Cloud project .

  8. Enable the Workflows API.

    Enable the API

  9. Create a service account:

    1. In the Google Cloud console, go to the Create service account page.

      Go to Create service account
    2. Select your project.
    3. In the Service account name field, enter a name. The Google Cloud console fills in the Service account ID field based on this name.

      In the Service account description field, enter a description. For example, Service account for quickstart .

    4. Click Create and continue .
    5. Grant the following roles to the service account: BigQuery > BigQuery Job User, Logging > Logs Writer .

      To grant a role, find the Select a role list, then select the role.

      To grant additional roles, click Add another role and add each additional role.

    6. Click Continue .
    7. Click Done to finish creating the service account.

gcloud

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account .

  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project .

  4. Enable the Workflows API.

    Enable the API

  5. Create a service account:

    1. In the Google Cloud console, go to the Create service account page.

      Go to Create service account
    2. Select your project.
    3. In the Service account name field, enter a name. The Google Cloud console fills in the Service account ID field based on this name.

      In the Service account description field, enter a description. For example, Service account for quickstart .

    4. Click Create and continue .
    5. Grant the following roles to the service account: roles/bigquery.jobUser, roles/logging.logWriter .

      To grant a role, find the Select a role list, then select the role.

      To grant additional roles, click Add another role and add each additional role.

    6. Click Continue .
    7. Click Done to finish creating the service account.

  6. Install the Google Cloud CLI.

  7. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity .

  8. To initialize the gcloud CLI, run the following command:

    gcloud  
    init
  9. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  10. Verify that billing is enabled for your Google Cloud project .

  11. Enable the Workflows API.

    Enable the API

  12. Create a service account:

    1. In the Google Cloud console, go to the Create service account page.

      Go to Create service account
    2. Select your project.
    3. In the Service account name field, enter a name. The Google Cloud console fills in the Service account ID field based on this name.

      In the Service account description field, enter a description. For example, Service account for quickstart .

    4. Click Create and continue .
    5. Grant the following roles to the service account: roles/bigquery.jobUser, roles/logging.logWriter .

      To grant a role, find the Select a role list, then select the role.

      To grant additional roles, click Add another role and add each additional role.

    6. Click Continue .
    7. Click Done to finish creating the service account.

  13. Install the Google Cloud CLI.

  14. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity .

  15. To initialize the gcloud CLI, run the following command:

    gcloud  
    init

Run a BigQuery query job

In BigQuery, you can run an interactive (on-demand) query job. For more information, see Running interactive and batch query jobs .

Console

  1. In the Google Cloud console, go to the BigQuerypage.

    Go to BigQuery

  2. Enter the following BigQuery SQL query in the Query editortext area:

     SELECT  
    TITLE,  
    SUM ( 
    views ) 
    FROM  
     ` 
    bigquery-samples.wikipedia_pageviews.201207h ` 
    GROUP  
    BY  
    TITLE
    ORDER  
    BY  
    SUM ( 
    views ) 
      
    DESC
    LIMIT  
     100 
     
    
  3. Click Run.

bq

In your terminal, enter the following bq query command to run an interactive query using standard SQL syntax:

 bq query \
--use_legacy_sql=false \
'SELECT
  TITLE, SUM(views)
FROM `bigquery-samples.wikipedia_pageviews.201207h` 
GROUP BY
  TITLE
ORDER BY
  SUM(views) DESC
LIMIT 100' 

This runs a query that returns the top 100 Wikipedia titles with the most views in a specific month and writes the output to a temporary table.

Note how long the query takes to run.

Deploy a workflow that runs multiple queries serially

A workflow definition is made up of a series of steps described using the Workflows syntax . After creating a workflow, you deploy it to make it available for execution. The deploy step also validates that the source file can be executed.

The following workflow defines a list of five tables to run a query against using the Workflows BigQuery connector . The queries are run serially, one after the other, and the most viewed titles from each table are saved to a results map.

Console

  1. In the Google Cloud console, go to the Workflowspage:

    Go to Workflows

  2. Click Create.

  3. Enter a name for the new workflow, such as workflow-serial-bqjobs .

  4. Choose an appropriate region; for example, us-central1.

  5. Select the service accountyou previously created.

    You should have already granted both the BigQuery > BigQuery Job Userand Logging > Logs WriterIAM roles to the service account.

  6. Click Next.

  7. In the workflow editor, enter the following definition for your workflow:

      main 
     : 
      
     steps 
     : 
      
     - 
      
     init 
     : 
      
     assign 
     : 
      
     - 
      
     results 
     : 
      
     {} 
      
     # result from each iteration keyed by table name 
      
     - 
      
     tables 
     : 
      
     - 
      
     201201h 
      
     - 
      
     201202h 
      
     - 
      
     201203h 
      
     - 
      
     201204h 
      
     - 
      
     201205h 
      
     - 
      
     runQueries 
     : 
      
     for 
     : 
      
     value 
     : 
      
     table 
      
     in 
     : 
      
     ${tables} 
      
     steps 
     : 
      
     - 
      
     logTable 
     : 
      
     call 
     : 
      
     sys.log 
      
     args 
     : 
      
     text 
     : 
      
     ${"Running query for table " + table} 
      
     - 
      
     runQuery 
     : 
      
     call 
     : 
      
     googleapis.bigquery.v2.jobs.query 
      
     args 
     : 
      
     projectId 
     : 
      
     ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")} 
      
     body 
     : 
      
     useLegacySql 
     : 
      
     false 
      
     useQueryCache 
     : 
      
     false 
      
     timeoutMs 
     : 
      
     30000 
      
     # Find top 100 titles with most views on Wikipedia 
      
     query 
     : 
      
     ${ 
      
     "SELECT TITLE, SUM(views) 
      
     FROM `bigquery-samples.wikipedia_pageviews." + table + "` 
      
     WHERE LENGTH(TITLE) > 10 
      
     GROUP BY TITLE 
      
     ORDER BY SUM(VIEWS) DESC 
      
     LIMIT 100" 
      
     } 
      
     result 
     : 
      
     queryResult 
      
     - 
      
     returnResult 
     : 
      
     assign 
     : 
      
     # Return the top title from each table 
      
     - 
      
     results[table] 
     : 
      
     {} 
      
     - 
      
     results[table].title 
     : 
      
     ${queryResult.rows[0].f[0].v} 
      
     - 
      
     results[table].views 
     : 
      
     ${queryResult.rows[0].f[1].v} 
      
     - 
      
     returnResults 
     : 
      
     return 
     : 
      
     ${results} 
     
    
  8. Click Deploy.

gcloud

  1. Open a terminal and create a source code file for your workflow:

    touch  
    workflow-serial-bqjobs.yaml
  2. Copy the following workflow to your source code file:

      main 
     : 
      
     steps 
     : 
      
     - 
      
     init 
     : 
      
     assign 
     : 
      
     - 
      
     results 
     : 
      
     {} 
      
     # result from each iteration keyed by table name 
      
     - 
      
     tables 
     : 
      
     - 
      
     201201h 
      
     - 
      
     201202h 
      
     - 
      
     201203h 
      
     - 
      
     201204h 
      
     - 
      
     201205h 
      
     - 
      
     runQueries 
     : 
      
     for 
     : 
      
     value 
     : 
      
     table 
      
     in 
     : 
      
     ${tables} 
      
     steps 
     : 
      
     - 
      
     logTable 
     : 
      
     call 
     : 
      
     sys.log 
      
     args 
     : 
      
     text 
     : 
      
     ${"Running query for table " + table} 
      
     - 
      
     runQuery 
     : 
      
     call 
     : 
      
     googleapis.bigquery.v2.jobs.query 
      
     args 
     : 
      
     projectId 
     : 
      
     ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")} 
      
     body 
     : 
      
     useLegacySql 
     : 
      
     false 
      
     useQueryCache 
     : 
      
     false 
      
     timeoutMs 
     : 
      
     30000 
      
     # Find top 100 titles with most views on Wikipedia 
      
     query 
     : 
      
     ${ 
      
     "SELECT TITLE, SUM(views) 
      
     FROM `bigquery-samples.wikipedia_pageviews." + table + "` 
      
     WHERE LENGTH(TITLE) > 10 
      
     GROUP BY TITLE 
      
     ORDER BY SUM(VIEWS) DESC 
      
     LIMIT 100" 
      
     } 
      
     result 
     : 
      
     queryResult 
      
     - 
      
     returnResult 
     : 
      
     assign 
     : 
      
     # Return the top title from each table 
      
     - 
      
     results[table] 
     : 
      
     {} 
      
     - 
      
     results[table].title 
     : 
      
     ${queryResult.rows[0].f[0].v} 
      
     - 
      
     results[table].views 
     : 
      
     ${queryResult.rows[0].f[1].v} 
      
     - 
      
     returnResults 
     : 
      
     return 
     : 
      
     ${results} 
     
    
  3. Deploy the workflow by entering the following command:

    gcloud  
    workflows  
    deploy  
    workflow-serial-bqjobs  
     \ 
      
    --source = 
    workflow-serial-bqjobs.yaml  
     \ 
      
    --service-account = 
     MY_SERVICE_ACCOUNT@MY_PROJECT.iam.gserviceaccount.com 
    

    Replace MY_SERVICE_ACCOUNT@MY_PROJECT.iam.gserviceaccount.com with the email of the service account you previously created.

    You should have already granted both the roles/bigquery.jobUser and roles/logging.logWriter IAM roles to the service account.

Execute the workflow and run multiple queries serially

Executing a workflow runs the current workflow definition associated with the workflow.

Console

  1. In the Google Cloud console, go to the Workflowspage:

    Go to Workflows

  2. On the Workflowspage, select the workflow-serial-bqjobsworkflow to go to its details page.

  3. On the Workflow Detailspage, click Execute.

  4. Click Executeagain.

  5. View the results of the workflow in the Outputpane.

gcloud

  1. Open a terminal.

  2. Execute the workflow:

      
    gcloud  
    workflows  
    run  
    workflow-serial-bqjob

The workflow execution should take approximately a minute or five times the previous running time. The result will include each table and look similar to the following:

 {
  "201201h": {
    "title": "Special:Search",
    "views": "14591339"
  },
  "201202h": {
    "title": "Special:Search",
    "views": "132765420"
  },
  "201203h": {
    "title": "Special:Search",
    "views": "123316818"
  },
  "201204h": {
    "title": "Special:Search",
    "views": "116830614"
  },
  "201205h": {
    "title": "Special:Search",
    "views": "131357063"
  }
} 

Deploy and execute a workflow that runs multiple queries in parallel

Instead of running five queries sequentially, you can run the queries in parallel by making a few changes:

   
 - 
  
 runQueries 
 : 
  
 parallel 
 : 
  
 shared 
 : 
  
 [ 
 results 
 ] 
  
 for 
 : 
  
 value 
 : 
  
 table 
  
 in 
 : 
  
 ${ 
 tables 
 } 
 
  • A parallel step allows each iteration of the for loop to run in parallel.
  • The results variable is declared as shared which allows it to be writable by a branch, and the result of each branch can be appended to it.

Console

  1. In the Google Cloud console, go to the Workflowspage:

    Go to Workflows

  2. Click Create.

  3. Enter a name for the new workflow, such as workflow-parallel-bqjobs .

  4. Choose an appropriate region; for example, us-central1.

  5. Select the service accountyou previously created.

  6. Click Next.

  7. In the workflow editor, enter the following definition for your workflow:

      main 
     : 
      
     steps 
     : 
      
     - 
      
     init 
     : 
      
     assign 
     : 
      
     - 
      
     results 
     : 
      
     {} 
      
     # result from each iteration keyed by table name 
      
     - 
      
     tables 
     : 
      
     - 
      
     201201h 
      
     - 
      
     201202h 
      
     - 
      
     201203h 
      
     - 
      
     201204h 
      
     - 
      
     201205h 
      
     - 
      
     runQueries 
     : 
      
     parallel 
     : 
      
     shared 
     : 
      
     [ 
     results 
     ] 
      
     for 
     : 
      
     value 
     : 
      
     table 
      
     in 
     : 
      
     ${tables} 
      
     steps 
     : 
      
     - 
      
     logTable 
     : 
      
     call 
     : 
      
     sys.log 
      
     args 
     : 
      
     text 
     : 
      
     ${"Running query for table " + table} 
      
     - 
      
     runQuery 
     : 
      
     call 
     : 
      
     googleapis.bigquery.v2.jobs.query 
      
     args 
     : 
      
     projectId 
     : 
      
     ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")} 
      
     body 
     : 
      
     useLegacySql 
     : 
      
     false 
      
     useQueryCache 
     : 
      
     false 
      
     timeoutMs 
     : 
      
     30000 
      
     # Find top 100 titles with most views on Wikipedia 
      
     query 
     : 
      
     ${ 
      
     "SELECT TITLE, SUM(views) 
      
     FROM `bigquery-samples.wikipedia_pageviews." + table + "` 
      
     WHERE LENGTH(TITLE) > 10 
      
     GROUP BY TITLE 
      
     ORDER BY SUM(VIEWS) DESC 
      
     LIMIT 100" 
      
     } 
      
     result 
     : 
      
     queryResult 
      
     - 
      
     returnResult 
     : 
      
     assign 
     : 
      
     # Return the top title from each table 
      
     - 
      
     results[table] 
     : 
      
     {} 
      
     - 
      
     results[table].title 
     : 
      
     ${queryResult.rows[0].f[0].v} 
      
     - 
      
     results[table].views 
     : 
      
     ${queryResult.rows[0].f[1].v} 
      
     - 
      
     returnResults 
     : 
      
     return 
     : 
      
     ${results} 
     
    
  8. Click Deploy.

  9. On the Workflow Detailspage, click Execute.

  10. Click Executeagain.

  11. View the results of the workflow in the Outputpane.

gcloud

  1. Open a terminal and create a source code file for your workflow:

    touch  
    workflow-parallel-bqjobs.yaml
  2. Copy the following workflow to your source code file:

      main 
     : 
      
     steps 
     : 
      
     - 
      
     init 
     : 
      
     assign 
     : 
      
     - 
      
     results 
     : 
      
     {} 
      
     # result from each iteration keyed by table name 
      
     - 
      
     tables 
     : 
      
     - 
      
     201201h 
      
     - 
      
     201202h 
      
     - 
      
     201203h 
      
     - 
      
     201204h 
      
     - 
      
     201205h 
      
     - 
      
     runQueries 
     : 
      
     parallel 
     : 
      
     shared 
     : 
      
     [ 
     results 
     ] 
      
     for 
     : 
      
     value 
     : 
      
     table 
      
     in 
     : 
      
     ${tables} 
      
     steps 
     : 
      
     - 
      
     logTable 
     : 
      
     call 
     : 
      
     sys.log 
      
     args 
     : 
      
     text 
     : 
      
     ${"Running query for table " + table} 
      
     - 
      
     runQuery 
     : 
      
     call 
     : 
      
     googleapis.bigquery.v2.jobs.query 
      
     args 
     : 
      
     projectId 
     : 
      
     ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")} 
      
     body 
     : 
      
     useLegacySql 
     : 
      
     false 
      
     useQueryCache 
     : 
      
     false 
      
     timeoutMs 
     : 
      
     30000 
      
     # Find top 100 titles with most views on Wikipedia 
      
     query 
     : 
      
     ${ 
      
     "SELECT TITLE, SUM(views) 
      
     FROM `bigquery-samples.wikipedia_pageviews." + table + "` 
      
     WHERE LENGTH(TITLE) > 10 
      
     GROUP BY TITLE 
      
     ORDER BY SUM(VIEWS) DESC 
      
     LIMIT 100" 
      
     } 
      
     result 
     : 
      
     queryResult 
      
     - 
      
     returnResult 
     : 
      
     assign 
     : 
      
     # Return the top title from each table 
      
     - 
      
     results[table] 
     : 
      
     {} 
      
     - 
      
     results[table].title 
     : 
      
     ${queryResult.rows[0].f[0].v} 
      
     - 
      
     results[table].views 
     : 
      
     ${queryResult.rows[0].f[1].v} 
      
     - 
      
     returnResults 
     : 
      
     return 
     : 
      
     ${results} 
     
    
  3. Deploy the workflow by entering the following command:

    gcloud  
    workflows  
    deploy  
    workflow-parallell-bqjobs  
     \ 
      
    --source = 
    workflow-parallel-bqjobs.yaml  
     \ 
      
    --service-account = 
     MY_SERVICE_ACCOUNT@MY_PROJECT.iam.gserviceaccount.com 
    

    Replace MY_SERVICE_ACCOUNT@MY_PROJECT.iam.gserviceaccount.com with the email of the service account you previously created.

  4. Execute the workflow:

      
    gcloud  
    workflows  
    run  
    workflow-parallel-bqjobs

The result will be similar to the previous output but the workflow execution should take approximately twenty seconds or less!

Clean up

If you created a new project for this tutorial, delete the project . If you used an existing project and wish to keep it without the changes added in this tutorial, delete resources created for the tutorial .

Delete the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete .
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Delete tutorial resources

Delete the workflows created in this tutorial:

gcloud  
workflows  
delete  
 WORKFLOW_NAME 

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: