Querying clustered tables

When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table's schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.

To optimize performance when you run queries against clustered tables, use an expression that filters on a clustered column or on multiple clustered columns in the order the clustered columns are specified. Queries that filter on clustered columns generally perform better than queries that filter only on non-clustered columns.

BigQuery sorts the data in a clustered table based on the values in the clustering columns and organizes them into blocks.

When you submit a query that contains a filter on a clustered column, BigQuery uses the clustering information to efficiently determine whether a block contains any data relevant to the query. This allows BigQuery to only scan the relevant blocks — a process referred to as block pruning .

You can query clustered tables by:

  • Using the Google Cloud console
  • Using the bq command-line tool's bq query command
  • Calling the jobs.insert method and configuring a query job
  • Using the client libraries

Currently, you can only use GoogleSQL with clustered tables.

Go

Before trying this sample, follow the Go setup instructions in the BigQuery quickstart using client libraries . For more information, see the BigQuery Go API reference documentation .

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries .

  import 
  
 ( 
  
" context 
"  
" fmt 
"  
" io 
"  
" cloud 
 . 
 google 
 . 
 com 
 / 
 go 
 / 
 bigquery 
"  
" google 
 . 
 golang 
 . 
 org 
 / 
 api 
 / 
 iterator 
" ) 
 // queryClusteredTable demonstrates querying a table that has a clustering specification. 
 func 
  
 queryClusteredTable 
 ( 
 w 
  
 io 
 . 
 Writer 
 , 
  
 projectID 
 , 
  
 datasetID 
 , 
  
 tableID 
  
 string 
 ) 
  
 error 
  
 { 
  
 // projectID := "my-project-id 
"  
 // datasetID := "mydataset 
"  
 // tableID := "mytable 
"  
 ctx 
  
 := 
  
 context 
 . 
 Background 
 () 
  
 client 
 , 
  
 err 
  
 := 
  
 bigquery 
 . 
 NewClient 
 ( 
 ctx 
 , 
  
 projectID 
 ) 
  
 if 
  
 err 
  
 != 
  
 nil 
  
 { 
  
 return 
  
 fmt 
 . 
 Errorf 
 ( 
" bigquery 
 . 
 NewClient 
 : 
  
 % 
 v 
" , 
  
 err 
 ) 
  
 } 
  
 defer 
  
 client 
 . 
 Close 
 () 
  
 q 
  
 := 
  
 client 
 . 
 Query 
 ( 
 fmt 
 . 
 Sprintf 
 ( 
 ` 
 SELECT 
 COUNT(1) as transactions, 
 SUM(amount) as total_paid, 
 COUNT(DISTINCT destination) as distinct_recipients 
 FROM 
 ` 
 + 
" `%s.%s` 
" + 
 ` 
 WHERE 
 timestamp > TIMESTAMP('2015-01-01') 
 AND origin = @wallet` 
 , 
  
 datasetID 
 , 
  
 tableID 
 )) 
  
 q 
 . 
 Parameters 
  
 = 
  
 [] 
 bigquery 
 . 
 QueryParameter 
 { 
  
 { 
  
 Name 
 : 
  
" wallet 
" , 
  
 Value 
 : 
  
" wallet00001866cb7e0f09a890 
" , 
  
 }, 
  
 } 
  
 // Run the query and print results when the query job is completed. 
  
 job 
 , 
  
 err 
  
 := 
  
 q 
 . 
 Run 
 ( 
 ctx 
 ) 
  
 if 
  
 err 
  
 != 
  
 nil 
  
 { 
  
 return 
  
 err 
  
 } 
  
 status 
 , 
  
 err 
  
 := 
  
 job 
 . 
 Wait 
 ( 
 ctx 
 ) 
  
 if 
  
 err 
  
 != 
  
 nil 
  
 { 
  
 return 
  
 err 
  
 } 
  
 if 
  
 err 
  
 := 
  
 status 
 . 
 Err 
 (); 
  
 err 
  
 != 
  
 nil 
  
 { 
  
 return 
  
 err 
  
 } 
  
 it 
 , 
  
 err 
  
 := 
  
 job 
 . 
 Read 
 ( 
 ctx 
 ) 
  
 for 
  
 { 
  
 var 
  
 row 
  
 [] 
 bigquery 
 . 
 Value 
  
 err 
  
 := 
  
 it 
 . 
 Next 
 ( 
& row 
 ) 
  
 if 
  
 err 
  
 == 
  
 iterator 
 . 
 Done 
  
 { 
  
 break 
  
 } 
  
 if 
  
 err 
  
 != 
  
 nil 
  
 { 
  
 return 
  
 err 
  
 } 
  
 fmt 
 . 
 Fprintln 
 ( 
 w 
 , 
  
 row 
 ) 
  
 } 
  
 return 
  
 nil 
 } 
 

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries . For more information, see the BigQuery Java API reference documentation .

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries .

  import 
  
 com.google.cloud.bigquery.BigQuery 
 ; 
 import 
  
 com.google.cloud.bigquery.BigQueryException 
 ; 
 import 
  
 com.google.cloud.bigquery.BigQueryOptions 
 ; 
 import 
  
 com.google.cloud.bigquery.QueryJobConfiguration 
 ; 
 import 
  
 com.google.cloud.bigquery.TableResult 
 ; 
 public 
  
 class 
 QueryClusteredTable 
  
 { 
  
 public 
  
 static 
  
 void 
  
 runQueryClusteredTable 
 () 
  
 throws 
  
 Exception 
  
 { 
  
 // TODO(developer): Replace these variables before running the sample. 
  
 String 
  
 projectId 
  
 = 
  
" MY_PROJECT_ID 
" ; 
  
 String 
  
 datasetName 
  
 = 
  
" MY_DATASET_NAME 
" ; 
  
 String 
  
 tableName 
  
 = 
  
" MY_TABLE_NAME 
" ; 
  
 queryClusteredTable 
 ( 
 projectId 
 , 
  
 datasetName 
 , 
  
 tableName 
 ); 
  
 } 
  
 public 
  
 static 
  
 void 
  
 queryClusteredTable 
 ( 
 String 
  
 projectId 
 , 
  
 String 
  
 datasetName 
 , 
  
 String 
  
 tableName 
 ) 
  
 { 
  
 try 
  
 { 
  
 // Initialize client that will be used to send requests. This client only needs to be created 
  
 // once, and can be reused for multiple requests. 
  
 BigQuery 
  
 bigquery 
  
 = 
  
 BigQueryOptions 
 . 
 getDefaultInstance 
 (). 
 getService 
 (); 
  
 String 
  
 sourceTable 
  
 = 
  
" ` 
"  
 + 
  
 projectId 
  
 + 
  
" . 
"  
 + 
  
 datasetName 
  
 + 
  
" . 
"  
 + 
  
 tableName 
  
 + 
  
" ` 
" ; 
  
 String 
  
 query 
  
 = 
  
" SELECT 
  
 word 
 , 
  
 word_count 
 \ 
 n 
"  
 + 
  
" FROM 
  
"  
 + 
  
 sourceTable 
  
 + 
  
" \ 
 n 
"  
 // Optimize query performance by filtering the clustered columns in sort order 
  
 + 
  
" WHERE 
  
 corpus 
  
 = 
  
' romeoandjuliet 
' \ 
 n 
"  
 + 
  
" AND 
  
 word_count 
  
> = 
  
 1 
" ; 
  
 QueryJobConfiguration 
  
 queryConfig 
  
 = 
  
 QueryJobConfiguration 
 . 
 newBuilder 
 ( 
 query 
 ). 
 build 
 (); 
  
 TableResult 
  
 results 
  
 = 
  
 bigquery 
 . 
 query 
 ( 
 queryConfig 
 ); 
  
 results 
  
 . 
 iterateAll 
 () 
  
 . 
 forEach 
 ( 
 row 
  
 - 
>  
 row 
 . 
 forEach 
 ( 
 val 
  
 - 
>  
 System 
 . 
 out 
 . 
 printf 
 ( 
" % 
 s 
 ,", 
  
 val 
 . 
 toString 
 ()))); 
  
 System 
 . 
 out 
 . 
 println 
 ( 
" Query 
  
 clustered 
  
 table 
  
 performed 
  
 successfully 
 ."); 
  
 } 
  
 catch 
  
 ( 
 BigQueryException 
  
 | 
  
 InterruptedException 
  
 e 
 ) 
  
 { 
  
 System 
 . 
 out 
 . 
 println 
 ( 
" Query 
  
 not 
  
 performed 
  
 \ 
 n 
"  
 + 
  
 e 
 . 
 toString 
 ()); 
  
 } 
  
 } 
 } 
 

Python

Before trying this sample, follow the Python setup instructions in the BigQuery quickstart using client libraries . For more information, see the BigQuery Python API reference documentation .

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries .

  from 
 google.cloud 
 import 
 bigquery 
 # Construct a BigQuery client object. 
 client 
 = 
 bigquery 
 . 
 Client 
 () 
 # TODO(developer): Set table_id to the ID of the destination table. 
 # table_id = "your-project.your_dataset.your_table_name 
" sql 
 = 
" SELECT 
 * 
 FROM 
 ` 
 bigquery 
 - 
 public 
 - 
 data 
 . 
 samples 
 . 
 shakespeare 
 ` 
" cluster_fields 
 = 
 [ 
" corpus 
" ] 
 job_config 
 = 
 bigquery 
 . 
 QueryJobConfig 
 ( 
 clustering_fields 
 = 
 cluster_fields 
 , 
 destination 
 = 
 table_id 
 ) 
 # Start the query, passing in the extra configuration. 
 query_job 
 = 
 client 
 . 
 query 
 ( 
 sql 
 , 
 job_config 
 = 
 job_config 
 ) 
 # Make an API request. 
 query_job 
 . 
 result 
 () 
 # Wait for the job to complete. 
 table 
 = 
 client 
 . 
 get_table 
 ( 
 table_id 
 ) 
 # Make an API request. 
 if 
 table 
 . 
 clustering_fields 
 == 
 cluster_fields 
 : 
 print 
 ( 
" The 
 destination 
 table 
 is 
 written 
 using 
 the 
 cluster_fields 
 configuration 
 . 
" ) 
 

Required permissions

To run a query job , you need the bigquery.jobs.create Identity and Access Management (IAM) permission on the project that runs the query job.

Each of the following predefined IAM roles includes the permissions that you need to run a query job:

  • roles/bigquery.admin
  • roles/bigquery.jobUser
  • roles/bigquery.user

You also need the bigquery.tables.getData permission on all tables and views that your query references. In addition, when querying a view you need this permission on all underlying tables and views. However, if you are using authorized views or authorized datasets , you don't need access to the underlying source data.

Each of the following predefined IAM roles includes the permission that you need on all tables and views that the query references:

  • roles/bigquery.admin
  • roles/bigquery.dataOwner
  • roles/bigquery.dataEditor
  • roles/bigquery.dataViewer

For more information about IAM roles in BigQuery, see Predefined roles and permissions .

Best practices

To get the best performance from queries against clustered tables, use the following best practices.

For context, the sample table used in the best practice examples is a clustered table that is created by using a DDL statement. The DDL statement creates a table named ClusteredSalesData . The table is clustered by the following columns: customer_id , product_id , order_id , in that sort order.

 CREATE 
  
 TABLE 
  
 ` 
 mydataset 
 . 
 ClusteredSalesData 
 ` 
 PARTITION 
  
 BY 
  
 DATE 
 ( 
 timestamp 
 ) 
 CLUSTER 
  
 BY 
  
 customer_id 
 , 
  
 product_id 
 , 
  
 order_id 
  
 AS 
 SELECT 
  
 * 
 FROM 
  
 ` 
 mydataset 
 . 
 SalesData 
 ` 

Filter clustered columns by sort order

When you specify a filter, use expressions that filter on the clustered columns in sort order. Sort order is the column order given in the CLUSTER BY clause. To get the benefits of clustering, include all of the clustered columns or a subset of the columns in left-to-right sort order, starting with the first column. For example, if the column sort order is A , B , C , a query that filters on A and B might benefit from clustering, but a query that filters on B and C does not. The ordering of the column names inside the filter expression doesn't affect performance.

The following example queries the ClusteredSalesData clustered table that was created in the preceding example. The query includes a filter expression that filters on customer_id and then on product_id . This query optimizes performance by filtering the clustered columns in sort order —the column order given in the CLUSTER BY clause.

 SELECT 
  
 SUM 
 ( 
 totalSale 
 ) 
 FROM 
  
 ` 
 mydataset 
 . 
 ClusteredSalesData 
 ` 
 WHERE 
  
 customer_id 
  
 = 
  
 10000 
  
 AND 
  
 product_id 
  
 LIKE 
  
 'gcp_analytics%' 

The following query does not filter the clustered columns in sort order. As a result, the performance of the query is not optimal. This query filters on product_id then on order_id (skipping customer_id ).

 SELECT 
  
 SUM 
 ( 
 totalSale 
 ) 
 FROM 
  
 ` 
 mydataset 
 . 
 ClusteredSalesData 
 ` 
 WHERE 
  
 product_id 
  
 LIKE 
  
 'gcp_analytics%' 
  
 AND 
  
 order_id 
  
 = 
  
 20000 

Do not use clustered columns in complex filter expressions

If you use a clustered column in a complex filter expression, the performance of the query is not optimized because block pruning cannot be applied.

For example, the following query will not prune blocks because a clustered column— customer_id —is used in a function in the filter expression.

 SELECT 
  
 SUM 
 ( 
 totalSale 
 ) 
 FROM 
  
 ` 
 mydataset 
 . 
 ClusteredSalesData 
 ` 
 WHERE 
  
 CAST 
 ( 
 customer_id 
  
 AS 
  
 STRING 
 ) 
  
 = 
  
 "10000" 

To optimize query performance by pruning blocks, use simple filter expressions like the following. In this example, a simple filter is applied to the clustered column— customer_id .

 SELECT 
  
 SUM 
 ( 
 totalSale 
 ) 
 FROM 
  
 ` 
 mydataset 
 . 
 ClusteredSalesData 
 ` 
 WHERE 
  
 customer_id 
  
 = 
  
 10000 

Do not compare clustered columns to other columns

If a filter expression compares a clustered column to another column (either a clustered column or a non-clustered column), the performance of the query is not optimized because block pruning cannot be applied.

The following query does not prune blocks because the filter expression compares a clustered column— customer_id to another column— order_id .

 SELECT 
  
 SUM 
 ( 
 totalSale 
 ) 
 FROM 
  
 ` 
 mydataset 
 . 
 ClusteredSalesData 
 ` 
 WHERE 
  
 customer_id 
  
 = 
  
 order_id 

Table security

To control access to tables in BigQuery, see Introduction to table access controls .

What's next