Use BigLake metastore with tables in BigQuery

This document explains how to use BigLake metastore with BigQuery tables and Spark.

With BigLake metastore, you can create and use standard (built-in) tables , BigLake tables for Apache Iceberg in BigQuery , and Apache Iceberg external tables from BigQuery.

Before you begin

Enable billing for your Google Cloud project. Learn how to check if billing is enabled on a project .
Enable the BigQuery, and Dataproc APIs.

Enable the APIs
Optional: Understand how BigLake metastore works and why you should use it.

Required roles

To get the permissions that you need to use Spark and Dataproc with BigLake metastore as a metadata store, ask your administrator to grant you the following IAM roles:

Create BigLake metastore tables in Spark:
- Dataproc Worker ( roles/dataproc.worker ) on the Dataproc Serverless service account in the project
- BigQuery Data Editor ( roles/bigquery.dataEditor ) on the Dataproc Serverless service account in the project
- Storage Object Admin ( roles/storage.objectAdmin ) on the Dataproc Serverless service account in the project
Query BigLake metastore tables in BigQuery:
- BigQuery Data Viewer ( roles/bigquery.dataViewer ) on the project
- BigQuery User ( roles/bigquery.user ) on the project
- Storage Object Viewer ( roles/storage.objectViewer ) on the project

For more information about granting roles, see Manage access to projects, folders, and organizations .

You might also be able to get the required permissions through custom roles or other predefined roles .

Connect to a table

Create a dataset in the Google Cloud console.
```
 CREATE 
  
 SCHEMA 
  
 ` PROJECT_ID 
` 
 . 
  DATASET_NAME 
 
 ; 
```
Replace the following:
- PROJECT_ID : the ID of the Google Cloud project to create the dataset.
- DATASET_NAME : a name for your dataset.
Create a Cloud Resource Connection .

Create a standard BigQuery table.

 CREATE 
  
 TABLE 
  
 ` PROJECT_ID 
` 
 . 
  DATASET_NAME 
 
 . 
  TABLE_NAME 
 
  
 ( 
 name 
  
 STRING 
 , 
 id 
  
 INT64 
 );

Replace the following:

TABLE_NAME : a name for your table.

Insert data into the standard BigQuery table.

 INSERT 
  
 INTO 
  
 ` PROJECT_ID 
` 
 . 
  DATASET_NAME 
 
 . 
  TABLE_NAME 
 
  
 VALUES 
  
 ( 
 'test_name1' 
 , 
  
 123 
 ),( 
 'test_name2' 
 , 
  
 456 
 ),( 
 'test_name3' 
 , 
  
 789 
 );

Create a BigLake table for Apache Iceberg in BigQuery .

For example, to create a table, run the following CREATE statement.

 CREATE 
  
 TABLE 
  
 ` PROJECT_ID 
` 
 . 
  DATASET_NAME 
 
 . 
  ICEBERG_TABLE_NAME 
 
 ( 
 name 
  
 STRING 
 , 
 id 
  
 INT64 
 ) 
 WITH 
  
 CONNECTION 
  
 ` CONNECTION_NAME 
` 
 OPTIONS 
  
 ( 
 file_format 
  
 = 
  
 'PARQUET' 
 , 
 table_format 
  
 = 
  
 'ICEBERG' 
 , 
 storage_uri 
  
 = 
  
 ' STORAGE_URI 
' 
 );

Replace the following:

ICEBERG_TABLE_NAME : a name for your BigLake table for Apache Iceberg in BigQuery. For example, iceberg_managed_table .
CONNECTION_NAME : the name of your connection. You created this in the previous step. For example, myproject.us.myconnection .
STORAGE_URI : a fully qualified Cloud Storage URI. For example, gs://mybucket/table .

Insert data into the BigLake table for Apache Iceberg in BigQuery.

 INSERT 
  
 INTO 
  
 ` PROJECT_ID 
` 
 . 
  DATASET_NAME 
 
 . 
  ICEBERG_TABLE_NAME 
 
  
 VALUES 
  
 ( 
 'test_name1' 
 , 
  
 123 
 ),( 
 'test_name2' 
 , 
  
 456 
 ),( 
 'test_name3' 
 , 
  
 789 
 );

Create an Apache Iceberg external table .

For example, to create an Iceberg external table, run the following CREATE statement.

 CREATE 
  
 OR 
  
 REPLACE 
  
 EXTERNAL 
  
 TABLE 
  
 ` PROJECT_ID 
` 
 . 
  DATASET_NAME 
 
 . 
  READONLY_ICEBERG_TABLE_NAME 
 
 WITH 
  
 CONNECTION 
  
 ` CONNECTION_NAME 
` 
 OPTIONS 
  
 ( 
  
 format 
  
 = 
  
 'ICEBERG' 
 , 
  
 uris 
  
 = 
  
 [ 
 ' BUCKET_PATH 
' 
 ] 
 , 
  
 require_partition_filter 
  
 = 
  
 FALSE 
 );

Replace the following:

READONLY_ICEBERG_TABLE_NAME : a name for your read-only table.
BUCKET_PATH : the path to the Cloud Storage bucket that contains the data for the external table, in the format ['gs://bucket_name/[folder_name/]file_name'] .

From PySpark, query the standard table, BigLake table for Apache Iceberg in BigQuery, and Apache Iceberg external table.

 from 
  
 pyspark 
 . 
 sql 
  
 import 
  
 SparkSession 
 # Create a spark session 
 spark 
  
 = 
  
 SparkSession 
 . 
 builder 
  
 \ 
 . 
 appName 
 ( 
 "BigLake Metastore Iceberg" 
 ) 
  
 \ 
 . 
 config 
 ( 
 "spark.sql.catalog. CATALOG_NAME 
" 
 , 
  
 "org.apache.iceberg.spark.SparkCatalog" 
 ) 
  
 \ 
 . 
 config 
 ( 
 "spark.sql.catalog. CATALOG_NAME 
.catalog-impl" 
 , 
  
 "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog" 
 ) 
  
 \ 
 . 
 config 
 ( 
 "spark.sql.catalog. CATALOG_NAME 
.gcp_project" 
 , 
  
 " PROJECT_ID 
" 
 ) 
  
 \ 
 . 
 config 
 ( 
 "spark.sql.catalog. CATALOG_NAME 
.gcp_location" 
 , 
  
 " LOCATION 
" 
 ) 
  
 \ 
 . 
 config 
 ( 
 "spark.sql.catalog. CATALOG_NAME 
.warehouse" 
 , 
  
 " WAREHOUSE_DIRECTORY 
" 
 ) 
  
 \ 
 . 
 getOrCreate 
 () 
 spark 
 . 
 conf 
 . 
 set 
 ( 
 "viewsEnabled" 
 , 
 "true" 
 ) 
 # Use the blms_catalog 
 spark 
 . 
 sql 
 ( 
 "USE ` CATALOG_NAME 
`;" 
 ) 
 spark 
 . 
 sql 
 ( 
 "USE NAMESPACE DATASET_NAME 
;" 
 ) 
 # Configure spark for temp results 
 spark 
 . 
 sql 
 ( 
 "CREATE namespace if not exists MATERIALIZATION_NAMESPACE 
" 
 ); 
 spark 
 . 
 conf 
 . 
 set 
 ( 
 "materializationDataset" 
 , 
 " MATERIALIZATION_NAMESPACE 
" 
 ) 
 # List the tables in the dataset 
 df 
  
 = 
  
 spark 
 . 
 sql 
 ( 
 "SHOW TABLES;" 
 ) 
 df 
 . 
 show 
 (); 
 # Query the tables 
 sql 
  
 = 
  
 " 
 "" 
 SELECT * FROM DATASET_NAME 
. TABLE_NAME 
 
 "" 
 " 
 df 
  
 = 
  
 spark 
 . 
 read 
 . 
 format 
 ( 
 "bigquery" 
 ). 
 load 
 ( 
 sql 
 ) 
 df 
 . 
 show 
 () 
 sql 
  
 = 
  
 " 
 "" 
 SELECT * FROM DATASET_NAME 
. ICEBERG_TABLE_NAME 
 
 "" 
 " 
 df 
  
 = 
  
 spark 
 . 
 read 
 . 
 format 
 ( 
 "bigquery" 
 ). 
 load 
 ( 
 sql 
 ) 
 df 
 . 
 show 
 () 
 sql 
  
 = 
  
 " 
 "" 
 SELECT * FROM DATASET_NAME 
. READONLY_ICEBERG_TABLE_NAME 
 
 "" 
 " 
 df 
  
 = 
  
 spark 
 . 
 read 
 . 
 format 
 ( 
 "bigquery" 
 ). 
 load 
 ( 
 sql 
 ) 
 df 
 . 
 show 
 ()

Replace the following:

WAREHOUSE_DIRECTORY : the URI of the Cloud Storage folder that's connected to your BigLake Iceberg table in BigQuery and your Iceberg external table.
CATALOG_NAME : the name of the catalog that you're using.
MATERIALIZATION_NAMESPACE : the namespace for storing temp results.

Run the PySpark script using Serverless Spark.
```
gcloud  
dataproc  
batches  
submit  
pyspark  
 SCRIPT_PATH 
  
 \ 
  
--version = 
 2 
.2  
 \ 
  
--project = 
 PROJECT_ID 
  
 \ 
  
--region = 
 REGION 
  
 \ 
  
--deps-bucket = 
 YOUR_BUCKET 
  
 \ 
```
Replace the following:
- SCRIPT_PATH : the path to the script that the batch job uses.
- PROJECT_ID : the ID of the Google Cloud project to run the batch job in.
- REGION : the region where your workload runs.
- YOUR_BUCKET : the location of the Cloud Storage bucket to upload workload dependencies. The gs:// URI prefix of the bucket is not required. You can specify the bucket path or bucket name, for example, mybucketname1 .

What's next

Set up optional BigLake metastore features .