This document explains how to use BigLake metastore with BigQuery tables and Spark.
With BigLake metastore, you can create and use standard (built-in) tables , BigLake tables for Apache Iceberg in BigQuery , and Apache Iceberg external tables from BigQuery.
Before you begin
- Enable billing for your Google Cloud project. Learn how to check if billing is enabled on a project .
-
Enable the BigQuery and Dataproc APIs.
Required roles
To get the permissions that you need to use Spark and Dataproc with BigLake metastore as a metadata store, ask your administrator to grant you the following IAM roles:
- Create BigLake metastore tables in Spark:
- Dataproc Worker
(
roles/dataproc.worker) on the Dataproc Serverless service account in the project - BigQuery Data Editor
(
roles/bigquery.dataEditor) on the Dataproc Serverless service account in the project - Storage Object User
(
roles/storage.objectUser) on the Dataproc Serverless service account in the project
- Dataproc Worker
(
- Query BigLake metastore tables in BigQuery:
- BigQuery Data Viewer
(
roles/bigquery.dataViewer) on the project - BigQuery User
(
roles/bigquery.user) on the project - Storage Object Viewer
(
roles/storage.objectViewer) on the project
- BigQuery Data Viewer
(
For more information about granting roles, see Manage access to projects, folders, and organizations .
You might also be able to get the required permissions through custom roles or other predefined roles .
Connect to a table
-
Create a dataset in the Google Cloud console.
CREATE SCHEMA ` PROJECT_ID ` . DATASET_NAME ;
Replace the following:
-
PROJECT_ID: the ID of the Google Cloud project to create the dataset. -
DATASET_NAME: a name for your dataset.
-
-
Create a Cloud resource connection .
-
Create a standard BigQuery table.
CREATE TABLE ` PROJECT_ID ` . DATASET_NAME . TABLE_NAME ( name STRING , id INT64 );
Replace the following:
-
TABLE_NAME: a name for your table.
-
-
Insert data into the standard BigQuery table.
INSERT INTO ` PROJECT_ID ` . DATASET_NAME . TABLE_NAME VALUES ( 'test_name1' , 123 ),( 'test_name2' , 456 ),( 'test_name3' , 789 );
-
Create a BigLake table for Apache Iceberg in BigQuery .
For example, to create a table, run the following
CREATEstatement.CREATE TABLE ` PROJECT_ID ` . DATASET_NAME . ICEBERG_TABLE_NAME ( name STRING , id INT64 ) WITH CONNECTION ` CONNECTION_NAME ` OPTIONS ( file_format = 'PARQUET' , table_format = 'ICEBERG' , storage_uri = ' STORAGE_URI ' );
Replace the following:
-
ICEBERG_TABLE_NAME: a name for your BigLake table for Apache Iceberg in BigQuery. For example,iceberg_managed_table. -
CONNECTION_NAME: the name of your connection. You created this in the previous step. For example,myproject.us.myconnection. -
STORAGE_URI: a fully qualified Cloud Storage URI. For example,gs://mybucket/table.
-
-
Insert data into the BigLake table for Apache Iceberg in BigQuery.
INSERT INTO ` PROJECT_ID ` . DATASET_NAME . ICEBERG_TABLE_NAME VALUES ( 'test_name1' , 123 ),( 'test_name2' , 456 ),( 'test_name3' , 789 );
-
Create an Apache Iceberg external table .
For example, to create an Iceberg external table, run the following
CREATEstatement.CREATE OR REPLACE EXTERNAL TABLE ` PROJECT_ID ` . DATASET_NAME . READONLY_ICEBERG_TABLE_NAME WITH CONNECTION ` CONNECTION_NAME ` OPTIONS ( format = 'ICEBERG' , uris = [ ' BUCKET_PATH ' ] , require_partition_filter = FALSE );
Replace the following:
-
READONLY_ICEBERG_TABLE_NAME: a name for your read-only table. -
BUCKET_PATH: the path to the Cloud Storage bucket that contains the data for the external table, in the format['gs://bucket_name/[folder_name/]file_name'].
-
-
From PySpark, query the standard table, BigLake table for Apache Iceberg in BigQuery, and Apache Iceberg external table.
from pyspark . sql import SparkSession # Create a spark session spark = SparkSession . builder \ . appName ( "BigLake Metastore Iceberg" ) \ . config ( "spark.sql.catalog. CATALOG_NAME " , "org.apache.iceberg.spark.SparkCatalog" ) \ . config ( "spark.sql.catalog. CATALOG_NAME .catalog-impl" , "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog" ) \ . config ( "spark.sql.catalog. CATALOG_NAME .gcp_project" , " PROJECT_ID " ) \ . config ( "spark.sql.catalog. CATALOG_NAME .gcp_location" , " LOCATION " ) \ . config ( "spark.sql.catalog. CATALOG_NAME .warehouse" , " WAREHOUSE_DIRECTORY " ) \ . getOrCreate () spark . conf . set ( "viewsEnabled" , "true" ) # Use the blms_catalog spark . sql ( "USE ` CATALOG_NAME `;" ) spark . sql ( "USE NAMESPACE DATASET_NAME ;" ) # Configure spark for temp results spark . sql ( "CREATE namespace if not exists MATERIALIZATION_NAMESPACE " ); spark . conf . set ( "materializationDataset" , " MATERIALIZATION_NAMESPACE " ) # List the tables in the dataset df = spark . sql ( "SHOW TABLES;" ) df . show (); # Query the tables sql = " "" SELECT * FROM DATASET_NAME . TABLE_NAME "" " df = spark . read . format ( "bigquery" ). load ( sql ) df . show () sql = " "" SELECT * FROM DATASET_NAME . ICEBERG_TABLE_NAME "" " df = spark . read . format ( "bigquery" ). load ( sql ) df . show () sql = " "" SELECT * FROM DATASET_NAME . READONLY_ICEBERG_TABLE_NAME "" " df = spark . read . format ( "bigquery" ). load ( sql ) df . show ()
Replace the following:
-
WAREHOUSE_DIRECTORY: the URI of the Cloud Storage folder that's connected to your BigLake Iceberg table in BigQuery and your Iceberg external table. -
CATALOG_NAME: the name of the catalog that you're using. -
MATERIALIZATION_NAMESPACE: the namespace for storing temp results.
-
-
Run the PySpark script using Serverless Spark.
gcloud dataproc batches submit pyspark SCRIPT_PATH \ -- version = 2.2 \ -- project = PROJECT_ID \ -- region = REGION \ -- deps - bucket = YOUR_BUCKET \
Replace the following:
-
SCRIPT_PATH: the path to the script that the batch job uses. -
PROJECT_ID: the ID of the Google Cloud project to run the batch job in. -
REGION: the region where your workload runs. -
YOUR_BUCKET: the location of the Cloud Storage bucket to upload workload dependencies. Thegs://URI prefix of the bucket is not required. You can specify the bucket path or bucket name, for example,mybucketname1.
-
What's next
- Set up optional BigLake metastore features .

