Use BigLake metastore with tables in BigQuery

This document explains how to use BigLake metastore with BigQuery tables and Spark.

With BigLake metastore, you can create and use standard (built-in) tables , BigLake tables for Apache Iceberg in BigQuery , and Apache Iceberg external tables from BigQuery.

Before you begin

  1. Enable billing for your Google Cloud project. Learn how to check if billing is enabled on a project .
  2. Enable the BigQuery, and Dataproc APIs.

    Enable the APIs

  3. Optional: Understand how BigLake metastore works and why you should use it.

Required roles

To get the permissions that you need to use Spark and Dataproc with BigLake metastore as a metadata store, ask your administrator to grant you the following IAM roles:

  • Create BigLake metastore tables in Spark:
    • Dataproc Worker ( roles/dataproc.worker ) on the Dataproc Serverless service account in the project
    • BigQuery Data Editor ( roles/bigquery.dataEditor ) on the Dataproc Serverless service account in the project
    • Storage Object Admin ( roles/storage.objectAdmin ) on the Dataproc Serverless service account in the project
  • Query BigLake metastore tables in BigQuery:

For more information about granting roles, see Manage access to projects, folders, and organizations .

You might also be able to get the required permissions through custom roles or other predefined roles .

Connect to a table

  1. Create a dataset in the Google Cloud console.

     CREATE 
      
     SCHEMA 
      
     ` PROJECT_ID 
    ` 
     . 
      DATASET_NAME 
     
     ; 
    

    Replace the following:

    • PROJECT_ID : the ID of the Google Cloud project to create the dataset.
    • DATASET_NAME : a name for your dataset.
  2. Create a Cloud Resource Connection .

  3. Create a standard BigQuery table.

     CREATE 
      
     TABLE 
      
     ` PROJECT_ID 
    ` 
     . 
      DATASET_NAME 
     
     . 
      TABLE_NAME 
     
      
     ( 
     name 
      
     STRING 
     , 
     id 
      
     INT64 
     ); 
    

    Replace the following:

    • TABLE_NAME : a name for your table.
  4. Insert data into the standard BigQuery table.

     INSERT 
      
     INTO 
      
     ` PROJECT_ID 
    ` 
     . 
      DATASET_NAME 
     
     . 
      TABLE_NAME 
     
      
     VALUES 
      
     ( 
     'test_name1' 
     , 
      
     123 
     ),( 
     'test_name2' 
     , 
      
     456 
     ),( 
     'test_name3' 
     , 
      
     789 
     ); 
    
  5. Create a BigLake table for Apache Iceberg in BigQuery .

    For example, to create a table, run the following CREATE statement.

     CREATE 
      
     TABLE 
      
     ` PROJECT_ID 
    ` 
     . 
      DATASET_NAME 
     
     . 
      ICEBERG_TABLE_NAME 
     
     ( 
     name 
      
     STRING 
     , 
     id 
      
     INT64 
     ) 
     WITH 
      
     CONNECTION 
      
     ` CONNECTION_NAME 
    ` 
     OPTIONS 
      
     ( 
     file_format 
      
     = 
      
     'PARQUET' 
     , 
     table_format 
      
     = 
      
     'ICEBERG' 
     , 
     storage_uri 
      
     = 
      
     ' STORAGE_URI 
    ' 
     ); 
    

    Replace the following:

    • ICEBERG_TABLE_NAME : a name for your BigLake table for Apache Iceberg in BigQuery. For example, iceberg_managed_table .
    • CONNECTION_NAME : the name of your connection. You created this in the previous step. For example, myproject.us.myconnection .
    • STORAGE_URI : a fully qualified Cloud Storage URI. For example, gs://mybucket/table .
  6. Insert data into the BigLake table for Apache Iceberg in BigQuery.

     INSERT 
      
     INTO 
      
     ` PROJECT_ID 
    ` 
     . 
      DATASET_NAME 
     
     . 
      ICEBERG_TABLE_NAME 
     
      
     VALUES 
      
     ( 
     'test_name1' 
     , 
      
     123 
     ),( 
     'test_name2' 
     , 
      
     456 
     ),( 
     'test_name3' 
     , 
      
     789 
     ); 
    
  7. Create an Apache Iceberg external table .

    For example, to create an Iceberg external table, run the following CREATE statement.

     CREATE 
      
     OR 
      
     REPLACE 
      
     EXTERNAL 
      
     TABLE 
      
     ` PROJECT_ID 
    ` 
     . 
      DATASET_NAME 
     
     . 
      READONLY_ICEBERG_TABLE_NAME 
     
     WITH 
      
     CONNECTION 
      
     ` CONNECTION_NAME 
    ` 
     OPTIONS 
      
     ( 
      
     format 
      
     = 
      
     'ICEBERG' 
     , 
      
     uris 
      
     = 
      
     [ 
     ' BUCKET_PATH 
    ' 
     ] 
     , 
      
     require_partition_filter 
      
     = 
      
     FALSE 
     ); 
    

    Replace the following:

    • READONLY_ICEBERG_TABLE_NAME : a name for your read-only table.
    • BUCKET_PATH : the path to the Cloud Storage bucket that contains the data for the external table, in the format ['gs://bucket_name/[folder_name/]file_name'] .
  8. From PySpark, query the standard table, BigLake table for Apache Iceberg in BigQuery, and Apache Iceberg external table.

     from 
      
     pyspark 
     . 
     sql 
      
     import 
      
     SparkSession 
     # Create a spark session 
     spark 
      
     = 
      
     SparkSession 
     . 
     builder 
      
     \ 
     . 
     appName 
     ( 
     "BigLake Metastore Iceberg" 
     ) 
      
     \ 
     . 
     config 
     ( 
     "spark.sql.catalog. CATALOG_NAME 
    " 
     , 
      
     "org.apache.iceberg.spark.SparkCatalog" 
     ) 
      
     \ 
     . 
     config 
     ( 
     "spark.sql.catalog. CATALOG_NAME 
    .catalog-impl" 
     , 
      
     "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog" 
     ) 
      
     \ 
     . 
     config 
     ( 
     "spark.sql.catalog. CATALOG_NAME 
    .gcp_project" 
     , 
      
     " PROJECT_ID 
    " 
     ) 
      
     \ 
     . 
     config 
     ( 
     "spark.sql.catalog. CATALOG_NAME 
    .gcp_location" 
     , 
      
     " LOCATION 
    " 
     ) 
      
     \ 
     . 
     config 
     ( 
     "spark.sql.catalog. CATALOG_NAME 
    .warehouse" 
     , 
      
     " WAREHOUSE_DIRECTORY 
    " 
     ) 
      
     \ 
     . 
     getOrCreate 
     () 
     spark 
     . 
     conf 
     . 
     set 
     ( 
     "viewsEnabled" 
     , 
     "true" 
     ) 
     # Use the blms_catalog 
     spark 
     . 
     sql 
     ( 
     "USE ` CATALOG_NAME 
    `;" 
     ) 
     spark 
     . 
     sql 
     ( 
     "USE NAMESPACE DATASET_NAME 
    ;" 
     ) 
     # Configure spark for temp results 
     spark 
     . 
     sql 
     ( 
     "CREATE namespace if not exists MATERIALIZATION_NAMESPACE 
    " 
     ); 
     spark 
     . 
     conf 
     . 
     set 
     ( 
     "materializationDataset" 
     , 
     " MATERIALIZATION_NAMESPACE 
    " 
     ) 
     # List the tables in the dataset 
     df 
      
     = 
      
     spark 
     . 
     sql 
     ( 
     "SHOW TABLES;" 
     ) 
     df 
     . 
     show 
     (); 
     # Query the tables 
     sql 
      
     = 
      
     " 
     "" 
     SELECT * FROM DATASET_NAME 
    . TABLE_NAME 
     
     "" 
     " 
     df 
      
     = 
      
     spark 
     . 
     read 
     . 
     format 
     ( 
     "bigquery" 
     ). 
     load 
     ( 
     sql 
     ) 
     df 
     . 
     show 
     () 
     sql 
      
     = 
      
     " 
     "" 
     SELECT * FROM DATASET_NAME 
    . ICEBERG_TABLE_NAME 
     
     "" 
     " 
     df 
      
     = 
      
     spark 
     . 
     read 
     . 
     format 
     ( 
     "bigquery" 
     ). 
     load 
     ( 
     sql 
     ) 
     df 
     . 
     show 
     () 
     sql 
      
     = 
      
     " 
     "" 
     SELECT * FROM DATASET_NAME 
    . READONLY_ICEBERG_TABLE_NAME 
     
     "" 
     " 
     df 
      
     = 
      
     spark 
     . 
     read 
     . 
     format 
     ( 
     "bigquery" 
     ). 
     load 
     ( 
     sql 
     ) 
     df 
     . 
     show 
     () 
    

    Replace the following:

    • WAREHOUSE_DIRECTORY : the URI of the Cloud Storage folder that's connected to your BigLake Iceberg table in BigQuery and your Iceberg external table.
    • CATALOG_NAME : the name of the catalog that you're using.
    • MATERIALIZATION_NAMESPACE : the namespace for storing temp results.
  9. Run the PySpark script using Serverless Spark.

    gcloud  
    dataproc  
    batches  
    submit  
    pyspark  
     SCRIPT_PATH 
      
     \ 
      
    --version = 
     2 
    .2  
     \ 
      
    --project = 
     PROJECT_ID 
      
     \ 
      
    --region = 
     REGION 
      
     \ 
      
    --deps-bucket = 
     YOUR_BUCKET 
      
     \ 
    

    Replace the following:

    • SCRIPT_PATH : the path to the script that the batch job uses.
    • PROJECT_ID : the ID of the Google Cloud project to run the batch job in.
    • REGION : the region where your workload runs.
    • YOUR_BUCKET : the location of the Cloud Storage bucket to upload workload dependencies. The gs:// URI prefix of the bucket is not required. You can specify the bucket path or bucket name, for example, mybucketname1 .

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: