Use BigLake metastore with Spark stored procedures

This document explains how to use Apache Spark stored procedures with BigLake metastore.

Before you begin

Enable billing for your Google Cloud project. Learn how to check if billing is enabled on a project .
Enable the BigQuery and Dataflow APIs.

Enable the APIs
Optional: Learn more about the following:
- Understand how BigLake metastore works and why you should use it.
- Learn how BigQuery Spark stored procedures work and complete the before you begin tasks.

Required roles

To use Spark stored procedures, review the required roles for stored procedures and grant the necessary roles.

To get the permissions that you need to use Spark and stored procedures with BigLake metastore as a metadata store, ask your administrator to grant you the following IAM roles:

Create BigLake metastore tables in Spark:
- BigQuery Data Editor ( roles/bigquery.dataEditor ) on the Spark Connection service account in the project
- Storage Object Admin ( roles/storage.objectAdmin ) on the Spark Connection service account in the project
Query BigLake metastore tables in BigQuery:
- BigQuery Data Viewer ( roles/bigquery.dataViewer ) on the project
- BigQuery User ( roles/bigquery.user ) on the project
- Storage Object Viewer ( roles/storage.objectViewer ) on the project

For more information about granting roles, see Manage access to projects, folders, and organizations .

You might also be able to get the required permissions through custom roles or other predefined roles .

Create and run a stored procedure

The following example shows you how to create and run a stored procedure with BigLake metastore.

Go to the BigQuerypage.

Go to BigQuery

In the query editor, add the following sample code for the CREATE PROCEDURE statement .

 CREATE 
  
 OR 
  
 REPLACE 
  
 PROCEDURE 
 ` PROJECT_ID 
. BQ_DATASET_ID 
. PROCEDURE_NAME 
` 
 () 
 WITH 
  
 CONNECTION 
  
 ` PROJECT_ID 
. REGION 
. SPARK_CONNECTION_ID 
` 
  
 OPTIONS 
  
 ( 
 engine 
 = 
 'SPARK' 
 , 
 runtime_version 
 = 
 '1.1' 
 , 
 properties 
 =[ 
 ( 
 "spark.sql.catalog. CATALOG_NAME 
.warehouse" 
 , 
 " WAREHOUSE_DIRECTORY 
" 
 ), 
 ( 
 "spark.sql.catalog. CATALOG_NAME 
.gcp_location" 
 , 
 " LOCATION 
" 
 ), 
 ( 
 "spark.sql.catalog. CATALOG_NAME 
.gcp_project" 
 , 
 " PROJECT_ID 
" 
 ), 
 ( 
 "spark.sql.catalog. CATALOG_NAME 
" 
 , 
 "org.apache.iceberg.spark.SparkCatalog" 
 ), 
 ( 
 "spark.sql.catalog. CATALOG_NAME 
.catalog-impl" 
 , 
 "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog" 
 ), 
 ( 
 "spark.jars.packages" 
 , 
 "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1" 
 ) 
 ] 
 , 
 jar_uris 
 =[ 
 "gs://spark-lib/bigquery/iceberg-bigquery-catalog-1.6.1-1.0.1-beta.jar" 
 ] 
 ) 
 LANGUAGE 
  
 python 
  
 AS 
  
 R 
 " 
 "" 
 from pyspark.sql import SparkSession 
 spark = SparkSession 
 \ 
 .builder 
 \ 
 .appName(" 
 BigLake 
  
 Metastore 
  
 Iceberg 
 ") 
 \ 
 .getOrCreate() 
 spark.sql(" 
 USE 
  
  CATALOG_NAME 
 
 ; 
 ") 
 spark.sql(" 
 CREATE 
  
 NAMESPACE 
  
 IF 
  
 NOT 
  
 EXISTS 
  
  NAMESPACE_NAME 
 
 ; 
 ") 
 spark.sql(" 
 USE 
  
  NAMESPACE_NAME 
 
 ; 
 ") 
 spark.sql(" 
 CREATE 
  
 TABLE 
  
  TABLE_NAME 
 
  
 ( 
 id 
  
 int 
 , 
  
 data 
  
 string 
 ) 
  
 USING 
  
 ICEBERG 
  
 LOCATION 
  
 ' WAREHOUSE_DIRECTORY 
' 
 ") 
 spark.sql(" 
 DESCRIBE 
  
  TABLE_NAME 
 
 ; 
 ") 
 spark.sql(" 
 INSERT 
  
 INTO 
  
  TABLE_NAME 
 
  
 VALUES 
  
 ( 
 1 
 , 
  
 \ 
 "first row 
 \" 
 );" 
 ) 
 spark 
 . 
 sql 
 ( 
 "SELECT * from TABLE_NAME 
;" 
 ) 
 spark 
 . 
 sql 
 ( 
 "ALTER TABLE TABLE_NAME 
ADD COLUMNS (newDoubleCol double);" 
 ) 
 spark 
 . 
 sql 
 ( 
 "DESCRIBE TABLE_NAME 
;" 
 ) 
 " 
 "" 
 ; 
 CALL ` PROJECT_ID 
. BQ_DATASET_ID 
. PROCEDURE_NAME 
`();

Replace the following:

PROJECT_ID : the ID of your Google Cloud project.
BQ_DATASET_ID : the ID of the dataset in BigQuery that contains the procedure.
PROCEDURE_NAME : the name of the procedure that you're creating or replacing.
REGION : the location of your Spark connection.
LOCATION : the location of your BigQuery resources.
SPARK_CONNECTION_ID : the ID of your Spark connection.
CATALOG_NAME : the name of the catalog that you're using.
WAREHOUSE_DIRECTORY : the URI of the Cloud Storage folder that contains your data warehouse.
NAMESPACE_NAME : the namespace that you're using.

Use BigLake metastore with Spark stored procedures

Before you begin

Required roles

Create and run a stored procedure

What's next