Create an Apache Iceberg table with metadata in BigLake metastore
Stay organized with collectionsSave and categorize content based on your preferences.
This document shows you how to create an Apache Iceberg table with metadata in
BigLake metastore using the Dataproc Jobs service,
theSpark SQL CLIor
theZeppelinweb interface
running on a Dataproc cluster.
Before you begin
If you haven't done so, create a Google Cloud project, a
Cloud Storagebucket,
and a Dataproc cluster.
Set up your project
Sign in to your Google Cloud account. If you're new to
Google Cloud,create an accountto evaluate how our products perform in
real-world scenarios. New customers also get $300 in free credits to
run, test, and deploy workloads.
In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
To filter objects to replicate by object name prefix,
enter a prefix that you want to include or exclude objects from, then clickaddAdd a prefix.
To set a storage class for the replicated objects,
select a storage class from theStorage classmenu.
If you skip this step, the replicated objects will use the
destination bucket's storage class by default.
ClickDone.
In theChoose how to store your datasection, do the following:
To enablehierarchical namespace, in theOptimize storage for data-intensive workloadssection, selectEnable hierarchical namespace on this bucket.
In theChoose how to control access to objectssection, select
whether or not your bucket enforcespublic access prevention,
and select anaccess control methodfor your bucket's objects.
In theChoose how to protect object datasection, do the
following:
Select any of the options underData protectionthat you
want to set for your bucket.
To enablesoft delete, click theSoft delete policy (For data recovery)checkbox,
and specify the number of days you want to retain objects
after deletion.
To setObject Versioning, click theObject versioning (For version control)checkbox,
and specify the maximum number of versions per object and the number of days after which
the noncurrent versions expire.
To enable the retention policy on objects and buckets, click theRetention (For compliance)checkbox, and then do the following:
If you want to run the Zeppelin web interface example in this guide,
you must use or create a Dataproc cluster with
theZeppelin optional component enabled.
Note the following mapping between open source database and BigQuery
dataset terms:
OSS database
BigQuery dataset
Namespace, Database
Dataset
Partitioned or Unpartitioned Table
Table
View
View
Create an Iceberg table
This section shows you how to create an Iceberg table with metadata in
BigLake metastore by submitting a Spark SQL code to theDataproc service,
theSpark SQL CLI,
and theZeppelin componentweb interface,
which run on a Dataproc cluster.
The examples in this section show you how to submit a Dataproc
Spark SQL job to the Dataproc
service to create an Iceberg table with metadata in BigQuery
using the gcloud CLI, Google Cloud console, or
Dataproc REST API.
Prepare job files
Perform the following steps to create a Spark SQL job file. The file contains
Spark SQL commands to create and update an Iceberg table.
In a local terminal window or inCloud Shell,
use a text editor, such as theviornano, to copy the
following commands into aniceberg-table.sqlfile, then save the
file in the current directory.
USECATALOG_NAME;
CREATE NAMESPACE IF NOT EXISTS example_namespace;
USE example_namespace;
DROP TABLE IF EXISTS example_table;
CREATE TABLE example_table (id int, data string) USING ICEBERG LOCATION 'gs://BUCKET/WAREHOUSE_FOLDER';
INSERT INTO example_table VALUES (1, 'first row');
ALTER TABLE example_table ADD COLUMNS (newDoubleCol double);
DESCRIBE TABLE example_table;
Replace the following:
CATALOG_NAME: Iceberg catalog name.
BUCKETandWAREHOUSE_FOLDER: Cloud Storage bucket
and folder used for the Iceberg warehouse.
Usegcloud CLIto copy the localiceberg-table.sqlto your bucket in Cloud Storage.
gcloud storage cp iceberg-table.sql gs://BUCKET/
Next, download and copy theiceberg-spark-runtime-3.5_2.12-1.6.1JAR file to
Cloud Storage.
In a local terminal window or inCloud Shell,
run the followingcurlcommand to download theiceberg-spark-runtime-3.5_2.12-1.6.1JAR file to the current directory.
Select a tab to follow the instructions to submit the Spark SQL job to the
Dataproc service using the gcloud CLI,
Google Cloud console, or Dataproc
REST API.
Perform the following steps to use the Google Cloud console to submit
the Spark SQL job to the Dataproc service to create an
Iceberg table with metadata in BigLake metastore.
In the Google Cloud console, go to the DataprocSubmit a job.
Properties: ClickaddAdd Propertyfive times to create a list of fivekeyvalueinput fields, then
copy the followingKeyandValuepairs to define
five properties.
PROJECT_ID: Your Google Cloud project ID.
Project IDs are listed in theProject infosection on
the Google Cloud consoleDashboard.regionwhere your cluster is located.
LOCATION: A supportedBigQuery location.
The default location is "US".
BUCKETandWAREHOUSE_FOLDER: Cloud Storage bucket
and folder used for the Iceberg warehouse.
ClickSubmit
To monitor job progress and view job output, go to the DataprocJobspage in the Google Cloud console,
then click theJob IDto open theJob detailspage.
To view table metadata in BigQuery
In the Google Cloud console, go to theBigQuerypage.
You can use the Dataprocjobs.submitAPI
to submit the Spark SQL job to the Dataproc service to create an
Iceberg table with metadata in BigLake metastore.
Before using any of the request data,
make the following replacements:
PROJECT_ID: Your Google Cloud project ID.
Project IDs are listed in theProject infosection on
the Google Cloud consoleDashboard.
The following steps show you how to create an Iceberg table with table metadata
stored in BigLake metastore using the Spark SQL CLI running on the
master node of a Dataproc cluster.
UseSSHto connect to the master node
of your Dataproc cluster.
In the SSH session terminal, use theviornanotext editor to copy the
following commands into aniceberg-table.sqlfile.
SET CATALOG_NAME = `CATALOG_NAME`;
SET BUCKET = `BUCKET`;
SET WAREHOUSE_FOLDER = `WAREHOUSE_FOLDER`;
USE `${CATALOG_NAME}`;
CREATE NAMESPACE IF NOT EXISTS `${CATALOG_NAME}`.example_namespace;
DROP TABLE IF EXISTS `${CATALOG_NAME}`.example_namespace.example_table;
CREATE TABLE `${CATALOG_NAME}`.example_namespace.example_table (id int, data string) USING ICEBERG LOCATION 'gs://${BUCKET}/${WAREHOUSE_FOLDER}';
INSERT INTO `${CATALOG_NAME}`.example_namespace.example_table VALUES (1, 'first row');
ALTER TABLE `${CATALOG_NAME}`.example_namespace.example_table ADD COLUMNS (newDoubleCol double);
DESCRIBE TABLE `${CATALOG_NAME}`.example_namespace.example_table;
Replace the following:
CATALOG_NAME: Iceberg catalog name.
BUCKETandWAREHOUSE_FOLDER: Cloud Storage bucket
and folder used for the Iceberg warehouse.
In the SSH session terminal, run the followingspark-sqlcommand to create
the iceberg table.
The following steps show you how to to create an Iceberg table with table
metadata stored in BigLake metastore using the Zeppelin web
interface running on the master node of a Dataproc cluster .
In the Google Cloud console, go to the DataprocClusterspage.
ClickSaveat the bottom of the Spark interpreter section, then
clickOKto update the interpreter and restart the Spark interpreter
with the new settings.
From the Zeppelin notebook menu, clickCreate new note.
In theCreate new notedialog, input a name for the notebook, and
accept the defaultsparkinterpreter. ClickCreateto open The
notebook.
Copy the following PySpark code into your Zeppelin notebook
after filling in the variables.
%pyspark frompyspark.sqlimportSparkSession project_id="PROJECT_ID"catalog="CATALOG_NAME"namespace="NAMESPACE"location="LOCATION"warehouse_dir="gs://BUCKET/WAREHOUSE_DIRECTORY" spark=SparkSession.builder\.appName("BigQuery Metastore Iceberg")\.config(f"spark.sql.catalog.{catalog}","org.apache.iceberg.spark.SparkCatalog")\.config(f"spark.sql.catalog.{catalog}.catalog-impl","org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog")\.config(f"spark.sql.catalog.{catalog}.gcp_project",f"{project_id}")\.config(f"spark.sql.catalog.{catalog}.gcp_location",f"{location}")\.config(f"spark.sql.catalog.{catalog}.warehouse",f"{warehouse_dir}")\.getOrCreate() spark.sql(f"USE `{catalog}`;")spark.sql(f"CREATE NAMESPACE IF NOT EXISTS `{namespace}`;")spark.sql(f"USE `{namespace}`;") \#Createtableanddisplayschema(withoutLOCATION)spark.sql("DROP TABLE IF EXISTS example_iceberg_table")spark.sql("CREATE TABLE example_iceberg_table (id int, data string) USING ICEBERG")spark.sql("DESCRIBE example_iceberg_table;") \#Inserttabledata.
spark.sql("INSERT INTO example_iceberg_table VALUES (1, 'first row');") \#Altertable,thendisplayschema.
spark.sql("ALTER TABLE example_iceberg_table ADD COLUMNS (newDoubleCol double);") \#Selectanddisplaythecontentsofthetable.
spark.sql("SELECT * FROM example_iceberg_table").show()
Replace the following:
PROJECT_ID: Your Google Cloud project ID.
Project IDs are listed in theProject infosection on
the Google Cloud consoleDashboard.
CATALOG_NAMEandNAMESPACE: The Iceberg catalog name
and namespace combine to identify the Iceberg table (catalog.namespace.table_name).
LOCATION: A supportedBigQuery location.
The default location is "US".
BUCKETandWAREHOUSE_DIRECTORY: Cloud Storage bucket
and folder used as Iceberg warehouse directory.
Click the run icon or pressShift-Enterto run the code. When the job
completes, the status message shows "Spark Job Finished", and the output
displays the table contents:
View table metadata in BigQuery
In the Google Cloud console, go to theBigQuerypage.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-04 UTC."],[[["\u003cp\u003eThis content outlines the process of creating an Apache Iceberg table with metadata in BigQuery Metastore using Dataproc Jobs service, Spark SQL CLI, or Zeppelin web interface, all running on a Dataproc cluster.\u003c/p\u003e\n"],["\u003cp\u003eBefore proceeding, users must set up a Google Cloud project, create a Cloud Storage bucket, and provision a Dataproc cluster, ensuring Private Google Access is enabled.\u003c/p\u003e\n"],["\u003cp\u003eTo create an Iceberg table, you can submit a Spark SQL job to the Dataproc service via the gcloud CLI, Google Cloud console, or Dataproc REST API, by preparing a Spark SQL job file and necessary JAR files in Cloud Storage.\u003c/p\u003e\n"],["\u003cp\u003eAlternatively, one can create an Iceberg table by connecting to a Dataproc cluster's master node via SSH, then running the Spark SQL CLI with appropriate configurations, or by using the Zeppelin web interface to run PySpark code.\u003c/p\u003e\n"],["\u003cp\u003eThe product or feature mentioned is in Pre-GA stage, therefore it's available "as is", might have limited support, and users need to refer to the "Pre-GA Offerings Terms".\u003c/p\u003e\n"]]],[],null,[]]