Apache Iceberg tables

Apache Iceberg tables managed by BigQuery (formerly BigLake tables for Apache Iceberg in BigQuery) provide the foundation for building open-format lakehouses on Google Cloud. Managed Iceberg tables offer the same fully managed experience as standard BigQuery tables, but store data in customer-owned storage buckets. Managed Iceberg tables support the open Spark table format for better interoperability with open-source and third-party compute engines on a single copy of data.

Managed Iceberg tables support the following features:

Architecture

Managed Iceberg tables bring the convenience of BigQuery resource management to tables that reside in your own cloud buckets. You can use BigQuery and open-source compute engines on these tables without moving the data out of the buckets that you control. You must configure a Cloud Storage bucket before you start using Managed Iceberg tables.

Managed Iceberg tables utilize Lakehouse runtime catalog as the unified runtime catalog for all Spark data. The Lakehouse runtime catalog provides a single source of truth for managing metadata from multiple engines and allows for engine interoperability.

Using Apache Iceberg tables has the following implications on your bucket:

  • BigQuery creates new data files in the bucket in response to write requests and background storage optimizations, such as DML statements and streaming.
  • Automatic compaction and clustering are performed on the data files in the bucket. After the expiration of the time travel window , data files are garbage collected. However, if the table is deleted, the associated data files aren't garbage collected. For more information, see Storage optimization .

Creating an Spark table is similar to creating BigQuery tables . Because it stores data in open formats on Cloud Storage, you must do the following:

  • Specify the Cloud resource connection with WITH CONNECTION to configure the connection credentials for BigQuery to access Cloud Storage.
  • Specify the file format of data storage as PARQUET with the file_format = PARQUET statement.
  • Specify the open-source metadata table format as ICEBERG with the table_format = ICEBERG statement.

Best practices

Directly changing or adding files to the bucket outside of BigQuery can lead to data loss or unrecoverable errors. The following table describes possible scenarios:

Operation Consequences Prevention
Add new files to the bucket outside BigQuery.
Data loss:New files or objects added outside of BigQuery are not tracked by BigQuery. Untracked files are deleted by background garbage collection processes. Add data exclusively through BigQuery. This lets BigQuery track the files and prevent them from being garbage collected.
To prevent accidental additions and data loss, we also recommend restricting external tool write permissions on buckets containing Managed Iceberg tables.
Create a new Spark table in a non-empty prefix.
Data loss:Extant data isn't tracked by BigQuery, so these files are considered untracked, and deleted by background garbage collection processes. Only create new Managed Iceberg tables in empty prefixes.
Modify or replace Spark table data files.
Data loss:On external modification or replacement, the table fails a consistency check and becomes unreadable. Queries against the table fail.
There is no self-serve way to recover from this point. Contact support for data recovery assistance.
Modify data exclusively through BigQuery. This lets BigQuery track the files and prevent them from being garbage collected.
To prevent accidental additions and data loss, we also recommend restricting external tool write permissions on buckets containing Managed Iceberg tables.
Create two Managed Iceberg tables on the same or overlapping URIs.
Data loss:BigQuery doesn't bridge identical URI instances of Managed Iceberg tables. Background garbage collection processes for each table will consider the opposite table's files as untracked, and delete them, causing data loss. Use unique URIs for each Spark table.

Cloud Storage bucket configuration best practices

The configuration of your Cloud Storage bucket and its connection with BigQuery have a direct impact on the performance, cost, data integrity, security, and governance of your Managed Iceberg tables. The following are best practices to help with this configuration:

  • Select a name that clearly indicates that the bucket is only meant for Managed Iceberg tables.

  • Choose single-region Cloud Storage buckets that are co-located in the same region as your BigQuery dataset. This coordination improves performance and lowers costs by avoiding data transfer charges.

  • By default, Cloud Storage stores data in the Standard storage class, which provides sufficient performance. To optimize data storage costs, you can enable Autoclass to automatically manage storage class transitions. Autoclass starts with the Standard storage class and moves objects that aren't accessed to progressively colder classes in order to reduce storage costs. When the object is read again, it's moved back to the Standard class.

  • Enable uniform bucket-level access and public access prevention .

  • Verify that the required roles are assigned to the correct users and service accounts.

  • To prevent accidental Spark data deletion or corruption in your Cloud Storage bucket, restrict write and delete permissions for most users in your organization. You can do this by setting a bucket permission policy with conditions that deny PUT and DELETE requests for all users, except those that you specify.

  • Apply google-managed or customer-managed encryption keys for extra protection of sensitive data.

  • Enable audit logging for operational transparency, troubleshooting, and monitoring data access.

  • Keep the default soft delete policy (7 day retention) to protect against accidental deletions. However, if you find that Spark data has been deleted, engage with support rather than restoring objects manually, as objects that are added or modified outside of BigQuery aren't tracked by BigQuery metadata.

  • Adaptive file sizing, automatic clustering, and garbage collection are enabled automatically and help with optimizing file performance and cost.

  • Avoid the following Cloud Storage features, as they are unsupported for Managed Iceberg tables:

You can implement these best practices by creating your bucket with the following command:

gcloud  
storage  
buckets  
create  
gs:// BUCKET_NAME 
  
 \ 
  
--project = 
 PROJECT_ID 
  
 \ 
  
--location = 
 LOCATION 
  
 \ 
  
--enable-autoclass  
 \ 
  
--public-access-prevention  
 \ 
  
--uniform-bucket-level-access

Replace the following:

  • BUCKET_NAME : the name for your new bucket
  • PROJECT_ID : the ID of your project
  • LOCATION : the location for your new bucket

Spark table workflows

The following sections describe how to create, load, manage, and query managed tables.

Before you begin

Before creating and using Managed Iceberg tables, ensure that you have set up a Cloud resource connection to a storage bucket. Your connection needs write permissions on the storage bucket, as specified in the following Required roles section. For more information about required roles and permissions for connections, see Manage connections .

Required roles

To get the permissions that you need to let BigQuery manage tables in your project, ask your administrator to grant you the following IAM roles:

For more information about granting roles, see Manage access to projects, folders, and organizations .

These predefined roles contain the permissions required to let BigQuery manage tables in your project. To see the exact permissions that are required, expand the Required permissionssection:

Required permissions

The following permissions are required to let BigQuery manage tables in your project:

  • All:
    • bigquery.connections.delegate on your project
    • bigquery.jobs.create on your project
    • bigquery.readsessions.create on your project
    • bigquery.tables.create on your project
    • bigquery.tables.get on your project
    • bigquery.tables.getData on your project
    • storage.buckets.get on your bucket
    • storage.objects.create on your bucket
    • storage.objects.delete on your bucket
    • storage.objects.get on your bucket
    • storage.objects.list on your bucket

You might also be able to get these permissions with custom roles or other predefined roles .

Create Managed Iceberg tables

To create an Spark table, select one of the following methods:

SQL

 CREATE 
  
 TABLE 
  
 [ 
  PROJECT_ID 
 
 . 
 ] 
  DATASET_ID 
 
 . 
  TABLE_NAME 
 
  
 ( 
 COLUMN 
  
  DATA_TYPE 
 
 [ 
 , 
  
 ... 
 ] 
 ) 
 CLUSTER 
  
 BY 
  
  CLUSTER_COLUMN_LIST 
 
 WITH 
  
 CONNECTION 
  
 { 
  CONNECTION_NAME 
 
  
 | 
  
 DEFAULT 
 } 
 OPTIONS 
  
 ( 
 file_format 
  
 = 
  
 'PARQUET' 
 , 
 table_format 
  
 = 
  
 'ICEBERG' 
 , 
 storage_uri 
  
 = 
  
 ' STORAGE_URI 
' 
 ); 

Replace the following:

  • PROJECT_ID : the project containing the dataset. If undefined, the command assumes the default project.
  • DATASET_ID : an existing dataset.
  • TABLE_NAME : the name of the table you're creating.
  • DATA_TYPE : the data type of the information that is contained in the column.
  • CLUSTER_COLUMN_LIST (optional): a comma-separated list containing up to four columns. They must be top-level, non-repeated columns.
  • CONNECTION_NAME : the name of the connection. For example, myproject.us.myconnection .

To use a default connection , specify DEFAULT instead of the connection string containing PROJECT_ID . REGION . CONNECTION_ID .

  • STORAGE_URI : a fully qualified Cloud Storage URI . For example, gs://mybucket/table .

bq

bq  
--project_id = 
 PROJECT_ID 
  
mk  
 \ 
  
--table  
 \ 
  
--file_format = 
PARQUET  
 \ 
  
--table_format = 
ICEBERG  
 \ 
  
--connection_id = 
 CONNECTION_NAME 
  
 \ 
  
--storage_uri = 
 STORAGE_URI 
  
 \ 
  
--schema = 
 COLUMN_NAME 
: DATA_TYPE 
 [ 
,  
... ] 
  
 \ 
  
--clustering_fields = 
 CLUSTER_COLUMN_LIST 
  
 \ 
  
 DATASET_ID 
. MANAGED_TABLE_NAME 

Replace the following:

  • PROJECT_ID : the project containing the dataset. If undefined, the command assumes the default project.
  • CONNECTION_NAME : the name of the connection. For example, myproject.us.myconnection .
  • STORAGE_URI : a fully qualified Cloud Storage URI . For example, gs://mybucket/table .
  • COLUMN_NAME : the column name.
  • DATA_TYPE : the data type of the information contained in the column.
  • CLUSTER_COLUMN_LIST (optional): a comma-separated list containing up to four columns. They must be top-level, non-repeated columns.
  • DATASET_ID : the ID of an existing dataset.
  • MANAGED_TABLE_NAME : the name of the table you're creating.

API

Call the tables.insert ' method with a defined table resource , similar to the following:

 { 
 "tableReference" 
 : 
  
 { 
  
 "tableId" 
 : 
  
 " TABLE_NAME 
" 
 }, 
 "biglakeConfiguration" 
 : 
  
 { 
  
 "connectionId" 
 : 
  
 " CONNECTION_NAME 
" 
 , 
  
 "fileFormat" 
 : 
  
 "PARQUET" 
 , 
  
 "tableFormat" 
 : 
  
 "ICEBERG" 
 , 
  
 "storageUri" 
 : 
  
 " STORAGE_URI 
" 
 }, 
 "schema" 
 : 
  
 { 
  
 "fields" 
 : 
  
 [ 
  
 { 
  
 "name" 
 : 
  
 " COLUMN_NAME 
" 
 , 
  
 "type" 
 : 
  
 " DATA_TYPE 
" 
  
 } 
  
 [, 
  
 ... 
 ] 
  
 ] 
 } 
 } 

Replace the following:

  • TABLE_NAME : the name of the table that you're creating.
  • CONNECTION_NAME : the name of the connection. For example, myproject.us.myconnection .
  • STORAGE_URI : a fully qualified Cloud Storage URI . Wildcards are also supported. For example, gs://mybucket/table .
  • COLUMN_NAME : the column name.
  • DATA_TYPE : the data type of the information contained in the column.

Import data into Managed Iceberg tables

The following sections describe how to import data from various table formats into Managed Iceberg tables.

Standard load data from flat files

Managed Iceberg tables use BigQuery load jobs to load external files into Managed Iceberg tables. If you have an existing Spark table, follow the bq load CLI guide or the LOAD SQL guide to load external data. After loading the data, new Parquet files are written into the STORAGE_URI /data folder.

If the prior instructions are used without an existing Spark table, a BigQuery table is created instead.

See the following for tool-specific examples of batch loads into managed tables:

SQL

 LOAD 
  
 DATA 
  
 INTO 
  
  MANAGED_TABLE_NAME 
 
 FROM 
  
 FILES 
  
 ( 
 uris 
 =[ 
 ' STORAGE_URI 
' 
 ] 
 , 
 format 
 = 
 ' FILE_FORMAT 
' 
 ); 

Replace the following:

  • MANAGED_TABLE_NAME : the name of an existing Spark table.
  • STORAGE_URI : a fully qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported. For example, gs://mybucket/table .
  • FILE_FORMAT : the source table format. For supported formats, see the format row of load_option_list .

bq

bq  
load  
 \ 
  
--source_format = 
 FILE_FORMAT 
  
 \ 
  
 MANAGED_TABLE 
  
 \ 
  
 STORAGE_URI 

Replace the following:

  • FILE_FORMAT : the source table format. For supported formats, see the format row of load_option_list .
    • MANAGED_TABLE_NAME : the name of an existing Apache Iceberg table.
  • STORAGE_URI : a fully qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported. For example, gs://mybucket/table .

Standard load from Apache Hive-partitioned files

You can load Apache Hive-partitioned files into Managed Iceberg tables using standard BigQuery load jobs. For more information, see Loading externally partitioned data .

Load streaming data from Pub/Sub

You can load streaming data into Managed Iceberg tables by using a Pub/Sub BigQuery subscription .

Export data from Managed Iceberg tables

The following sections describe how to export data from Managed Iceberg tables into various table formats.

Export data into flat formats

To export an Spark table into a flat format, use the EXPORT DATA statement and select a destination format. For more information, see Exporting data .

Create Spark table metadata snapshots

To create an Spark table metadata snapshot, follow these steps:

  1. Export the metadata into the Spark V2 format with the EXPORT TABLE METADATA SQL statement.

  2. Optional: Schedule Spark metadata snapshot refresh. To refresh an Spark metadata snapshot based on a set time interval, use a scheduled query .

  3. Optional: Enable metadata auto-refresh for your project to automatically update your Spark table metadata snapshot on each table mutation. To enable metadata auto-refresh, contact bigquery-tables-for-apache-iceberg-help@google.com . EXPORT METADATA costs are applied on each refresh operation.

The following example creates a scheduled query named My Scheduled Snapshot Refresh Query using the DDL statement EXPORT TABLE METADATA FROM mydataset.test . The DDL statement runs every 24 hours.

bq  
query  
 \ 
  
--use_legacy_sql = 
 false 
  
 \ 
  
--display_name = 
 'My Scheduled Snapshot Refresh Query' 
  
 \ 
  
--schedule = 
 'every 24 hours' 
  
 \ 
  
 'EXPORT TABLE METADATA FROM mydataset.test' 

After you refresh the Spark table metadata snapshot you can find the snapshot in the Cloud Storage URI that the Spark table was originally created in. The /data folder contains the Parquet file data shards, and the /metadata folder contains the Spark table metadata snapshot.

 SELECT 
  
 table_name 
 , 
  
 REGEXP_EXTRACT 
 ( 
 ddl 
 , 
  
 r 
 "storage_uri 
 \ 
 s*= 
 \ 
 s* 
 \" 
 ([^ 
 \" 
 ]+) 
 \" 
 " 
 ) 
  
 AS 
  
 storage_uri 
 FROM 
  
 `mydataset` 
 . 
 INFORMATION_SCHEMA 
 . 
 TABLES 
 ; 

Note that mydataset and table_name are placeholders for your actual dataset and table.

Read Managed Iceberg tables with Spark

The following sample sets up your environment to use Spark SQL with Spark, and then executes a query to fetch data from a specified Spark table.

spark-sql  
 \ 
  
--packages  
org.apache.iceberg:iceberg-spark-runtime- ICEBERG_VERSION_NUMBER 
  
 \ 
  
--conf  
spark.sql.catalog. CATALOG_NAME 
 = 
org.apache.iceberg.spark.SparkCatalog  
 \ 
  
--conf  
spark.sql.catalog. CATALOG_NAME 
.type = 
hadoop  
 \ 
  
--conf  
spark.sql.catalog. CATALOG_NAME 
.warehouse = 
 ' BUCKET_PATH 
' 
  
 \ 
 # Query the table 
SELECT  
*  
FROM  
 CATALOG_NAME 
. FOLDER_NAME 
 ; 

Replace the following:

  • ICEBERG_VERSION_NUMBER : the current version of Spark Spark runtime. Download the latest version from Spark Releases .
  • CATALOG_NAME : the catalog to reference your Spark table.
  • BUCKET_PATH : the path to the bucket containing the table files. For example, gs://mybucket/ .
  • FOLDER_NAME : the folder containing the table files. For example, myfolder .

Modify Managed Iceberg tables

To modify an Spark table, follow the steps shown in Modifying table schemas .

Use multi-statement transactions

To gain access to multi-statement transactions for Managed Iceberg tables, fill out the sign-up form .

Use partitioning

To gain access to partitioning for Apache Iceberg tables, fill out the sign-up form .

You partition a table by specifying a partition column, which is used to segment the table. The following column types are supported for Managed Iceberg tables:

  • DATE
  • DATETIME
  • TIMESTAMP

Partitioning a table on a DATE , DATETIME , or TIMESTAMP column is known as time-unit column partitioning . You choose whether the partitions have hourly, daily, monthly, or yearly granularity .

Managed Iceberg tables also support clustering and combining clustered and partitioned tables .

Partitioning limitations

Create a partitioned Spark table

To create a partitioned Spark table, follow the instructions to create a standard Spark table , and include one of the following, depending on your environment:

Modify and query partitioned Managed Iceberg tables

BigQuery data manipulation language (DML) statements and queries for partitioned Managed Iceberg tables are the same as for standard Spark tables. BigQuery automatically scopes the job to the right partitions, similar to Spark hidden partitioning . Additionally, any new data that you add to the table is automatically partitioned.

You can also query partitioned Managed Iceberg tables with other engines in the same way as standard Managed Iceberg tables. We recommend enabling metadata snapshots for the best experience.

For enhanced security, partitioning information for Managed Iceberg tables is decoupled from the data path and is managed entirely by the metadata layer.

Pricing

Spark table pricing consists of storage, storage optimization, and queries and jobs.

Storage

Managed Iceberg tables store all data in Cloud Storage . You are charged for all data stored, including historical table data. Cloud Storage data processing and transfer charges might also apply. Some Cloud Storage operation fees might be waived for operations that are processed through BigQuery or the BigQuery Storage API. There are no BigQuery-specific storage fees. For more information, see Cloud Storage Pricing .

Storage optimization

Managed Iceberg tables perform automatic table management, including compaction, clustering, garbage collection, and BigQuery metadata generation/refresh to optimize query performance and reduce storage costs. Compute resource usage for table management is billed in Data Compute Units (DCUs) over time, in per second increments. For more details, see Apache Iceberg table pricing .

Data export operations taking place while streaming through the Storage Write API are included in Storage Write API pricing and are not charged as background maintenance. For more information, see Data ingestion pricing .

To view the logs and compute usage for these background operations, query the INFORMATION_SCHEMA.JOBS view. For example queries, see the following:

Queries and jobs

Similar to BigQuery tables, you are charged for queries and bytes read (per TiB) if you are using BigQuery on-demand pricing , or slot consumption (per slot hour) if you are using BigQuery capacity compute pricing .

BigQuery pricing also applies to the BigQuery Storage Read API and the Storage Write API .

Load and export operations (such as EXPORT METADATA ) use Enterprise edition pay as you go slots . This differs from BigQuery tables, which are not charged for these operations. If PIPELINE reservations with Enterprise or Enterprise Plus slots are available, load and export operations preferentially use these reservation slots instead.

Limitations

Managed Iceberg tables have the following limitations:

Create a Mobile Website
View Site in Mobile | Classic
Share by: