Stay organized with collectionsSave and categorize content based on your preferences.
Create Amazon S3 BigLake tables
This document describes how to create an Amazon Simple Storage Service (Amazon S3) BigLake table. ABigLake tablelets you use
access delegation to query data in Amazon S3. Access delegation
decouples access to the BigLake table from access to the
underlying data store.
To get the permissions that you need to create an external table,
ask your administrator to grant you theBigQuery Admin(roles/bigquery.admin) IAM role on your dataset.
For more information about granting roles, seeManage access to projects, folders, and organizations.
This predefined role contains
the permissions required to create an external table. To see the exact permissions that are
required, expand theRequired permissionssection:
Required permissions
The following permissions are required to create an external table:
Optional: To delete tables automatically, select theEnable table expirationcheckbox and set
theDefault maximum table agein days. Data in Amazon S3
is not deleted when the table expires.
If you want to usedefault collation,
expand theAdvanced optionssection and then select theEnable default collationoption.
The--project_idparameter overrides the default project.
Replace the following:
LOCATION: the location of your dataset
For information about supported regions, seeLocations.
After you
create a dataset, you can't change its location. You can set a default
value for the location by using the.bigqueryrcfile.
PROJECT_ID: your project ID
DATASET_NAME: the name of the dataset that
you want to create
To create a dataset in a project other than your default project, add the
project ID to the dataset name in the following format:PROJECT_ID:DATASET_NAME.
importcom.google.cloud.bigquery.BigQuery;importcom.google.cloud.bigquery.BigQueryException;importcom.google.cloud.bigquery.BigQueryOptions;importcom.google.cloud.bigquery.Dataset;importcom.google.cloud.bigquery.DatasetInfo;// Sample to create a aws datasetpublicclassCreateDatasetAws{publicstaticvoidmain(String[]args){// TODO(developer): Replace these variables before running the sample.StringprojectId="MY_PROJECT_ID";StringdatasetName="MY_DATASET_NAME";// Note: As of now location only supports aws-us-east-1Stringlocation="aws-us-east-1";createDatasetAws(projectId,datasetName,location);}publicstaticvoidcreateDatasetAws(StringprojectId,StringdatasetName,Stringlocation){try{// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests.BigQuerybigquery=BigQueryOptions.getDefaultInstance().getService();DatasetInfodatasetInfo=DatasetInfo.newBuilder(projectId,datasetName).setLocation(location).build();Datasetdataset=bigquery.create(datasetInfo);System.out.println("Awsdatasetcreatedsuccessfully:"+dataset.getDatasetId().getDataset());}catch(BigQueryExceptione){System.out.println("Awsdatasetwasnotcreated.\n"+e.toString());}}}
In theExplorerpane, expand your project, and then select a dataset.
In theDataset infosection, clickadd_boxCreate table.
On theCreate tablepage, in theSourcesection, do the following:
ForCreate table from, selectAmazon S3.
ForSelect S3 path, enter a URI pointing to the Amazon S3
data in the formats3://BUCKET_NAME/PATH. ReplaceBUCKET_NAMEwith the name of the Amazon S3 bucket;
the bucket's region should be the same as the dataset's region.
ReplacePATHwith the path that you would like
to write the exported file to; it can contain one wildcard*.
ForFile format, select the data format in Amazon S3. Supported formats
areAVRO,CSV,DELTA_LAKE,ICEBERG,JSONL,ORC, andPARQUET.
In theDestinationsection, specify the
following details:
ForDataset, choose the appropriate dataset.
In theTablefield, enter the name of the table.
Verify thatTable typeis set toExternal table.
ForConnection ID, choose the appropriate connection ID from the
drop-down. For information about connections, seeConnect to Amazon S3.
In theSchemasection, you can either enableschema auto-detectionor manually specify
a schema if you have a source file. If you don't have a source file, you
must manually specify a schema.
To enable schema auto-detection, select theAuto-detectoption.
To manually specify a schema, leave theAuto-detectoption
unchecked. EnableEdit as textand enter the table schema as aJSON array.
S3_URI: a URI pointing to the Amazon S3 data (for
example,s3://bucket/path)
STALENESS_INTERVAL: specifies whether
cached metadata is used by operations against the BigLake
table, and how fresh the cached metadata must be in order for the
operation to use it. For more information about metadata caching
considerations, seeMetadata caching for performance.
To disable metadata caching, specify 0. This is the default.
To enable metadata caching, specify aninterval literalvalue between 30 minutes and 7 days. For example, specifyINTERVAL 4 HOURfor a 4 hour staleness interval.
With this value, operations against the table use cached metadata if
it has been refreshed within the past 4 hours. If the cached metadata
is older than that, the operation retrieves metadata from
Amazon S3 instead.
CACHE_MODE: specifies whether the metadata
cache is refreshed automatically or manually. For more information
about metadata caching considerations, seeMetadata caching for performance.
Set toAUTOMATICfor the metadata cache to be
refreshed at a system-defined interval, usually somewhere between 30 and
60 minutes.
DATA_FORMAT: any of the supportedBigQuery federated formats(such asAVRO,CSV,DELTA_LAKE,ICEBERG, orPARQUET).
S3_URI: a URI pointing to the Amazon S3 data (for example,s3://bucket/path).
AWS_LOCATION: an AWS location in Google Cloud (for
example,aws-us-east-1).
CONNECTION_NAME: the name of the connection you
created.
CACHE_MODE: specifies whether the metadata
cache is refreshed automatically or manually. You only need to include
this flag if you also plan to use the--max_stalenessflag
in the subsequentbq mkcommand to enable metadata caching.
For more information about metadata caching considerations, seeMetadata caching for performance.
Set toAUTOMATICfor the metadata cache to be
refreshed at a system-defined interval, usually somewhere between 30 and
60 minutes.
Set toMANUALif you want to refresh
the metadata cache on a schedule you determine. In this case, you can call
theBQ.REFRESH_EXTERNAL_METADATA_CACHEsystem procedureto refresh the cache.
You must setCACHE_MODEifSTALENESS_INTERVALis set to a value greater
than 0.
STALENESS_INTERVAL: specifies whether
cached metadata is used by operations against the BigLake
table, and how fresh the cached metadata must be in order for the
operation to use it. For more information about metadata caching
considerations, seeMetadata caching for performance.
To disable metadata caching, specify 0. This is the default.
To enable metadata caching, specify aninterval literalvalue between 30 minutes and 7 days. For example, specifyINTERVAL 4 HOURfor a 4 hour staleness interval.
With this value, operations against the table use cached metadata if
it has been refreshed within the past 4 hours. If the cached metadata
is older than that, the operation retrieves metadata from
Amazon S3 instead.
DATASET_NAME: the name of the dataset you created.
TABLE_NAME: the name you want to give to this table.
For example, the following command creates a new BigLake table,awsdataset.awstable, which can query your Amazon S3 data that's stored
at the paths3://s3-bucket/path/file.csvand has a read connection in the
locationaws-us-east-1:
importcom.google.cloud.bigquery.BigQuery;importcom.google.cloud.bigquery.BigQueryException;importcom.google.cloud.bigquery.BigQueryOptions;importcom.google.cloud.bigquery.CsvOptions;importcom.google.cloud.bigquery.ExternalTableDefinition;importcom.google.cloud.bigquery.Field;importcom.google.cloud.bigquery.Schema;importcom.google.cloud.bigquery.StandardSQLTypeName;importcom.google.cloud.bigquery.TableId;importcom.google.cloud.bigquery.TableInfo;// Sample to create an external aws tablepublicclassCreateExternalTableAws{publicstaticvoidmain(String[]args){// TODO(developer): Replace these variables before running the sample.StringprojectId="MY_PROJECT_ID";StringdatasetName="MY_DATASET_NAME";StringtableName="MY_TABLE_NAME";StringconnectionId="MY_CONNECTION_ID";StringsourceUri="s3://your-bucket-name/";CsvOptionsoptions=CsvOptions.newBuilder().setSkipLeadingRows(1).build();Schemaschema=Schema.of(Field.of("name",StandardSQLTypeName.STRING),Field.of("post_abbr",StandardSQLTypeName.STRING));ExternalTableDefinitionexternalTableDefinition=ExternalTableDefinition.newBuilder(sourceUri,options).setConnectionId(connectionId).setSchema(schema).build();createExternalTableAws(projectId,datasetName,tableName,externalTableDefinition);}publicstaticvoidcreateExternalTableAws(StringprojectId,StringdatasetName,StringtableName,ExternalTableDefinitionexternalTableDefinition){try{// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests.BigQuerybigquery=BigQueryOptions.getDefaultInstance().getService();TableIdtableId=TableId.of(projectId,datasetName,tableName);TableInfotableInfo=TableInfo.newBuilder(tableId,externalTableDefinition).build();bigquery.create(tableInfo);System.out.println("Awsexternaltablecreatedsuccessfully");// Clean upbigquery.delete(TableId.of(projectId,datasetName,tableName));}catch(BigQueryExceptione){System.out.println("Awsexternalwasnotcreated."+e.toString());}}}
Create BigLake tables on partitioned data
You can create a BigLake table for Hive partitioned data in
Amazon S3. After you create an externally partitioned table, you can't
change the partition key. You need to recreate the table to change the
partition key.
To create a BigLake table based on Hive partitioned data,
select one of the following options:
In theExplorerpane, expand your project and select a dataset.
Clickmore_vertView actions,
and then clickCreate table. This opens theCreate tablepane.
In theSourcesection, specify the following details:
ForCreate table from, selectAmazon S3.
Provide the path to the folder, usingwildcards.
For example,s3://mybucket/*.
The folder
must be in the same location as the dataset that contains the
table you want to create, append, or overwrite.
From theFile formatlist, select the file type.
Select theSource data partitioningcheckbox, and then specify
the following details:
ForSelect Source URI Prefix, enter the
URI prefix. For example,s3://mybucket/my_files.
Optional: To require a partition filter on all queries for this
table, select theRequire partition filtercheckbox.
Requiring a partition filter can reduce cost and improve
performance. For more information, seeRequiring predicate filters on partition keys in queries.
In thePartition inference modesection, select one of the
following options:
Automatically infer types: set the partition schema
detection mode toAUTO.
All columns are strings: set the partition schema
detection mode toSTRINGS.
Provide my own: set the partition schema detection mode toCUSTOMand manually enter the schema
information for the partition keys. For more information, seeProvide a custom partition key schema.
In theDestinationsection, specify the following details:
ForProject, select the project in which you want to create
the table.
ForDataset, select the dataset in which you want to create
the table.
ForTable, enter the name of the table that you want
to create.
ForConnection ID, select the connection that you created
earlier.
In theSchemasection, you can either enableschema auto-detectionor manually specify
a schema if you have a source file. If you don't have a source file, you
must manually specify a schema.
To enable schema auto-detection, select theAuto-detectoption.
To manually specify a schema, leave theAuto-detectoption
unchecked. EnableEdit as textand enter the table schema as aJSON array.
To ignore rows with extra column values that don't match the schema,
expand theAdvanced optionssection and selectUnknown values.
PROJECT_ID: the name of your
project in which you want to create the table—for example,myproject
DATASET: the name of the
BigQuery dataset that you want to create the table
in—for example,mydataset
EXTERNAL_TABLE_NAME: the name of the table
that you want to create—for example,mytable
PARTITION_COLUMN: the name of the
partitioning column
PARTITION_COLUMN_TYPE: the type of the
partitioning column
REGION: the region that contains the
connection—for example,us
CONNECTION_ID: the name of the connection—for
example,myconnection
HIVE_PARTITION_URI_PREFIX: hive partitioning
uri prefix–for example:s3://mybucket/
FILE_PATH: path to the data source for the
external table that you want to create—for example:s3://mybucket/*.parquet
TABLE_FORMAT: the format of the table that
you want to create—for example,PARQUET
STALENESS_INTERVAL: specifies whether
cached metadata is used by operations against the BigLake
table, and how fresh the cached metadata must be in order for the
operation to use it. For more information about metadata caching
considerations, seeMetadata caching for performance.
To disable metadata caching, specify 0. This is the default.
To enable metadata caching, specify aninterval literalvalue between 30 minutes and 7 days. For example, specifyINTERVAL 4 HOURfor a 4 hour staleness interval.
With this value, operations against the table use cached metadata if
it has been refreshed within the past 4 hours. If the cached metadata
is older than that, the operation retrieves metadata from
Amazon S3 instead.
CACHE_MODE: specifies whether the metadata
cache is refreshed automatically or manually. For more information
about metadata caching considerations, seeMetadata caching for performance.
Set toAUTOMATICfor the metadata cache to be
refreshed at a system-defined interval, usually somewhere between 30 and
60 minutes.
SOURCE_FORMAT: the format of the external data
source. For example,CSV.
REGION: the region that contains the
connection—for example,us.
CONNECTION_ID: the name of the connection—for
example,myconnection.
PARTITIONING_MODE: the Hive partitioning mode. Use one of the
following values:
AUTO: Automatically detect the key names and types.
STRINGS: Automatically convert the key names to strings.
CUSTOM: Encode the key schema in the source URI prefix.
URI_SHARED_PREFIX: the source URI prefix.
BOOLEAN: specifies whether to require a predicate filter at query
time. This flag is optional. The default value isfalse.
CACHE_MODE: specifies whether the metadata
cache is refreshed automatically or manually. You only need to include
this flag if you also plan to use the--max_stalenessflag
in the subsequentbq mkcommand to enable metadata caching.
For more information about metadata caching considerations, seeMetadata caching for performance.
Set toAUTOMATICfor the metadata cache to be
refreshed at a system-defined interval, usually somewhere between 30 and
60 minutes.
Set toMANUALif you want to refresh
the metadata cache on a schedule you determine. In this case, you can call
theBQ.REFRESH_EXTERNAL_METADATA_CACHEsystem procedureto refresh the cache.
You must setCACHE_MODEifSTALENESS_INTERVALis set to a value greater
than 0.
URIS: the path to the Amazon S3 folder, using
wildcard format.
STALENESS_INTERVAL: specifies whether
cached metadata is used by operations against the BigLake
table, and how fresh the cached metadata must be in order for the
operation to use it. For more information about metadata caching
considerations, seeMetadata caching for performance.
To disable metadata caching, specify 0. This is the default.
To enable metadata caching, specify aninterval literalvalue between 30 minutes and 7 days. For example, specifyINTERVAL 4 HOURfor a 4 hour staleness interval.
With this value, operations against the table use cached metadata if
it has been refreshed within the past 4 hours. If the cached metadata
is older than that, the operation retrieves metadata from
Amazon S3 instead.
DEFINITION_FILE: the path to the table definition file.
DATASET_NAME: the name of the dataset that contains the
table.
TABLE_NAME: the name of the table you're creating.
SCHEMA: specifies a path to aJSON schema file,
or specifies the schema in the formfield:data_type,field:data_type,.... To use schema
auto-detection, omit this argument.
Examples
The following example usesAUTOHive partitioning mode for Amazon S3
data:
If you set thehivePartitioningOptions.modefield toCUSTOM, you must
encode the partition key schema in thehivePartitioningOptions.sourceUriPrefixfield as follows:s3://BUCKET/PATH_TO_TABLE/{KEY1:TYPE1}/{KEY2:TYPE2}/...
To enforce the use of a predicate filter at query time, set thehivePartitioningOptions.requirePartitionFilterfield totrue.
Delta Lake tables
Delta Lake is an open source table format that supports petabyte scale data
tables. Delta Lake tables can be queried as both temporary and permanent tables,
and is supported as aBigLake
table.
Schema synchronization
Delta Lake maintains a canonical schema as part of its metadata. You
can't update a schema using a JSON metadata file. To update the schema:
Delta Lake tables are only supported on
BigQuery Omni and have the associatedlimitations.
You can't update a table with a new JSON metadata file. You must use an auto
detect schema table update operation. SeeSchema
synchronizationfor more information.
BigLake security features only protect Delta Lake
tables when accessed through BigQuery services.
Create a Delta Lake table
The following example creates an external table by using theCREATE EXTERNAL
TABLEstatement with the
Delta Lake format:
In theQuery settingsdialog, forAdditional settings>Data location, select theBigQuery regionthat is collocated with the BigQuery Omni region.
For example, if your BigQuery Omni region isaws-us-east-1,
specifyus-east4.
Select the remaining fields and clickSave.
bq
Use the--locationflag to set the job's processing location to theBigQuery regionthat is
collocated with the BigQuery Omni region.
For example, if your BigQuery Omni region isaws-us-east-1,
specifyus-east4.
If you arerunning jobs programmatically,
set the location argument to theBigQuery regionthat is collocated with the BigQuery Omni region.
For example, if your BigQuery Omni region isaws-us-east-1,
specifyus-east4.
The following example lists the metadata refresh jobs:
You can use VPC Service Controls perimeters to restrict access from
BigQuery Omni to an external cloud service as an extra layer of
defense. For example, VPC Service Controls perimeters can limit exports from
your BigQuery Omni tables to a specific Amazon S3 bucket
or Blob Storage container.
Ensure that you have the required permissions to configure service perimeters.
To view a list of IAM roles required to
configure VPC Service Controls, seeAccess control with
IAMin the
VPC Service Controls documentation.
Set up VPC Service Controls using the Google Cloud console
In the Google Cloud console navigation menu, clickSecurity, and then
clickVPC Service Controls.
To set up VPC Service Controls for BigQuery Omni,
follow the steps in theCreate a service
perimeterguide, and when you are in theEgress rulespane, follow these steps:
In theEgress rulespanel, clickAdd rule.
In theFrom attributes of the API clientsection, select an option
from theIdentitylist.
SelectTo attributes of external resources.
To add an external resource, clickAdd external resources.
In theAdd external resourcedialog, forExternal resource name,
enter a valid resource name. For example:
For Amazon Simple Storage Service (Amazon S3):s3://BUCKET_NAME
ReplaceBUCKET_NAMEwith the name of your Amazon S3 bucket.
For Azure Blob Storage:azure://myaccount.blob.core.windows.net/CONTAINER_NAME
ReplaceCONTAINER NAMEwith the name of your Blob Storage
container.
Select the methods that you want to allow on your external resources:
If you want to allow all methods, selectAll methodsin theMethodslist.
If you want to allow specific methods, selectSelected method,
clickSelect methods, and then select the methods that you
want to allow on your external resources.
ClickCreate perimeter.
Set up VPC Service Controls using the gcloud CLI
To set up VPC Service Controls using the gcloud CLI, follow these
steps:
An access policy is an organization-wide container
for access levels and service perimeters. For information about setting a
default access policy or getting an access policy name, seeManaging an access
policy.
Create the egress policy input file
An egress rule block defines the allowed access from within a perimeter to resources
outside of that perimeter. For external resources, theexternalResourcesproperty
defines the external resource paths allowed access from within your
VPC Service Controls perimeter.
Egress rules can be configured using
a JSON file, or a YAML file. The following sample uses the.yamlformat:
methodSelectors: list methods that a client satisfying thefromconditions
can access. For restrictable methods and permissions for services, seeSupported service method restrictions.
method: a valid service method, or\"*\"to allow allserviceNamemethods.
permission: a valid service permission, such as\"*\",externalResource.read, orexternalResource.write. Access to resources
outside the perimeter is allowed for operations that require this permission.
externalResources: lists external resources that clients inside a perimeter
can access. ReplaceEXTERNAL_RESOURCE_PATHwith either a valid
Amazon S3 bucket, such ass3://bucket_name, or a
Blob Storage container path, such asazure://myaccount.blob.core.windows.net/container_name.
egressFrom: lists allowed service operations on Google Cloud
resources in specified projects within the perimeter.
identityTypeoridentities: defines the identity types that can access the
specified resources outside the perimeter. ReplaceIDENTITY_TYPEwith one of the following valid values:
ANY_IDENTITY: to allow all identities.
ANY_USER_ACCOUNT: to allow all users.
ANY_SERVICE_ACCOUNT: to allow all service accounts
identities: lists service accounts that can access the specified resources
outside the perimeter.
serviceAccount(optional): replaceSERVICE_ACCOUNTwith the
service account that can access the specified resources outside the
perimeter.
Examples
The following example is a policy that allows egress operations from inside the
perimeter to thes3://mybucketAmazon S3 location in AWS.
To add the egress policy when you create a new service perimeter, use thegcloud access-context-manager perimeters createcommand.
For example, the following command creates a new
perimeter namedomniPerimeterthat includes the project with project number12345, restricts the BigQuery API, and adds an egress policy
defined in theegress.yamlfile:
To add the egress policy to an existing service perimeter, use thegcloud access-context-manager perimeters updatecommand.
For example, the following command adds an egress policy defined in theegress.yamlfile to an existing service perimeter namedomniPerimeter:
As a BigQuery administrator, you can create an S3 bucket policy to
grant BigQuery Omni access to your Amazon S3 resources.
This ensures that only authorized BigQuery Omni VPCs can interact with
your Amazon S3, enhancing the security of your data.
Apply an S3 bucket policy for BigQuery Omni VPC
To apply an S3 bucket policy, use the AWS CLI or Terraform:
AWS CLI
Run the following command to apply an S3 bucket policy that includes a
condition using theaws:SourceVpcattribute:
BUCKET_NAME: the Amazon S3 bucket that
you want BigQuery to access.
VPC_ID: the BigQuery Omni VPC ID of the
BigQuery Omni region collocated with the Amazon S3
bucket. You can find this information in the table on this page.
Terraform
Add the following to your Terraform configuration file: