The Spanner to BigQuery template is a batch pipeline that reads data from a Spanner table and writes the data to BigQuery.
Pipeline requirements
- The source Spanner table must exist prior to running the pipeline.
- The BigQuery dataset must exist prior to running the pipeline.
- A JSON file that describes your BigQuery schema.
The file must contain a top-level JSON array titled
fields
. The contents of thefields
array must use the following pattern:
{"name": "COLUMN_NAME", "type": "DATA_TYPE"}
.The following JSON describes an example BigQuery schema:
{ "fields" : [ { "name" : "location" , "type" : "STRING" }, { "name" : "name" , "type" : "STRING" }, { "name" : "age" , "type" : "STRING" }, { "name" : "color" , "type" : "STRING" }, { "name" : "coffee" , "type" : "STRING" } ] }
The Spanner to BigQuery batch template doesn't support importing data into
STRUCT
(Record) fields in the target BigQuery table.
Template parameters
Required parameters
- spannerInstanceId: The instance ID of the Spanner database to read from.
- spannerDatabaseId: The database ID of the Spanner database to export.
- outputTableSpec: The BigQuery output table location to write the output to. For example,
<PROJECT_ID>:<DATASET_NAME>.<TABLE_NAME>
.Depending on thecreateDisposition
specified, the output table might be created automatically using the user provided Avro schema.
Optional parameters
- spannerProjectId: The ID of the project that the Spanner database resides in. The default value for this parameter is the project where the Dataflow pipeline is running.
- spannerTableId: The table name of the Spanner database to export. Ignored if sqlQuery is set.
- spannerRpcPriority: The request priority ( https://cloud.google.com/spanner/docs/reference/rest/v1/RequestOptions
) for Spanner calls. Possible values are
HIGH
,MEDIUM
, andLOW
. The default value isHIGH
. - sqlQuery: The SQL query to use to read data from the Spanner database. Required if spannerTableId is empty.
- bigQuerySchemaPath: The Cloud Storage path (gs://) to the JSON file that defines your BigQuery schema. This is required if the Create Disposition is not CREATE_NEVER For example,
gs://your-bucket/your-schema.json
. - writeDisposition: The BigQuery WriteDisposition ( https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#jobconfigurationload
) value. For example,
WRITE_APPEND
,WRITE_EMPTY
, orWRITE_TRUNCATE
. Defaults toWRITE_APPEND
. - createDisposition: The BigQuery CreateDisposition ( https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#jobconfigurationload
). For example,
CREATE_IF_NEEDED
andCREATE_NEVER
. Defaults toCREATE_IF_NEEDED
. - useStorageWriteApi: If
true
, the pipeline uses the BigQuery Storage Write API ( https://cloud.google.com/bigquery/docs/write-api ). The default value isfalse
. For more information, see Using the Storage Write API ( https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-write-api ). - useStorageWriteApiAtLeastOnce: When using the Storage Write API, specifies the write semantics. To use at-least-once semantics ( https://beam.apache.org/documentation/io/built-in/google-bigquery/#at-least-once-semantics
), set this parameter to
true
. To use exactly-once semantics, set the parameter tofalse
. This parameter applies only whenuseStorageWriteApi
istrue
. The default value isfalse
.
Run the template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint
, select a value from the drop-down menu. The default
region is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations .
- From the Dataflow template drop-down menu, select the Spanner to BigQuery template.
- In the provided parameter fields, enter your parameter values.
- Click Run job .
gcloud
In your shell or terminal, run the template:
gcloud dataflow flex-template run JOB_NAME \ --template-file-gcs-location = gs://dataflow-templates- REGION_NAME / VERSION /flex/Cloud_Spanner_to_BigQuery_Flex \ --project = PROJECT_ID \ --region = REGION_NAME \ --parameters \ spannerInstanceId = SPANNER_INSTANCE_ID , \ spannerDatabaseId = SPANNER_DATABASE_ID , \ spannerTableId = SPANNER_TABLE_ID , \ sqlQuery = SQL_QUERY , \ outputTableSpec = OUTPUT_TABLE_SPEC , \
Replace the following:
-
JOB_NAME
: a unique job name of your choice -
VERSION
: the version of the template that you want to useYou can use the following values:
-
latest
to use the latest version of the template, which is available in the non-datedparent folder in the bucket— gs://dataflow-templates- REGION_NAME /latest/ - the version name, like
2023-09-12-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates- REGION_NAME /
-
-
REGION_NAME
: the region where you want to deploy your Dataflow job—for example,us-central1
-
SPANNER_INSTANCE_ID
: the Spanner instance ID -
SPANNER_DATABASE_ID
: the Spanner database ID -
SPANNER_TABLE_ID
: the Spanner table name -
SQL_QUERY
: the SQL query -
OUTPUT_TABLE_SPEC
: the BigQuery table location
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see projects.templates.launch
.
POST h tt ps : //dataflow.googleapis.com/v1b3/projects/ PROJECT_ID /locations/ LOCATION /flexTemplates:launch { "launchParameter" : { "jobName" : " JOB_NAME " , "parameters" : { "spannerInstanceId" : " SPANNER_INSTANCE_ID " , "spannerDatabaseId" : " SPANNER_DATABASE_ID " , "spannerTableId" : " SPANNER_TABLE_ID " , "sqlQuery" : " SQL_QUERY " , "outputTableSpec" : " OUTPUT_TABLE_SPEC " , }, "containerSpecGcsPath" : "gs://dataflow-templates- LOCATION / VERSION /flex/Cloud_Spanner_to_BigQuery_Flex" , "environment" : { "maxWorkers" : "10" } } }
Replace the following:
-
PROJECT_ID
: the Google Cloud project ID where you want to run the Dataflow job -
JOB_NAME
: a unique job name of your choice -
VERSION
: the version of the template that you want to useYou can use the following values:
-
latest
to use the latest version of the template, which is available in the non-datedparent folder in the bucket— gs://dataflow-templates- REGION_NAME /latest/ - the version name, like
2023-09-12-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates- REGION_NAME /
-
-
LOCATION
: the region where you want to deploy your Dataflow job—for example,us-central1
-
SPANNER_INSTANCE_ID
: the Spanner instance ID -
SPANNER_DATABASE_ID
: the Spanner database ID -
SPANNER_TABLE_ID
: the Spanner table name -
SQL_QUERY
: the SQL query -
OUTPUT_TABLE_SPEC
: the BigQuery table location
What's next
- Learn about Dataflow templates .
- See the list of Google-provided templates .