The BigQuery to Clickhouse template is a batch pipeline that copies data from a BigQuery table into an existing Clickhouse table. The template can either read the entire table or read specific records using a supplied query.
Pipeline requirements
- The source BigQuery table must exist.
- The Clickhouse table must exist.
Template parameters
Required parameters
- jdbcUrl: The target ClickHouse JDBC URL in the format
jdbc:clickhouse://host:port/schema
. Any JDBC option could be added at the end of the JDBC URL. For example,jdbc:clickhouse://localhost:8123/default
. - clickHouseUsername: The ClickHouse username to authenticate with.
- clickHouseTable: The target ClickHouse table name to insert the data to.
Optional parameters
- inputTableSpec: The BigQuery table to read from. If you specify
inputTableSpec
, the template reads the data directly from BigQuery storage by using the BigQuery Storage Read API ( https://cloud.google.com/bigquery/docs/reference/storage ). For information about limitations in the Storage Read API, see https://cloud.google.com/bigquery/docs/reference/storage#limitations . You must specify eitherinputTableSpec
orquery
. If you set both parameters, the template uses thequery
parameter. For example,<BIGQUERY_PROJECT>:<DATASET_NAME>.<INPUT_TABLE>
. - outputDeadletterTable: The BigQuery table for messages that failed to reach the output table. If a table doesn't exist, it is created during pipeline execution. If not specified,
<outputTableSpec>_error_records
is used. For example,<PROJECT_ID>:<DATASET_NAME>.<DEADLETTER_TABLE>
. - query: The SQL query to use to read data from BigQuery. If the BigQuery dataset is in a different project than the Dataflow job, specify the full dataset name in the SQL query, for example: <PROJECT_ID>.<DATASET_NAME>.<TABLE_NAME>. By default, the
query
parameter uses GoogleSQL ( https://cloud.google.com/bigquery/docs/introduction-sql ), unlessuseLegacySql
istrue
. You must specify eitherinputTableSpec
orquery
. If you set both parameters, the template uses thequery
parameter. For example,select * from sampledb.sample_table
. - useLegacySql: Set to
true
to use legacy SQL. This parameter only applies when using thequery
parameter. Defaults tofalse
. - queryLocation: Needed when reading from an authorized view without underlying table's permission. For example,
US
. - queryTempDataset: With this option, you can set an existing dataset to create the temporary table to store the results of the query. For example,
temp_dataset
. - KMSEncryptionKey: If reading from BigQuery using query source, use this Cloud KMS key to encrypt any temporary tables created. For example,
projects/your-project/locations/global/keyRings/your-keyring/cryptoKeys/your-key
. - clickHousePassword: The ClickHouse password to authenticate with.
- maxInsertBlockSize: The maximum block size for insertion, if we control the creation of blocks for insertion (ClickHouseIO option).
- insertDistributedSync: If setting is enabled, insert query into distributed waits until data will be sent to all nodes in cluster. (ClickHouseIO option).
- insertQuorum: For INSERT queries in the replicated table, wait writing for the specified number of replicas and linearize the addition of the data. 0 - disabled. This setting is disabled in default server settings (ClickHouseIO option).
- insertDeduplicate: For INSERT queries in the replicated table, specifies that deduplication of inserting blocks should be performed.
- maxRetries: Maximum number of retries per insert.
Run the template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint
, select a value from the drop-down menu. The default
region is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations .
- From the Dataflow template drop-down menu, select the BigQuery to Clickhouse template.
- In the provided parameter fields, enter your parameter values.
- Click Run job .
gcloud
In your shell or terminal, run the template:
gcloud dataflow flex-template run JOB_NAME \ --project = PROJECT_ID \ --region = REGION_NAME \ --template-file-gcs-location = gs://dataflow-templates- REGION_NAME / VERSION /flex/BigQuery_to_Clickhouse \ --parameters \ jdbcUrl = JDBC_URL , \ clickHouseUsername = CLICKHOUSE_USERNAME , \ clickHouseTable = CLICKHOUSE_TABLE
Replace the following:
-
PROJECT_ID
: the Google Cloud project ID where you want to run the Dataflow job -
JOB_NAME
: a unique job name of your choice -
REGION_NAME
: the region where you want to deploy your Dataflow job—for example,us-central1
-
VERSION
: the version of the template that you want to useYou can use the following values:
-
latest
to use the latest version of the template, which is available in the non-datedparent folder in the bucket— gs://dataflow-templates- REGION_NAME /latest/ - the version name, like
2023-09-12-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates- REGION_NAME /
-
-
JDBC_URL
: your jdbc url. -
CLICKHOUSE_USERNAME
: your Clickhouse username. -
CLICKHOUSE_TABLE
: your Clickhouse table.
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see projects.templates.launch
.
POST h tt ps : //dataflow.googleapis.com/v1b3/projects/ PROJECT_ID /locations/ LOCATION /flexTemplates:launch { "launch_parameter" : { "jobName" : " JOB_NAME " , "parameters" : { "jdbcUrl" : " JDBC_URL " , "clickHouseUsername" : " CLICKHOUSE_USERNAME " , "clickHouseTable" : " CLICKHOUSE_TABLE " }, "containerSpecGcsPath" : "gs://dataflow-templates- LOCATION / VERSION /flex/BigQuery_to_Clickhouse" , } }
Replace the following:
-
PROJECT_ID
: the Google Cloud project ID where you want to run the Dataflow job -
JOB_NAME
: a unique job name of your choice -
LOCATION
: the region where you want to deploy your Dataflow job—for example,us-central1
-
VERSION
: the version of the template that you want to useYou can use the following values:
-
latest
to use the latest version of the template, which is available in the non-datedparent folder in the bucket— gs://dataflow-templates- REGION_NAME /latest/ - the version name, like
2023-09-12-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates- REGION_NAME /
-
-
JDBC_URL
: your jdbc url. -
CLICKHOUSE_USERNAME
: your Clickhouse username. -
CLICKHOUSE_TABLE
: your Clickhouse table.
What's next
- Learn about Dataflow templates .
- See the list of Google-provided templates .