JDBC to Cloud Spanner template

Use the Dataproc Serverless JDBC to Spanner template to extract data from JDBC databases to Spanner.

This template supports the following databases as input:

  • MySQL
  • PostgreSQL
  • Microsoft SQL Server
  • Oracle

Use the template

Run the template using the gcloud CLI or Dataproc API.

gcloud

Before using any of the command data below, make the following replacements:

  • PROJECT_ID : Required. Your Google Cloud project ID listed in the IAM Settings .
  • REGION : Required. Compute Engine region .
  • TEMPLATE_VERSION : Required. Specify latest for the latest template version, or the date of a specific version, for example, 2023-03-17_v0.1.0-beta (visit gs://dataproc-templates-binaries or run gsutil ls gs://dataproc-templates-binaries to list available template versions).
  • SUBNET : Optional. If a subnet is not specified, the subnet in the specified REGION in the default network is selected.

    Example: projects/ PROJECT_ID /regions/ REGION /subnetworks/ SUBNET_NAME

  • JDBC_CONNECTOR_CLOUD_STORAGE_PATH : Required. The full Cloud Storage path, including the filename, where the JDBC connector jar is stored. You can use the following commands to download JDBC connectors for uploading to Cloud Storage:
    • MySQL:
       wget http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.30.tar.gz 
      
    • Postgres SQL:
       wget https://jdbc.postgresql.org/download/postgresql-42.2.6.jar 
      
    • Microsoft SQL Server:
       wget https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/6.4.0.jre8/mssql-jdbc-6.4.0.jre8.jar 
      
    • Oracle:
       wget https://repo1.maven.org/maven2/com/oracle/database/jdbc/ojdbc8/21.7.0.0/ojdbc8-21.7.0.0.jar 
      
  • The following variables are used to construct the required JDBC_CONNECTION_URL :
    • JDBC_HOST , JDBC_PORT , JDBC_DATABASE , or, for Oracle, JDBC_SERVICE , JDBC_USERNAME , and JDBC_PASSWORD : Required. JDBC host, port, database, username, and password.
      • MySQL:
         jdbc:mysql:// JDBC_HOST 
        : JDBC_PORT 
        / JDBC_DATABASE 
        ?user= JDBC_USERNAME 
        &password= JDBC_PASSWORD 
         
        
      • PostgreSQL:
         jdbc:postgresql:// JDBC_HOST 
        : JDBC_PORT 
        / JDBC_DATABASE 
        ?user= JDBC_USERNAME 
        &password= JDBC_PASSWORD 
         
        
      • Microsoft SQL Server:
         jdbc:sqlserver:// JDBC_HOST 
        : JDBC_PORT 
        ;databaseName= JDBC_DATABASE 
        ;user= JDBC_USERNAME 
        ;password= JDBC_PASSWORD 
         
        
      • Oracle:
         jdbc:oracle:thin:@// JDBC_HOST 
        : JDBC_PORT 
        / JDBC_SERVICE 
        ?user= JDBC_USERNAME 
        &password= JDBC_PASSWORD 
         
        
    • DRIVER : Required. The JDBC driver which will be used for the connection:
      • MySQL:
         com.mysql.cj.jdbc.Driver 
        
      • Postgres SQL:
         org.postgresql.Driver 
        
      • Microsoft SQL Server:
         com.microsoft.sqlserver.jdbc.SQLServerDriver 
        
      • Oracle:
         oracle.jdbc.driver.OracleDriver 
        
    • QUERY or QUERY_FILE : Required. Set either QUERY or QUERY_FILE to specify the query to use to extract data from JDBC
    • INPUT_PARTITION_COLUMN , LOWERBOUND , UPPERBOUND , NUM_PARTITIONS : Optional. If used, all of the following parameters must be specified:
      • INPUT_PARTITION_COLUMN: JDBC input table partition column name.
      • LOWERBOUND: JDBC input table partition column lower bound used to determine the partition stride.
      • UPPERBOUND:JDBC input table partition column upper bound used to decide the partition stride.
      • NUM_PARTITIONS:The maximum number of partitions that can be used for parallelism of table reads and writes. If specified, this value is used for the JDBC input and output connection. Default: 10 .
    • FETCHSIZE : Optional. How many rows to fetch per round trip. Default: 10.
    • JDBC_SESSION_INIT : Optional. Session initialization statement to read Java templates.
    • TEMPVIEW and SQL_QUERY : Optional. You can use these two optional parameters to apply a Spark SQL transformation while loading data into Spanner. TEMPVIEW is the temporary view name, and SQL_QUERY is the query statement. TEMPVIEW and the table name in SQL_QUERY must match.
    • INSTANCE : Required. Spanner instance ID.
    • SPANNER_DATABASE : Required. Spanner database ID.
    • TABLE : Required. Spanner output table name.
    • MODE : Optional. Write mode for Spanner output. Options: Append , Overwrite , Ignore , or ErrorIfExists . Defaults to ErrorIfExists .
    • PRIMARY_KEY : Required. Comma separated Primary key columns needed when creating Spanner output table.
    • SERVICE_ACCOUNT : Optional. If not provided, the default Compute Engine service account is used.
    • PROPERTY and PROPERTY_VALUE : Optional. Comma-separated list of Spark property = value pairs.
    • LABEL and LABEL_VALUE : Optional. Comma-separated list of label = value pairs.
    • LOG_LEVEL : Optional. Level of logging. Can be one of ALL , DEBUG , ERROR , FATAL , INFO , OFF , TRACE , or WARN . Default: INFO .
    • KMS_KEY : Optional. The Cloud Key Management Service key to use for encryption. If a key is not specified, data is encrypted at rest using a Google-owned and Google-managed key.

      Example: projects/ PROJECT_ID /regions/ REGION /keyRings/ KEY_RING_NAME /cryptoKeys/ KEY_NAME

    Execute the following command:

    Linux, macOS, or Cloud Shell

    gcloud dataproc batches submit spark \
        --class=com.google.cloud.dataproc.templates.main.DataProcTemplate \
        --version="1.1" \
        --project=" PROJECT_ID 
    " \
        --region=" REGION 
    " \
        --jars="gs://dataproc-templates-binaries/ TEMPLATE_VERSION 
    /java/dataproc-templates.jar, JDBC_CONNECTOR_CLOUD_STORAGE_PATH 
    " \
        --subnet=" SUBNET 
    " \
        --kms-key=" KMS_KEY 
    " \
        --service-account=" SERVICE_ACCOUNT 
    " \
        --properties=" PROPERTY 
    = PROPERTY_VALUE 
    " \
        --labels=" LABEL 
    = LABEL_VALUE 
    " \
        -- --template=JDBCTOSPANNER \
        --templateProperty log.level=" LOG_LEVEL 
    " \
        --templateProperty project.id=" PROJECT_ID 
    " \
        --templateProperty jdbctospanner.jdbc.url=" JDBC_CONNECTION_URL 
    " \
        --templateProperty jdbctospanner.jdbc.driver.class.name=" DRIVER 
    " \
        --templateProperty jdbctospanner.jdbc.fetchsize=" FETCHSIZE 
    " \
        --templateProperty jdbctospanner.jdbc.sessioninitstatement=" JDBC_SESSION_INIT 
    " \
        --templateProperty jdbctospanner.sql=" QUERY 
    " \
        --templateProperty jdbctospanner.sql.file=" QUERY_FILE 
    " \
        --templateProperty jdbctospanner.sql.numPartitions=" NUM_PARTITIONS 
    " \
        --templateProperty jdbctospanner.sql.partitionColumn=" INPUT_PARTITION_COLUMN 
    " \
        --templateProperty jdbctospanner.sql.lowerBound=" LOWERBOUND 
    " \
        --templateProperty jdbctospanner.sql.upperBound=" UPPERBOUND 
    " \
        --templateProperty jdbctospanner.output.instance=" INSTANCE 
    " \
        --templateProperty jdbctospanner.output.database=" SPANNER_DATABASE 
    " \
        --templateProperty jdbctospanner.output.table=" TABLE 
    " \
        --templateProperty jdbctospanner.output.saveMode=" MODE 
    " \
        --templateProperty jdbctospanner.output.primaryKey=" PRIMARY_KEY 
    " \
        --templateProperty jdbctospanner.output.batch.size=" BATCHSIZE 
    " \
        --templateProperty jdbctospanner.temp.table=" TEMPVIEW 
    " \
        --templateProperty jdbctospanner.temp.query=" SQL_QUERY 
    "

    Windows (PowerShell)

    gcloud dataproc batches submit spark `
        --class=com.google.cloud.dataproc.templates.main.DataProcTemplate `
        --version="1.1" `
        --project=" PROJECT_ID 
    " `
        --region=" REGION 
    " `
        --jars="gs://dataproc-templates-binaries/ TEMPLATE_VERSION 
    /java/dataproc-templates.jar, JDBC_CONNECTOR_CLOUD_STORAGE_PATH 
    " `
        --subnet=" SUBNET 
    " `
        --kms-key=" KMS_KEY 
    " `
        --service-account=" SERVICE_ACCOUNT 
    " `
        --properties=" PROPERTY 
    = PROPERTY_VALUE 
    " `
        --labels=" LABEL 
    = LABEL_VALUE 
    " `
        -- --template=JDBCTOSPANNER `
        --templateProperty log.level=" LOG_LEVEL 
    " `
        --templateProperty project.id=" PROJECT_ID 
    " `
        --templateProperty jdbctospanner.jdbc.url=" JDBC_CONNECTION_URL 
    " `
        --templateProperty jdbctospanner.jdbc.driver.class.name=" DRIVER 
    " `
        --templateProperty jdbctospanner.jdbc.fetchsize=" FETCHSIZE 
    " `
        --templateProperty jdbctospanner.jdbc.sessioninitstatement=" JDBC_SESSION_INIT 
    " `
        --templateProperty jdbctospanner.sql=" QUERY 
    " `
        --templateProperty jdbctospanner.sql.file=" QUERY_FILE 
    " `
        --templateProperty jdbctospanner.sql.numPartitions=" NUM_PARTITIONS 
    " `
        --templateProperty jdbctospanner.sql.partitionColumn=" INPUT_PARTITION_COLUMN 
    " `
        --templateProperty jdbctospanner.sql.lowerBound=" LOWERBOUND 
    " `
        --templateProperty jdbctospanner.sql.upperBound=" UPPERBOUND 
    " `
        --templateProperty jdbctospanner.output.instance=" INSTANCE 
    " `
        --templateProperty jdbctospanner.output.database=" SPANNER_DATABASE 
    " `
        --templateProperty jdbctospanner.output.table=" TABLE 
    " `
        --templateProperty jdbctospanner.output.saveMode=" MODE 
    " `
        --templateProperty jdbctospanner.output.primaryKey=" PRIMARY_KEY 
    " `
        --templateProperty jdbctospanner.output.batch.size=" BATCHSIZE 
    " `
        --templateProperty jdbctospanner.temp.table=" TEMPVIEW 
    " `
        --templateProperty jdbctospanner.temp.query=" SQL_QUERY 
    "

    Windows (cmd.exe)

    gcloud dataproc batches submit spark ^
        --class=com.google.cloud.dataproc.templates.main.DataProcTemplate ^
        --version="1.1" ^
        --project=" PROJECT_ID 
    " ^
        --region=" REGION 
    " ^
        --jars="gs://dataproc-templates-binaries/ TEMPLATE_VERSION 
    /java/dataproc-templates.jar, JDBC_CONNECTOR_CLOUD_STORAGE_PATH 
    " ^
        --subnet=" SUBNET 
    " ^
        --kms-key=" KMS_KEY 
    " ^
        --service-account=" SERVICE_ACCOUNT 
    " ^
        --properties=" PROPERTY 
    = PROPERTY_VALUE 
    " ^
        --labels=" LABEL 
    = LABEL_VALUE 
    " ^
        -- --template=JDBCTOSPANNER ^
        --templateProperty log.level=" LOG_LEVEL 
    " ^
        --templateProperty project.id=" PROJECT_ID 
    " ^
        --templateProperty jdbctospanner.jdbc.url=" JDBC_CONNECTION_URL 
    " ^
        --templateProperty jdbctospanner.jdbc.driver.class.name=" DRIVER 
    " ^
        --templateProperty jdbctospanner.jdbc.fetchsize=" FETCHSIZE 
    " ^
        --templateProperty jdbctospanner.jdbc.sessioninitstatement=" JDBC_SESSION_INIT 
    " ^
        --templateProperty jdbctospanner.sql=" QUERY 
    " ^
        --templateProperty jdbctospanner.sql.file=" QUERY_FILE 
    " ^
        --templateProperty jdbctospanner.sql.numPartitions=" NUM_PARTITIONS 
    " ^
        --templateProperty jdbctospanner.sql.partitionColumn=" INPUT_PARTITION_COLUMN 
    " ^
        --templateProperty jdbctospanner.sql.lowerBound=" LOWERBOUND 
    " ^
        --templateProperty jdbctospanner.sql.upperBound=" UPPERBOUND 
    " ^
        --templateProperty jdbctospanner.output.instance=" INSTANCE 
    " ^
        --templateProperty jdbctospanner.output.database=" SPANNER_DATABASE 
    " ^
        --templateProperty jdbctospanner.output.table=" TABLE 
    " ^
        --templateProperty jdbctospanner.output.saveMode=" MODE 
    " ^
        --templateProperty jdbctospanner.output.primaryKey=" PRIMARY_KEY 
    " ^
        --templateProperty jdbctospanner.output.batch.size=" BATCHSIZE 
    " ^
        --templateProperty jdbctospanner.temp.table=" TEMPVIEW 
    " ^
        --templateProperty jdbctospanner.temp.query=" SQL_QUERY 
    "

REST

Before using any of the request data, make the following replacements:

  • PROJECT_ID : Required. Your Google Cloud project ID listed in the IAM Settings .
  • REGION : Required. Compute Engine region .
  • TEMPLATE_VERSION : Required. Specify latest for the latest template version, or the date of a specific version, for example, 2023-03-17_v0.1.0-beta (visit gs://dataproc-templates-binaries or run gsutil ls gs://dataproc-templates-binaries to list available template versions).
  • SUBNET : Optional. If a subnet is not specified, the subnet in the specified REGION in the default network is selected.

    Example: projects/ PROJECT_ID /regions/ REGION /subnetworks/ SUBNET_NAME

  • JDBC_CONNECTOR_CLOUD_STORAGE_PATH : Required. The full Cloud Storage path, including the filename, where the JDBC connector jar is stored. You can use the following commands to download JDBC connectors for uploading to Cloud Storage:
    • MySQL:
       wget http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.30.tar.gz 
      
    • Postgres SQL:
       wget https://jdbc.postgresql.org/download/postgresql-42.2.6.jar 
      
    • Microsoft SQL Server:
       wget https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/6.4.0.jre8/mssql-jdbc-6.4.0.jre8.jar 
      
    • Oracle:
       wget https://repo1.maven.org/maven2/com/oracle/database/jdbc/ojdbc8/21.7.0.0/ojdbc8-21.7.0.0.jar 
      
  • The following variables are used to construct the required JDBC_CONNECTION_URL :
    • JDBC_HOST , JDBC_PORT , JDBC_DATABASE , or, for Oracle, JDBC_SERVICE , JDBC_USERNAME , and JDBC_PASSWORD : Required. JDBC host, port, database, username, and password.
      • MySQL:
         jdbc:mysql:// JDBC_HOST 
        : JDBC_PORT 
        / JDBC_DATABASE 
        ?user= JDBC_USERNAME 
        &password= JDBC_PASSWORD 
         
        
      • PostgreSQL:
         jdbc:postgresql:// JDBC_HOST 
        : JDBC_PORT 
        / JDBC_DATABASE 
        ?user= JDBC_USERNAME 
        &password= JDBC_PASSWORD 
         
        
      • Microsoft SQL Server:
         jdbc:sqlserver:// JDBC_HOST 
        : JDBC_PORT 
        ;databaseName= JDBC_DATABASE 
        ;user= JDBC_USERNAME 
        ;password= JDBC_PASSWORD 
         
        
      • Oracle:
         jdbc:oracle:thin:@// JDBC_HOST 
        : JDBC_PORT 
        / JDBC_SERVICE 
        ?user= JDBC_USERNAME 
        &password= JDBC_PASSWORD 
         
        
    • DRIVER : Required. The JDBC driver which will be used for the connection:
      • MySQL:
         com.mysql.cj.jdbc.Driver 
        
      • Postgres SQL:
         org.postgresql.Driver 
        
      • Microsoft SQL Server:
         com.microsoft.sqlserver.jdbc.SQLServerDriver 
        
      • Oracle:
         oracle.jdbc.driver.OracleDriver 
        
    • QUERY or QUERY_FILE : Required. Set either QUERY or QUERY_FILE to specify the query to use to extract data from JDBC
    • INPUT_PARTITION_COLUMN , LOWERBOUND , UPPERBOUND , NUM_PARTITIONS : Optional. If used, all of the following parameters must be specified:
      • INPUT_PARTITION_COLUMN: JDBC input table partition column name.
      • LOWERBOUND: JDBC input table partition column lower bound used to determine the partition stride.
      • UPPERBOUND:JDBC input table partition column upper bound used to decide the partition stride.
      • NUM_PARTITIONS:The maximum number of partitions that can be used for parallelism of table reads and writes. If specified, this value is used for the JDBC input and output connection. Default: 10 .
    • FETCHSIZE : Optional. How many rows to fetch per round trip. Default: 10.
    • JDBC_SESSION_INIT : Optional. Session initialization statement to read Java templates.
    • TEMPVIEW and SQL_QUERY : Optional. You can use these two optional parameters to apply a Spark SQL transformation while loading data into Spanner. TEMPVIEW is the temporary view name, and SQL_QUERY is the query statement. TEMPVIEW and the table name in SQL_QUERY must match.
    • INSTANCE : Required. Spanner instance ID.
    • SPANNER_DATABASE : Required. Spanner database ID.
    • TABLE : Required. Spanner output table name.
    • MODE : Optional. Write mode for Spanner output. Options: Append , Overwrite , Ignore , or ErrorIfExists . Defaults to ErrorIfExists .
    • PRIMARY_KEY : Required. Comma separated Primary key columns needed when creating Spanner output table.
    • SERVICE_ACCOUNT : Optional. If not provided, the default Compute Engine service account is used.
    • PROPERTY and PROPERTY_VALUE : Optional. Comma-separated list of Spark property = value pairs.
    • LABEL and LABEL_VALUE : Optional. Comma-separated list of label = value pairs.
    • LOG_LEVEL : Optional. Level of logging. Can be one of ALL , DEBUG , ERROR , FATAL , INFO , OFF , TRACE , or WARN . Default: INFO .
    • KMS_KEY : Optional. The Cloud Key Management Service key to use for encryption. If a key is not specified, data is encrypted at rest using a Google-owned and Google-managed key.

      Example: projects/ PROJECT_ID /regions/ REGION /keyRings/ KEY_RING_NAME /cryptoKeys/ KEY_NAME

    HTTP method and URL:

    POST https://dataproc.googleapis.com/v1/projects/ PROJECT_ID 
    /locations/ REGION 
    /batches

    Request JSON body:

    {
      "environmentConfig": {
        "executionConfig": {
          "subnetworkUri": " SUBNET 
    ",
          "kmsKey": " KMS_KEY 
    ",
          "serviceAccount": " SERVICE_ACCOUNT 
    "
        }
      },
      "labels": {
        " LABEL 
    ": " LABEL_VALUE 
    "
      },
      "runtimeConfig": {
        "version": "1.1",
        "properties": {
          " PROPERTY 
    ": " PROPERTY_VALUE 
    "
        }
      },
      "sparkBatch": {
        "mainClass": "com.google.cloud.dataproc.templates.main.DataProcTemplate",
        "args": [
          "--template","JDBCTOSPANNER",
          "--templateProperty","log.level= LOG_LEVEL 
    ",
          "--templateProperty","project.id= PROJECT_ID 
    ",
          "--templateProperty","jdbctospanner.jdbc.url= JDBC_CONNECTION_URL 
    ",
          "--templateProperty","jdbctospanner.jdbc.driver.class.name= DRIVER 
    ",
          "--templateProperty","jdbctospanner.jdbc.fetchsize= FETCHSIZE 
    ",
          "--templateProperty","jdbctospanner.jdbc.sessioninitstatement= JDBC_SESSION_INIT 
    ",
          "--templateProperty","jdbctospanner.sql= QUERY 
    ",
          "--templateProperty","jdbctospanner.sql.file= QUERY_FILE 
    ",
          "--templateProperty","jdbctospanner.sql.numPartitions= NUM_PARTITIONS 
    ",
          "--templateProperty","jdbctospanner.sql.partitionColumn= INPUT_PARTITION_COLUMN 
    ",
          "--templateProperty","jdbctospanner.sql.lowerBound= LOWERBOUND 
    ",
          "--templateProperty","jdbctospanner.sql.upperBound= UPPERBOUND 
    ",
          "--templateProperty","jdbctospanner.output.instance= INSTANCE 
    ",
          "--templateProperty","jdbctospanner.output.database= SPANNER_DATABASE 
    ",
          "--templateProperty","jdbctospanner.output.table= TABLE 
    ",
          "--templateProperty","jdbctospanner.output.saveMode= MODE 
    ",
          "--templateProperty","jdbctospanner.output.primaryKey= PRIMARY_KEY 
    ",
          "--templateProperty","jdbctospanner.output.batch.size= BATCHSIZE 
    ",
          "--templateProperty","jdbctospanner.temp.table= TEMPVIEW 
    ",
          "--templateProperty","jdbctospanner.temp.query= SQL_QUERY 
    " 
        ],
        "jarFileUris": [
          "gs://dataproc-templates-binaries/ TEMPLATE_VERSION 
    /java/dataproc-templates.jar"," JDBC_CONNECTOR_CLOUD_STORAGE_PATH 
    "
        ]
      }
    }

    To send your request, expand one of these options:

    You should receive a JSON response similar to the following:

    {
      "name": "projects/ PROJECT_ID 
    /regions/ REGION 
    /operations/ OPERATION_ID 
    ",
      "metadata": {
        "@type": "type.googleapis.com/google.cloud.dataproc.v1.BatchOperationMetadata",
        "batch": "projects/ PROJECT_ID 
    /locations/ REGION 
    /batches/ BATCH_ID 
    ",
        "batchUuid": "de8af8d4-3599-4a7c-915c-798201ed1583",
        "createTime": "2023-02-24T03:31:03.440329Z",
        "operationType": "BATCH",
        "description": "Batch"
      }
    }