Cloud Pub/Sub to Cloud Storage template

Use the Serverless for Apache Spark Cloud Pub/Sub to Cloud Storage template to extract data from Pub/Sub to Cloud Storage.

Use the template

Run the template using the gcloud CLI or Dataproc API.

gcloud

Before using any of the command data below, make the following replacements:

PROJECT_ID : Required. Your Google Cloud project ID listed in the IAM Settings .
REGION : Required. Compute Engine region .
SUBNET : Optional. If a subnet is not specified, the subnet in the specified REGION in the default network is selected.
Example: projects/ PROJECT_ID /regions/ REGION /subnetworks/ SUBNET_NAME
TEMPLATE_VERSION : Required. Specify latest for the latest template version, or the date of a specific version, for example, 2023-03-17_v0.1.0-beta (visit gs://dataproc-templates-binaries or run gcloud storage ls gs://dataproc-templates-binaries to list available template versions).
PUBSUB_SUBSCRIPTION_PROJECT_ID : Required. The Google Cloud project ID listed in the IAM Settings that contains the input Pub/Sub subscription to be read.
SUBSCRIPTION : Required. Pub/Sub subscription name.
CLOUD_STORAGE_OUTPUT_BUCKET_NAME : Required. Cloud Storage bucket name where output will be stored.
Note:The output files will be stored in the output/ folder inside the bucket.
FORMAT : Required. Output data format. Options: avro or json .
Note:If avro , you must add " file:///usr/lib/spark/connector/spark-avro.jar " to the jars gcloud CLI flag or API field.

Example (the file:// prefix references a Serverless for Apache Spark jar file):
--jars=file:///usr/lib/spark/connector/spark-avro.jar, [ ... other jars]
TIMEOUT : Optional. Time in milliseconds before termination of stream. Defaults to 60000 .
DURATION : Optional. Frequency in seconds of writes to Cloud Storage. Defaults to 15 seconds .
NUM_RECEIVERS : Optional. Number of streams read from a Pub/Sub subscription in parallel. Defaults to 5 .
BATCHSIZE : Optional. Number of records to insert in one round trip into Cloud Storage. Defaults to 1000 .
SERVICE_ACCOUNT : Optional. If not provided, the default Compute Engine service account is used.
PROPERTY and PROPERTY_VALUE : Optional. Comma-separated list of Spark property = value pairs.
LABEL and LABEL_VALUE : Optional. Comma-separated list of label = value pairs.
LOG_LEVEL : Optional. Level of logging. Can be one of ALL , DEBUG , ERROR , FATAL , INFO , OFF , TRACE , or WARN . Default: INFO .
KMS_KEY : Optional. The Cloud Key Management Service key to use for encryption. If a key is not specified, data is encrypted at rest using a Google-owned and Google-managed encryption key.
Example: projects/ PROJECT_ID /regions/ REGION /keyRings/ KEY_RING_NAME /cryptoKeys/ KEY_NAME

Execute the following command:

Linux, macOS, or Cloud Shell

gcloud  
dataproc  
batches  
submit  
spark  
 \ 
  
--class = 
com.google.cloud.dataproc.templates.main.DataProcTemplate  
 \ 
  
--version = 
 "1.2" 
  
 \ 
  
--project = 
 " PROJECT_ID 
" 
  
 \ 
  
--region = 
 " REGION 
" 
  
 \ 
  
--jars = 
 "gs://dataproc-templates-binaries/ TEMPLATE_VERSION 
/java/dataproc-templates.jar" 
  
 \ 
  
--subnet = 
 " SUBNET 
" 
  
 \ 
  
--kms-key = 
 " KMS_KEY 
" 
  
 \ 
  
--service-account = 
 " SERVICE_ACCOUNT 
" 
  
 \ 
  
--properties = 
 " PROPERTY 
= PROPERTY_VALUE 
" 
  
 \ 
  
--labels = 
 " LABEL 
= LABEL_VALUE 
" 
  
 \ 
  
--  
--template = 
PUBSUBTOGCS  
 \ 
  
--templateProperty  
log.level = 
 " LOG_LEVEL 
" 
  
 \ 
  
--templateProperty  
pubsubtogcs.input.project.id = 
 " PUBSUB_SUBSCRIPTION_PROJECT_ID 
" 
  
 \ 
  
--templateProperty  
pubsubtogcs.input.subscription = 
 " SUBSCRIPTION 
" 
  
 \ 
  
--templateProperty  
pubsubtogcs.gcs.bucket.name = 
 " CLOUD_STORAGE_OUTPUT_BUCKET_NAME 
" 
  
 \ 
  
--templateProperty  
pubsubtogcs.gcs.output.data.format = 
 " FORMAT 
" 
  
 \ 
  
--templateProperty  
pubsubtogcs.timeout.ms = 
 " TIMEOUT 
" 
  
 \ 
  
--templateProperty  
pubsubtogcs.streaming.duration.seconds = 
 " DURATION 
" 
  
 \ 
  
--templateProperty  
pubsubtogcs.total.receivers = 
 " NUM_RECEIVERS 
" 
  
 \ 
  
--templateProperty  
pubsubtogcs.batch.size = 
 " BATCHSIZE 
"

Windows (PowerShell)

gcloud  
dataproc  
batches  
submit  
spark  
 ` 
  
--class = 
com.google.cloud.dataproc.templates.main.DataProcTemplate  
 ` 
  
--version = 
 "1.2" 
  
 ` 
  
--project = 
 " PROJECT_ID 
" 
  
 ` 
  
--region = 
 " REGION 
" 
  
 ` 
  
--jars = 
 "gs://dataproc-templates-binaries/ TEMPLATE_VERSION 
/java/dataproc-templates.jar" 
  
 ` 
  
--subnet = 
 " SUBNET 
" 
  
 ` 
  
--kms-key = 
 " KMS_KEY 
" 
  
 ` 
  
--service-account = 
 " SERVICE_ACCOUNT 
" 
  
 ` 
  
--properties = 
 " PROPERTY 
= PROPERTY_VALUE 
" 
  
 ` 
  
--labels = 
 " LABEL 
= LABEL_VALUE 
" 
  
 ` 
  
--  
--template = 
PUBSUBTOGCS  
 ` 
  
--templateProperty  
log.level = 
 " LOG_LEVEL 
" 
  
 ` 
  
--templateProperty  
pubsubtogcs.input.project.id = 
 " PUBSUB_SUBSCRIPTION_PROJECT_ID 
" 
  
 ` 
  
--templateProperty  
pubsubtogcs.input.subscription = 
 " SUBSCRIPTION 
" 
  
 ` 
  
--templateProperty  
pubsubtogcs.gcs.bucket.name = 
 " CLOUD_STORAGE_OUTPUT_BUCKET_NAME 
" 
  
 ` 
  
--templateProperty  
pubsubtogcs.gcs.output.data.format = 
 " FORMAT 
" 
  
 ` 
  
--templateProperty  
pubsubtogcs.timeout.ms = 
 " TIMEOUT 
" 
  
 ` 
  
--templateProperty  
pubsubtogcs.streaming.duration.seconds = 
 " DURATION 
" 
  
 ` 
  
--templateProperty  
pubsubtogcs.total.receivers = 
 " NUM_RECEIVERS 
" 
  
 ` 
  
--templateProperty  
pubsubtogcs.batch.size = 
 " BATCHSIZE 
"

Windows (cmd.exe)

gcloud  
dataproc  
batches  
submit  
spark  
^  
--class = 
com.google.cloud.dataproc.templates.main.DataProcTemplate  
^  
--version = 
 "1.2" 
  
^  
--project = 
 " PROJECT_ID 
" 
  
^  
--region = 
 " REGION 
" 
  
^  
--jars = 
 "gs://dataproc-templates-binaries/ TEMPLATE_VERSION 
/java/dataproc-templates.jar" 
  
^  
--subnet = 
 " SUBNET 
" 
  
^  
--kms-key = 
 " KMS_KEY 
" 
  
^  
--service-account = 
 " SERVICE_ACCOUNT 
" 
  
^  
--properties = 
 " PROPERTY 
= PROPERTY_VALUE 
" 
  
^  
--labels = 
 " LABEL 
= LABEL_VALUE 
" 
  
^  
--  
--template = 
PUBSUBTOGCS  
^  
--templateProperty  
log.level = 
 " LOG_LEVEL 
" 
  
^  
--templateProperty  
pubsubtogcs.input.project.id = 
 " PUBSUB_SUBSCRIPTION_PROJECT_ID 
" 
  
^  
--templateProperty  
pubsubtogcs.input.subscription = 
 " SUBSCRIPTION 
" 
  
^  
--templateProperty  
pubsubtogcs.gcs.bucket.name = 
 " CLOUD_STORAGE_OUTPUT_BUCKET_NAME 
" 
  
^  
--templateProperty  
pubsubtogcs.gcs.output.data.format = 
 " FORMAT 
" 
  
^  
--templateProperty  
pubsubtogcs.timeout.ms = 
 " TIMEOUT 
" 
  
^  
--templateProperty  
pubsubtogcs.streaming.duration.seconds = 
 " DURATION 
" 
  
^  
--templateProperty  
pubsubtogcs.total.receivers = 
 " NUM_RECEIVERS 
" 
  
^  
--templateProperty  
pubsubtogcs.batch.size = 
 " BATCHSIZE 
"

REST

Before using any of the request data, make the following replacements:

PROJECT_ID : Required. Your Google Cloud project ID listed in the IAM Settings .
REGION : Required. Compute Engine region .
SUBNET : Optional. If a subnet is not specified, the subnet in the specified REGION in the default network is selected.
Example: projects/ PROJECT_ID /regions/ REGION /subnetworks/ SUBNET_NAME
TEMPLATE_VERSION : Required. Specify latest for the latest template version, or the date of a specific version, for example, 2023-03-17_v0.1.0-beta (visit gs://dataproc-templates-binaries or run gcloud storage ls gs://dataproc-templates-binaries to list available template versions).
PUBSUB_SUBSCRIPTION_PROJECT_ID : Required. The Google Cloud project ID listed in the IAM Settings that contains the input Pub/Sub subscription to be read.
SUBSCRIPTION : Required. Pub/Sub subscription name.
CLOUD_STORAGE_OUTPUT_BUCKET_NAME : Required. Cloud Storage bucket name where output will be stored.
Note:The output files will be stored in the output/ folder inside the bucket.
FORMAT : Required. Output data format. Options: avro or json .
Note:If avro , you must add " file:///usr/lib/spark/connector/spark-avro.jar " to the jars gcloud CLI flag or API field.

Example (the file:// prefix references a Serverless for Apache Spark jar file):
--jars=file:///usr/lib/spark/connector/spark-avro.jar, [ ... other jars]
TIMEOUT : Optional. Time in milliseconds before termination of stream. Defaults to 60000 .
DURATION : Optional. Frequency in seconds of writes to Cloud Storage. Defaults to 15 seconds .
NUM_RECEIVERS : Optional. Number of streams read from a Pub/Sub subscription in parallel. Defaults to 5 .
BATCHSIZE : Optional. Number of records to insert in one round trip into Cloud Storage. Defaults to 1000 .
SERVICE_ACCOUNT : Optional. If not provided, the default Compute Engine service account is used.
PROPERTY and PROPERTY_VALUE : Optional. Comma-separated list of Spark property = value pairs.
LABEL and LABEL_VALUE : Optional. Comma-separated list of label = value pairs.
LOG_LEVEL : Optional. Level of logging. Can be one of ALL , DEBUG , ERROR , FATAL , INFO , OFF , TRACE , or WARN . Default: INFO .
KMS_KEY : Optional. The Cloud Key Management Service key to use for encryption. If a key is not specified, data is encrypted at rest using a Google-owned and Google-managed encryption key.
Example: projects/ PROJECT_ID /regions/ REGION /keyRings/ KEY_RING_NAME /cryptoKeys/ KEY_NAME

HTTP method and URL:

POST https://dataproc.googleapis.com/v1/projects/ PROJECT_ID 
/locations/ REGION 
/batches

Request JSON body:

{
  "environmentConfig":{
    "executionConfig":{
      "subnetworkUri":" SUBNET 
",
      "kmsKey": " KMS_KEY 
",
      "serviceAccount": " SERVICE_ACCOUNT 
"
    }
  },
  "labels": {
    " LABEL 
": " LABEL_VALUE 
"
  },
  "runtimeConfig": {
    "version": "1.2",
    "properties": {
      " PROPERTY 
": " PROPERTY_VALUE 
"
    }
  },
  "sparkBatch":{
    "mainClass":"com.google.cloud.dataproc.templates.main.DataProcTemplate",
    "args":[
      "--template","PUBSUBTOGCS",
      "--templateProperty","log.level= LOG_LEVEL 
",
      "--templateProperty","pubsubtogcs.input.project.id= PUBSUB_SUBSCRIPTION_PROJECT_ID 
",
      "--templateProperty","pubsubtogcs.input.subscription= SUBSCRIPTION 
",
      "--templateProperty","pubsubtogcs.gcs.bucket.name= CLOUD_STORAGE_OUTPUT_BUCKET_NAME 
",
      "--templateProperty","pubsubtogcs.gcs.output.data.format= FORMAT 
",
      "--templateProperty","pubsubtogcs.timeout.ms= TIMEOUT 
",
      "--templateProperty","pubsubtogcs.streaming.duration.seconds= DURATION 
",
      "--templateProperty","pubsubtogcs.total.receivers= NUM_RECEIVERS 
",
      "--templateProperty","pubsubtogcs.batch.size= BATCHSIZE 
"
    ],
    "jarFileUris":[
      "file:///usr/lib/spark/connector/spark-avro.jar", "gs://dataproc-templates-binaries/ TEMPLATE_VERSION 
/java/dataproc-templates.jar"
    ]
  }
}

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login , or by using Cloud Shell , which automatically logs you into the gcloud CLI . You can check the currently active account by running gcloud auth list .

Save the request body in a file named request.json , and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://dataproc.googleapis.com/v1/projects/ PROJECT_ID 
/locations/ REGION 
/batches"

PowerShell (Windows)

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login . You can check the currently active account by running gcloud auth list .

Save the request body in a file named request.json , and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://dataproc.googleapis.com/v1/projects/ PROJECT_ID 
/locations/ REGION 
/batches" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
  "name": "projects/ PROJECT_ID 
/regions/ REGION 
/operations/ OPERATION_ID 
",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.dataproc.v1.BatchOperationMetadata",
    "batch": "projects/ PROJECT_ID 
/locations/ REGION 
/batches/ BATCH_ID 
",
    "batchUuid": "de8af8d4-3599-4a7c-915c-798201ed1583",
    "createTime": "2023-02-24T03:31:03.440329Z",
    "operationType": "BATCH",
    "description": "Batch"
  }
}