Cloud Storage SequenceFile to Bigtable template

The Cloud Storage SequenceFile to Bigtable template is a pipeline that reads data from SequenceFiles in a Cloud Storage bucket and writes the data to a Bigtable table. You can use the template to copy data from Cloud Storage to Bigtable.

Pipeline requirements

  • The Bigtable table must exist.
  • The input SequenceFiles must exist in a Cloud Storage bucket before running the pipeline.
  • The input SequenceFiles must have been exported from Bigtable or HBase.

Template parameters

Required parameters

  • bigtableProject: The ID of the Google Cloud project that contains the Bigtable instance that you want to write data to.
  • bigtableInstanceId: The ID of the Bigtable instance that contains the table.
  • bigtableTableId: The ID of the Bigtable table to import.
  • sourcePattern: The Cloud Storage path pattern to the location of the data. For example, gs://your-bucket/your-path/prefix* .

Optional parameters

  • bigtableAppProfileId: The ID of the Bigtable application profile to use for the import. If you don't specify an application profile, Bigtable uses the instance's default application profile ( https://cloud.google.com/bigtable/docs/app-profiles#default-app-profile ).
  • mutationThrottleLatencyMs: Optional Set mutation latency throttling (enables the feature). Value in milliseconds. Defaults to: 0.

Run the template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint , select a value from the drop-down menu. The default region is us-central1 .

    For a list of regions where you can run a Dataflow job, see Dataflow locations .

  5. From the Dataflow template drop-down menu, select the SequenceFile Files on Cloud Storage to Cloud Bigtable template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job .

gcloud

In your shell or terminal, run the template:

gcloud  
dataflow  
 jobs 
  
run  
 JOB_NAME 
  
 \ 
  
--gcs-location  
gs://dataflow-templates- REGION_NAME 
/ VERSION 
/GCS_SequenceFile_to_Cloud_Bigtable  
 \ 
  
--region  
 REGION_NAME 
  
 \ 
  
--parameters  
 \ 
 bigtableProject 
 = 
 BIGTABLE_PROJECT_ID 
, \ 
 bigtableInstanceId 
 = 
 INSTANCE_ID 
, \ 
 bigtableTableId 
 = 
 TABLE_ID 
, \ 
 bigtableAppProfileId 
 = 
 APPLICATION_PROFILE_ID 
, \ 
 sourcePattern 
 = 
 SOURCE_PATTERN 

Replace the following:

  • JOB_NAME : a unique job name of your choice
  • VERSION : the version of the template that you want to use

    You can use the following values:

  • REGION_NAME : the region where you want to deploy your Dataflow job—for example, us-central1
  • BIGTABLE_PROJECT_ID : the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID : the ID of the Bigtable instance that contains the table
  • TABLE_ID : the ID of the Bigtable table to export
  • APPLICATION_PROFILE_ID : the ID of the Bigtable application profile to be used for the export
  • SOURCE_PATTERN : the Cloud Storage path pattern where data is located, for example, gs://mybucket/somefolder/prefix*

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch .

 POST 
  
 h 
 tt 
 ps 
 : 
 //dataflow.googleapis.com/v1b3/projects/ PROJECT_ID 
/locations/ LOCATION 
/templates:launch?gcsPath=gs://dataflow-templates- LOCATION 
/ VERSION 
/GCS_SequenceFile_to_Cloud_Bigtable { 
  
 "jobName" 
 : 
  
 " JOB_NAME 
" 
 , 
  
 "parameters" 
 : 
  
 { 
  
 "bigtableProject" 
 : 
  
 " BIGTABLE_PROJECT_ID 
" 
 , 
  
 "bigtableInstanceId" 
 : 
  
 " INSTANCE_ID 
" 
 , 
  
 "bigtableTableId" 
 : 
  
 " TABLE_ID 
" 
 , 
  
 "bigtableAppProfileId" 
 : 
  
 " APPLICATION_PROFILE_ID 
" 
 , 
  
 "sourcePattern" 
 : 
  
 " SOURCE_PATTERN 
" 
 , 
  
 }, 
  
 "environment" 
 : 
  
 { 
  
 "zone" 
 : 
  
 "us-central1-f" 
  
 } 
 } 
 

Replace the following:

  • PROJECT_ID : the Google Cloud project ID where you want to run the Dataflow job
  • JOB_NAME : a unique job name of your choice
  • VERSION : the version of the template that you want to use

    You can use the following values:

  • LOCATION : the region where you want to deploy your Dataflow job—for example, us-central1
  • BIGTABLE_PROJECT_ID : the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID : the ID of the Bigtable instance that contains the table
  • TABLE_ID : the ID of the Bigtable table to export
  • APPLICATION_PROFILE_ID : the ID of the Bigtable application profile to be used for the export
  • SOURCE_PATTERN : the Cloud Storage path pattern where data is located, for example, gs://mybucket/somefolder/prefix*

What's next

Design a Mobile Site
View Site in Mobile | Classic
Share by: