The Cloud Storage batch source plugin lets you read data from Cloud Storage buckets and bring it into Cloud Data Fusion for further processing and transformation. It lets you load data from multiple file formats, including the following:
- Structured: CSV, Avro, Parquet, ORC
- Semi-structured: JSON, XML
- Others: Text, Binary
Before you begin
Cloud Data Fusion typically has two service accounts:
- Design-time service account: Cloud Data Fusion API Service Agent
- Execution-time service account: Compute Engine Service Account
Before using the Cloud Storage batch source plugin, grant the following role or permissions to each service account.
Cloud Data Fusion API Service Agent
This service account already has all the required permissions and you don't need to add additional permissions.
Compute Engine Service Account
In your Google Cloud project, grant the following IAM roles or permissions to the Compute Engine Service Account:
- Storage Legacy Bucket Reader
(
roles/storage.legacyBucketReader). This predefined role contains the requiredstorage.buckets.getpermission. -
Storage Object Viewer (
roles/storage.legacyBucketReader). This predefined role contains the following required permissions:-
storage.objects.get -
storage.objects.list
-
Configure the plugin
- Go to the Cloud Data Fusion web interface and click Studio.
- Check that Data Pipeline - Batchis selected (not Realtime).
- In the Sourcemenu, click GCS. The Cloud Storage node appears in your pipeline.
- To configure the source, go to the Cloud Storage node and click Properties.
-
Enter the following properties. For a complete list, see Properties .
- Enter a Labelfor the Cloud Storage node—for
example,
Cloud Storage tables. -
Enter the connection details. You can set up a new, one-time connection, or an existing, reusable connection.
New connection
To add a one-time connection to Cloud Storage, follow these steps:
- Keep Use connectionturned off.
- In the Project IDfield, leave the value as auto-detect.
-
In the Service account typefield, leave the value as File pathand the Service account file pathas auto-detect.
Reusable connection
To reuse an existing connection, follow these steps:
- Turn on Use connection.
- Click Browse connections.
-
Click the connection name—for example, Cloud Storage Default.
-
Optional: if a connection doesn't exist and you want to create a new reusable connection, click Add connectionand refer to the steps in the New connectiontab on this page.
-
In the Reference namefield, enter a name to use for lineage—for example,
data-fusion-gcs-campaign. -
In the Pathfield, enter the path to read from—for example,
gs:// BUCKET_PATH. -
In the Formatfield, select one of the following file formats for the data being read:
- avro
- blob(the blob format requires a schema that contains a field named body of type bytes)
- csv
- delimited
- json
- parquet
- text(the text format requires a schema that contains a field named body of type string)
- tsv
- The name of any format plugin that you have deployed in your environment
-
Optional: to test connectivity, click Get schema.
-
Optional: in the Sample sizefield, enter the maximum rows to check for the selected data type—for example,
1000. -
Optional: in the Overridefield, enter the column names and their respective data types to skip.
-
Optional: enter Advanced properties, such as a minimum split size or a regular expression path filter (see Properties ).
-
Optional: in the Temporary bucket namefield, enter a name for the Cloud Storage bucket.
- Enter a Labelfor the Cloud Storage node—for
example,
-
Optional: click Validateand address any errors found.
-
Click Close. Properties are saved and you can continue to build your data pipeline in the Cloud Data Fusion Studio.
Properties
Default is
auto-detect
.- File path: the path where the service account is located.
- JSON: JSON content of the service account.
Default is
auto-detect
./
). For example, gs://bucket/path/to/directory/
. To match a filename pattern,
you can use an asterisk ( *
) as a wildcard. If no files are
found or matched, the pipeline fails.- avro
- blob(the blob format requires a schema that contains a field named body of type bytes)
- csv
- delimited
- json
- parquet
- text(the text format requires a schema that contains a field named body of type string)
- tsv
- The name of any format plugin that you have deployed in your environment
- If the format is a macro, only the pre-packaged formats can be used
1, "a, b, c"
.
The first field has 1
as its value. The second has a, b, c
. The quotation mark characters are trimmed. The
newline delimiter cannot be within quotes.The plugin assumes the quotes are correctly enclosed, for example,
"a, b, c"
. Not closing a quote ( "a,b,c,
) causes
an error.Default value is False.
Default is False.
If the Formatvalue is
blob
, you cannot split
the data.If the Formatvalue is
blob
, you cannot split
the data.Default is 128 MB.
Default is False.
Default is False.
Default is False.
Default is metadata.
Default is UTF-8.
Data file encryption
This section describes the Data file encryptionproperty. If you set it to true, files are decrypted
using the Streaming AEAD provided by the Tink library
. Each data file
must be accompanied with a metadata file that contains the cipher
information. For example, an encrypted data file at gs:// BUCKET
/ PATH_TO_DIRECTORY
/file1.csv.enc
must have a metadata file at gs:// BUCKET
/ PATH_TO_DIRECTORY
/file1.csv.enc.metadata
. The metadata file
contains a JSON object with the following properties:
| Property | Description |
|---|---|
kms
|
The Cloud Key Management Service URI that was used to encrypt the Data Encryption Key. |
aad
|
The Base64 encoded Additional Authenticated Data used in the encryption. |
key set
|
A JSON object representing the serialized keyset information from the Tink library. |
Example
/* Counting example */ { "kms" : "gcp-kms://projects/my-key-project/locations/us-west1/keyRings/my-key-ring/cryptoKeys/mykey" , "aad" : "73iT4SUJBM24umXecCCf3A==" , "keyset" : { "keysetInfo" : { "primaryKeyId" : 602257784 , "keyInfo" : [{ "typeUrl" : "type.googleapis.com/google.crypto.tink.AesGcmHkdfStreamingKey" , "outputPrefixType" : "RAW" , "keyId" : 602257784 , "status" : "ENABLED" }] }, "encryptedKeyset" : "CiQAz5HH+nUA0Zuqnz4LCnBEVTHS72s/zwjpcnAMIPGpW6kxLggSrAEAcJKHmXeg8kfJ3GD4GuFeWDZzgGn3tfolk6Yf5d7rxKxDEChIMWJWGhWlDHbBW5B9HqWfKx2nQWSC+zjM8FLefVtPYrdJ8n6Eg8ksAnSyXmhN5LoIj6az3XBugtXvCCotQHrBuyoDY+j5ZH9J4tm/bzrLEjCdWAc+oAlhsUAV77jZhowJr6EBiyVuRVfcwLwiscWkQ9J7jjHc7ih9HKfnqAZmQ6iWP36OMrEn" } }
Release notes
What's next
- Learn more about plugins in Cloud Data Fusion .

