Stay organized with collectionsSave and categorize content based on your preferences.
This page provides guidance about configuring the Cloud Storage batch source
plugin in Cloud Data Fusion.
The Cloud Storage batch source plugin lets you read data from
Cloud Storage buckets and bring it into Cloud Data Fusion for
further processing and transformation. It lets you load data from multiple file
formats, including the following:
Structured: CSV, Avro, Parquet, ORC
Semi-structured: JSON, XML
Others: Text, Binary
Before you begin
Cloud Data Fusion typically has two service accounts:
In theSourcemenu, clickGCS. The Cloud Storage node
appears in your pipeline.
To configure the source, go to the Cloud Storage node and clickProperties.
Enter the following properties. For a complete list, seeProperties.
Enter aLabelfor the Cloud Storage node—for
example,Cloud Storage tables.
Enter the connection details. You can set up a new, one-time connection,
or an existing, reusable connection.
New connection
To add a one-time connection to Cloud Storage, follow these
steps:
KeepUse connectionturned off.
In theProject IDfield, leave the value as auto-detect.
In theService account typefield, leave the value asFile
pathand theService account file pathas auto-detect.
Reusable connection
To reuse an existing connection, follow these steps:
Turn onUse connection.
ClickBrowse connections.
Click the connection name—for example,Cloud Storage Default.
Optional: if a connection doesn't exist and you want to create a
new reusable connection, clickAdd connectionand refer to the
steps in theNew connectiontab on this page.
In theReference namefield, enter a name to use for
lineage—for example,data-fusion-gcs-campaign.
In thePathfield, enter the path to read from—for
example,gs://BUCKET_PATH.
In theFormatfield, select one of the following file formats for
the data being read:
avro
blob(the blob format requires a schema that contains a field
named body of type bytes)
csv
delimited
json
parquet
text(the text format requires a schema that contains a field
named body of type string)
tsv
The name of any format plugin that you have deployed in your
environment
Optional: to test connectivity, clickGet schema.
Optional: in theSample sizefield, enter the maximum rows to check
for the selected data type—for example,1000.
Optional: in theOverridefield, enter the column names and their
respective data types to skip.
Optional: enterAdvanced properties, such as a minimum split size or
a regular expression path filter (seeProperties).
Optional: in theTemporary bucket namefield, enter a name
for the Cloud Storage bucket.
Optional: clickValidateand address any errors found.
ClickClose. Properties are saved and you can continue to build your
data pipeline in the Cloud Data Fusion Studio.
Properties
Property
Macro enabled
Required property
Description
Label
No
Yes
The name of the node in your data pipeline.
Use connection
No
No
Browse for a reusable connection to the source. For more information
about adding, importing, and editing the connections that appear when
you browse connections, seeManage connections.
Connection
Yes
Yes
IfUse connectionis turned on, the name of the
reusable connection you select appears in this field.
Project ID
Yes
No
Used only whenUse connectionis turned off. A globally
unique identifier for the project. Default isauto-detect.
Service account type
Yes
No
Select one of the following options:
File path: the path where the service account is
located.
JSON: JSON content of the service account.
Service account file path
Yes
No
Used only when the Service account type value isFile
path. The path on the local file system of the service account key
used for authorization. If jobs run on Dataproc clusters,
set the value to auto-detect. If jobs run on other types of clusters, the
file must be present on every node in the cluster. Default isauto-detect.
Service account JSON
Yes
No
Used only when the Service account type value isJSON.
The JSON file content of the service account.
Reference name
No
Yes
Name that uniquely identifies this source for other services, such as
lineage and annotating metadata.
Path
Yes
Yes
Path to the files to be read. If a directory is specified, terminate the
path with a backslash (/). For example,gs://bucket/path/to/directory/. To match a filename pattern,
you can use an asterisk (*) as a wildcard. If no files are
found or matched, the pipeline fails.
Format
No
Yes
Format of the data to read. The format must be one of the following:
avro
blob(the blob format requires a schema that contains
a field named body of type bytes)
csv
delimited
json
parquet
text(the text format requires a schema that contains
a field named body of type string)
tsv
The name of any format plugin that you have deployed in your
environment
If the format is a macro, only the pre-packaged formats can be
used
Sample size
Yes
No
The maximum number of rows that are investigated for automatic data
type detection. Default is1000.
Override
Yes
No
A list of columns with the corresponding data from which the automatic
data type detection gets skipped.
Delimiter
Yes
No
Delimiter to use when the format isdelimited. This
property is ignored for other formats.
Enable quoted values
Yes
No
Whether to treat content between quotes as a value. This property is
only used for thecsv,tsv, ordelimitedformats. For example, if this property is set to
true, the following outputs two fields:1, "a, b, c".
The first field has1as its value. The second hasa, b, c. The quotation mark characters are trimmed. The
newline delimiter cannot be within quotes. The plugin assumes the quotes are correctly enclosed, for example,"a, b, c". Not closing a quote ("a,b,c,) causes
an error. Default value isFalse.
Use first row as header
Yes
No
Whether to use the first line of each file as the column
header. Supported formats aretext,csv,tsv, anddelimited. Default isFalse.
Minimum split size
Yes
No
Minimum size, in bytes, for each input partition. Smaller partitions
increase the level of parallelism, but require more resources and overhead. If theFormatvalue isblob, you cannot split
the data.
Maximum split size
Yes
No
Maximum size, in bytes, for each input partition. Smaller partitions
increase the level of parallelism, but require more resources and overhead. If theFormatvalue isblob, you cannot split
the data. Default is128 MB.
Regex path filter
Yes
No
Regular expression that file paths must match to be included in the
input. The full path is compared, not just the filename. If no file is
given, no file filtering is done. For more information about regular
expression syntax, seePattern.
Path field
Yes
No
Output field to place the path of the file that the record was read
from. If not specified, the path isn't included in output records. If
specified, the field must exist in the output schema as a string.
Path filename only
Yes
No
If aPath fieldproperty is set, use only the filename
and not the URI of the path. Default isFalse.
Read files recursively
Yes
No
Whether files are to be read recursively from the path. Default isFalse.
Allow empty input
Yes
No
Whether to allow an input path that contains no data. When set toFalse, the plugin will error when there is no data to
read. When set toTrue, no error is thrown and zero
records are read. Default isFalse.
Data file encrypted
Yes
No
Whether files are encrypted. For more information, seeData file encryption. Default isFalse.
Encryption metadata file suffix
Yes
No
The filename suffix for the encryption metadata file. Default ismetadata.
File system properties
Yes
No
Additional properties to use with the InputFormat when reading the
data.
File encoding
Yes
No
The character encoding for the files to be read. Default isUTF-8.
Output schema
Yes
No
If aPath fieldproperty is set, it must be present in
the schema as a string.
Data file encryption
This section describes theData file encryptionproperty. If you set it totrue, files are decrypted
using the Streaming AEAD provided by theTink library. Each data file
must be accompanied with a metadata file that contains the cipher
information. For example, an encrypted data file atgs://BUCKET/PATH_TO_DIRECTORY/file1.csv.encmust have a metadata file atgs://BUCKET/PATH_TO_DIRECTORY/file1.csv.enc.metadata. The metadata file
contains a JSON object with the following properties:
Property
Description
kms
The Cloud Key Management Service URI that was used to encrypt the Data Encryption
Key.
aad
The Base64 encoded Additional Authenticated Data used in the
encryption.
key set
A JSON object representing the serialized keyset information from the Tink library.
Example
/* Counting example */{"kms":"gcp-kms://projects/my-key-project/locations/us-west1/keyRings/my-key-ring/cryptoKeys/mykey","aad":"73iT4SUJBM24umXecCCf3A==","keyset":{"keysetInfo":{"primaryKeyId":602257784,"keyInfo":[{"typeUrl":"type.googleapis.com/google.crypto.tink.AesGcmHkdfStreamingKey","outputPrefixType":"RAW","keyId":602257784,"status":"ENABLED"}]},"encryptedKeyset":"CiQAz5HH+nUA0Zuqnz4LCnBEVTHS72s/zwjpcnAMIPGpW6kxLggSrAEAcJKHmXeg8kfJ3GD4GuFeWDZzgGn3tfolk6Yf5d7rxKxDEChIMWJWGhWlDHbBW5B9HqWfKx2nQWSC+zjM8FLefVtPYrdJ8n6Eg8ksAnSyXmhN5LoIj6az3XBugtXvCCotQHrBuyoDY+j5ZH9J4tm/bzrLEjCdWAc+oAlhsUAV77jZhowJr6EBiyVuRVfcwLwiscWkQ9J7jjHc7ih9HKfnqAZmQ6iWP36OMrEn"}}
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-04 UTC."],[[["\u003cp\u003eThis guide explains how to use the Cloud Storage batch source plugin in Cloud Data Fusion to read data from Cloud Storage buckets in various formats like CSV, JSON, and Parquet.\u003c/p\u003e\n"],["\u003cp\u003eBefore using the plugin, ensure both the Cloud Data Fusion API Service Agent and the Compute Engine Service Account have the necessary IAM roles and permissions, including Storage Legacy Bucket Reader and Storage Object Viewer.\u003c/p\u003e\n"],["\u003cp\u003eTo configure the plugin, navigate to the Cloud Data Fusion Studio, select the GCS source, and define properties such as the file path, data format, and connection details, either using a new or existing connection.\u003c/p\u003e\n"],["\u003cp\u003eThe plugin supports various file formats, including structured (CSV, Avro), semi-structured (JSON, XML), and others like Text and Binary, each with specific requirements for schema definition.\u003c/p\u003e\n"],["\u003cp\u003eOptional configurations include setting a sample size for data type detection, overriding column types, specifying minimum and maximum split sizes for partitions, using a regex filter, and managing encrypted data files.\u003c/p\u003e\n"]]],[],null,["# Cloud Storage batch source\n\nThis page provides guidance about configuring the Cloud Storage batch source plugin in Cloud Data Fusion.\n\n\u003cbr /\u003e\n\nThe Cloud Storage batch source plugin lets you read data from\nCloud Storage buckets and bring it into Cloud Data Fusion for\nfurther processing and transformation. It lets you load data from multiple file\nformats, including the following:\n\n- **Structured**: CSV, Avro, Parquet, ORC\n- **Semi-structured**: JSON, XML\n- **Others**: Text, Binary\n\nBefore you begin\n----------------\n\nCloud Data Fusion typically has two service accounts:\n\n- Design-time service account: [Cloud Data Fusion API Service Agent](/data-fusion/docs/concepts/service-accounts)\n- Execution-time service account: [Compute Engine Service Account](/data-fusion/docs/concepts/service-accounts)\n\nBefore using the Cloud Storage batch source plugin, grant the\nfollowing role or permissions to each service account.\n\n### Cloud Data Fusion API Service Agent\n\nThis service account already has all the required permissions and you don't need\nto add additional permissions.\n| **Note:** When you design a pipeline, you need `storage.buckets.list` permission on the bucket used by the pipeline. It isn't required to execute the pipeline.\n\n### Compute Engine Service Account\n\nIn your Google Cloud project, grant the following IAM roles or\npermissions to the Compute Engine Service Account:\n\n- [Storage Legacy Bucket Reader](/iam/docs/understanding-roles#storage.legacyBucketReader) (`roles/storage.legacyBucketReader`). This predefined role contains the required `storage.buckets.get` permission.\n- [Storage Object Viewer](/iam/docs/understanding-roles#storage.objectViewer) (`roles/storage.legacyBucketReader`). This\n predefined role contains the following required permissions:\n\n - `storage.objects.get`\n - `storage.objects.list`\n\nConfigure the plugin\n--------------------\n\n1. [Go to the Cloud Data Fusion web interface](/data-fusion/docs/create-data-pipeline#navigate-web-interface) and click **Studio**.\n2. Check that **Data Pipeline - Batch** is selected (not **Realtime**).\n3. In the **Source** menu, click **GCS**. The Cloud Storage node appears in your pipeline.\n4. To configure the source, go to the Cloud Storage node and click **Properties**.\n5. Enter the following properties. For a complete list, see\n [Properties](#properties).\n\n 1. Enter a **Label** for the Cloud Storage node---for example, `Cloud Storage tables`.\n 2. Enter the connection details. You can set up a new, one-time connection,\n or an existing, reusable connection.\n\n ### New connection\n\n\n To add a one-time connection to Cloud Storage, follow these\n steps:\n 1. Keep **Use connection** turned off.\n 2. In the **Project ID** field, leave the value as auto-detect.\n 3. In the **Service account type** field, leave the value as **File\n path** and the **Service account file path** as auto-detect.\n\n | **Note:** If the plugin isn't running on a Dataproc cluster, enter the values for **Service account type** and **Service\n | account file path** . For more information, see [Properties](#properties).\n\n ### Reusable connection\n\n\n To reuse an existing connection, follow these steps:\n 1. Turn on **Use connection**.\n 2. Click **Browse connections**.\n 3. Click the connection name---for example,\n **Cloud Storage Default**.\n\n | **Note:** For more information about adding, importing, and editing the connections that appear when you browse connections, see [Manage connections](/data-fusion/docs/how-to/managing-connections).\n 4. Optional: if a connection doesn't exist and you want to create a\n new reusable connection, click **Add connection** and refer to the\n steps in the **New connection** tab on this page.\n\n 3. In the **Reference name** field, enter a name to use for\n lineage---for example, `data-fusion-gcs-campaign`.\n\n 4. In the **Path** field, enter the path to read from---for\n example, `gs://`\u003cvar translate=\"no\"\u003eBUCKET_PATH\u003c/var\u003e.\n\n 5. In the **Format** field, select one of the following file formats for\n the data being read:\n\n - **avro**\n - **blob** (the blob format requires a schema that contains a field named body of type bytes)\n - **csv**\n - **delimited**\n - **json**\n - **parquet**\n - **text** (the text format requires a schema that contains a field named body of type string)\n - **tsv**\n - The name of any format plugin that you have deployed in your environment\n\n | **Note:** If you use a macro in this field, you must use one of the predefined formats. Macros don't support formats added by plugins.\n 6. Optional: to test connectivity, click **Get schema**.\n\n 7. Optional: in the **Sample size** field, enter the maximum rows to check\n for the selected data type---for example, `1000`.\n\n 8. Optional: in the **Override** field, enter the column names and their\n respective data types to skip.\n\n 9. Optional: enter **Advanced properties** , such as a minimum split size or\n a regular expression path filter (see [Properties](#properties)).\n\n 10. Optional: in the **Temporary bucket name** field, enter a name\n for the Cloud Storage bucket.\n\n6. Optional: click **Validate** and address any errors found.\n\n7. Click **Close**. Properties are saved and you can continue to build your\n data pipeline in the Cloud Data Fusion Studio.\n\n### Properties\n\nData file encryption\n--------------------\n\nThis section describes the **Data file encryption**\nproperty. If you set it to **true** , files are decrypted\nusing the Streaming AEAD provided by the\n[Tink library](https://github.com/google/tink). Each data file\nmust be accompanied with a metadata file that contains the cipher\ninformation. For example, an encrypted data file at\n`gs://`\u003cvar translate=\"no\"\u003eBUCKET\u003c/var\u003e`/`\u003cvar translate=\"no\"\u003ePATH_TO_DIRECTORY\u003c/var\u003e`/file1.csv.enc\n` must have a metadata file at `gs://`\u003cvar translate=\"no\"\u003eBUCKET\u003c/var\u003e`/`\u003cvar translate=\"no\"\u003e PATH_TO_DIRECTORY\u003c/var\u003e`/file1.csv.enc.metadata`. The metadata file\ncontains a JSON object with the following properties:\n\n**Example** \n\n```json\n /* Counting example */\n {\n\n \"kms\": \"gcp-kms://projects/my-key-project/locations/us-west1/keyRings/my-key-ring/cryptoKeys/mykey\",\n\n \"aad\": \"73iT4SUJBM24umXecCCf3A==\",\n\n \"keyset\": {\n\n \"keysetInfo\": {\n\n \"primaryKeyId\": 602257784,\n\n \"keyInfo\": [{\n\n \"typeUrl\": \"type.googleapis.com/google.crypto.tink.AesGcmHkdfStreamingKey\",\n\n \"outputPrefixType\": \"RAW\",\n\n \"keyId\": 602257784,\n\n \"status\": \"ENABLED\"\n\n }]\n\n },\n\n \"encryptedKeyset\": \"CiQAz5HH+nUA0Zuqnz4LCnBEVTHS72s/zwjpcnAMIPGpW6kxLggSrAEAcJKHmXeg8kfJ3GD4GuFeWDZzgGn3tfolk6Yf5d7rxKxDEChIMWJWGhWlDHbBW5B9HqWfKx2nQWSC+zjM8FLefVtPYrdJ8n6Eg8ksAnSyXmhN5LoIj6az3XBugtXvCCotQHrBuyoDY+j5ZH9J4tm/bzrLEjCdWAc+oAlhsUAV77jZhowJr6EBiyVuRVfcwLwiscWkQ9J7jjHc7ih9HKfnqAZmQ6iWP36OMrEn\"\n\n }\n\n }\n \n``` \n\nRelease notes\n-------------\n\n- [September 6, 2023](https://cdap.atlassian.net/wiki/spaces/DOCS/pages/1280901131/CDAP+Hub+Release+Log#September-6%2C-2023)\n\nWhat's next\n-----------\n\n- Learn more about [plugins in Cloud Data Fusion](/data-fusion/docs/concepts/plugins)."]]