Loading CSV data from Cloud Storage

When you load CSV data from Cloud Storage, you can load the data into a new table or partition, or you can append to or overwrite an existing table or partition. When your data is loaded into BigQuery, it is converted into columnar format for Capacitor (BigQuery's storage format).

When you load data from Cloud Storage into a BigQuery table, the dataset that contains the table must be in the same regional or multi- regional location as the Cloud Storage bucket.

For information about loading CSV data from a local file, see Loading data into BigQuery from a local data source .

Try it for yourself

If you're new to Google Cloud, create an account to evaluate how BigQuery performs in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Try BigQuery free

Limitations

You are subject to the following limitations when you load data into BigQuery from a Cloud Storage bucket:

BigQuery does not guarantee data consistency for external data sources. Changes to the underlying data while a query is running can result in unexpected behavior.
BigQuery doesn't support Cloud Storage object versioning . If you include a generation number in the Cloud Storage URI, then the load job fails.

When you load CSV files into BigQuery, note the following:

CSV files don't support nested or repeated data.
Remove byte order mark (BOM) characters. They might cause unexpected issues.
If you use gzip compression, BigQuery cannot read the data in parallel. Loading compressed CSV data into BigQuery is slower than loading uncompressed data. See Loading compressed and uncompressed data .
You cannot include both compressed and uncompressed files in the same load job.
The maximum size for a gzip file is 4 GB.
Loading CSV data using schema autodetection does not automatically detect headers if all of the columns are string types. In this case, add a numerical column to the input or declare the schema explicitly.
When you load CSV or JSON data, values in DATE columns must use the dash ( - ) separator and the date must be in the following format: YYYY-MM-DD (year-month-day).
When you load JSON or CSV data, values in TIMESTAMP columns must use a dash ( - ) or slash ( / ) separator for the date portion of the timestamp, and the date must be in one of the following formats: YYYY-MM-DD (year-month-day) or YYYY/MM/DD (year/month/day). The hh:mm:ss (hour-minute-second) portion of the timestamp must use a colon ( : ) separator.
Your files must meet the CSV file size limits described in the load jobs limits .

Before you begin

Grant Identity and Access Management (IAM) roles that give users the necessary permissions to perform each task in this document, and create a dataset to store your data.

Required permissions

To load data into BigQuery, you need IAM permissions to run a load job and load data into BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need IAM permissions to access the bucket that contains your data.

Permissions to load data into BigQuery

To load data into a new BigQuery table or partition or to append or overwrite an existing table or partition, you need the following IAM permissions:

bigquery.tables.create
bigquery.tables.updateData
bigquery.tables.update
bigquery.jobs.create

Each of the following predefined IAM roles includes the permissions that you need in order to load data into a BigQuery table or partition:

roles/bigquery.dataEditor
roles/bigquery.dataOwner
roles/bigquery.admin (includes the bigquery.jobs.create permission)
bigquery.user (includes the bigquery.jobs.create permission)
bigquery.jobUser (includes the bigquery.jobs.create permission)

Additionally, if you have the bigquery.datasets.create permission, you can create and update tables using a load job in the datasets that you create.

For more information on IAM roles and permissions in BigQuery, see Predefined roles and permissions .

Permissions to load data from Cloud Storage

To get the permissions that you need to load data from a Cloud Storage bucket, ask your administrator to grant you the Storage Admin ( roles/storage.admin ) IAM role on the bucket. For more information about granting roles, see Manage access to projects, folders, and organizations .

This predefined role contains the permissions required to load data from a Cloud Storage bucket. To see the exact permissions that are required, expand the Required permissionssection:

Required permissions

The following permissions are required to load data from a Cloud Storage bucket:

storage.buckets.get
storage.objects.get
storage.objects.list (required if you are using a URI wildcard )

You might also be able to get these permissions with custom roles or other predefined roles .

Create a dataset

Create a BigQuery dataset to store your data.

CSV compression

You can use the gzip utility to compress CSV files. Note that gzip performs full file compression, unlike the file content compression performed by compression codecs for other file formats, such as Avro. Using gzip to compress your CSV files might have a performance impact; for more information about the trade-offs, see Loading compressed and uncompressed data .

Loading CSV data into a table

To load CSV data from Cloud Storage into a new BigQuery table, select one of the following options:

Console

To follow step-by-step guidance for this task directly in the Cloud Shell Editor, click Guide me :

Guide me

In the Google Cloud console, go to the BigQuery page.

Go to BigQuery
In the Explorer pane, expand your project, and then select a dataset.
In the Dataset info section, click Create table .
In the Create table panel, specify the following details:

In the Source section, select Google Cloud Storage in the Create table from list. Then, do the following:
1. Select a file from the Cloud Storage bucket, or enter the Cloud Storage URI . You cannot include multiple URIs in the Google Cloud console, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you want to create, append, or overwrite.
2. For File format , select CSV .
In the Destination section, specify the following details:
1. For Dataset , select the dataset in which you want to create the table.
2. In the Table field, enter the name of the table that you want to create.
3. Verify that the Table type field is set to Native table .
In the Schema section, enter the schema definition. To enable the auto detection of a schema, select Auto detect . You can enter schema information manually by using one of the following methods:
- Option 1: Click Edit as text and paste the schema in the form of a JSON array. When you use a JSON array, you generate the schema using the same process as creating a JSON schema file . You can view the schema of an existing table in JSON format by entering the following command:
```
  
 bq  
show 
  
--format = 
prettyjson  
 dataset.table 
  
```
- Option 2: Click Add field and enter the table schema. Specify each field's Name , Type , and Mode .
Optional: Specify Partition and cluster settings . For more information, see Creating partitioned tables and Creating and using clustered tables .
Click Advanced options and do the following:
- For Write preference , leave Write if empty selected. This option creates a new table and loads your data into it.
- For Number of errors allowed , accept the default value of 0 or enter the maximum number of rows containing errors that can be ignored. If the number of rows with errors exceeds this value, the job will result in an invalid message and fail. This option applies only to CSV and JSON files.
- For Time zone , enter the default time zone that will apply when parsing timestamp values that have no specific time zone. Check here for more valid time zone names. If this value is not present, the timestamp values without specific time zone is parsed using default time zone UTC. ( Preview ).
- For Date Format , enter the format elements that define how the DATE values are formatted in the input files. This field expects SQL styles format (for example, MM/DD/YYYY ). If this value is present, this format is the only compatible DATE format. Schema autodetection will also decide DATE column type based on this format instead of the existing format. If this value is not present, the DATE field is parsed with the default formats . ( Preview ).
- For Datetime Format , enter the format elements that define how the DATETIME values are formatted in the input files. This field expects SQL styles format (for example, MM/DD/YYYY HH24:MI:SS.FF3 ). If this value is present, this format is the only compatible DATETIME format. Schema autodetection will also decide DATETIME column type based on this format instead of the existing format. If this value is not present, the DATETIME field is parsed with the default formats . ( Preview ).
- For Time Format , enter the format elements that define how the TIME values are formatted in the input files. This field expects SQL styles format (for example, HH24:MI:SS.FF3 ). If this value is present, this format is the only compatible TIME format. Schema autodetection will also decide TIME column type based on this format instead of the existing format. If this value is not present, the TIME field is parsed with the default formats . ( Preview ).
- For Timestamp Format , enter the format elements that define how the TIMESTAMP values are formatted in the input files. This field expects SQL styles format (for example, MM/DD/YYYY HH24:MI:SS.FF3 ). If this value is present, this format is the only compatible TIMESTAMP format. Schema autodetection will also decide TIMESTAMP column type based on this format instead of the existing format. If this value is not present, the TIMESTAMP field is parsed with the default formats . ( Preview ).
- If you want to ignore values in a row that are not present in the table's schema, then select Unknown values .
- For Field delimiter , choose the character that separates the cells in your CSV file: Comma , Tab , Pipe , or Custom . If you choose Custom , enter the delimiter in the Custom field delimiter box. The default value is Comma .
- For Source column match , choose one of the following strategies used to match the loaded columns to the schema ( Preview ).
- For Header rows to skip , enter the number of header rows to skip at the top of the CSV file. The default value is 0 .
- For Quoted newlines , check Allow quoted newlines to allow quoted data sections that contain newline characters in a CSV file. The default value is false .
- For Jagged rows , check Allow jagged rows to accept rows in CSV files that are missing trailing optional columns. The missing values are treated as nulls. If unchecked, records with missing trailing columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. The default value is false .
- For Null markers , enter a list of custom strings that represents a NULL value in CSV data. ( Preview ).
- For Encryption , click Customer-managed key to use a Cloud Key Management Service key . If you leave the Google-managed key setting, BigQuery encrypts the data at rest .
Click Create table .

SQL

Use the LOAD DATA DDL statement . The following example loads a CSV file into the new table mytable :

In the Google Cloud console, go to the BigQuerypage.

Go to BigQuery

In the query editor, enter the following statement:

 LOAD 
  
  DATA 
 
  
 OVERWRITE 
  
 mydataset 
 . 
 mytable 
 ( 
 x 
  
 INT64 
 , 
 y 
  
 STRING 
 ) 
 FROM 
  
 FILES 
  
 ( 
  
 format 
  
 = 
  
 'CSV' 
 , 
  
 uris 
  
 = 
  
 [ 
 'gs://bucket/path/file.csv' 
 ] 
 );

Click Run.

For more information about how to run queries, see Run an interactive query .

bq

Use the bq load command, specify CSV using the --source_format flag, and include a Cloud Storage URI . You can include a single URI, a comma-separated list of URIs, or a URI containing a wildcard . Supply the schema inline, in a schema definition file, or use schema auto-detect . If you don't specify a schema, and --autodetect is false , and the destination table exists, then the schema of the destination table is used.

(Optional) Supply the --location flag and set the value to your location .

Other optional flags include:

--allow_jagged_rows : When specified, accept rows in CSV files that are missing trailing optional columns. The missing values are treated as nulls. If unchecked, records with missing trailing columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. The default value is false .
--allow_quoted_newlines : When specified, allows quoted data sections that contain newline characters in a CSV file. The default value is false .
--field_delimiter : The character that indicates the boundary between columns in the data. Both \t and tab are allowed for tab delimiters. The default value is , .
--null_marker : An optional custom string that represents a NULL value in CSV data.
--null_markers : ( Preview ) An optional comma-separated list of custom strings that represent NULL values in CSV data. This option cannot be used with --null_marker flag.
--source_column_match : ( Preview ) Specifies the strategy used to match loaded columns to the schema. You can specify POSITION to match loaded columns by position, assuming that the columns are ordered the same way as the schema. You can also specify NAME to match by name by reading the header row as the column names and reordering columns to match the field names in the schema. If this value is unspecified, then the default is based on how the schema is provided. If --autodetect is enabled, then the default behavior is to match columns by name. Otherwise, the default is to match columns by position.
--skip_leading_rows : Specifies the number of header rows to skip at the top of the CSV file. The default value is 0 .
--quote : The quote character to use to enclose records. The default value is " . To indicate no quote character, use an empty string.
--max_bad_records : An integer that specifies the maximum number of bad records allowed before the entire job fails. The default value is 0 . At most, five errors of any type are returned regardless of the --max_bad_records value.
--ignore_unknown_values : When specified, allows and ignores extra, unrecognized values in CSV or JSON data.
--time_zone : ( Preview ) An optional default time zone that will apply when parsing timestamp values that have no specific time zone in CSV or JSON data.
--date_format : ( Preview ) An optional custom string that defines how the DATE values are formatted in CSV or JSON data.
--datetime_format : ( Preview ) An optional custom string that defines how the DATETIME values are formatted in CSV or JSON data.
--time_format : ( Preview ) An optional custom string that defines how the TIME values are formatted in CSV or JSON data.
--timestamp_format : ( Preview ) An optional custom string that defines how the TIMESTAMP values are formatted in CSV or JSON data.
--autodetect : When specified, enable schema auto-detection for CSV and JSON data.
--time_partitioning_type : Enables time-based partitioning on a table and sets the partition type. Possible values are HOUR , DAY , MONTH , and YEAR . This flag is optional when you create a table partitioned on a DATE , DATETIME , or TIMESTAMP column. The default partition type for time-based partitioning is DAY . You cannot change the partitioning specification on an existing table.
--time_partitioning_expiration : An integer that specifies (in seconds) when a time-based partition should be deleted. The expiration time evaluates to the partition's UTC date plus the integer value.
--time_partitioning_field : The DATE or TIMESTAMP column used to create a partitioned table. If time-based partitioning is enabled without this value, an ingestion-time partitioned table is created.
--require_partition_filter : When enabled, this option requires users to include a WHERE clause that specifies the partitions to query. Requiring a partition filter may reduce cost and improve performance. For more information, see Querying partitioned tables .
--clustering_fields : A comma-separated list of up to four column names used to create a clustered table .
--destination_kms_key : The Cloud KMS key for encryption of the table data.
--column_name_character_map : Defines the scope and handling of characters in column names, with the option of enabling flexible column names . Requires the --autodetect option for CSV files. For more information, see load_option_list .

For more information on the bq load command, see:
- Command-line reference
For more information on partitioned tables, see:
- Creating partitioned tables
For more information on clustered tables, see:
- Creating and using clustered tables
For more information on table encryption, see:
- Protecting data with Cloud KMS keys

To load CSV data into BigQuery, enter the following command:

bq  
--location = 
 location 
  
load  
 \ 
--source_format = 
 format 
  
 \ 
 dataset.table 
  
 \ 
 path_to_source 
  
 \ 
 schema

Where:

location is your location. The --location flag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the flag's value to asia-northeast1 . You can set a default value for the location using the .bigqueryrc file .
format is CSV .
dataset is an existing dataset.
table is the name of the table into which you're loading data.
path_to_source is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.
schema is a valid schema. The schema can be a local JSON file, or it can be typed inline as part of the command. You can also use the --autodetect flag instead of supplying a schema definition.

Examples:

The following command loads data from gs://mybucket/mydata.csv into a table named mytable in mydataset . The schema is defined in a local schema file named myschema.json .

   
 bq 
  
 load 
  
\  
 -- 
 source_format 
 = 
 CSV 
  
\  
 mydataset 
 . 
 mytable 
  
\  
 gs 
 : 
 // 
 mybucket 
 / 
 mydata 
 . 
 csv 
  
\  
 ./ 
 myschema 
 . 
 json

The following command loads data from gs://mybucket/mydata.csv into a table named mytable in mydataset . The schema is defined in a local schema file named myschema.json . The CSV file includes two header rows. If --skip_leading_rows is unspecified, the default behavior is to assume the file does not contain headers.

   
 bq 
  
 load 
  
\  
 -- 
 source_format 
 = 
 CSV 
  
\  
 -- 
 skip_leading_rows 
 = 
 2 
  
 mydataset 
 . 
 mytable 
  
\  
 gs 
 : 
 // 
 mybucket 
 / 
 mydata 
 . 
 csv 
  
\  
 ./ 
 myschema 
 . 
 json

The following command loads data from gs://mybucket/mydata.csv into an ingestion-time partitioned table named mytable in mydataset . The schema is defined in a local schema file named myschema.json .

   
 bq 
  
 load 
  
\  
 -- 
 source_format 
 = 
 CSV 
  
\  
 -- 
 time_partitioning_type 
 = 
 DAY 
  
\  
 mydataset 
 . 
 mytable 
  
\  
 gs 
 : 
 // 
 mybucket 
 / 
 mydata 
 . 
 csv 
  
\  
 ./ 
 myschema 
 . 
 json

The following command loads data from gs://mybucket/mydata.csv into a new partitioned table named mytable in mydataset . The table is partitioned on the mytimestamp column. The schema is defined in a local schema file named myschema.json .

   
 bq 
  
 load 
  
\  
 -- 
 source_format 
 = 
 CSV 
  
\  
 -- 
 time_partitioning_field 
  
 mytimestamp 
  
\  
 mydataset 
 . 
 mytable 
  
\  
 gs 
 : 
 // 
 mybucket 
 / 
 mydata 
 . 
 csv 
  
\  
 ./ 
 myschema 
 . 
 json

The following command loads data from gs://mybucket/mydata.csv into a table named mytable in mydataset . The schema is auto detected.

   
 bq 
  
 load 
  
\  
 -- 
 autodetect 
  
\  
 -- 
 source_format 
 = 
 CSV 
  
\  
 mydataset 
 . 
 mytable 
  
\  
 gs 
 : 
 // 
 mybucket 
 / 
 mydata 
 . 
 csv

The following command loads data from gs://mybucket/mydata.csv into a table named mytable in mydataset . The schema is defined inline in the format field:data_type , field:data_type .

   
 bq 
  
 load 
  
\  
 -- 
 source_format 
 = 
 CSV 
  
\  
 mydataset 
 . 
 mytable 
  
\  
 gs 
 : 
 // 
 mybucket 
 / 
 mydata 
 . 
 csv 
  
\  
 qtr 
 : 
 STRING 
 , 
 sales 
 : 
 FLOAT 
 , 
 year 
 : 
 STRING

The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset . The Cloud Storage URI uses a wildcard. The schema is auto detected.

   
 bq 
  
 load 
  
\  
 -- 
 autodetect 
  
\  
 -- 
 source_format 
 = 
 CSV 
  
\  
 mydataset 
 . 
 mytable 
  
\  
 gs 
 : 
 // 
 mybucket 
 / 
 mydata 
 *. 
 csv

The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset . The command includes a comma- separated list of Cloud Storage URIs with wildcards. The schema is defined in a local schema file named myschema.json .

   
 bq 
  
 load 
  
\  
 -- 
 source_format 
 = 
 CSV 
  
\  
 mydataset 
 . 
 mytable 
  
\  
 "gs://mybucket/00/*.csv" 
 , 
 "gs://mybucket/01/*.csv" 
  
\  
 ./ 
 myschema 
 . 
 json

API

Create a load job that points to the source data in Cloud Storage.
(Optional) Specify your location in the location property in the jobReference section of the job resource .
The source URIs property must be fully-qualified, in the format gs:// bucket / object . Each URI can contain one '*' wildcard character .
Specify the CSV data format by setting the sourceFormat property to CSV .
To check the job status, call jobs.get( job_id *) , where job_id is the ID of the job returned by the initial request.
- If status.state = DONE , the job completed successfully.
- If the status.errorResult property is present, the request failed, and that object will include information describing what went wrong. When a request fails, no table is created and no data is loaded.
- If status.errorResult is absent, the job finished successfully, although there might have been some nonfatal errors, such as problems importing a few rows. Nonfatal errors are listed in the returned job object's status.errors property.

API notes:

Load jobs are atomic and consistent; if a load job fails, none of the data is available, and if a load job succeeds, all of the data is available.
As a best practice, generate a unique ID and pass it as jobReference.jobId when calling jobs.insert to create a load job. This approach is more robust to network failure because the client can poll or retry on the known job ID.
Calling jobs.insert on a given job ID is idempotent. You can retry as many times as you like on the same job ID, and at most one of those operations will succeed.

C#

Before trying this sample, follow the C# setup instructions in the BigQuery quickstart using client libraries . For more information, see the BigQuery C# API reference documentation .

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries .

  using 
  
  Google.Cloud.BigQuery.V2 
 
 ; 
 using 
  
 System 
 ; 
 public 
  
 class 
  
 BigQueryLoadTableGcsCsv 
 { 
  
 public 
  
 void 
  
 LoadTableGcsCsv 
 ( 
  
 string 
  
 projectId 
  
 = 
  
 "your-project-id" 
 , 
  
 string 
  
 datasetId 
  
 = 
  
 "your_dataset_id" 
  
 ) 
  
 { 
  
  BigQueryClient 
 
  
 client 
  
 = 
  
  BigQueryClient 
 
 . 
  Create 
 
 ( 
 projectId 
 ); 
  
 var 
  
 gcsURI 
  
 = 
  
 "gs://cloud-samples-data/bigquery/us-states/us-states.csv" 
 ; 
  
 var 
  
 dataset 
  
 = 
  
 client 
 . 
  GetDataset 
 
 ( 
 datasetId 
 ); 
  
 var 
  
 schema 
  
 = 
  
 new 
  
  TableSchemaBuilder 
 
  
 { 
  
 { 
  
 "name" 
 , 
  
  BigQueryDbType 
 
 . 
  String 
 
  
 }, 
  
 { 
  
 "post_abbr" 
 , 
  
  BigQueryDbType 
 
 . 
  String 
 
  
 } 
  
 }. 
 Build 
 (); 
  
 var 
  
 destinationTableRef 
  
 = 
  
 dataset 
 . 
 GetTableReference 
 ( 
  
 tableId 
 : 
  
 "us_states" 
 ); 
  
 // Create job configuration 
  
 var 
  
 jobOptions 
  
 = 
  
 new 
  
  CreateLoadJobOptions 
 
 () 
  
 { 
  
 // The source format defaults to CSV; line below is optional. 
  
 SourceFormat 
  
 = 
  
  FileFormat 
 
 . 
  Csv 
 
 , 
  
 SkipLeadingRows 
  
 = 
  
 1 
  
 }; 
  
 // Create and run job 
  
 var 
  
 loadJob 
  
 = 
  
 client 
 . 
  CreateLoadJob 
 
 ( 
  
 sourceUri 
 : 
  
 gcsURI 
 , 
  
 destination 
 : 
  
 destinationTableRef 
 , 
  
 schema 
 : 
  
 schema 
 , 
  
 options 
 : 
  
 jobOptions 
 ); 
  
 loadJob 
  
 = 
  
 loadJob 
 . 
  PollUntilCompleted 
 
 (). 
 ThrowOnAnyError 
 (); 
  
 // Waits for the job to complete. 
  
 // Display the number of rows uploaded 
  
  BigQueryTable 
 
  
 table 
  
 = 
  
 client 
 . 
  GetTable 
 
 ( 
 destinationTableRef 
 ); 
  
 Console 
 . 
 WriteLine 
 ( 
  
 $"Loaded {table. Resource 
.NumRows} rows to {table. FullyQualifiedId 
}" 
 ); 
  
 } 
 }

Console option	bq tool flag	BigQuery API property	Description
Write if empty	Not supported	`WRITE_EMPTY`	Writes the data only if the table is empty.
Append to table	`--noreplace` or `--replace=false` ; if `--[no]replace` is unspecified, the default is append	`WRITE_APPEND`	( Default ) Appends the data to the end of the table.
Overwrite table	`--replace` or `--replace=true`	`WRITE_TRUNCATE`	Erases all existing data in a table before writing the new data. This action also deletes the table schema, row level security, and removes any Cloud KMS key.

Loading CSV data from Cloud Storage

Try it for yourself

Limitations

Before you begin

Required permissions

Permissions to load data into BigQuery

Permissions to load data from Cloud Storage

Required permissions

Create a dataset

CSV compression

Loading CSV data into a table

Console

SQL

bq

API

C#

Go

Java

Node.js

PHP

Python

Ruby

Loading CSV data into a table that uses column-based time partitioning

Go

Java

Node.js

Python

Appending to or overwriting a table with CSV data

Console

SQL

bq

API

Go

Java

Node.js

Python

Loading hive-partitioned CSV data

Details of loading CSV data

Encoding

Field delimiters

Data types

Schema auto-detection

CSV delimiter

CSV header

CSV quoted new lines

Troubleshoot parsing errors

Compressed CSV files

CSV options