Loading Parquet data from Cloud Storage

This page provides an overview of loading Parquet data from Cloud Storage into BigQuery.

Parquet is an open source column-oriented data format that is widely used in the Apache Hadoop ecosystem.

When you load Parquet data from Cloud Storage, you can load the data into a new table or partition, or you can append to or overwrite an existing table or partition. When your data is loaded into BigQuery, it is converted into columnar format for Capacitor (BigQuery's storage format).

When you load data from Cloud Storage into a BigQuery table, the dataset that contains the table must be in the same regional or multi- regional location as the Cloud Storage bucket.

For information about loading Parquet data from a local file, see Loading data from local files .

Limitations

You are subject to the following limitations when you load data into BigQuery from a Cloud Storage bucket:

BigQuery does not guarantee data consistency for external data sources. Changes to the underlying data while a query is running can result in unexpected behavior.
BigQuery doesn't support Cloud Storage object versioning . If you include a generation number in the Cloud Storage URI, then the load job fails.
You can't use a wildcard in the Cloud Storage URI if any of the files to be loaded have different schemas. Any difference in the position of columns qualifies as a different schema.

Input file requirements

To avoid resourcesExceeded errors when loading Parquet files into BigQuery, follow these guidelines:

Keep row sizes to 50 MB or less.
If your input data contains more than 100 columns, consider reducing the page size to be smaller than the default page size (1 * 1024 * 1024 bytes). This is especially helpful if you are using significant compression.
For optimal performance, aim for row group sizes of at least 16 MiB. Smaller row group sizes increase I/O and slow down loads and queries.

Before you begin

Grant Identity and Access Management (IAM) roles that give users the necessary permissions to perform each task in this document, and create a dataset to store your data.

Required permissions

To load data into BigQuery, you need IAM permissions to run a load job and load data into BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need IAM permissions to access the bucket that contains your data.

Permissions to load data into BigQuery

To load data into a new BigQuery table or partition or to append or overwrite an existing table or partition, you need the following IAM permissions:

bigquery.tables.create
bigquery.tables.updateData
bigquery.tables.update
bigquery.jobs.create

Each of the following predefined IAM roles includes the permissions that you need in order to load data into a BigQuery table or partition:

roles/bigquery.dataEditor
roles/bigquery.dataOwner
roles/bigquery.admin (includes the bigquery.jobs.create permission)
bigquery.user (includes the bigquery.jobs.create permission)
bigquery.jobUser (includes the bigquery.jobs.create permission)

Additionally, if you have the bigquery.datasets.create permission, you can create and update tables using a load job in the datasets that you create.

For more information on IAM roles and permissions in BigQuery, see Predefined roles and permissions .

Permissions to load data from Cloud Storage

To get the permissions that you need to load data from a Cloud Storage bucket, ask your administrator to grant you the Storage Admin ( roles/storage.admin ) IAM role on the bucket. For more information about granting roles, see Manage access to projects, folders, and organizations .

This predefined role contains the permissions required to load data from a Cloud Storage bucket. To see the exact permissions that are required, expand the Required permissionssection:

Required permissions

The following permissions are required to load data from a Cloud Storage bucket:

storage.buckets.get
storage.objects.get
storage.objects.list (required if you are using a URI wildcard )

You might also be able to get these permissions with custom roles or other predefined roles .

Create a dataset

Create a BigQuery dataset to store your data.

Parquet schemas

When you load Parquet files into BigQuery, the table schema is automatically retrieved from the self-describing source data. When BigQuery retrieves the schema from the source data, the alphabetically last file is used.

For example, you have the following Parquet files in Cloud Storage:

gs://mybucket/00/
  a.parquet
  z.parquet
gs://mybucket/01/
  b.parquet

Running this command in the bq command-line tool loads all of the files (as a comma-separated list), and the schema is derived from mybucket/01/b.parquet :

bq  
load  
 \ 
--source_format = 
PARQUET  
 \ 
 dataset.table 
  
 \ 
 "gs://mybucket/00/*.parquet" 
, "gs://mybucket/01/*.parquet"

When you load multiple Parquet files that have different schemas, identical columns specified in multiple schemas must have the same mode in each schema definition.

When BigQuery detects the schema, some Parquet data types are converted to BigQuery data types to make them compatible with GoogleSQL syntax. For more information, see Parquet conversions .

To provide a table schema for creating external tables, set the referenceFileSchemaUri property in BigQuery API or
--reference_file_schema_uri parameter in bq command-line tool to the URL of the reference file.

For example, --reference_file_schema_uri="gs://mybucket/schema.parquet" .

Parquet compression

BigQuery supports the following compression codecs for Parquet file contents:

GZip
LZO_1C
LZO_1X
LZ4_RAW
Snappy
ZSTD

Loading Parquet data into a new table

You can load Parquet data into a new table by using one of the following:

The Google Cloud console
The bq command-line tool's bq load command
The jobs.insert API method and configuring a load job
The client libraries

To load Parquet data from Cloud Storage into a new BigQuery table:

Console

In the Google Cloud console, go to the BigQuery page.

Go to BigQuery
In the left pane, click Explorer .
In the Explorer pane, expand your project, click Datasets , and then select a dataset.
In the Dataset info section, click Create table .
In the Create table pane, specify the following details:

In the Source section, select Google Cloud Storage in the Create table from list. Then, do the following:
1. Select a file from the Cloud Storage bucket, or enter the Cloud Storage URI . You cannot include multiple URIs in the Google Cloud console, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you want to create, append, or overwrite.
2. For File format , select Parquet .
In the Destination section, specify the following details:
1. For Dataset , select the dataset in which you want to create the table.
2. In the Table field, enter the name of the table that you want to create.
3. Verify that the Table type field is set to Native table .
In the Schema section, no action is necessary. The schema is self-described in Parquet files.
Optional: Specify Partition and cluster settings . For more information, see Creating partitioned tables and Creating and using clustered tables .
Click Advanced options and do the following:
- For Write preference , leave Write if empty selected. This option creates a new table and loads your data into it.
- If you want to ignore values in a row that are not present in the table's schema, then select Unknown values .
- For Encryption , click Customer-managed key to use a Cloud Key Management Service key . If you leave the Google-managed key setting, BigQuery encrypts the data at rest .
Click Create table .

SQL

Use the LOAD DATA DDL statement . The following example loads a Parquet file into the new table mytable :

In the Google Cloud console, go to the BigQuerypage.

Go to BigQuery

In the query editor, enter the following statement:

 LOAD 
  
  DATA 
 
  
 OVERWRITE 
  
 mydataset 
 . 
 mytable 
 FROM 
  
 FILES 
  
 ( 
  
 format 
  
 = 
  
 'PARQUET' 
 , 
  
 uris 
  
 = 
  
 [ 
 'gs://bucket/path/file.parquet' 
 ] 
 );

Click Run.

For more information about how to run queries, see Run an interactive query .

bq

Use the bq load command, specify PARQUET using the --source_format flag, and include a Cloud Storage URI . You can include a single URI, a comma-separated list of URIs, or a URI containing a wildcard .

(Optional) Supply the --location flag and set the value to your location .

Other optional flags include:

--time_partitioning_type : Enables time-based partitioning on a table and sets the partition type. Possible values are HOUR , DAY , MONTH , and YEAR . This flag is optional when you create a table partitioned on a DATE , DATETIME , or TIMESTAMP column. The default partition type for time-based partitioning is DAY . You cannot change the partitioning specification on an existing table.
--time_partitioning_expiration : An integer that specifies (in seconds) when a time-based partition should be deleted. The expiration time evaluates to the partition's UTC date plus the integer value.
--time_partitioning_field : The DATE or TIMESTAMP column used to create a partitioned table. If time-based partitioning is enabled without this value, an ingestion-time partitioned table is created.
--require_partition_filter : When enabled, this option requires users to include a WHERE clause that specifies the partitions to query. Requiring a partition filter can reduce cost and improve performance. For more information, see Require a partition filter in queries .
--clustering_fields : A comma-separated list of up to four column names used to create a clustered table .
--destination_kms_key : The Cloud KMS key for encryption of the table data.
--column_name_character_map : Defines the scope and handling of characters in column names, with the option of enabling flexible column names . For more information, see load_option_list . For more information on supported and unsupported characters, see flexible column names .

For more information on partitioned tables, see:
- Creating partitioned tables
For more information on clustered tables, see:
- Creating and using clustered tables
For more information on table encryption, see:
- Protecting data with Cloud KMS keys

To load Parquet data into BigQuery, enter the following command:

bq  
--location = 
 LOCATION 
  
load  
 \ 
--source_format = 
 FORMAT 
  
 \ 
 DATASET 
. TABLE 
  
 \ 
 PATH_TO_SOURCE

Replace the following:

LOCATION : your location. The --location flag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the flag's value to asia-northeast1 . You can set a default value for the location using the .bigqueryrc file .
FORMAT : PARQUET .
DATASET : an existing dataset.
TABLE : the name of the table into which you're loading data.
PATH_TO_SOURCE : a fully qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.

Examples:

The following command loads data from gs://mybucket/mydata.parquet into a table named mytable in mydataset .

   
 bq 
  
 load 
  
\  
 -- 
 source_format 
 = 
 PARQUET 
  
\  
 mydataset 
 . 
 mytable 
  
\  
 gs 
 : 
 // 
 mybucket 
 / 
 mydata 
 . 
 parquet

The following command loads data from gs://mybucket/mydata.parquet into a new ingestion-time partitioned table named mytable in mydataset .

   
 bq 
  
 load 
  
\  
 -- 
 source_format 
 = 
 PARQUET 
  
\  
 -- 
 time_partitioning_type 
 = 
 DAY 
  
\  
 mydataset 
 . 
 mytable 
  
\  
 gs 
 : 
 // 
 mybucket 
 / 
 mydata 
 . 
 parquet

The following command loads data from gs://mybucket/mydata.parquet into a partitioned table named mytable in mydataset . The table is partitioned on the mytimestamp column.

   
 bq 
  
 load 
  
\  
 -- 
 source_format 
 = 
 PARQUET 
  
\  
 -- 
 time_partitioning_field 
  
 mytimestamp 
  
\  
 mydataset 
 . 
 mytable 
  
\  
 gs 
 : 
 // 
 mybucket 
 / 
 mydata 
 . 
 parquet

The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset . The Cloud Storage URI uses a wildcard.

   
 bq 
  
 load 
  
\  
 -- 
 source_format 
 = 
 PARQUET 
  
\  
 mydataset 
 . 
 mytable 
  
\  
 gs 
 : 
 // 
 mybucket 
 / 
 mydata 
 *. 
 parquet

The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset . The command includes a comma- separated list of Cloud Storage URIs with wildcards.

   
 bq 
  
 load 
  
\  
 -- 
 source_format 
 = 
 PARQUET 
  
\  
 mydataset 
 . 
 mytable 
  
\  
 "gs://mybucket/00/*.parquet" 
 , 
 "gs://mybucket/01/*.parquet"

API

Create a load job that points to the source data in Cloud Storage.
(Optional) Specify your location in the location property in the jobReference section of the job resource .
The source URIs property must be fully qualified, in the format gs:// BUCKET / OBJECT . Each URI can contain one '*' wildcard character .
Specify the Parquet data format by setting the sourceFormat property to PARQUET .
To check the job status, call jobs.get( JOB_ID *) , replacing JOB_ID with the ID of the job returned by the initial request.
- If status.state = DONE , the job completed successfully.
- If the status.errorResult property is present, the request failed, and that object includes information describing what went wrong. When a request fails, no table is created and no data is loaded.
- If status.errorResult is absent, the job finished successfully; although, there might have been some nonfatal errors, such as problems importing a few rows. Nonfatal errors are listed in the returned job object's status.errors property.

API notes:

Load jobs are atomic and consistent: if a load job fails, none of the data is available, and if a load job succeeds, all of the data is available.
As a best practice, generate a unique ID and pass it as jobReference.jobId when calling jobs.insert to create a load job. This approach is more robust to network failure because the client can poll or retry on the known job ID.
Calling jobs.insert on a given job ID is idempotent. You can retry as many times as you like on the same job ID, and at most one of those operations will succeed.

Go

Before trying this sample, follow the Go setup instructions in the BigQuery quickstart using client libraries . For more information, see the BigQuery Go API reference documentation .

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries .

  import 
  
 ( 
  
 "context" 
  
 "fmt" 
  
 "cloud.google.com/go/bigquery" 
 ) 
 // importParquet demonstrates loading Apache Parquet data from Cloud Storage into a table. 
 func 
  
 importParquet 
 ( 
 projectID 
 , 
  
 datasetID 
 , 
  
 tableID 
  
 string 
 ) 
  
 error 
  
 { 
  
 // projectID := "my-project-id" 
  
 // datasetID := "mydataset" 
  
 // tableID := "mytable" 
  
 ctx 
  
 := 
  
 context 
 . 
 Background 
 () 
  
 client 
 , 
  
 err 
  
 := 
  
 bigquery 
 . 
 NewClient 
 ( 
 ctx 
 , 
  
 projectID 
 ) 
  
 if 
  
 err 
  
 != 
  
 nil 
  
 { 
  
 return 
  
 fmt 
 . 
 Errorf 
 ( 
 "bigquery.NewClient: %v" 
 , 
  
 err 
 ) 
  
 } 
  
 defer 
  
 client 
 . 
 Close 
 () 
  
 gcsRef 
  
 := 
  
 bigquery 
 . 
  NewGCSReference 
 
 ( 
 "gs://cloud-samples-data/bigquery/us-states/us-states.parquet" 
 ) 
  
 gcsRef 
 . 
 SourceFormat 
  
 = 
  
 bigquery 
 . 
  Parquet 
 
  
 gcsRef 
 . 
 AutoDetect 
  
 = 
  
 true 
  
 loader 
  
 := 
  
 client 
 . 
 Dataset 
 ( 
 datasetID 
 ). 
 Table 
 ( 
 tableID 
 ). 
  LoaderFrom 
 
 ( 
 gcsRef 
 ) 
  
 job 
 , 
  
 err 
  
 := 
  
 loader 
 . 
 Run 
 ( 
 ctx 
 ) 
  
 if 
  
 err 
  
 != 
  
 nil 
  
 { 
  
 return 
  
 err 
  
 } 
  
 status 
 , 
  
 err 
  
 := 
  
 job 
 . 
 Wait 
 ( 
 ctx 
 ) 
  
 if 
  
 err 
  
 != 
  
 nil 
  
 { 
  
 return 
  
 err 
  
 } 
  
 if 
  
 status 
 . 
  Err 
 
 () 
  
 != 
  
 nil 
  
 { 
  
 return 
  
 fmt 
 . 
 Errorf 
 ( 
 "job completed with error: %v" 
 , 
  
 status 
 . 
  Err 
 
 ()) 
  
 } 
  
 return 
  
 nil 
 }

Console option	bq tool flag	BigQuery API property	Description
Write if empty	Not supported	`WRITE_EMPTY`	Writes the data only if the table is empty.
Append to table	`--noreplace` or `--replace=false` ; if `--[no]replace` is unspecified, the default is append	`WRITE_APPEND`	( Default ) Appends the data to the end of the table.
Overwrite table	`--replace` or `--replace=true`	`WRITE_TRUNCATE`	Erases all existing data in a table before writing the new data. This action also deletes the table schema, row level security, and removes any Cloud KMS key.

Parquet type	Parquet logical type(s)	BigQuery data type
`BOOLEAN`	None	BOOLEAN
INT32	None, `INTEGER` ( `UINT_8` , `UINT_16` , `UINT_32` , `INT_8` , `INT_16` , `INT_32` )	INT64
INT32	DECIMAL	NUMERIC, BIGNUMERIC, or STRING
`INT32`	`DATE`	DATE
`INT32`	`TIME` , `precision=MILLIS` ( `TIME_MILLIS` )	TIME
`INT64`	None, `INTEGER` ( `UINT_64` , `INT_64` )	INT64
INT64	DECIMAL	NUMERIC, BIGNUMERIC, or STRING
`INT64`	`TIME` , `precision=MICROS` ( `TIME_MICROS` )	TIME
`INT64`	`TIMESTAMP` , `precision=MILLIS` ( `TIMESTAMP_MILLIS` )	TIMESTAMP
`INT64`	`TIMESTAMP` , `precision=MICROS` ( `TIMESTAMP_MICROS` )	TIMESTAMP
`INT96`	None	TIMESTAMP
`FLOAT`	None	FLOAT64
`DOUBLE`	None	FLOAT64
`BYTE_ARRAY`	None	BYTES
`BYTE_ARRAY`	`STRING` ( `UTF8` )	STRING
FIXED_LEN_BYTE_ARRAY	DECIMAL	NUMERIC, BIGNUMERIC, or STRING
`FIXED_LEN_BYTE_ARRAY`	None	BYTES

Loading Parquet data from Cloud Storage

Limitations

Input file requirements

Before you begin

Required permissions

Permissions to load data into BigQuery

Permissions to load data from Cloud Storage

Required permissions

Create a dataset

Parquet schemas

Parquet compression

Loading Parquet data into a new table

Console

SQL

bq

API

Go

Java

Node.js

PHP

Python

Appending to or overwriting a table with Parquet data

Console

SQL

bq

API

Go

Java

Node.js

PHP

Python

Loading hive-partitioned Parquet data

Parquet conversions

Type conversions

Unsigned logical types

Decimal logical type

Enum logical type

List logical type

Geospatial data

Column name conversions

Flexible column names

Limitations

Debugging your Parquet file