Stay organized with collectionsSave and categorize content based on your preferences.
Loading Parquet data from Cloud Storage
This page provides an overview of loading Parquet data from Cloud Storage
into BigQuery.
Parquetis an
open source column-oriented data format that is widely used in the
Apache Hadoop ecosystem.
When you load Parquet data from Cloud Storage, you can load the data into a
new table or partition, or you can append to or overwrite an existing table or
partition. When your data is loaded into BigQuery, it is
converted into columnar format forCapacitor(BigQuery's storage format).
When you load data from Cloud Storage into a BigQuery table,
the dataset that contains the table must be in the same regional or multi-
regional location as the Cloud Storage bucket.
You are subject to the following limitations when you load data into
BigQuery from a Cloud Storage bucket:
BigQuery does not guarantee data consistency for external data
sources. Changes to the underlying data while a query is running can result in
unexpected behavior.
BigQuery doesn't supportCloud Storage object versioning. If you
include a generation number in the Cloud Storage URI, then the load job
fails.
You can't use a wildcard in the Cloud Storage URI if any of the
files to be loaded have different schemas. Any difference in the position of
columns qualifies as a different schema.
Input file requirements
To avoidresourcesExceedederrors when loading Parquet files into
BigQuery, follow these guidelines:
Keep row sizes to 50 MB or less.
If your input data contains more than 100 columns, consider reducing the page
size to be smaller than the default page size (1 * 1024 * 1024 bytes). This
is especially helpful if you are using significant compression.
For optimal performance, aim for row group sizes of at least 16 MiB.
Smaller row group sizes increase I/O and slow down loads and queries.
Before you begin
Grant Identity and Access Management (IAM) roles that give users the necessary
permissions to perform each task in this document, and create a dataset
to store your data.
Required permissions
To load data into BigQuery, you need IAM permissions to run a load job and load data into BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need IAM permissions to access the bucket that contains your data.
Permissions to load data into BigQuery
To load data into a new BigQuery table or partition or to append or overwrite an existing table or partition, you need the following IAM permissions:
bigquery.tables.create
bigquery.tables.updateData
bigquery.tables.update
bigquery.jobs.create
Each of the following predefined IAM roles includes the permissions that you need in order to load data into a BigQuery table or partition:
To get the permissions that
you need to load data from a Cloud Storage bucket,
ask your administrator to grant you theStorage Admin(roles/storage.admin) IAM role on the bucket.
For more information about granting roles, seeManage access to projects, folders, and organizations.
This predefined role contains
the permissions required to load data from a Cloud Storage bucket. To see the exact permissions that are
required, expand theRequired permissionssection:
Required permissions
The following permissions are required to load data from a Cloud Storage bucket:
storage.buckets.get
storage.objects.get
storage.objects.list (required if you are using a URIwildcard)
When you load Parquet files into BigQuery, the table schema is
automatically retrieved from the self-describing source data. When
BigQuery retrieves the schema from the source data, the
alphabetically last file is used.
For example, you have the following Parquet files in Cloud Storage:
Running this command in the bq command-line tool loads all of the files (as a
comma-separated list), and the schema is derived frommybucket/01/b.parquet:
When you load multiple Parquet files that have different schemas, identical
columns specified in multiple schemas must have the samemodein each schema definition.
When BigQuery detects the schema, some Parquet data types are
converted to BigQuery data types to make them compatible with
GoogleSQL syntax. For more information, seeParquet conversions.
To provide a table schema for creating external tables, set thereferenceFileSchemaUriproperty in BigQuery API or --reference_file_schema_uriparameter in bq command-line tool
to the URL of the reference file.
For example,--reference_file_schema_uri="gs://mybucket/schema.parquet".
Parquet compression
BigQuery supports the following compression codecs for
Parquet file contents:
GZip
LZO_1C
LZO_1X
LZ4_RAW
Snappy
ZSTD
Loading Parquet data into a new table
You can load Parquet data into a new table by using one of the following:
The Google Cloud console
The bq command-line tool'sbq loadcommand
Thejobs.insertAPI method and configuring aloadjob
The client libraries
To load Parquet data from Cloud Storage into a new BigQuery
table:
importcom.google.cloud.bigquery.BigQuery;importcom.google.cloud.bigquery.BigQueryException;importcom.google.cloud.bigquery.BigQueryOptions;importcom.google.cloud.bigquery.FormatOptions;importcom.google.cloud.bigquery.Job;importcom.google.cloud.bigquery.JobInfo;importcom.google.cloud.bigquery.LoadJobConfiguration;importcom.google.cloud.bigquery.TableId;importjava.math.BigInteger;publicclassLoadParquet{publicstaticvoidrunLoadParquet(){// TODO(developer): Replace these variables before running the sample.StringdatasetName="MY_DATASET_NAME";loadParquet(datasetName);}publicstaticvoidloadParquet(StringdatasetName){try{// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests.BigQuerybigquery=BigQueryOptions.getDefaultInstance().getService();StringsourceUri="gs://cloud-samples-data/bigquery/us-states/us-states.parquet";TableIdtableId=TableId.of(datasetName,"us_states");LoadJobConfigurationconfiguration=LoadJobConfiguration.builder(tableId,sourceUri).setFormatOptions(FormatOptions.parquet()).build();// For more information on Job see:// https://googleapis.dev/java/google-cloud-clients/latest/index.html?com/google/cloud/bigquery/package-summary.html// Load the tableJobjob=bigquery.create(JobInfo.of(configuration));// Blocks until this load table job completes its execution, either failing or succeeding.JobcompletedJob=job.waitFor();if(completedJob==null){System.out.println("Job not executed since it no longer exists.");return;}elseif(completedJob.getStatus().getError()!=null){System.out.println("BigQuery was unable to load the table due to an error: \n"+job.getStatus().getError());return;}// Check number of rows loaded into the tableBigIntegernumRows=bigquery.getTable(tableId).getNumRows();System.out.printf("Loaded %d rows. \n",numRows);System.out.println("GCS parquet loaded successfully.");}catch(BigQueryException|InterruptedExceptione){System.out.println("GCS Parquet was not loaded. \n"+e.toString());}}}
// Import the Google Cloud client librariesconst{BigQuery}=require('@google-cloud/bigquery');const{Storage}=require('@google-cloud/storage');// Instantiate clientsconstbigquery=newBigQuery();conststorage=newStorage();/*** This sample loads the Parquet file at* https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.parquet** TODO(developer): Replace the following lines with the path to your file.*/constbucketName='cloud-samples-data';constfilename='bigquery/us-states/us-states.parquet';asyncfunctionloadTableGCSParquet(){// Imports a GCS file into a table with Parquet source format./*** TODO(developer): Uncomment the following lines before running the sample.*/// const datasetId = 'my_dataset';// const tableId = 'my_table';// Configure the load job. For full list of options, see:// https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoadconstmetadata={sourceFormat:'PARQUET',location:'US',};// Load data from a Google Cloud Storage file into the tableconst[job]=awaitbigquery.dataset(datasetId).table(tableId).load(storage.bucket(bucketName).file(filename),metadata);// load() waits for the job to finishconsole.log(`Job${job.id}completed.`);// Check the job's status for errorsconsterrors=job.status.errors;if(errors&&errors.length>0){throwerrors;}}
use Google\Cloud\BigQuery\BigQueryClient;use Google\Cloud\Core\ExponentialBackoff;/** Uncomment and populate these variables in your code */// $projectId = 'The Google project ID';// $datasetId = 'The BigQuery dataset ID';// instantiate the bigquery table service$bigQuery = new BigQueryClient(['projectId' => $projectId,]);$dataset = $bigQuery->dataset($datasetId);$table = $dataset->table('us_states');// create the import job$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.parquet';$loadConfig = $table->loadFromStorage($gcsUri)->sourceFormat('PARQUET');$job = $table->runJob($loadConfig);// poll the job until it is complete$backoff = new ExponentialBackoff(10);$backoff->execute(function () use ($job) {print('Waiting for job to complete' . PHP_EOL);$job->reload();if (!$job->isComplete()) {throw new Exception('Job has not yet completed', 500);}});// check if the job has errorsif (isset($job->info()['status']['errorResult'])) {$error = $job->info()['status']['errorResult']['message'];printf('Error running job: %s' . PHP_EOL, $error);} else {print('Data imported successfully' . PHP_EOL);}
fromgoogle.cloudimportbigquery# Construct a BigQuery client object.client=bigquery.Client()# TODO(developer): Set table_id to the ID of the table to create.# table_id = "your-project.your_dataset.your_table_name"job_config=bigquery.LoadJobConfig(source_format=bigquery.SourceFormat.PARQUET,)uri="gs://cloud-samples-data/bigquery/us-states/us-states.parquet"load_job=client.load_table_from_uri(uri,table_id,job_config=job_config)# Make an API request.load_job.result()# Waits for the job to complete.destination_table=client.get_table(table_id)print("Loaded{}rows.".format(destination_table.num_rows))
Appending to or overwriting a table with Parquet data
You can load additional data into a table either from source files or by
appending query results.
In the Google Cloud console, use theWrite preferenceoption to specify
what action to take when you load data from a source file or from a query
result.
You have the following options when you load additional data into a table:
Console option
bq tool flag
BigQuery API property
Description
Write if empty
Not supported
WRITE_EMPTY
Writes the data only if the table is empty.
Append to table
--noreplaceor--replace=false; if--[no]replaceis unspecified, the default is append
WRITE_APPEND
(Default)
Appends the data to the end of the table.
Overwrite table
--replaceor--replace=true
WRITE_TRUNCATE
Erases all existing data in a table before writing the new data.
This action also deletes the table schema, row level security, and removes any
Cloud KMS key.
If you load data into an existing table, the load job can append the data or
overwrite the table.
You can append or overwrite a table by using one of the following:
BigQuery supports loading hive partitioned Parquet data stored on
Cloud Storage and populates the hive partitioning columns as columns in
the destination BigQuery managed table. For more information, seeLoading externally partitioned data.
Parquet conversions
This section describes how BigQuery parses various data types when loading Parquet data.
Some Parquet data types (such asINT32,INT64,BYTE_ARRAY, andFIXED_LEN_BYTE_ARRAY) can be converted into multiple BigQuery data types. To ensure BigQuery converts the Parquet data types correctly, specify the appropriate data type in the Parquet file.
For example, to convert the ParquetINT32data type to the BigQueryDATEdata type, specify the following:
optional int32 date_col (DATE);
BigQuery converts Parquet data types to the
BigQuery data types that are described in the following sections.
Nested groups are converted intoSTRUCTtypes.
Other combinations of Parquet types and converted types are not supported.
Unsigned logical types
The ParquetUINT_8,UINT_16,UINT_32, andUINT_64types are unsigned.
BigQuery will treat values with these types as unsigned when loading into a
BigQuery signedINTEGERcolumn. In the case ofUINT_64, an error will be returned
if the unsigned value exceeds the maximumINTEGERvalue of
9,223,372,036,854,775,807.
Decimal logical type
Decimallogical types can be converted toNUMERIC,BIGNUMERIC, orSTRINGtypes. The converted type depends
on the precision and scale parameters of thedecimallogical type and the
specified decimal target types. Specify the decimal target type as follows:
You can enable schema inference for ParquetLISTlogical types. BigQuery
checks whether theLISTnode is in thestandard formor in one of the forms described by thebackward-compatibility rules:
// standard form<optional|required>group<name>(LIST){repeatedgrouplist{<optional|required><element-type>element;}}
If yes, the corresponding field for theLISTnode in the converted schema is treated
as if the node has the following schema:
You can load Parquet files that containWKT, hex-encodedWKB, orGeoJSONin aSTRINGcolumn, orWKBin aBYTE_ARRAYcolumn by specifying a
BigQuery schema with the typeGEOGRAPHY. For more information,
seeLoading geospatial data.
You can also loadGeoParquetfiles. In this case, the
columns described by the GeoParquet metadata are interpreted as typeGEOGRAPHYby default. You can also load the raw WKB data into aBYTEScolumn by
providing an explicit schema. For more information, seeLoading GeoParquet
files.
Column name conversions
A column name can contain letters (a-z, A-Z), numbers (0-9), or underscores
(_), and it must start with a letter or underscore. If you use flexible column
names, BigQuery supports starting a column name with a number.
Exercise caution when starting columns with a number, since using flexible
column names with the BigQuery Storage Read API or
BigQuery Storage Write API requires special handling. For more information about
flexible column name support, seeflexible column names.
Column names have a maximum length of 300 characters. Column names can't use any
of the following prefixes:
_TABLE_
_FILE_
_PARTITION
_ROW_TIMESTAMP
__ROOT__
_COLIDENTIFIER
_CHANGE_SEQUENCE_NUMBER
_CHANGE_TYPE
_CHANGE_TIMESTAMP
Duplicate column names are not allowed even if the case differs. For example, a
column namedColumn1is considered identical to a column namedcolumn1. To
learn more about column naming rules, seeColumn
namesin the
GoogleSQL reference.
If a table name (for example,test) is the same as one of its column names
(for example,test), theSELECTexpression interprets thetestcolumn as
aSTRUCTcontaining all other table columns. To avoid this collision, use
one of the following methods:
Avoid using the same name for a table and its columns.
Avoid using_field_as a column name prefix. System-reserved prefixes
cause automatic renaming during queries. For example, theSELECT _field_ FROM project1.dataset.testquery returns a column named_field_1. If you must query a column with this name, use an alias to
control the output.
Assign the table a different alias. For example, the following query assigns
a table aliastto the tableproject1.dataset.test:
SELECTtestFROMproject1.dataset.testASt;
Include the table name when referencing a column. For example:
SELECTtest.testFROMproject1.dataset.test;
Flexible column names
You have more flexibility in what you name columns, including expanded access
to characters in languages other than English as well as additional symbols.
Make sure to use backtick (`) characters to enclose flexible column names if they areQuoted Identifiers.
Flexible column names support the following characters:
Any letter in any language, as represented by the Unicode regular expression\p{L}.
Any numeric character in any language as represented by the Unicode regular
expression\p{N}.
Any connector punctuation character, including underscores, as represented
by the Unicode regular expression\p{Pc}.
A hyphen or dash as represented by the Unicode regular expression\p{Pd}.
Any mark intended to accompany another character as represented by the
Unicode regular expression\p{M}.
For example, accents, umlauts, or enclosing boxes.
The following special characters:
An ampersand (&) as represented by the Unicode regular
expression\u0026.
A percent sign (%) as represented by the Unicode regular
expression\u0025.
An equals sign (=) as represented by the Unicode regular
expression\u003D.
A plus sign (+) as represented by the Unicode regular
expression\u002B.
A colon (:) as represented by the Unicode regular
expression\u003A.
An apostrophe (') as represented by the Unicode regular
expression\u0027.
A less-than sign (<) as represented by the Unicode regular
expression\u003C.
A greater-than sign (>) as represented by the Unicode regular
expression\u003E.
A number sign (#) as represented by the Unicode regular
expression\u0023.
A vertical line (|) as represented by the Unicode regular
expression\u007c.
Whitespace.
Flexible column names don't support the following special characters:
An exclamation mark (!) as represented by the Unicode regular
expression\u0021.
A quotation mark (") as represented by the Unicode regular
expression\u0022.
A dollar sign ($) as represented by the Unicode regular
expression\u0024.
A left parenthesis (() as represented by the Unicode regular
expression\u0028.
A right parenthesis ()) as represented by the Unicode regular
expression\u0029.
An asterisk (*) as represented by the Unicode regular
expression\u002A.
A comma (,) as represented by the Unicode regular
expression\u002C.
A period (.) as represented by the Unicode regular
expression\u002E. Periods arenotreplaced by underscores in Parquet file
column names when a column name character map is used. For more information,
seeflexible column limitations.
A slash (/) as represented by the Unicode regular
expression\u002F.
A semicolon (;) as represented by the Unicode regular
expression\u003B.
A question mark (?) as represented by the Unicode regular
expression\u003F.
An at sign (@) as represented by the Unicode regular
expression\u0040.
A left square bracket ([) as represented by the Unicode regular
expression\u005B.
A backslash (\) as represented by the Unicode regular
expression\u005C.
A right square bracket (]) as represented by the Unicode regular
expression\u005D.
A circumflex accent (^) as represented by the Unicode regular
expression\u005E.
A grave accent (`) as represented by the Unicode regular
expression\u0060.
A left curly bracket {{) as represented by the Unicode regular
expression\u007B.
A right curly bracket (}) as represented by the Unicode regular
expression\u007D.
A tilde (~) as represented by the Unicode regular expression\u007E.
The expanded column characters are supported by both the BigQuery Storage Read API
and the BigQuery Storage Write API. To use the expanded list of Unicode characters
with the BigQuery Storage Read API, you must set a flag. You can use thedisplayNameattribute to retrieve the column name. The following example
shows how to set a flag with the Python client:
fromgoogle.cloud.bigquery_storageimporttypesrequested_session=types.ReadSession()#set avro serialization options for flexible column.options=types.AvroSerializationOptions()options.enable_display_name_attribute=Truerequested_session.read_options.avro_serialization_options=options
To use the expanded list of Unicode characters with the BigQuery Storage Write API,
you must provide the schema withcolumn_namenotation, unless you are using
theJsonStreamWriterwriter object. The following example shows how to
provide the schema:
syntax="proto2";packagemypackage;// Source protos located in github.com/googleapis/googleapisimport"google/cloud/bigquery/storage/v1/annotations.proto";messageFlexibleSchema{optionalstringitem_name_column=1[(.google.cloud.bigquery.storage.v1.column_name)="name-列"];optionalstringitem_description_column=2[(.google.cloud.bigquery.storage.v1.column_name)="description-列"];}
In this example,item_name_columnanditem_description_columnare
placeholder names which need to be compliant with theprotocol buffernaming
convention. Note thatcolumn_nameannotations always take precedence over
placeholder names.
Limitations
Flexible column names are not supported withexternal tables.
You cannot load Parquet files containing columns that have a period
(.) in the column name.
Column names from Parquet files are treated as case-insensitive when loaded
into BigQuery. Identical case-insensitive names will cause
collisions. To avoid this, either append an underscore to one of the duplicate
column names or rename the columns before loading.
Debugging your Parquet file
If your load jobs fail with data errors, you can usePyArrowto verify if your
Parquet data files are corrupted. If PyArrow fails to read the files, the files
are likely to be rejected by the BigQuery load job. The following
example shows how to read the contents of a Parquet file by using PyArrow:
frompyarrowimportparquetaspq# Read the entire filepq.read_table('your_sample_file.parquet')# Read specific columnspq.read_table('your_sample_file.parquet',columns=['some_column','another_column'])# Read the metadata of specific columnsfile_metadata=pq.read_metadata('your_sample_file.parquet')forcolinfile_metadata.row_group(0).to_dict()['columns']:printcol['column_path_in_schema']printcol['num_values']
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2026-05-08 UTC."],[],[]]