Stay organized with collectionsSave and categorize content based on your preferences.
Loading Parquet data from Cloud Storage
This page provides an overview of loading Parquet data from Cloud Storage
into BigQuery.
Parquetis an
open source column-oriented data format that is widely used in the
Apache Hadoop ecosystem.
When you load Parquet data from Cloud Storage, you can load the data into a
new table or partition, or you can append to or overwrite an existing table or
partition. When your data is loaded into BigQuery, it is
converted into columnar format forCapacitor(BigQuery's storage format).
When you load data from Cloud Storage into a BigQuery table,
the dataset that contains the table must be in the same regional or multi-
regional location as the Cloud Storage bucket.
You are subject to the following limitations when you load data into
BigQuery from a Cloud Storage bucket:
BigQuery does not guarantee data consistency for external data
sources. Changes to the underlying data while a query is running can result in
unexpected behavior.
BigQuery doesn't supportCloud Storage object versioning. If you
include a generation number in the Cloud Storage URI, then the load job
fails.
You can't use a wildcard in the Cloud Storage URI if any of the
files to be loaded have different schemas. Any difference in the position of
columns qualifies as a different schema.
Input file requirements
To avoidresourcesExceedederrors when loading Parquet files into
BigQuery, follow these guidelines:
Keep row sizes to 50 MB or less.
If your input data contains more than 100 columns, consider reducing the page
size to be smaller than the default page size (1 * 1024 * 1024 bytes). This
is especially helpful if you are using significant compression.
For optimal performance, aim for row group sizes of at least 16 MiB.
Smaller row group sizes increase I/O and slow down loads and queries.
Before you begin
Grant Identity and Access Management (IAM) roles that give users the necessary
permissions to perform each task in this document, and create a dataset
to store your data.
Required permissions
To load data into BigQuery, you need IAM permissions to run a load job and load data into BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need IAM permissions to access the bucket that contains your data.
Permissions to load data into BigQuery
To load data into a new BigQuery table or partition or to append or overwrite an existing table or partition, you need the following IAM permissions:
bigquery.tables.create
bigquery.tables.updateData
bigquery.tables.update
bigquery.jobs.create
Each of the following predefined IAM roles includes the permissions that you need in order to load data into a BigQuery table or partition:
To get the permissions that
you need to load data from a Cloud Storage bucket,
ask your administrator to grant you theStorage Admin(roles/storage.admin)
IAM role on the bucket.
For more information about granting roles, seeManage access to projects, folders, and organizations.
This predefined role contains
the permissions required to load data from a Cloud Storage bucket. To see the exact permissions that are
required, expand theRequired permissionssection:
Required permissions
The following permissions are required to load data from a Cloud Storage bucket:
storage.buckets.get
storage.objects.get
storage.objects.list (required if you are using a URIwildcard)
When you load Parquet files into BigQuery, the table schema is
automatically retrieved from the self-describing source data. When
BigQuery retrieves the schema from the source data, the
alphabetically last file is used.
For example, you have the following Parquet files in Cloud Storage:
Running this command in the bq command-line tool loads all of the files (as a
comma-separated list), and the schema is derived frommybucket/01/b.parquet:
When you load multiple Parquet files that have different schemas, identical
columns specified in multiple schemas must have the samemodein each schema definition.
When BigQuery detects the schema, some Parquet data types are
converted to BigQuery data types to make them compatible with
GoogleSQL syntax. For more information, seeParquet conversions.
To provide a table schema for creating external tables, set thereferenceFileSchemaUriproperty in BigQuery API or --reference_file_schema_uriparameter in bq command-line tool
to the URL of the reference file.
For example,--reference_file_schema_uri="gs://mybucket/schema.parquet".
Parquet compression
BigQuery supports the following compression codecs for
Parquet file contents:
GZip
LZO_1C
LZO_1X
LZ4_RAW
Snappy
ZSTD
Loading Parquet data into a new table
You can load Parquet data into a new table by using one of the following:
The Google Cloud console
The bq command-line tool'sbq loadcommand
Thejobs.insertAPI method and configuring aloadjob
The client libraries
To load Parquet data from Cloud Storage into a new BigQuery
table:
Console
In the Google Cloud console, go to theBigQuerypage.
In theExplorerpane, expand your project, and then select a dataset.
In theDataset infosection, clickadd_boxCreate table.
In theCreate tablepanel, specify the following details:
In theSourcesection, selectGoogle Cloud Storagein theCreate table fromlist.
Then, do the following:
Select a file from the Cloud Storage bucket, or enter theCloud Storage URI.
You cannot include multiple URIs
in the Google Cloud console, butwildcardsare supported. The Cloud Storage bucket must be in the same
location as the dataset that contains the table you want to create, append, or
overwrite.
ForFile format, selectParquet.
In theDestinationsection, specify the following details:
ForDataset, select the dataset in which you want to create the
table.
In theTablefield, enter the name of the table that you want to create.
Verify that theTable typefield is set toNative table.
In theSchemasection, no action is necessary. The schema is
self-described in Parquet files.
Use thebq loadcommand, specifyPARQUETusing the--source_formatflag, and include aCloud Storage URI.
You can include a single URI, a comma-separated list of URIs, or a URI
containing awildcard.
(Optional) Supply the--locationflag and set the value to yourlocation.
Other optional flags include:
--time_partitioning_type: Enables time-based partitioning on a table
and sets the partition type. Possible values areHOUR,DAY,MONTH,
andYEAR. This flag is optional when you create a
table partitioned on aDATE,DATETIME, orTIMESTAMPcolumn. The
default partition type for time-based partitioning isDAY. You cannot
change the partitioning specification on an existing table.
--time_partitioning_expiration: An integer that specifies (in seconds)
when a time-based partition should be deleted. The expiration time
evaluates to the partition's UTC date plus the integer value.
--time_partitioning_field: TheDATEorTIMESTAMPcolumn used to
create a partitioned table. If time-based partitioning is enabled without
this value, an ingestion-time partitioned table is created.
--require_partition_filter: When enabled, this option requires users
to include aWHEREclause that specifies the partitions to query.
Requiring a partition filter can reduce cost and improve performance.
For more information, seeRequire a partition filter in queries.
--clustering_fields: A comma-separated list of up to four column names
used to create aclustered table.
--destination_kms_key: The Cloud KMS key for encryption of the
table data.
--column_name_character_map: Defines the scope and handling of
characters in column names, with the option of enablingflexible column names.
For more information, seeload_option_list.
For more information on supported and unsupported characters, seeflexible column names.
LOCATION: your location. The--locationflag is
optional. For example, if you are using BigQuery in the
Tokyo region, you can set the flag's value toasia-northeast1. You can
set a default value for the location using the.bigqueryrc file.
FORMAT:PARQUET.
DATASET: an existing dataset.
TABLE: the name of the table into which you're
loading data.
PATH_TO_SOURCE: a fully qualifiedCloud Storage URIor a comma-separated list of URIs.Wildcardsare also supported.
Examples:
The following command loads data fromgs://mybucket/mydata.parquetinto a
table namedmytableinmydataset.
The following command loads data fromgs://mybucket/mydata.parquetinto a
partitioned table namedmytableinmydataset. The table is partitioned
on themytimestampcolumn.
The following command loads data from multiple files ings://mybucket/into a table namedmytableinmydataset. The command includes a comma-
separated list of Cloud Storage URIs with wildcards.
Create aloadjob that points to the source data in Cloud Storage.
(Optional) Specify yourlocationin
thelocationproperty in thejobReferencesection of thejob resource.
Thesource URIsproperty must be fully qualified, in the formatgs://BUCKET/OBJECT.
Each URI can contain one '*'wildcard character.
Specify the Parquet data format by setting thesourceFormatproperty toPARQUET.
To check the job status, calljobs.get(JOB_ID*),
replacingJOB_IDwith the ID of the job returned by the initial
request.
Ifstatus.state = DONE, the job completed successfully.
If thestatus.errorResultproperty is present, the request failed,
and that object includes information describing what went wrong.
When a request fails, no table is created and no data is loaded.
Ifstatus.errorResultis absent, the job finished successfully;
although, there might have been some nonfatal errors, such as problems
importing a few rows. Nonfatal errors are listed in the returned job
object'sstatus.errorsproperty.
API notes:
Load jobs are atomic and consistent: if a load job fails, none of the data
is available, and if a load job succeeds, all of the data is available.
As a best practice, generate a unique ID and pass it asjobReference.jobIdwhen callingjobs.insertto create a load job. This
approach is more robust to network failure because the client can poll or
retry on the known job ID.
Callingjobs.inserton a given job ID is idempotent. You can retry as
many times as you like on the same job ID, and at most one of those
operations will succeed.
importcom.google.cloud.bigquery.BigQuery;importcom.google.cloud.bigquery.BigQueryException;importcom.google.cloud.bigquery.BigQueryOptions;importcom.google.cloud.bigquery.FormatOptions;importcom.google.cloud.bigquery.Job;importcom.google.cloud.bigquery.JobInfo;importcom.google.cloud.bigquery.LoadJobConfiguration;importcom.google.cloud.bigquery.TableId;importjava.math.BigInteger;publicclassLoadParquet{publicstaticvoidrunLoadParquet(){// TODO(developer): Replace these variables before running the sample.StringdatasetName="MY_DATASET_NAME";loadParquet(datasetName);}publicstaticvoidloadParquet(StringdatasetName){try{// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests.BigQuerybigquery=BigQueryOptions.getDefaultInstance().getService();StringsourceUri="gs://cloud-samples-data/bigquery/us-states/us-states.parquet";TableIdtableId=TableId.of(datasetName,"us_states");LoadJobConfigurationconfiguration=LoadJobConfiguration.builder(tableId,sourceUri).setFormatOptions(FormatOptions.parquet()).build();// For more information on Job see:// https://googleapis.dev/java/google-cloud-clients/latest/index.html?com/google/cloud/bigquery/package-summary.html// Load the tableJobjob=bigquery.create(JobInfo.of(configuration));// Blocks until this load table job completes its execution, either failing or succeeding.JobcompletedJob=job.waitFor();if(completedJob==null){System.out.println("Job not executed since it no longer exists.");return;}elseif(completedJob.getStatus().getError()!=null){System.out.println("BigQuery was unable to load the table due to an error: \n"+job.getStatus().getError());return;}// Check number of rows loaded into the tableBigIntegernumRows=bigquery.getTable(tableId).getNumRows();System.out.printf("Loaded %d rows. \n",numRows);System.out.println("GCS parquet loaded successfully.");}catch(BigQueryException|InterruptedExceptione){System.out.println("GCS Parquet was not loaded. \n"+e.toString());}}}
// Import the Google Cloud client librariesconst{BigQuery}=require('@google-cloud/bigquery');const{Storage}=require('@google-cloud/storage');// Instantiate clientsconstbigquery=newBigQuery();conststorage=newStorage();/*** This sample loads the Parquet file at* https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.parquet** TODO(developer): Replace the following lines with the path to your file.*/constbucketName='cloud-samples-data';constfilename='bigquery/us-states/us-states.parquet';asyncfunctionloadTableGCSParquet(){// Imports a GCS file into a table with Parquet source format./*** TODO(developer): Uncomment the following lines before running the sample.*/// const datasetId = 'my_dataset';// const tableId = 'my_table';// Configure the load job. For full list of options, see:// https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoadconstmetadata={sourceFormat:'PARQUET',location:'US',};// Load data from a Google Cloud Storage file into the tableconst[job]=awaitbigquery.dataset(datasetId).table(tableId).load(storage.bucket(bucketName).file(filename),metadata);// load() waits for the job to finishconsole.log(`Job${job.id}completed.`);// Check the job's status for errorsconsterrors=job.status.errors;if(errors&&errors.length>0){throwerrors;}}
use Google\Cloud\BigQuery\BigQueryClient;use Google\Cloud\Core\ExponentialBackoff;/** Uncomment and populate these variables in your code */// $projectId = 'The Google project ID';// $datasetId = 'The BigQuery dataset ID';// instantiate the bigquery table service$bigQuery = new BigQueryClient(['projectId' => $projectId,]);$dataset = $bigQuery->dataset($datasetId);$table = $dataset->table('us_states');// create the import job$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.parquet';$loadConfig = $table->loadFromStorage($gcsUri)->sourceFormat('PARQUET');$job = $table->runJob($loadConfig);// poll the job until it is complete$backoff = new ExponentialBackoff(10);$backoff->execute(function () use ($job) {print('Waiting for job to complete' . PHP_EOL);$job->reload();if (!$job->isComplete()) {throw new Exception('Job has not yet completed', 500);}});// check if the job has errorsif (isset($job->info()['status']['errorResult'])) {$error = $job->info()['status']['errorResult']['message'];printf('Error running job: %s' . PHP_EOL, $error);} else {print('Data imported successfully' . PHP_EOL);}
fromgoogle.cloudimportbigquery# Construct a BigQuery client object.client=bigquery.Client()# TODO(developer): Set table_id to the ID of the table to create.# table_id = "your-project.your_dataset.your_table_name"job_config=bigquery.LoadJobConfig(source_format=bigquery.SourceFormat.PARQUET,)uri="gs://cloud-samples-data/bigquery/us-states/us-states.parquet"load_job=client.load_table_from_uri(uri,table_id,job_config=job_config)# Make an API request.load_job.result()# Waits for the job to complete.destination_table=client.get_table(table_id)print("Loaded{}rows.".format(destination_table.num_rows))
Appending to or overwriting a table with Parquet data
You can load additional data into a table either from source files or by
appending query results.
In the Google Cloud console, use theWrite preferenceoption to specify
what action to take when you load data from a source file or from a query
result.
You have the following options when you load additional data into a table:
Console option
bq tool flag
BigQuery API property
Description
Write if empty
Not supported
WRITE_EMPTY
Writes the data only if the table is empty.
Append to table
--noreplaceor--replace=false; if--[no]replaceis unspecified, the default is append
WRITE_APPEND
(Default)
Appends the data to the end of the table.
Overwrite table
--replaceor--replace=true
WRITE_TRUNCATE
Erases all existing data in a table before writing the new data.
This action also deletes the table schema, row level security, and removes any
Cloud KMS key.
If you load data into an existing table, the load job can append the data or
overwrite the table.
You can append or overwrite a table by using one of the following:
The Google Cloud console
The bq command-line tool'sbq loadcommand
Thejobs.insertAPI method and configuring aloadjob
The client libraries
To append or overwrite a table with Parquet data:
Console
In the Google Cloud console, go to theBigQuerypage.
In theExplorerpane, expand your project, and then select a dataset.
In theDataset infosection, clickadd_boxCreate table.
In theCreate tablepanel, specify the following details:
In theSourcesection, selectGoogle Cloud Storagein theCreate table fromlist.
Then, do the following:
Select a file from the Cloud Storage bucket, or enter theCloud Storage URI.
You cannot include multiple URIs
in the Google Cloud console, butwildcardsare supported. The Cloud Storage bucket must be in the same
location as the dataset that contains the table you want to create, append, or
overwrite.
ForFile format, selectParquet.
In theDestinationsection, specify the following details:
ForDataset, select the dataset in which you want to create the
table.
In theTablefield, enter the name of the table that you want to create.
Verify that theTable typefield is set toNative table.
In theSchemasection, no action is necessary. The schema is
self-described in Parquet files.
Optional: SpecifyPartition and cluster settings. For more information, seeCreating partitioned tablesandCreating and using clustered tables.
You
cannot convert a table to a partitioned or clustered table by appending or
overwriting it. The Google Cloud console does not support
appending to or overwriting partitioned or clustered tables in a load job.
ClickAdvanced optionsand do the following:
ForWrite preference, chooseAppend to tableorOverwrite
table.
If you want to ignore values in a row that are not present in the table's schema, then
selectUnknown values.
Enter thebq loadcommand with the--replaceflag to overwrite the
table. Use the--noreplaceflag to append data to the table. If no flag is
specified, the default is to append data. Supply the--source_formatflag
and set it toPARQUET. Because Parquet schemas are automatically retrieved
from the self-describing source data, you don't need to provide a schema
definition.
(Optional) Supply the--locationflag and set the value to yourlocation.
Other optional flags include:
--destination_kms_key: The Cloud KMS key for encryption of the
table data.
Create aloadjob that points to the source data in Cloud Storage.
(Optional) Specify yourlocationin
thelocationproperty in thejobReferencesection of thejob resource.
Thesource URIsproperty
must be fully qualified, in the formatgs://BUCKET/OBJECT. You can
include multiple URIs as a comma-separated list. Note thatwildcardsare
also supported.
Specify the data format by setting theconfiguration.load.sourceFormatproperty toPARQUET.
Specify the write preference by setting theconfiguration.load.writeDispositionproperty toWRITE_TRUNCATEorWRITE_APPEND.
import("context""fmt""cloud.google.com/go/bigquery")// importParquetTruncate demonstrates loading Apache Parquet data from Cloud Storage into a table// and overwriting/truncating existing data in the table.funcimportParquetTruncate(projectID,datasetID,tableIDstring)error{// projectID := "my-project-id"// datasetID := "mydataset"// tableID := "mytable"ctx:=context.Background()client,err:=bigquery.NewClient(ctx,projectID)iferr!=nil{returnfmt.Errorf("bigquery.NewClient: %v",err)}deferclient.Close()gcsRef:=bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.parquet")gcsRef.SourceFormat=bigquery.ParquetgcsRef.AutoDetect=trueloader:=client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)loader.WriteDisposition=bigquery.WriteTruncatejob,err:=loader.Run(ctx)iferr!=nil{returnerr}status,err:=job.Wait(ctx)iferr!=nil{returnerr}ifstatus.Err()!=nil{returnfmt.Errorf("job completed with error: %v",status.Err())}returnnil}
importcom.google.cloud.bigquery.BigQuery;importcom.google.cloud.bigquery.BigQueryException;importcom.google.cloud.bigquery.BigQueryOptions;importcom.google.cloud.bigquery.FormatOptions;importcom.google.cloud.bigquery.Job;importcom.google.cloud.bigquery.JobInfo;importcom.google.cloud.bigquery.JobInfo.WriteDisposition;importcom.google.cloud.bigquery.LoadJobConfiguration;importcom.google.cloud.bigquery.TableId;importjava.math.BigInteger;publicclassLoadParquetReplaceTable{publicstaticvoidrunLoadParquetReplaceTable(){// TODO(developer): Replace these variables before running the sample.StringdatasetName="MY_DATASET_NAME";loadParquetReplaceTable(datasetName);}publicstaticvoidloadParquetReplaceTable(StringdatasetName){try{// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests.BigQuerybigquery=BigQueryOptions.getDefaultInstance().getService();// Imports a GCS file into a table and overwrites table data if table already exists.// This sample loads CSV file at:// https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.csvStringsourceUri="gs://cloud-samples-data/bigquery/us-states/us-states.parquet";TableIdtableId=TableId.of(datasetName,"us_states");// For more information on LoadJobConfiguration see:// https://googleapis.dev/java/google-cloud-clients/latest/com/google/cloud/bigquery/LoadJobConfiguration.Builder.htmlLoadJobConfigurationconfiguration=LoadJobConfiguration.builder(tableId,sourceUri).setFormatOptions(FormatOptions.parquet())// Set the write disposition to overwrite existing table data..setWriteDisposition(WriteDisposition.WRITE_TRUNCATE).build();// For more information on Job see:// https://googleapis.dev/java/google-cloud-clients/latest/index.html?com/google/cloud/bigquery/package-summary.html// Load the tableJobjob=bigquery.create(JobInfo.of(configuration));// Load data from a GCS parquet file into the table// Blocks until this load table job completes its execution, either failing or succeeding.JobcompletedJob=job.waitFor();if(completedJob==null){System.out.println("Job not executed since it no longer exists.");return;}elseif(completedJob.getStatus().getError()!=null){System.out.println("BigQuery was unable to load into the table due to an error: \n"+job.getStatus().getError());return;}// Check number of rows loaded into the tableBigIntegernumRows=bigquery.getTable(tableId).getNumRows();System.out.printf("Loaded %d rows. \n",numRows);System.out.println("GCS parquet overwrote existing table successfully.");}catch(BigQueryException|InterruptedExceptione){System.out.println("Table extraction job was interrupted. \n"+e.toString());}}}
// Import the Google Cloud client librariesconst{BigQuery}=require('@google-cloud/bigquery');const{Storage}=require('@google-cloud/storage');// Instantiate clientsconstbigquery=newBigQuery();conststorage=newStorage();/*** This sample loads the CSV file at* https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.csv** TODO(developer): Replace the following lines with the path to your file.*/constbucketName='cloud-samples-data';constfilename='bigquery/us-states/us-states.parquet';asyncfunctionloadParquetFromGCSTruncate(){/*** Imports a GCS file into a table and overwrites* table data if table already exists.*//*** TODO(developer): Uncomment the following lines before running the sample.*/// const datasetId = "my_dataset";// const tableId = "my_table";// Configure the load job. For full list of options, see:// https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoadconstmetadata={sourceFormat:'PARQUET',// Set the write disposition to overwrite existing table data.writeDisposition:'WRITE_TRUNCATE',location:'US',};// Load data from a Google Cloud Storage file into the tableconst[job]=awaitbigquery.dataset(datasetId).table(tableId).load(storage.bucket(bucketName).file(filename),metadata);// load() waits for the job to finishconsole.log(`Job${job.id}completed.`);// Check the job's status for errorsconsterrors=job.status.errors;if(errors&&errors.length>0){throwerrors;}}
use Google\Cloud\BigQuery\BigQueryClient;use Google\Cloud\Core\ExponentialBackoff;/** Uncomment and populate these variables in your code */// $projectId = 'The Google project ID';// $datasetId = 'The BigQuery dataset ID';// $tableID = 'The BigQuery table ID';// instantiate the bigquery table service$bigQuery = new BigQueryClient(['projectId' => $projectId,]);$table = $bigQuery->dataset($datasetId)->table($tableId);// create the import job$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.parquet';$loadConfig = $table->loadFromStorage($gcsUri)->sourceFormat('PARQUET')->writeDisposition('WRITE_TRUNCATE');$job = $table->runJob($loadConfig);// poll the job until it is complete$backoff = new ExponentialBackoff(10);$backoff->execute(function () use ($job) {print('Waiting for job to complete' . PHP_EOL);$job->reload();if (!$job->isComplete()) {throw new Exception('Job has not yet completed', 500);}});// check if the job has errorsif (isset($job->info()['status']['errorResult'])) {$error = $job->info()['status']['errorResult']['message'];printf('Error running job: %s' . PHP_EOL, $error);} else {print('Data imported successfully' . PHP_EOL);}
importiofromgoogle.cloudimportbigquery# Construct a BigQuery client object.client=bigquery.Client()# TODO(developer): Set table_id to the ID of the table to create.# table_id = "your-project.your_dataset.your_table_namejob_config=bigquery.LoadJobConfig(schema=[bigquery.SchemaField("name","STRING"),bigquery.SchemaField("post_abbr","STRING"),],)body=io.BytesIO(b"Washington,WA")client.load_table_from_file(body,table_id,job_config=job_config).result()previous_rows=client.get_table(table_id).num_rowsassertprevious_rows>0job_config=bigquery.LoadJobConfig(write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,source_format=bigquery.SourceFormat.PARQUET,)uri="gs://cloud-samples-data/bigquery/us-states/us-states.parquet"load_job=client.load_table_from_uri(uri,table_id,job_config=job_config)# Make an API request.load_job.result()# Waits for the job to complete.destination_table=client.get_table(table_id)print("Loaded{}rows.".format(destination_table.num_rows))
Loading hive-partitioned Parquet data
BigQuery supports loading hive partitioned Parquet data stored on
Cloud Storage and populates the hive partitioning columns as columns in
the destination BigQuery managed table. For more information, seeLoading externally partitioned data.
Parquet conversions
This section describes how BigQuery parses various data types when loading Parquet data.
Some Parquet data types (such asINT32,INT64,BYTE_ARRAY, andFIXED_LEN_BYTE_ARRAY) can be converted into multiple BigQuery data types. To ensure BigQuery converts the Parquet data types correctly, specify the appropriate data type in the Parquet file.
For example, to convert the ParquetINT32data type to the BigQueryDATEdata type, specify the following:
optional int32 date_col (DATE);
BigQuery converts Parquet data types to the
BigQuery data types that are described in the following sections.
Nested groups are converted intoSTRUCTtypes.
Other combinations of Parquet types and converted types are not supported.
Unsigned logical types
The ParquetUINT_8,UINT_16,UINT_32, andUINT_64types are unsigned.
BigQuery will treat values with these types as unsigned when loading into a
BigQuery signedINTEGERcolumn. In the case ofUINT_64, an error will be returned
if the unsigned value exceeds the maximumINTEGERvalue of
9,223,372,036,854,775,807.
Decimal logical type
Decimallogical types can be converted toNUMERIC,BIGNUMERIC, orSTRINGtypes. The converted type depends
on the precision and scale parameters of thedecimallogical type and the
specified decimal target types. Specify the decimal target type as follows:
You can enable schema inference for ParquetLISTlogical types. BigQuery
checks whether theLISTnode is in thestandard formor in one of the forms described by thebackward-compatibility rules:
// standard form<optional|required>group<name>(LIST){repeatedgrouplist{<optional|required><element-type>element;}}
If yes, the corresponding field for theLISTnode in the converted schema is treated
as if the node has the following schema:
You can load Parquet files that containWKT, hex-encodedWKB, orGeoJSONin aSTRINGcolumn, orWKBin aBYTE_ARRAYcolumn by specifying a
BigQuery schema with the typeGEOGRAPHY. For more information,
seeLoading geospatial data.
You can also loadGeoParquetfiles. In this case, the
columns described by the GeoParquet metadata are interpreted as typeGEOGRAPHYby default. You can also load the raw WKB data into aBYTEScolumn by
providing an explicit schema. For more information, seeLoading GeoParquet
files.
Column name conversions
A column name can contain letters (a-z, A-Z), numbers (0-9), or underscores
(_), and it must start with a letter or underscore. If you use flexible column
names, BigQuery supports starting a column name with a number.
Exercise caution when starting columns with a number, since using flexible
column names with the BigQuery Storage Read API or
BigQuery Storage Write API requires special handling. For more information about
flexible column name support, seeflexible column names.
Column names have a maximum length of 300 characters. Column names can't use any
of the following prefixes:
_TABLE_
_FILE_
_PARTITION
_ROW_TIMESTAMP
__ROOT__
_COLIDENTIFIER
Duplicate column names are not allowed even if the case differs. For example, a
column namedColumn1is considered identical to a column namedcolumn1. To
learn more about column naming rules, seeColumn
namesin the
GoogleSQL reference.
If a table name (for example,test) is the same as one of its column names
(for example,test), theSELECTexpression interprets thetestcolumn as
aSTRUCTcontaining all other table columns. To avoid this collision, use
one of the following methods:
Avoid using the same name for a table and its columns.
Assign the table a different alias. For example, the following query assigns
a table aliastto the tableproject1.dataset.test:
SELECTtestFROMproject1.dataset.testASt;
Include the table name when referencing a column. For example:
SELECTtest.testFROMproject1.dataset.test;
Flexible column names
You have more flexibility in what you name columns, including expanded access
to characters in languages other than English as well as additional symbols.
Make sure to use backtick (`) characters to enclose flexible column names if they areQuoted Identifiers.
Flexible column names support the following characters:
Any letter in any language, as represented by the Unicode regular expression\p{L}.
Any numeric character in any language as represented by the Unicode regular
expression\p{N}.
Any connector punctuation character, including underscores, as represented
by the Unicode regular expression\p{Pc}.
A hyphen or dash as represented by the Unicode regular expression\p{Pd}.
Any mark intended to accompany another character as represented by the
Unicode regular expression\p{M}.
For example, accents, umlauts, or enclosing boxes.
The following special characters:
An ampersand (&) as represented by the Unicode regular
expression\u0026.
A percent sign (%) as represented by the Unicode regular
expression\u0025.
An equals sign (=) as represented by the Unicode regular
expression\u003D.
A plus sign (+) as represented by the Unicode regular
expression\u002B.
A colon (:) as represented by the Unicode regular
expression\u003A.
An apostrophe (') as represented by the Unicode regular
expression\u0027.
A less-than sign (<) as represented by the Unicode regular
expression\u003C.
A greater-than sign (>) as represented by the Unicode regular
expression\u003E.
A number sign (#) as represented by the Unicode regular
expression\u0023.
A vertical line (|) as represented by the Unicode regular
expression\u007c.
Whitespace.
Flexible column names don't support the following special characters:
An exclamation mark (!) as represented by the Unicode regular
expression\u0021.
A quotation mark (") as represented by the Unicode regular
expression\u0022.
A dollar sign ($) as represented by the Unicode regular
expression\u0024.
A left parenthesis (() as represented by the Unicode regular
expression\u0028.
A right parenthesis ()) as represented by the Unicode regular
expression\u0029.
An asterisk (*) as represented by the Unicode regular
expression\u002A.
A comma (,) as represented by the Unicode regular
expression\u002C.
A period (.) as represented by the Unicode regular
expression\u002E. Periods arenotreplaced by underscores in Parquet file
column names when a column name character map is used. For more information,
seeflexible column limitations.
A slash (/) as represented by the Unicode regular
expression\u002F.
A semicolon (;) as represented by the Unicode regular
expression\u003B.
A question mark (?) as represented by the Unicode regular
expression\u003F.
An at sign (@) as represented by the Unicode regular
expression\u0040.
A left square bracket ([) as represented by the Unicode regular
expression\u005B.
A backslash (\) as represented by the Unicode regular
expression\u005C.
A right square bracket (]) as represented by the Unicode regular
expression\u005D.
A circumflex accent (^) as represented by the Unicode regular
expression\u005E.
A grave accent (`) as represented by the Unicode regular
expression\u0060.
A left curly bracket {{) as represented by the Unicode regular
expression\u007B.
A right curly bracket (}) as represented by the Unicode regular
expression\u007D.
A tilde (~) as represented by the Unicode regular expression\u007E.
The expanded column characters are supported by both the BigQuery Storage Read API
and the BigQuery Storage Write API. To use the expanded list of Unicode characters
with the BigQuery Storage Read API, you must set a flag. You can use thedisplayNameattribute to retrieve the column name. The following example
shows how to set a flag with the Python client:
fromgoogle.cloud.bigquery_storageimporttypesrequested_session=types.ReadSession()#set avro serialization options for flexible column.options=types.AvroSerializationOptions()options.enable_display_name_attribute=Truerequested_session.read_options.avro_serialization_options=options
To use the expanded list of Unicode characters with the BigQuery Storage Write API,
you must provide the schema withcolumn_namenotation, unless you are using
theJsonStreamWriterwriter object. The following example shows how to
provide the schema:
syntax="proto2";packagemypackage;// Source protos located in github.com/googleapis/googleapisimport"google/cloud/bigquery/storage/v1/annotations.proto";messageFlexibleSchema{optionalstringitem_name_column=1[(.google.cloud.bigquery.storage.v1.column_name)="name-列"];optionalstringitem_description_column=2[(.google.cloud.bigquery.storage.v1.column_name)="description-列"];}
In this example,item_name_columnanditem_description_columnare
placeholder names which need to be compliant with theprotocol buffernaming
convention. Note thatcolumn_nameannotations always take precedence over
placeholder names.
Limitations
Flexible column names are not supported withexternal tables.
You cannot load Parquet files containing columns that have a period
(.) in the column name.
Column names from Parquet files are treated as case-insensitive when loaded
into BigQuery. Identical case-insensitive names will cause
collisions. To avoid this, either append an underscore to one of the duplicate
column names or rename the columns before loading.
Debugging your Parquet file
If your load jobs fail with data errors, you can usePyArrowto verify if your
Parquet data files are corrupted. If PyArrow fails to read the files, the files
are likely to be rejected by the BigQuery load job. The following
example shows how to read the contents of a Parquet file by using PyArrow:
frompyarrowimportparquetaspq# Read the entire filepq.read_table('your_sample_file.parquet')# Read specific columnspq.read_table('your_sample_file.parquet',columns=['some_column','another_column'])# Read the metadata of specific columnsfile_metadata=pq.read_metadata('your_sample_file.parquet')forcolinfile_metadata.row_group(0).to_dict()['columns']:printcol['column_path_in_schema']printcol['num_values']
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-04 UTC."],[[["\u003cp\u003eParquet data from Cloud Storage can be loaded into BigQuery, either into a new table or by appending to or overwriting an existing table or partition, with the data being converted to BigQuery's columnar Capacitor storage format during the process.\u003c/p\u003e\n"],["\u003cp\u003eWhen loading Parquet data, the Cloud Storage bucket and the BigQuery dataset containing the table must be in the same regional or multi-regional location, and the schema is automatically retrieved from the self-describing Parquet source data, typically from the alphabetically last file when loading multiple files.\u003c/p\u003e\n"],["\u003cp\u003eLoading Parquet data into BigQuery is subject to certain limitations, including requirements on dataset location, no guarantees of data consistency for external sources, no support for Cloud Storage object versioning, and the potential need to manage row sizes and row group sizes to prevent errors.\u003c/p\u003e\n"],["\u003cp\u003eBigQuery supports various compression codecs for Parquet files, and it provides different methods for loading data, including using the Google Cloud console, the \u003ccode\u003ebq\u003c/code\u003e command-line tool, the \u003ccode\u003ejobs.insert\u003c/code\u003e API method, and client libraries, while also offering options for data partitioning and clustering during the load process.\u003c/p\u003e\n"],["\u003cp\u003eWhen loading Parquet data, BigQuery performs conversions between Parquet and BigQuery data types, and it supports loading hive-partitioned data and geospatial data, as well as offering the flexibility of expanding the range of characters that can be used in column names when in preview mode.\u003c/p\u003e\n"]]],[],null,[]]