Stay organized with collectionsSave and categorize content based on your preferences.
Loading ORC data from Cloud Storage
This page provides an overview of loading ORC data from Cloud Storage into
BigQuery.
ORCis an
open source column-oriented data format that is widely used in the Apache Hadoop
ecosystem.
When you load ORC data from Cloud Storage, you can load the data into a new
table or partition, or you can append to or overwrite an existing table or
partition. When your data is loaded into BigQuery, it is
converted into columnar format forCapacitor(BigQuery's storage format).
When you load data from Cloud Storage into a BigQuery table,
the dataset that contains the table must be in the same regional or multi-
regional location as the Cloud Storage bucket.
You are subject to the following limitations when you load data into
BigQuery from a Cloud Storage bucket:
If your dataset's location is set to a value other than theUSmulti-region,
then the Cloud Storage bucket must be in the same region or contained
in the same multi-region as the dataset.
BigQuery does not guarantee data consistency for external data
sources. Changes to the underlying data while a query is running can result in
unexpected behavior.
BigQuery does not supportCloud Storage object versioning. If you
include a generation number in the Cloud Storage URI, then the load job
fails.
Before you begin
Grant Identity and Access Management (IAM) roles that give users the necessary
permissions to perform each task in this document, and create a dataset
to store your data.
Required permissions
To load data into BigQuery, you need IAM permissions to run a load job and load data into BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need IAM permissions to access the bucket that contains your data.
Permissions to load data into BigQuery
To load data into a new BigQuery table or partition or to append or overwrite an existing table or partition, you need the following IAM permissions:
bigquery.tables.create
bigquery.tables.updateData
bigquery.tables.update
bigquery.jobs.create
Each of the following predefined IAM roles includes the permissions that you need in order to load data into a BigQuery table or partition:
To get the permissions that you need to load data from a Cloud Storage bucket,
ask your administrator to grant you theStorage Admin(roles/storage.admin) IAM role on the bucket.
For more information about granting roles, seeManage access to projects, folders, and organizations.
This predefined role contains
the permissions required to load data from a Cloud Storage bucket. To see the exact permissions that are
required, expand theRequired permissionssection:
Required permissions
The following permissions are required to load data from a Cloud Storage bucket:
storage.buckets.get
storage.objects.get
storage.objects.list (required if you are using a URIwildcard)
When you load ORC files into BigQuery, the table schema is
automatically retrieved from the self-describing source data. When
BigQuery retrieves the schema from the source data, the
alphabetically last file is used.
For example, you have the following ORC files in Cloud Storage:
When BigQuery detects the schema, some ORC data types are
converted to BigQuery data types to make them compatible with
GoogleSQL syntax. All fields in the detected schema areNULLABLE. For more information, seeORC conversions.
When you load multiple ORC files that have different schemas, identical
fields (with the same name and same nested level) specified in multiple
schemas must map to the same converted BigQuery data type in
each schema definition.
To provide a table schema for creating external tables, set thereferenceFileSchemaUriproperty in BigQuery API or --reference_file_schema_uriparameter in bq command-line tool
to the URL of the reference file.
For example,--reference_file_schema_uri="gs://mybucket/schema.orc".
ORC compression
BigQuery supports the following compression codecs for
ORC file contents:
Zlib
Snappy
LZO
LZ4
Data in ORC files doesn't remain compressed after it is uploaded to
BigQuery. Data storage is reported in logical bytes or physical
bytes, depending on thedataset storage billing model.
To get information on storage usage, query theINFORMATION_SCHEMA.TABLE_STORAGEview.
Loading ORC data into a new table
You can load ORC data into a new table by:
Using the Google Cloud console
Using the bq command-line tool'sbq loadcommand
Calling thejobs.insertAPI method and configuring aloadjob
Using the client libraries
To load ORC data from Cloud Storage into a new BigQuery
table:
Console
In the Google Cloud console, go to theBigQuerypage.
In theExplorerpane, expand your project, and then select a dataset.
In theDataset infosection, clickadd_boxCreate table.
In theCreate tablepanel, specify the following details:
In theSourcesection, selectGoogle Cloud Storagein theCreate table fromlist.
Then, do the following:
Select a file from the Cloud Storage bucket, or enter theCloud Storage URI.
You cannot include multiple URIs
in the Google Cloud console, butwildcardsare supported. The Cloud Storage bucket must be in the same
location as the dataset that contains the table you want to create, append, or
overwrite.
ForFile format, selectORC.
In theDestinationsection, specify the following details:
ForDataset, select the dataset in which you want to create the
table.
In theTablefield, enter the name of the table that you want to create.
Verify that theTable typefield is set toNative table.
In theSchemasection, no action is necessary. The schema is
self-described in ORC files.
Use thebq loadcommand, specify ORC as thesource_format, and include aCloud Storage URI.
You can include a single URI, a comma-separated list of URIs or a URI
containing awildcard.
(Optional) Supply the--locationflag and set the value to yourlocation.
Other optional flags include:
--time_partitioning_type: Enables time-based partitioning on a table and
sets the partition type. Possible values areHOUR,DAY,MONTH, andYEAR. This flag is optional when you create a
table partitioned on aDATE,DATETIME, orTIMESTAMPcolumn. The
default partition type for time-based partitioning isDAY. You cannot
change the partitioning specification on an existing table.
--time_partitioning_expiration: An integer that specifies (in seconds)
when a time-based partition should be deleted. The expiration time evaluates
to the partition's UTC date plus the integer value.
--time_partitioning_field: TheDATEorTIMESTAMPcolumn used to
create a partitioned table. If time-based partitioning is enabled without
this value, an ingestion-time partitioned table is created.
--require_partition_filter: When enabled, this option requires users
to include aWHEREclause that specifies the partitions to query.
Requiring a partition filter may reduce cost and improve performance.
For more information, seeQuerying partitioned tables.
--clustering_fields: A comma-separated list of up to four column names
used to create aclustered table.
--destination_kms_key: The Cloud KMS key for encryption of the
table data.
For more information about partitioned tables, see:
locationis your location. The--locationflag is optional.
For example, if you are using BigQuery in the Tokyo region,
you can set the flag's value toasia-northeast1. You can set a default
value for the location using the.bigqueryrc file.
formatisORC.
datasetis an existing dataset.
tableis the name of the table into which you're loading data.
path_to_sourceis a fully-qualifiedCloud Storage URIor a comma-separated list of URIs.Wildcardsare also supported.
Examples:
The following command loads data fromgs://mybucket/mydata.orcinto a
table namedmytableinmydataset.
The following command loads data fromgs://mybucket/mydata.orcinto a
partitioned table namedmytableinmydataset. The table is partitioned
on themytimestampcolumn.
The following command loads data from multiple files ings://mybucket/into a table namedmytableinmydataset. The command includes a comma-
separated list of Cloud Storage URIs with wildcards.
Create aloadjob that points to the source data in Cloud Storage.
(Optional) Specify yourlocationin
thelocationproperty in thejobReferencesection of thejob resource.
Thesource URIsproperty must be fully-qualified, in the formatgs://bucket/object.
Each URI can contain one '*'wildcard character.
Specify the ORC data format by setting thesourceFormatproperty toORC.
To check the job status, calljobs.get(job_id*),
wherejob_idis the ID of the job returned by the initial
request.
Ifstatus.state = DONE, the job completed successfully.
If thestatus.errorResultproperty is present, the request failed,
and that object includes information describing what went wrong.
When a request fails, no table is created and no data is loaded.
Ifstatus.errorResultis absent, the job finished successfully,
although there might have been some non-fatal errors, such as problems
importing a few rows. Non-fatal errors are listed in the returned job
object'sstatus.errorsproperty.
API notes:
Load jobs are atomic and consistent; if a load job fails, none of the data
is available, and if a load job succeeds, all of the data is available.
As a best practice, generate a unique ID and pass it asjobReference.jobIdwhen callingjobs.insertto create a load job. This
approach is more robust to network failure because the client can poll or
retry on the known job ID.
Callingjobs.inserton a given job ID is idempotent. You can retry as
many times as you like on the same job ID, and at most one of those
operations succeeds.
usingGoogle.Apis.Bigquery.v2.Data;usingGoogle.Cloud.BigQuery.V2;usingSystem;publicclassBigQueryLoadTableGcsOrc{publicvoidLoadTableGcsOrc(stringprojectId="your-project-id",stringdatasetId="your_dataset_id"){BigQueryClientclient=BigQueryClient.Create(projectId);vargcsURI="gs://cloud-samples-data/bigquery/us-states/us-states.orc";vardataset=client.GetDataset(datasetId);TableReferencedestinationTableRef=dataset.GetTableReference(tableId:"us_states");// Create job configurationvarjobOptions=newCreateLoadJobOptions(){SourceFormat=FileFormat.Orc};// Create and run jobvarloadJob=client.CreateLoadJob(sourceUri:gcsURI,destination:destinationTableRef,// Pass null as the schema because the schema is inferred when// loading Orc dataschema:null,options:jobOptions);loadJob=loadJob.PollUntilCompleted().ThrowOnAnyError();// Waits for the job to complete.// Display the number of rows uploadedBigQueryTabletable=client.GetTable(destinationTableRef);Console.WriteLine($"Loaded{table.Resource.NumRows}rowsto{table.FullyQualifiedId}");}}
import("context""fmt""cloud.google.com/go/bigquery")// importORCTruncate demonstrates loading Apache ORC data from Cloud Storage into a table.funcimportORC(projectID,datasetID,tableIDstring)error{// projectID := "my-project-id"// datasetID := "mydataset"// tableID := "mytable"ctx:=context.Background()client,err:=bigquery.NewClient(ctx,projectID)iferr!=nil{returnfmt.Errorf("bigquery.NewClient:%v",err)}deferclient.Close()gcsRef:=bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.orc")gcsRef.SourceFormat=bigquery.ORCloader:=client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)job,err:=loader.Run(ctx)iferr!=nil{returnerr}status,err:=job.Wait(ctx)iferr!=nil{returnerr}ifstatus.Err()!=nil{returnfmt.Errorf("jobcompletedwitherror:%v",status.Err())}returnnil}
Java
importcom.google.cloud.bigquery.BigQuery;importcom.google.cloud.bigquery.BigQueryException;importcom.google.cloud.bigquery.BigQueryOptions;importcom.google.cloud.bigquery.Field;importcom.google.cloud.bigquery.FormatOptions;importcom.google.cloud.bigquery.Job;importcom.google.cloud.bigquery.JobInfo;importcom.google.cloud.bigquery.LoadJobConfiguration;importcom.google.cloud.bigquery.Schema;importcom.google.cloud.bigquery.StandardSQLTypeName;importcom.google.cloud.bigquery.TableId;// Sample to load ORC data from Cloud Storage into a new BigQuery tablepublicclassLoadOrcFromGCS{publicstaticvoidrunLoadOrcFromGCS(){// TODO(developer): Replace these variables before running the sample.StringdatasetName="MY_DATASET_NAME";StringtableName="MY_TABLE_NAME";StringsourceUri="gs://cloud-samples-data/bigquery/us-states/us-states.orc";Schemaschema=Schema.of(Field.of("name",StandardSQLTypeName.STRING),Field.of("post_abbr",StandardSQLTypeName.STRING));loadOrcFromGCS(datasetName,tableName,sourceUri,schema);}publicstaticvoidloadOrcFromGCS(StringdatasetName,StringtableName,StringsourceUri,Schemaschema){try{// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests.BigQuerybigquery=BigQueryOptions.getDefaultInstance().getService();TableIdtableId=TableId.of(datasetName,tableName);LoadJobConfigurationloadConfig=LoadJobConfiguration.newBuilder(tableId,sourceUri,FormatOptions.orc()).setSchema(schema).build();// Load data from a GCS ORC file into the tableJobjob=bigquery.create(JobInfo.of(loadConfig));// Blocks until this load table job completes its execution, either failing or succeeding.job=job.waitFor();if(job.isDone()&&job.getStatus().getError()==null){System.out.println("ORCfromGCSsuccessfullyaddedduringloadappendjob");}else{System.out.println("BigQuerywasunabletoloadintothetableduetoanerror:"+job.getStatus().getError());}}catch(BigQueryException|InterruptedExceptione){System.out.println("Columnnotaddedduringloadappend\n"+e.toString());}}}
// Import the Google Cloud client librariesconst{BigQuery}=require('@google-cloud/bigquery');const{Storage}=require('@google-cloud/storage');// Instantiate clientsconstbigquery=newBigQuery();conststorage=newStorage();/*** This sample loads the ORC file at* https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.orc** TODO(developer): Replace the following lines with the path to your file.*/constbucketName='cloud-samples-data';constfilename='bigquery/us-states/us-states.orc';asyncfunctionloadTableGCSORC(){// Imports a GCS file into a table with ORC source format./*** TODO(developer): Uncomment the following line before running the sample.*/// const datasetId = 'my_dataset';// const tableId = 'my_table'// Configure the load job. For full list of options, see:// https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoadconstmetadata={sourceFormat:'ORC',location:'US',};// Load data from a Google Cloud Storage file into the tableconst[job]=awaitbigquery.dataset(datasetId).table(tableId).load(storage.bucket(bucketName).file(filename),metadata);// load() waits for the job to finishconsole.log(`Job${job.id}completed.`);// Check the job's status for errorsconsterrors=job.status.errors;if(errors&&errors.length>0){throwerrors;}}
use Google\Cloud\BigQuery\BigQueryClient;use Google\Cloud\Core\ExponentialBackoff;/** Uncomment and populate these variables in your code */// $projectId = 'The Google project ID';// $datasetId = 'The BigQuery dataset ID';// instantiate the bigquery table service$bigQuery = new BigQueryClient(['projectId' => $projectId,]);$dataset = $bigQuery->dataset($datasetId);$table = $dataset->table('us_states');// create the import job$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.orc';$loadConfig = $table->loadFromStorage($gcsUri)->sourceFormat('ORC');$job = $table->runJob($loadConfig);// poll the job until it is complete$backoff = new ExponentialBackoff(10);$backoff->execute(function () use ($job) {print('Waiting for job to complete' . PHP_EOL);$job->reload();if (!$job->isComplete()) {throw new Exception('Job has not yet completed', 500);}});// check if the job has errorsif (isset($job->info()['status']['errorResult'])) {$error = $job->info()['status']['errorResult']['message'];printf('Error running job: %s' . PHP_EOL, $error);} else {print('Data imported successfully' . PHP_EOL);}
fromgoogle.cloudimportbigquery# Construct a BigQuery client object.client=bigquery.Client()# TODO(developer): Set table_id to the ID of the table to create.# table_id = "your-project.your_dataset.your_table_namejob_config=bigquery.LoadJobConfig(source_format=bigquery.SourceFormat.ORC)uri="gs://cloud-samples-data/bigquery/us-states/us-states.orc"load_job=client.load_table_from_uri(uri,table_id,job_config=job_config)# Make an API request.load_job.result()# Waits for the job to complete.destination_table=client.get_table(table_id)print("Loaded{}rows.".format(destination_table.num_rows))
require"google/cloud/bigquery"defload_table_gcs_orcdataset_id="your_dataset_id"bigquery=Google::Cloud::Bigquery.newdataset=bigquery.datasetdataset_idgcs_uri="gs://cloud-samples-data/bigquery/us-states/us-states.orc"table_id="us_states"load_job=dataset.load_jobtable_id,gcs_uri,format:"orc"puts"Startingjob#{load_job.job_id}"load_job.wait_until_done!# Waits for table load to complete.puts"Jobfinished."table=dataset.tabletable_idputs"Loaded#{table.rows_count} rows to table #{table.id}"end
Append to or overwrite a table with ORC data
You can load additional data into a table either from source files or by
appending query results.
In the Google Cloud console, use theWrite preferenceoption to specify
what action to take when you load data from a source file or from a query
result.
You have the following options when you load additional data into a table:
Console option
bq tool flag
BigQuery API property
Description
Write if empty
Not supported
WRITE_EMPTY
Writes the data only if the table is empty.
Append to table
--noreplaceor--replace=false; if--[no]replaceis unspecified, the default is append
WRITE_APPEND
(Default)
Appends the data to the end of the table.
Overwrite table
--replaceor--replace=true
WRITE_TRUNCATE
Erases all existing data in a table before writing the new data.
This action also deletes the table schema, row level security, and removes any
Cloud KMS key.
If you load data into an existing table, the load job can append the data or
overwrite the table.
You can append or overwrite a table by:
Using the Google Cloud console
Using the bq command-line tool'sbq loadcommand
Calling thejobs.insertAPI method and configuring aloadjob
Using the client libraries
To append or overwrite a table with ORC data:
Console
In the Google Cloud console, go to theBigQuerypage.
In theExplorerpane, expand your project, and then select a dataset.
In theDataset infosection, clickadd_boxCreate table.
In theCreate tablepanel, specify the following details:
In theSourcesection, selectGoogle Cloud Storagein theCreate table fromlist.
Then, do the following:
Select a file from the Cloud Storage bucket, or enter theCloud Storage URI.
You cannot include multiple URIs
in the Google Cloud console, butwildcardsare supported. The Cloud Storage bucket must be in the same
location as the dataset that contains the table you want to create, append, or
overwrite.
ForFile format, selectORC.
In theDestinationsection, specify the following details:
ForDataset, select the dataset in which you want to create the
table.
In theTablefield, enter the name of the table that you want to create.
Verify that theTable typefield is set toNative table.
In theSchemasection, no action is necessary. The schema is
self-described in ORC files.
Optional: SpecifyPartition and cluster settings. For more information, seeCreating partitioned tablesandCreating and using clustered tables.
You
cannot convert a table to a partitioned or clustered table by appending or
overwriting it. The Google Cloud console does not support
appending to or overwriting partitioned or clustered tables in a load job.
ClickAdvanced optionsand do the following:
ForWrite preference, chooseAppend to tableorOverwrite
table.
If you want to ignore values in a row that are not present in the table's schema, then
selectUnknown values.
Enter thebq loadcommand with the--replaceflag to overwrite the
table. Use the--noreplaceflag to append data to the table. If no flag is
specified, the default is to append data. Supply the--source_formatflag
and set it toORC. Because ORC schemas are automatically retrieved
from the self-describing source data, you don't need to provide a schema
definition.
(Optional) Supply the--locationflag and set the value to yourlocation.
Other optional flags include:
--destination_kms_key: The Cloud KMS key for encryption of the
table data.
Create aloadjob that points to the source data in Cloud Storage.
(Optional) Specify yourlocationin thelocationproperty in thejobReferencesection of thejob resource.
Thesource URIsproperty
must be fully-qualified, in the formatgs://bucket/object. You can
include multiple URIs as a comma-separated list. Note thatwildcardsare
also supported.
Specify the data format by setting theconfiguration.load.sourceFormatproperty toORC.
Specify the write preference by setting theconfiguration.load.writeDispositionproperty toWRITE_TRUNCATEorWRITE_APPEND.
usingGoogle.Apis.Bigquery.v2.Data;usingGoogle.Cloud.BigQuery.V2;usingSystem;publicclassBigQueryLoadTableGcsOrcTruncate{publicvoidLoadTableGcsOrcTruncate(stringprojectId="your-project-id",stringdatasetId="your_dataset_id",stringtableId="your_table_id"){BigQueryClientclient=BigQueryClient.Create(projectId);vargcsURI="gs://cloud-samples-data/bigquery/us-states/us-states.orc";vardataset=client.GetDataset(datasetId);TableReferencedestinationTableRef=dataset.GetTableReference(tableId:"us_states");// Create job configurationvarjobOptions=newCreateLoadJobOptions(){SourceFormat=FileFormat.Orc,WriteDisposition=WriteDisposition.WriteTruncate};// Create and run jobvarloadJob=client.CreateLoadJob(sourceUri:gcsURI,destination:destinationTableRef,// Pass null as the schema because the schema is inferred when// loading Orc dataschema:null,options:jobOptions);loadJob=loadJob.PollUntilCompleted().ThrowOnAnyError();// Waits for the job to complete.// Display the number of rows uploadedBigQueryTabletable=client.GetTable(destinationTableRef);Console.WriteLine($"Loaded{table.Resource.NumRows}rowsto{table.FullyQualifiedId}");}}
import("context""fmt""cloud.google.com/go/bigquery")// importORCTruncate demonstrates loading Apache ORC data from Cloud Storage into a table// and overwriting/truncating existing data in the table.funcimportORCTruncate(projectID,datasetID,tableIDstring)error{// projectID := "my-project-id"// datasetID := "mydataset"// tableID := "mytable"ctx:=context.Background()client,err:=bigquery.NewClient(ctx,projectID)iferr!=nil{returnfmt.Errorf("bigquery.NewClient:%v",err)}deferclient.Close()gcsRef:=bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.orc")gcsRef.SourceFormat=bigquery.ORCloader:=client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)// Default for import jobs is to append data to a table. WriteTruncate// specifies that existing data should instead be replaced/overwritten.loader.WriteDisposition=bigquery.WriteTruncatejob,err:=loader.Run(ctx)iferr!=nil{returnerr}status,err:=job.Wait(ctx)iferr!=nil{returnerr}ifstatus.Err()!=nil{returnfmt.Errorf("jobcompletedwitherror:%v",status.Err())}returnnil}
Java
importcom.google.cloud.bigquery.BigQuery;importcom.google.cloud.bigquery.BigQueryException;importcom.google.cloud.bigquery.BigQueryOptions;importcom.google.cloud.bigquery.FormatOptions;importcom.google.cloud.bigquery.Job;importcom.google.cloud.bigquery.JobInfo;importcom.google.cloud.bigquery.LoadJobConfiguration;importcom.google.cloud.bigquery.TableId;// Sample to overwrite the BigQuery table data by loading a ORC file from GCSpublicclassLoadOrcFromGcsTruncate{publicstaticvoidrunLoadOrcFromGcsTruncate(){// TODO(developer): Replace these variables before running the sample.StringdatasetName="MY_DATASET_NAME";StringtableName="MY_TABLE_NAME";StringsourceUri="gs://cloud-samples-data/bigquery/us-states/us-states.orc";loadOrcFromGcsTruncate(datasetName,tableName,sourceUri);}publicstaticvoidloadOrcFromGcsTruncate(StringdatasetName,StringtableName,StringsourceUri){try{// Initialize client that will be used to send requests. This client only needs to be created// once, and can be reused for multiple requests.BigQuerybigquery=BigQueryOptions.getDefaultInstance().getService();TableIdtableId=TableId.of(datasetName,tableName);LoadJobConfigurationloadConfig=LoadJobConfiguration.newBuilder(tableId,sourceUri).setFormatOptions(FormatOptions.orc())// Set the write disposition to overwrite existing table data.setWriteDisposition(JobInfo.WriteDisposition.WRITE_TRUNCATE).build();// Load data from a GCS ORC file into the tableJobjob=bigquery.create(JobInfo.of(loadConfig));// Blocks until this load table job completes its execution, either failing or succeeding.job=job.waitFor();if(job.isDone()&&job.getStatus().getError()==null){System.out.println("TableissuccessfullyoverwrittenbyORCfileloadedfromGCS");}else{System.out.println("BigQuerywasunabletoloadintothetableduetoanerror:"+job.getStatus().getError());}}catch(BigQueryException|InterruptedExceptione){System.out.println("Columnnotaddedduringloadappend\n"+e.toString());}}}
// Import the Google Cloud client librariesconst{BigQuery}=require('@google-cloud/bigquery');const{Storage}=require('@google-cloud/storage');// Instantiate the clientsconstbigquery=newBigQuery();conststorage=newStorage();/*** This sample loads the CSV file at* https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.csv** TODO(developer): Replace the following lines with the path to your file.*/constbucketName='cloud-samples-data';constfilename='bigquery/us-states/us-states.orc';asyncfunctionloadORCFromGCSTruncate(){/*** Imports a GCS file into a table and overwrites* table data if table already exists.*//*** TODO(developer): Uncomment the following lines before running the sample.*/// const datasetId = "my_dataset";// const tableId = "my_table";// Configure the load job. For full list of options, see:// https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoadconstmetadata={sourceFormat:'ORC',// Set the write disposition to overwrite existing table data.writeDisposition:'WRITE_TRUNCATE',location:'US',};// Load data from a Google Cloud Storage file into the tableconst[job]=awaitbigquery.dataset(datasetId).table(tableId).load(storage.bucket(bucketName).file(filename),metadata);// load() waits for the job to finishconsole.log(`Job${job.id}completed.`);// Check the job's status for errorsconsterrors=job.status.errors;if(errors&&errors.length>0){throwerrors;}}
use Google\Cloud\BigQuery\BigQueryClient;use Google\Cloud\Core\ExponentialBackoff;/** Uncomment and populate these variables in your code */// $projectId = 'The Google project ID';// $datasetId = 'The BigQuery dataset ID';// $tableID = 'The BigQuery table ID';// instantiate the bigquery table service$bigQuery = new BigQueryClient(['projectId' => $projectId,]);$table = $bigQuery->dataset($datasetId)->table($tableId);// create the import job$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.orc';$loadConfig = $table->loadFromStorage($gcsUri)->sourceFormat('ORC')->writeDisposition('WRITE_TRUNCATE');$job = $table->runJob($loadConfig);// poll the job until it is complete$backoff = new ExponentialBackoff(10);$backoff->execute(function () use ($job) {print('Waiting for job to complete' . PHP_EOL);$job->reload();if (!$job->isComplete()) {throw new Exception('Job has not yet completed', 500);}});// check if the job has errorsif (isset($job->info()['status']['errorResult'])) {$error = $job->info()['status']['errorResult']['message'];printf('Error running job: %s' . PHP_EOL, $error);} else {print('Data imported successfully' . PHP_EOL);}
importiofromgoogle.cloudimportbigquery# Construct a BigQuery client object.client=bigquery.Client()# TODO(developer): Set table_id to the ID of the table to create.# table_id = "your-project.your_dataset.your_table_namejob_config=bigquery.LoadJobConfig(schema=[bigquery.SchemaField("name","STRING"),bigquery.SchemaField("post_abbr","STRING"),],)body=io.BytesIO(b"Washington,WA")client.load_table_from_file(body,table_id,job_config=job_config).result()previous_rows=client.get_table(table_id).num_rowsassertprevious_rows>0job_config=bigquery.LoadJobConfig(write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,source_format=bigquery.SourceFormat.ORC,)uri="gs://cloud-samples-data/bigquery/us-states/us-states.orc"load_job=client.load_table_from_uri(uri,table_id,job_config=job_config)# Make an API request.load_job.result()# Waits for the job to complete.destination_table=client.get_table(table_id)print("Loaded{}rows.".format(destination_table.num_rows))
require"google/cloud/bigquery"defload_table_gcs_orc_truncatedataset_id="your_dataset_id",table_id="your_table_id"bigquery=Google::Cloud::Bigquery.newdataset=bigquery.datasetdataset_idgcs_uri="gs://cloud-samples-data/bigquery/us-states/us-states.orc"load_job=dataset.load_jobtable_id,gcs_uri,format:"orc",write:"truncate"puts"Startingjob#{load_job.job_id}"load_job.wait_until_done!# Waits for table load to complete.puts"Jobfinished."table=dataset.tabletable_idputs"Loaded#{table.rows_count} rows to table #{table.id}"end
Load hive-partitioned ORC data
BigQuery supports loading hive partitioned ORC data stored on
Cloud Storage and populates the hive partitioning columns as columns in
the destination BigQuery managed table. For more information, seeLoading Externally Partitioned Data from Cloud Storage.
ORC conversions
BigQuery converts ORC data types to the following
BigQuery data types:
Primitive types
ORC data type
BigQuery data type
Notes
boolean
BOOLEAN
byte
INTEGER
short
INTEGER
int
INTEGER
long
INTEGER
float
FLOAT
double
FLOAT
string
STRING
UTF-8 only
varchar
STRING
UTF-8 only
char
STRING
UTF-8 only
binary
BYTES
date
DATE
An attempt to convert any value in the ORC data that is less than
-719162 days or greater than 2932896 days returns aninvalid date
valueerror. If this affects you, contactSupportto have
unsupported values converted to the BigQuery minimum value
of0001-01-01or maximum value of9999-12-31,
as appropriate.
timestamp
TIMESTAMP
ORC supports nanosecond precision, but BigQuery
converts sub-microsecond values to microseconds when the data is
read.
An attempt to convert any value in the ORC data that is less than
-719162 days or greater than 2932896 days returns aninvalid date
valueerror. If this affects you, contactSupportto have unsupported values converted to the BigQuery
minimum value of0001-01-01or maximum value of9999-12-31, as appropriate.
Decimallogical types can be converted toNUMERIC,BIGNUMERIC, orSTRINGtypes. The converted type depends
on the precision and scale parameters of thedecimallogical type and the
specified decimal target types. Specify the decimal target type as follows:
An ORC map<K,V> field is converted to a repeated RECORD
that contains two fields: a key of the same data type as
K, and a value of the same data type as V. Both
fields are NULLABLE.
list
repeated fields
Nested lists and lists of maps are not supported.
union
RECORD
When union only has one variant, it's converted to a NULLABLE
field.
Otherwise a union is converted to a RECORD with a list of NULLABLE
fields. The NULLABLE fields have suffixes such as field_0, field_1,
and so on. Only one of these fields is assigned a value when the
data is read.
Column names
A column name can contain letters (a-z, A-Z), numbers (0-9), or underscores
(_), and it must start with a letter or underscore. If you use flexible column
names, BigQuery supports starting a column name with a number.
Exercise caution when starting columns with a number, since using flexible
column names with the BigQuery Storage Read API or
BigQuery Storage Write API requires special handling. For more information about
flexible column name support, seeflexible column names.
Column names have a maximum length of 300 characters. Column names can't use any
of the following prefixes:
_TABLE_
_FILE_
_PARTITION
_ROW_TIMESTAMP
__ROOT__
_COLIDENTIFIER
Duplicate column names are not allowed even if the case differs. For example, a
column namedColumn1is considered identical to a column namedcolumn1. To
learn more about column naming rules, seeColumn
namesin the
GoogleSQL reference.
If a table name (for example,test) is the same as one of its column names
(for example,test), theSELECTexpression interprets thetestcolumn as
aSTRUCTcontaining all other table columns. To avoid this collision, use
one of the following methods:
Avoid using the same name for a table and its columns.
Assign the table a different alias. For example, the following query assigns
a table aliastto the tableproject1.dataset.test:
SELECTtestFROMproject1.dataset.testASt;
Include the table name when referencing a column. For example:
SELECTtest.testFROMproject1.dataset.test;
Flexible column names
You have more flexibility in what you name columns, including expanded access
to characters in languages other than English as well as additional symbols.
Flexible column names support the following characters:
Any letter in any language, as represented by the Unicode regular expression\p{L}.
Any numeric character in any language as represented by the Unicode regular
expression\p{N}.
Any connector punctuation character, including underscores, as represented
by the Unicode regular expression\p{Pc}.
A hyphen or dash as represented by the Unicode regular expression\p{Pd}.
Any mark intended to accompany another character as represented by the
Unicode regular expression\p{M}.
For example, accents, umlauts, or enclosing boxes.
The following special characters:
An ampersand (&) as represented by the Unicode regular
expression\u0026.
A percent sign (%) as represented by the Unicode regular
expression\u0025.
An equals sign (=) as represented by the Unicode regular
expression\u003D.
A plus sign (+) as represented by the Unicode regular
expression\u002B.
A colon (:) as represented by the Unicode regular
expression\u003A.
An apostrophe (') as represented by the Unicode regular
expression\u0027.
A less-than sign (<) as represented by the Unicode regular
expression\u003C.
A greater-than sign (>) as represented by the Unicode regular
expression\u003E.
A number sign (#) as represented by the Unicode regular
expression\u0023.
A vertical line (|) as represented by the Unicode regular
expression\u007c.
Whitespace.
Flexible column names don't support the following special characters:
An exclamation mark (!) as represented by the Unicode regular
expression\u0021.
A quotation mark (") as represented by the Unicode regular
expression\u0022.
A dollar sign ($) as represented by the Unicode regular
expression\u0024.
A left parenthesis (() as represented by the Unicode regular
expression\u0028.
A right parenthesis ()) as represented by the Unicode regular
expression\u0029.
An asterisk (*) as represented by the Unicode regular
expression\u002A.
A comma (,) as represented by the Unicode regular
expression\u002C.
A period (.) as represented by the Unicode regular
expression\u002E.
A slash (/) as represented by the Unicode regular
expression\u002F.
A semicolon (;) as represented by the Unicode regular
expression\u003B.
A question mark (?) as represented by the Unicode regular
expression\u003F.
An at sign (@) as represented by the Unicode regular
expression\u0040.
A left square bracket ([) as represented by the Unicode regular
expression\u005B.
A backslash (\) as represented by the Unicode regular
expression\u005C.
A right square bracket (]) as represented by the Unicode regular
expression\u005D.
A circumflex accent (^) as represented by the Unicode regular
expression\u005E.
A grave accent (`) as represented by the Unicode regular
expression\u0060.
A left curly bracket {{) as represented by the Unicode regular
expression\u007B.
A right curly bracket (}) as represented by the Unicode regular
expression\u007D.
A tilde (~) as represented by the Unicode regular expression\u007E.
The expanded column characters are supported by both the BigQuery Storage Read API
and the BigQuery Storage Write API. To use the expanded list of Unicode characters
with the BigQuery Storage Read API, you must set a flag. You can use thedisplayNameattribute to retrieve the column name. The following example
shows how to set a flag with the Python client:
fromgoogle.cloud.bigquery_storageimporttypesrequested_session=types.ReadSession()#set avro serialization options for flexible column.options=types.AvroSerializationOptions()options.enable_display_name_attribute=Truerequested_session.read_options.avro_serialization_options=options
To use the expanded list of Unicode characters with the BigQuery Storage Write API,
you must provide the schema withcolumn_namenotation, unless you are using
theJsonStreamWriterwriter object. The following example shows how to
provide the schema:
syntax="proto2";packagemypackage;// Source protos located in github.com/googleapis/googleapisimport"google/cloud/bigquery/storage/v1/annotations.proto";messageFlexibleSchema{optionalstringitem_name_column=1[(.google.cloud.bigquery.storage.v1.column_name)="name-列"];optionalstringitem_description_column=2[(.google.cloud.bigquery.storage.v1.column_name)="description-列"];}
In this example,item_name_columnanditem_description_columnare
placeholder names which need to be compliant with theprotocol buffernaming
convention. Note thatcolumn_nameannotations always take precedence over
placeholder names.
Limitations
Flexible column names are not supported withexternal tables.
NULLvalues
Note that for load jobs, BigQuery ignoresNULLelements for thelistcompound type, since otherwise they would be translated toNULLARRAYelements which cannot persist to a table (seeData Typesfor
details).