Loading ORC data from Cloud Storage

This page provides an overview of loading ORC data from Cloud Storage into BigQuery.

ORC is an open source column-oriented data format that is widely used in the Apache Hadoop ecosystem.

When you load ORC data from Cloud Storage, you can load the data into a new table or partition, or you can append to or overwrite an existing table or partition. When your data is loaded into BigQuery, it is converted into columnar format for Capacitor (BigQuery's storage format).

When you load data from Cloud Storage into a BigQuery table, the dataset that contains the table must be in the same regional or multi- regional location as the Cloud Storage bucket.

For information about loading ORC data from a local file, see Loading data into BigQuery from a local data source .

Limitations

You are subject to the following limitations when you load data into BigQuery from a Cloud Storage bucket:

  • If your dataset's location is set to a value other than the US multi-region, then the Cloud Storage bucket must be in the same region or contained in the same multi-region as the dataset.
  • BigQuery does not guarantee data consistency for external data sources. Changes to the underlying data while a query is running can result in unexpected behavior.
  • BigQuery does not support Cloud Storage object versioning . If you include a generation number in the Cloud Storage URI, then the load job fails.

Before you begin

Grant Identity and Access Management (IAM) roles that give users the necessary permissions to perform each task in this document, and create a dataset to store your data.

Required permissions

To load data into BigQuery, you need IAM permissions to run a load job and load data into BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need IAM permissions to access the bucket that contains your data.

Permissions to load data into BigQuery

To load data into a new BigQuery table or partition or to append or overwrite an existing table or partition, you need the following IAM permissions:

  • bigquery.tables.create
  • bigquery.tables.updateData
  • bigquery.tables.update
  • bigquery.jobs.create

Each of the following predefined IAM roles includes the permissions that you need in order to load data into a BigQuery table or partition:

  • roles/bigquery.dataEditor
  • roles/bigquery.dataOwner
  • roles/bigquery.admin (includes the bigquery.jobs.create permission)
  • bigquery.user (includes the bigquery.jobs.create permission)
  • bigquery.jobUser (includes the bigquery.jobs.create permission)

Additionally, if you have the bigquery.datasets.create permission, you can create and update tables using a load job in the datasets that you create.

For more information on IAM roles and permissions in BigQuery, see Predefined roles and permissions .

Permissions to load data from Cloud Storage

To get the permissions that you need to load data from a Cloud Storage bucket, ask your administrator to grant you the Storage Admin ( roles/storage.admin ) IAM role on the bucket. For more information about granting roles, see Manage access to projects, folders, and organizations .

This predefined role contains the permissions required to load data from a Cloud Storage bucket. To see the exact permissions that are required, expand the Required permissionssection:

Required permissions

The following permissions are required to load data from a Cloud Storage bucket:

  • storage.buckets.get
  • storage.objects.get
  • storage.objects.list (required if you are using a URI wildcard )

You might also be able to get these permissions with custom roles or other predefined roles .

Create a dataset

Create a BigQuery dataset to store your data.

ORC schemas

When you load ORC files into BigQuery, the table schema is automatically retrieved from the self-describing source data. When BigQuery retrieves the schema from the source data, the alphabetically last file is used.

For example, you have the following ORC files in Cloud Storage:

gs://mybucket/00/
  a.orc
  z.orc
gs://mybucket/01/
  b.orc

Running this command in the bq command-line tool loads all of the files (as a comma-separated list), and the schema is derived from mybucket/01/b.orc :

bq  
load  
 \ 
--source_format = 
ORC  
 \ 
 dataset.table 
  
 \ 
 "gs://mybucket/00/*.orc" 
, "gs://mybucket/01/*.orc" 

When BigQuery detects the schema, some ORC data types are converted to BigQuery data types to make them compatible with GoogleSQL syntax. All fields in the detected schema are NULLABLE . For more information, see ORC conversions .

When you load multiple ORC files that have different schemas, identical fields (with the same name and same nested level) specified in multiple schemas must map to the same converted BigQuery data type in each schema definition.

To provide a table schema for creating external tables, set the referenceFileSchemaUri property in BigQuery API or
--reference_file_schema_uri parameter in bq command-line tool to the URL of the reference file.

For example, --reference_file_schema_uri="gs://mybucket/schema.orc" .

ORC compression

BigQuery supports the following compression codecs for ORC file contents:

  • Zlib
  • Snappy
  • LZO
  • LZ4

Data in ORC files doesn't remain compressed after it is uploaded to BigQuery. Data storage is reported in logical bytes or physical bytes, depending on the dataset storage billing model . To get information on storage usage, query the INFORMATION_SCHEMA.TABLE_STORAGE view .

Loading ORC data into a new table

You can load ORC data into a new table by:

  • Using the Google Cloud console
  • Using the bq command-line tool's bq load command
  • Calling the jobs.insert API method and configuring a load job
  • Using the client libraries

To load ORC data from Cloud Storage into a new BigQuery table:

Console

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, expand your project, and then select a dataset.
  3. In the Dataset info section, click Create table .
  4. In the Create table panel, specify the following details:
    1. In the Source section, select Google Cloud Storage in the Create table from list. Then, do the following:
      1. Select a file from the Cloud Storage bucket, or enter the Cloud Storage URI . You cannot include multiple URIs in the Google Cloud console, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you want to create, append, or overwrite.select source file to create a BigQuery table
      2. For File format , select ORC .
    2. In the Destination section, specify the following details:
      1. For Dataset , select the dataset in which you want to create the table.
      2. In the Table field, enter the name of the table that you want to create.
      3. Verify that the Table type field is set to Native table .
    3. In the Schema section, no action is necessary. The schema is self-described in ORC files.
    4. Optional: Specify Partition and cluster settings . For more information, see Creating partitioned tables and Creating and using clustered tables .
    5. Click Advanced options and do the following:
      • For Write preference , leave Write if empty selected. This option creates a new table and loads your data into it.
      • If you want to ignore values in a row that are not present in the table's schema, then select Unknown values .
      • For Encryption , click Customer-managed key to use a Cloud Key Management Service key . If you leave the Google-managed key setting, BigQuery encrypts the data at rest .
    6. Click Create table .

SQL

Use the LOAD DATA DDL statement . The following example loads an ORC file into the new table mytable :

  1. In the Google Cloud console, go to the BigQuerypage.

    Go to BigQuery

  2. In the query editor, enter the following statement:

     LOAD 
      
      DATA 
     
      
     OVERWRITE 
      
     mydataset 
     . 
     mytable 
     FROM 
      
     FILES 
      
     ( 
      
     format 
      
     = 
      
    ' ORC 
    ' , 
      
     uris 
      
     = 
      
     [ 
    ' gs 
     : 
     // 
     bucket 
     / 
     path 
     / 
     file 
     . 
     orc 
    ' ]); 
    
  3. Click Run.

For more information about how to run queries, see Run an interactive query .

bq

Use the bq load command, specify ORC as the source_format , and include a Cloud Storage URI . You can include a single URI, a comma-separated list of URIs or a URI containing a wildcard .

(Optional) Supply the --location flag and set the value to your location .

Other optional flags include:

  • --time_partitioning_type : Enables time-based partitioning on a table and sets the partition type. Possible values are HOUR , DAY , MONTH , and YEAR . This flag is optional when you create a table partitioned on a DATE , DATETIME , or TIMESTAMP column. The default partition type for time-based partitioning is DAY . You cannot change the partitioning specification on an existing table.
  • --time_partitioning_expiration : An integer that specifies (in seconds) when a time-based partition should be deleted. The expiration time evaluates to the partition's UTC date plus the integer value.
  • --time_partitioning_field : The DATE or TIMESTAMP column used to create a partitioned table. If time-based partitioning is enabled without this value, an ingestion-time partitioned table is created.
  • --require_partition_filter : When enabled, this option requires users to include a WHERE clause that specifies the partitions to query. Requiring a partition filter may reduce cost and improve performance. For more information, see Querying partitioned tables .
  • --clustering_fields : A comma-separated list of up to four column names used to create a clustered table .
  • --destination_kms_key : The Cloud KMS key for encryption of the table data.

    For more information about partitioned tables, see:

    For more information about clustered tables, see:

    For more information about table encryption, see:

To load ORC data into BigQuery, enter the following command:

bq  
--location = 
 location 
  
load  
 \ 
--source_format = 
 format 
  
 \ 
 dataset.table 
  
 \ 
 path_to_source 

Where:

  • location is your location. The --location flag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the flag's value to asia-northeast1 . You can set a default value for the location using the .bigqueryrc file .
  • format is ORC .
  • dataset is an existing dataset.
  • table is the name of the table into which you're loading data.
  • path_to_source is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.

Examples:

The following command loads data from gs://mybucket/mydata.orc into a table named mytable in mydataset .

   
 bq 
  
 load 
  
\  
 -- 
 source_format 
 = 
 ORC 
  
\  
 mydataset 
 . 
 mytable 
  
\  
 gs 
 : 
 // 
 mybucket 
 / 
 mydata 
 . 
 orc 
 

The following command loads data from gs://mybucket/mydata.orc into a new ingestion-time partitioned table named mytable in mydataset .

   
 bq 
  
 load 
  
\  
 -- 
 source_format 
 = 
 ORC 
  
\  
 -- 
 time_partitioning_type 
 = 
 DAY 
  
\  
 mydataset 
 . 
 mytable 
  
\  
 gs 
 : 
 // 
 mybucket 
 / 
 mydata 
 . 
 orc 
 

The following command loads data from gs://mybucket/mydata.orc into a partitioned table named mytable in mydataset . The table is partitioned on the mytimestamp column.

   
 bq 
  
 load 
  
\  
 -- 
 source_format 
 = 
 ORC 
  
\  
 -- 
 time_partitioning_field 
  
 mytimestamp 
  
\  
 mydataset 
 . 
 mytable 
  
\  
 gs 
 : 
 // 
 mybucket 
 / 
 mydata 
 . 
 orc 
 

The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset . The Cloud Storage URI uses a wildcard.

   
 bq 
  
 load 
  
\  
 -- 
 source_format 
 = 
 ORC 
  
\  
 mydataset 
 . 
 mytable 
  
\  
 gs 
 : 
 // 
 mybucket 
 / 
 mydata 
 *. 
 orc 
 

The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset . The command includes a comma- separated list of Cloud Storage URIs with wildcards.

   
 bq 
  
 load 
  
 -- 
 autodetect 
  
\  
 -- 
 source_format 
 = 
 ORC 
  
\  
 mydataset 
 . 
 mytable 
  
\  
" gs 
 : 
 // 
 mybucket 
 / 
 00 
 /*. 
 orc 
" , 
" gs 
 : 
 // 
 mybucket 
 / 
 01 
 /*. 
 orc 
" 

API

  1. Create a load job that points to the source data in Cloud Storage.

  2. (Optional) Specify your location in the location property in the jobReference section of the job resource .

  3. The source URIs property must be fully-qualified, in the format gs:// bucket / object . Each URI can contain one '*' wildcard character .

  4. Specify the ORC data format by setting the sourceFormat property to ORC .

  5. To check the job status, call jobs.get( job_id *) , where job_id is the ID of the job returned by the initial request.

    • If status.state = DONE , the job completed successfully.
    • If the status.errorResult property is present, the request failed, and that object includes information describing what went wrong. When a request fails, no table is created and no data is loaded.
    • If status.errorResult is absent, the job finished successfully, although there might have been some non-fatal errors, such as problems importing a few rows. Non-fatal errors are listed in the returned job object's status.errors property.

API notes:

  • Load jobs are atomic and consistent; if a load job fails, none of the data is available, and if a load job succeeds, all of the data is available.

  • As a best practice, generate a unique ID and pass it as jobReference.jobId when calling jobs.insert to create a load job. This approach is more robust to network failure because the client can poll or retry on the known job ID.

  • Calling jobs.insert on a given job ID is idempotent. You can retry as many times as you like on the same job ID, and at most one of those operations succeeds.

C#

Before trying this sample, follow the C# setup instructions in the BigQuery quickstart using client libraries . For more information, see the BigQuery C# API reference documentation .

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries .

  using 
  
 Google.Apis.Bigquery.v2.Data 
 ; 
 using 
  
 Google.Cloud.BigQuery.V2 
 ; 
 using 
  
 System 
 ; 
 public 
  
 class 
  
 BigQueryLoadTableGcsOrc 
 { 
  
 public 
  
 void 
  
 LoadTableGcsOrc 
 ( 
  
 string 
  
 projectId 
  
 = 
  
" your 
 - 
 project 
 - 
 id 
" , 
  
 string 
  
 datasetId 
  
 = 
  
" your_dataset_id 
"  
 ) 
  
 { 
  
 BigQueryClient 
  
 client 
  
 = 
  
 BigQueryClient 
 . 
 Create 
 ( 
 projectId 
 ); 
  
 var 
  
 gcsURI 
  
 = 
  
" gs 
 : 
 //cloud-samples-data/bigquery/us-states/us-states.orc"; 
  
 var 
  
 dataset 
  
 = 
  
 client 
 . 
 GetDataset 
 ( 
 datasetId 
 ); 
  
 TableReference 
  
 destinationTableRef 
  
 = 
  
 dataset 
 . 
 GetTableReference 
 ( 
  
 tableId 
 : 
  
" us_states 
" ); 
  
 // Create job configuration 
  
 var 
  
 jobOptions 
  
 = 
  
 new 
  
 CreateLoadJobOptions 
 () 
  
 { 
  
 SourceFormat 
  
 = 
  
 FileFormat 
 . 
 Orc 
  
 }; 
  
 // Create and run job 
  
 var 
  
 loadJob 
  
 = 
  
 client 
 . 
 CreateLoadJob 
 ( 
  
 sourceUri 
 : 
  
 gcsURI 
 , 
  
 destination 
 : 
  
 destinationTableRef 
 , 
  
 // Pass null as the schema because the schema is inferred when 
  
 // loading Orc data 
  
 schema 
 : 
  
 null 
 , 
  
 options 
 : 
  
 jobOptions 
  
 ); 
  
 loadJob 
  
 = 
  
 loadJob 
 . 
 PollUntilCompleted 
 (). 
 ThrowOnAnyError 
 (); 
  
 // Waits for the job to complete. 
  
 // Display the number of rows uploaded 
  
 BigQueryTable 
  
 table 
  
 = 
  
 client 
 . 
 GetTable 
 ( 
 destinationTableRef 
 ); 
  
 Console 
 . 
 WriteLine 
 ( 
  
 $ 
" Loaded 
  
 { 
 table 
 . 
 Resource 
 . 
 NumRows 
 } 
  
 rows 
  
 to 
  
 { 
 table 
 . 
 FullyQualifiedId 
 }"); 
  
 } 
 }