Using schema auto-detection

Schema auto-detection

Schema auto-detection enables BigQuery to infer the schema for CSV, JSON, or Google Sheets data. Schema auto-detection is available when you load data into BigQuery and when you query an external data source .

When auto-detection is enabled, BigQuery infers the data type for each column. BigQuery selects a random file in the data source and scans up to the first 500 rows of data to use as a representative sample. BigQuery then examines each field and attempts to assign a data type to that field based on the values in the sample. If all of the rows in a column are empty, auto-detection will default to STRING data type for the column.

If you don't enable schema auto-detection for CSV, JSON, or Google Sheets data, then you must provide the schema manually when creating the table.

You don't need to enable schema auto-detection for Avro, Parquet, ORC, Firestore export, or Datastore export files. These file formats are self-describing, so BigQuery automatically infers the table schema from the source data. For Parquet, Avro, and Orc files, you can optionally provide an explicit schema to override the inferred schema.

You can see the detected schema for a table in the following ways:

Use the Google Cloud console.
Use the bq command-line tool's bq show command.

When BigQuery detects schemas, it might, on rare occasions, change a field name to make it compatible with GoogleSQL syntax.

For information about data type conversions, see the following:

Data type conversion when loading data from Datastore
Data type conversion when loading data from Firestore
Avro conversions
Parquet conversions
ORC conversions

Loading data using schema auto-detection

To enable schema auto-detection when loading data, use one of these approaches:

In the Google Cloud console, in the Schemasection, for Auto detect, check the Schema and input parametersoption.
In the bq command-line tool, use the bq load command with the --autodetect parameter.

When schema auto-detection is enabled, BigQuery makes a best-effort attempt to automatically infer the schema for CSV and JSON files. The auto-detection logic infers the schema field types by reading up to the first 500 rows of data. Leading lines are skipped if the --skip_leading_rows flag is present. The field types are based on the rows having the most fields. Therefore, auto-detection should work as expected as long as there is at least one row of data that has values in every column/field.

Schema auto-detection is not used with Avro files, Parquet files, ORC files, Firestore export files, or Datastore export files. When you load these files into BigQuery, the table schema is automatically retrieved from the self-describing source data.

To use schema auto-detection when you load JSON or CSV data:

Console

In the Google Cloud console, go to the BigQuery page.

Go to BigQuery
In the left pane, click Explorer:

If you don't see the left pane, click Expand left paneto open the pane.
In the Explorerpane, expand your project, click Datasets, and then click your dataset.
In the details pane, click Create table.
On the Create tablepage, in the Sourcesection:
- For Create table from, select your desired source type.
- In the source field, browse for the File/Cloud Storage bucket, or enter the Cloud Storage URI . Note that you cannot include multiple URIs in the Google Cloud console, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're creating.
- For File format, select CSVor JSON.
On the Create tablepage, in the Destinationsection:
- For Dataset name, choose the appropriate dataset.
- In the Table namefield, enter the name of the table you're creating.
- Verify that Table typeis set to Native table.
Click Create table.

bq

Issue the bq load command with the --autodetect parameter.

(Optional) Supply the --location flag and set the value to your location .

The following command loads a file using schema auto-detect:

bq  
--location = 
 LOCATION 
  
load  
 \ 
--autodetect  
 \ 
--source_format = 
 FORMAT 
  
 \ 
 DATASET 
. TABLE 
  
 \ 
 PATH_TO_SOURCE

Replace the following:

LOCATION : the name of your location. The --location flag is optional. For example, if you are using BigQuery in the Tokyo region, set the flag's value to asia-northeast1 . You can set a default value for the location by using the .bigqueryrc file .
FORMAT : either NEWLINE_DELIMITED_JSON or CSV .
DATASET : the dataset that contains the table into which you're loading data.
TABLE : the name of the table into which you're loading data.
PATH_TO_SOURCE : is the location of the CSV or JSON file.

Examples:

Enter the following command to load myfile.csv from your local machine into a table named mytable that is stored in a dataset named mydataset .

  bq 
  
 load 
  
 -- 
 autodetect 
  
 -- 
 source_format 
 = 
 CSV 
  
 mydataset 
 . 
 mytable 
  
 ./ 
 myfile 
 . 
 csv

Enter the following command to load myfile.json from your local machine into a table named mytable that is stored in a dataset named mydataset .

  bq 
  
 load 
  
 -- 
 autodetect 
  
 -- 
 source_format 
 = 
 NEWLINE_DELIMITED_JSON 
  
\ mydataset 
 . 
 mytable 
  
 ./ 
 myfile 
 . 
 json

API

Create a load job that points to the source data. For information about creating jobs, see Running BigQuery jobs programmatically . Specify your location in the location property in the jobReference section.
Specify the data format by setting the sourceFormat property. To use schema autodetection, this value must be set to NEWLINE_DELIMITED_JSON or CSV .
Use the autodetect property to set schema autodetection to true .

Go

Before trying this sample, follow the Go setup instructions in the BigQuery quickstart using client libraries . For more information, see the BigQuery Go API reference documentation .

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries .

  import 
  
 ( 
  
 "context" 
  
 "fmt" 
  
 "cloud.google.com/go/bigquery" 
 ) 
 // importJSONAutodetectSchema demonstrates loading data from newline-delimited JSON data in Cloud Storage 
 // and using schema autodetection to identify the available columns. 
 func 
  
 importJSONAutodetectSchema 
 ( 
 projectID 
 , 
  
 datasetID 
 , 
  
 tableID 
  
 string 
 ) 
  
 error 
  
 { 
  
 // projectID := "my-project-id" 
  
 // datasetID := "mydataset" 
  
 // tableID := "mytable" 
  
 ctx 
  
 := 
  
 context 
 . 
 Background 
 () 
  
 client 
 , 
  
 err 
  
 := 
  
 bigquery 
 . 
 NewClient 
 ( 
 ctx 
 , 
  
 projectID 
 ) 
  
 if 
  
 err 
  
 != 
  
 nil 
  
 { 
  
 return 
  
 fmt 
 . 
 Errorf 
 ( 
 "bigquery.NewClient: %v" 
 , 
  
 err 
 ) 
  
 } 
  
 defer 
  
 client 
 . 
 Close 
 () 
  
 gcsRef 
  
 := 
  
 bigquery 
 . 
  NewGCSReference 
 
 ( 
 "gs://cloud-samples-data/bigquery/us-states/us-states.json" 
 ) 
  
 gcsRef 
 . 
 SourceFormat 
  
 = 
  
 bigquery 
 . 
  JSON 
 
  
 gcsRef 
 . 
 AutoDetect 
  
 = 
  
 true 
  
 loader 
  
 := 
  
 client 
 . 
 Dataset 
 ( 
 datasetID 
 ). 
 Table 
 ( 
 tableID 
 ). 
  LoaderFrom 
 
 ( 
 gcsRef 
 ) 
  
 loader 
 . 
 WriteDisposition 
  
 = 
  
 bigquery 
 . 
  WriteEmpty 
 
  
 job 
 , 
  
 err 
  
 := 
  
 loader 
 . 
 Run 
 ( 
 ctx 
 ) 
  
 if 
  
 err 
  
 != 
  
 nil 
  
 { 
  
 return 
  
 err 
  
 } 
  
 status 
 , 
  
 err 
  
 := 
  
 job 
 . 
 Wait 
 ( 
 ctx 
 ) 
  
 if 
  
 err 
  
 != 
  
 nil 
  
 { 
  
 return 
  
 err 
  
 } 
  
 if 
  
 status 
 . 
  Err 
 
 () 
  
 != 
  
 nil 
  
 { 
  
 return 
  
 fmt 
 . 
 Errorf 
 ( 
 "job completed with error: %v" 
 , 
  
 status 
 . 
  Err 
 
 ()) 
  
 } 
  
 return 
  
 nil 
 }

Using schema auto-detection

Schema auto-detection

Loading data using schema auto-detection

Console

bq

API

Go

Java

Node.js

PHP

Python

Ruby

Schema auto-detection for external data sources

Auto-detection details

Compression

Date and time values

Schema auto-detection for CSV data

CSV delimiter

CSV header

CSV quoted new lines

Schema auto-detection for JSON data

JSON nested and repeated fields

String conversion

Schema auto-detection for Google Sheets

Table security