Loading JSON data from Cloud Storage
You can load newline-delimited JSON (ndJSON) data from Cloud Storage into a new table or partition, or append to or overwrite an existing table or partition. When your data is loaded into BigQuery, it is converted into columnar format for Capacitor (BigQuery's storage format).
When you load data from Cloud Storage into a BigQuery table, the dataset that contains the table must be in the same regional or multi- regional location as the Cloud Storage bucket.
The ndJSON format is the same format as the JSON Lines format.
Limitations
You are subject to the following limitations when you load data into BigQuery from a Cloud Storage bucket:
- If your dataset's location is set to a value other than the
US
multi-region, then the Cloud Storage bucket must be in the same region or contained in the same multi-region as the dataset. - BigQuery does not guarantee data consistency for external data sources. Changes to the underlying data while a query is running can result in unexpected behavior.
- BigQuery does not support Cloud Storage object versioning . If you include a generation number in the Cloud Storage URI, then the load job fails.
When you load JSON files into BigQuery, note the following:
- JSON data must be newline delimited, or ndJSON. Each JSON object must be on a separate line in the file.
- If you use gzip compression , BigQuery cannot read the data in parallel. Loading compressed JSON data into BigQuery is slower than loading uncompressed data.
- You cannot include both compressed and uncompressed files in the same load job.
- The maximum size for a gzip file is 4 GB.
-
BigQuery supports the
JSON
type even if schema information is not known at the time of ingestion. A field that is declared asJSON
type is loaded with the raw JSON values. -
If you use the BigQuery API to load an integer outside the range of [-2 53 +1, 2 53 -1] (usually this means larger than 9,007,199,254,740,991), into an integer (INT64) column, pass it as a string to avoid data corruption. This issue is caused by a limitation on integer size in JSON or ECMAScript. For more information, see the Numbers section of RFC 7159 .
- When you load CSV or JSON data, values in
DATE
columns must use the dash (-
) separator and the date must be in the following format:YYYY-MM-DD
(year-month-day). - When you load JSON or CSV data, values in
TIMESTAMP
columns must use a dash (-
) or slash (/
) separator for the date portion of the timestamp, and the date must be in one of the following formats:YYYY-MM-DD
(year-month-day) orYYYY/MM/DD
(year/month/day). Thehh:mm:ss
(hour-minute-second) portion of the timestamp must use a colon (:
) separator. -
Your files must meet the JSON file size limits described in the load jobs limits .
Before you begin
Grant Identity and Access Management (IAM) roles that give users the necessary permissions to perform each task in this document, and create a dataset to store your data.
Required permissions
To load data into BigQuery, you need IAM permissions to run a load job and load data into BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need IAM permissions to access the bucket that contains your data.
Permissions to load data into BigQuery
To load data into a new BigQuery table or partition or to append or overwrite an existing table or partition, you need the following IAM permissions:
-
bigquery.tables.create
-
bigquery.tables.updateData
-
bigquery.tables.update
-
bigquery.jobs.create
Each of the following predefined IAM roles includes the permissions that you need in order to load data into a BigQuery table or partition:
-
roles/bigquery.dataEditor
-
roles/bigquery.dataOwner
-
roles/bigquery.admin
(includes thebigquery.jobs.create
permission) -
bigquery.user
(includes thebigquery.jobs.create
permission) -
bigquery.jobUser
(includes thebigquery.jobs.create
permission)
Additionally, if you have the bigquery.datasets.create
permission, you can create and
update tables using a load job in the datasets that you create.
For more information on IAM roles and permissions in BigQuery, see Predefined roles and permissions .
Permissions to load data from Cloud Storage
To get the permissions that you need to load data from a Cloud Storage bucket,
ask your administrator to grant you the Storage Admin
( roles/storage.admin
) IAM role on the bucket.
For more information about granting roles, see Manage access to projects, folders, and organizations
.
This predefined role contains the permissions required to load data from a Cloud Storage bucket. To see the exact permissions that are required, expand the Required permissionssection:
Required permissions
The following permissions are required to load data from a Cloud Storage bucket:
-
storage.buckets.get
-
storage.objects.get
-
storage.objects.list (required if you are using a URI wildcard )
You might also be able to get these permissions with custom roles or other predefined roles .
Create a dataset
Create a BigQuery dataset to store your data.
JSON compression
You can use the gzip
utility to compress JSON files. Note that gzip
performs
full file compression, unlike the file content compression performed by
compression codecs for other file formats, such as Avro. Using gzip
to
compress your JSON files might have a performance impact; for more information
about the trade-offs, see Loading compressed and uncompressed data
.
Loading JSON data into a new table
To load JSON data from Cloud Storage into a new BigQuery table:
Console
-
In the Google Cloud console, go to the BigQuery page.
- In the Explorer pane, expand your project, and then select a dataset.
- In the Dataset info section, click Create table .
- In the Create table panel, specify the following details:
- In the Source
section, select Google Cloud Storage
in the Create table from
list.
Then, do the following:
- Select a file from the Cloud Storage bucket, or enter the Cloud Storage URI . You cannot include multiple URIs in the Google Cloud console, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you want to create, append, or overwrite.
- For File format , select JSONL (Newline delimited JSON) .
- In the Destination
section, specify the following details:
- For Dataset , select the dataset in which you want to create the table.
- In the Table field, enter the name of the table that you want to create.
- Verify that the Table type field is set to Native table .
- In the Schema
section, enter the schema
definition.
To enable the auto detection
of a schema,
select Auto detect
.
You can enter schema information manually by using one of
the following methods:
- Option 1: Click Edit as text
and paste the schema in the form of a
JSON array. When you use a JSON array, you generate the schema using the
same process as creating a JSON schema file
.
You can view the schema of an existing table in JSON format by entering the following
command:
bq show --format = prettyjson dataset.table
- Option 2: Click Add field and enter the table schema. Specify each field's Name , Type , and Mode .
- Option 1: Click Edit as text
and paste the schema in the form of a
JSON array. When you use a JSON array, you generate the schema using the
same process as creating a JSON schema file
.
You can view the schema of an existing table in JSON format by entering the following
command:
- Optional: Specify Partition and cluster settings . For more information, see Creating partitioned tables and Creating and using clustered tables .
- Click Advanced options
and do the following:
- For Write preference , leave Write if empty selected. This option creates a new table and loads your data into it.
- For Number of errors allowed
, accept the default value of
0
or enter the maximum number of rows containing errors that can be ignored. If the number of rows with errors exceeds this value, the job will result in aninvalid
message and fail. This option applies only to CSV and JSON files. - If you want to ignore values in a row that are not present in the table's schema, then select Unknown values .
- For Encryption , click Customer-managed key to use a Cloud Key Management Service key . If you leave the Google-managed key setting, BigQuery encrypts the data at rest .
- Click Create table .
SQL
Use the LOAD DATA
DDL statement
.
The following example loads a JSON file into the new table mytable
:
-
In the Google Cloud console, go to the BigQuerypage.
-
In the query editor, enter the following statement:
LOAD DATA OVERWRITE mydataset . mytable ( x INT64 , y STRING ) FROM FILES ( format = 'JSON' , uris = [ 'gs://bucket/path/file.json' ]);
-
Click Run.
For more information about how to run queries, see Run an interactive query .
bq
Use the bq load
command, specify NEWLINE_DELIMITED_JSON
using the --source_format
flag, and include a Cloud Storage URI
.
You can include a single URI, a comma-separated list of URIs, or a URI
containing a wildcard
.
Supply the schema inline, in a schema definition file, or use schema auto-detect
.
(Optional) Supply the --location
flag and set the value to your location
.
Other optional flags include:
-
--max_bad_records
: An integer that specifies the maximum number of bad records allowed before the entire job fails. The default value is0
. At most, five errors of any type are returned regardless of the--max_bad_records
value. -
--ignore_unknown_values
: When specified, allows and ignores extra, unrecognized values in CSV or JSON data. -
--autodetect
: When specified, enable schema auto-detection for CSV and JSON data. -
--time_partitioning_type
: Enables time-based partitioning on a table and sets the partition type. Possible values areHOUR
,DAY
,MONTH
, andYEAR
. This flag is optional when you create a table partitioned on aDATE
,DATETIME
, orTIMESTAMP
column. The default partition type for time-based partitioning isDAY
. You cannot change the partitioning specification on an existing table. -
--time_partitioning_expiration
: An integer that specifies (in seconds) when a time-based partition should be deleted. The expiration time evaluates to the partition's UTC date plus the integer value. -
--time_partitioning_field
: TheDATE
orTIMESTAMP
column used to create a partitioned table. If time-based partitioning is enabled without this value, an ingestion-time partitioned table is created. -
--require_partition_filter
: When enabled, this option requires users to include aWHERE
clause that specifies the partitions to query. Requiring a partition filter can reduce cost and improve performance. For more information, see Querying partitioned tables . -
--clustering_fields
: A comma-separated list of up to four column names used to create a clustered table . -
--destination_kms_key
: The Cloud KMS key for encryption of the table data.For more information on partitioned tables, see:
For more information on clustered tables, see:
For more information on table encryption, see:
To load JSON data into BigQuery, enter the following command:
bq --location = LOCATION load \ --source_format = FORMAT \ DATASET.TABLE \ PATH_TO_SOURCE \ SCHEMA
Replace the following:
-
LOCATION
: your location. The--location
flag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the flag's value toasia-northeast1
. You can set a default value for the location using the .bigqueryrc file . -
FORMAT
:NEWLINE_DELIMITED_JSON
. -
DATASET
: an existing dataset. -
TABLE
: the name of the table into which you're loading data. -
PATH_TO_SOURCE
: a fully qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported. -
SCHEMA
: a valid schema. The schema can be a local JSON file, or it can be typed inline as part of the command. If you use a schema file, do not give it an extension. You can also use the--autodetect
flag instead of supplying a schema definition.
Examples:
The following command loads data from gs://mybucket/mydata.json
into a
table named mytable
in mydataset
. The schema is defined in a local
schema file named myschema
.
bq
load
\
--
source_format
=
NEWLINE_DELIMITED_JSON
\
mydataset
.
mytable
\
gs
:
//
mybucket
/
mydata
.
json
\
./
myschema
The following command loads data from gs://mybucket/mydata.json
into a new
ingestion-time partitioned table named mytable
in mydataset
. The schema
is defined in a local schema file named myschema
.
bq
load
\
--
source_format
=
NEWLINE_DELIMITED_JSON
\
--
time_partitioning_type
=
DAY
\
mydataset
.
mytable
\
gs
:
//
mybucket
/
mydata
.
json
\
./
myschema
The following command loads data from gs://mybucket/mydata.json
into a
partitioned table named mytable
in mydataset
. The table is partitioned
on the mytimestamp
column. The schema is defined in a local schema file
named myschema
.
bq
load
\
--
source_format
=
NEWLINE_DELIMITED_JSON
\
--
time_partitioning_field
mytimestamp
\
mydataset
.
mytable
\
gs
:
//
mybucket
/
mydata
.
json
\
./
myschema
The following command loads data from gs://mybucket/mydata.json
into a
table named mytable
in mydataset
. The schema is auto detected.
bq
load
\
--
autodetect
\
--
source_format
=
NEWLINE_DELIMITED_JSON
\
mydataset
.
mytable
\
gs
:
//
mybucket
/
mydata
.
json
The following command loads data from gs://mybucket/mydata.json
into a
table named mytable
in mydataset
. The schema is defined inline in the
format FIELD:DATA_TYPE
, FIELD:DATA_TYPE
.
bq
load
\
--
source_format
=
NEWLINE_DELIMITED_JSON
\
mydataset
.
mytable
\
gs
:
//
mybucket
/
mydata
.
json
\
qtr
:
STRING
,
sales
:
FLOAT
,
year
:
STRING
The following command loads data from multiple files in gs://mybucket/
into a table named mytable
in mydataset
. The Cloud Storage URI uses a
wildcard. The schema is auto detected.
bq
load
\
--
autodetect
\
--
source_format
=
NEWLINE_DELIMITED_JSON
\
mydataset
.
mytable
\
gs
:
//
mybucket
/
mydata
*.
json
The following command loads data from multiple files in gs://mybucket/
into a table named mytable
in mydataset
. The command includes a comma-
separated list of Cloud Storage URIs with wildcards. The schema is
defined in a local schema file named myschema
.
bq
load
\
--
source_format
=
NEWLINE_DELIMITED_JSON
\
mydataset
.
mytable
\
"gs://mybucket/00/*.json"
,
"gs://mybucket/01/*.json"
\
./
myschema
API
-
Create a
load
job that points to the source data in Cloud Storage. -
(Optional) Specify your location in the
location
property in thejobReference
section of the job resource . -
The
source URIs
property must be fully qualified, in the formatgs:// BUCKET / OBJECT
. Each URI can contain one '*' wildcard character . -
Specify the
JSON
data format by setting thesourceFormat
property toNEWLINE_DELIMITED_JSON
. -
To check the job status, call
jobs.get( JOB_ID *)
, replacingJOB_ID
with the ID of the job returned by the initial request.- If
status.state = DONE
, the job completed successfully. - If the
status.errorResult
property is present, the request failed, and that object includes information describing what went wrong. When a request fails, no table is created and no data is loaded. - If
status.errorResult
is absent, the job finished successfully; although, there might have been some nonfatal errors, such as problems importing a few rows. Nonfatal errors are listed in the returned job object'sstatus.errors
property.
- If
API notes:
-
Load jobs are atomic and consistent; if a load job fails, none of the data is available, and if a load job succeeds, all of the data is available.
-
As a best practice, generate a unique ID and pass it as
jobReference.jobId
when callingjobs.insert
to create a load job. This approach is more robust to network failure because the client can poll or retry on the known job ID. -
Calling
jobs.insert
on a given job ID is idempotent. You can retry as many times as you like on the same job ID, and at most, one of those operations succeed.
C#
Before trying this sample, follow the C# setup instructions in the BigQuery quickstart using client libraries . For more information, see the BigQuery C# API reference documentation .
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries .
Use theBigQueryClient.CreateLoadJob()
method to start a load job
from Cloud Storage. To use JSONL, create a CreateLoadJobOptions
object and set its SourceFormat
property to FileFormat.NewlineDelimitedJson
.