Generate dataset insights
This document describes how to generate dataset insights for BigQuery datasets. Dataset insights help you understand relationships between tables in a dataset by generating relationship graphs and cross-table queries.
Dataset insights help you accelerate the exploration of datasets with multiple tables by automatically discovering and visualizing relationships between tables in a graph, identifying primary-key and foreign-key relationships, and generating sample cross-table queries. This is useful for understanding data structure without documentation, discovering schema-defined, usage-based, or AI-inferred relationships between tables, and generating complex queries that join multiple tables.
For an overview of table and dataset insights, see Data insights overview .
Modes for generating dataset insights
When generating dataset insights, BigQuery provides two modes:
Persists generated dataset insights into Knowledge Catalog as metadata aspects and relationships. You must have the necessary permissions to publish. When you use Generate and publish , BigQuery does the following:
- Stores the dataset description in Knowledge Catalog.
- Captures suggested queries and questions as reusable aspects.
- Captures relationships as metadata in Knowledge Catalog.
- Makes published insights accessible to all users who have appropriate Knowledge Catalog access, ensuring shared organizational knowledge.
- Lets you edit and save descriptions directly in Knowledge Catalog using the API. You can edit the suggested queries using the Google Cloud console.
Use this mode for enterprise-wide data documentation that persists and is reusable, or when building catalog-driven governance workflows.
Creates dataset insights such as descriptions, natural language questions, relationships, and SQL queries on demand. Generate without publishing doesn't publish insights to Knowledge Catalog.
Use this mode for quick, ad hoc exploration to avoid cluttering the catalog.
Before you begin
Data insights are generated using Gemini in BigQuery . To start generating insights, you must first set up Gemini in BigQuery .
Enable APIs
To use data insights, enable the following APIs in your project: Dataplex API, BigQuery API, and Gemini for Google Cloud API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM
role ( roles/serviceusage.serviceUsageAdmin
), which
contains the serviceusage.services.enable
permission. Learn how to grant
roles
.
For more information about enabling the Gemini for Google Cloud API, see Enable the Gemini for Google Cloud API in a Google Cloud project .
Complete a data profile scan
To improve the quality of insights, generate data profile scan for tables in your dataset.
Required roles
To get the permissions that you need to generate, manage, and retrieve dataset insights, ask your administrator to grant you the following IAM roles:
- To generate, manage, and retrieve insights:
- Dataplex DataScan Editor (
roles/dataplex.dataScanEditor) or Dataplex DataScan Administrator (roles/dataplex.dataScanAdmin) on project - BigQuery Data Editor
(
roles/bigquery.dataEditor) on tables - BigQuery User (
roles/bigquery.user) or BigQuery Studio User (roles/bigquery.studioUser) on project - BigQuery Resource Viewer
(
roles/bigquery.resourceViewer) on project
- Dataplex DataScan Editor (
- To view insights:
- Dataplex DataScan DataViewer
(
roles/dataplex.dataScanDataViewer) on project - BigQuery Data Viewer
(
roles/bigquery.dataViewer) on dataset
- Dataplex DataScan DataViewer
(
- To publish insights to Knowledge Catalog: Dataplex Entry and EntryLink Owner
(
roles/dataplex.entryOwner) on entry group
For more information about granting roles, see Manage access to projects, folders, and organizations .
You might also be able to get the required permissions through custom roles or other predefined roles .
To see the exact permissions that are required to generate insights, expand the Required permissionssection:
Required permissions
-
bigquery.datasets.get: read dataset metadata -
bigquery.jobs.create: create jobs -
bigquery.jobs.listAll: list all jobs in the project -
bigquery.tables.get: get table metadata -
bigquery.tables.getData: get table data and metadata -
dataplex.datascans.create: create DataScan resource -
dataplex.datascans.get: read DataScan resource metadata -
dataplex.datascans.getData: read DataScan execution results -
dataplex.datascans.run: run on-demand DataScan -
dataplex.entryGroups.useSchemaJoinEntryLink: useschema-joinentry links -
dataplex.entryGroups.useSchemaJoinAspect: use schema join aspects -
dataplex.entryLinks.create: create entry links -
dataplex.entryLinks.update: update entry links -
dataplex.entryLinks.delete: delete entry links -
dataplex.entries.link: link entries -
dataplex.entries.update: update entries -
dataplex.entryGroups.useDescriptionsAspect: use description aspects -
dataplex.entryGroups.useQueriesAspect: use query aspects
Generate dataset insights
Console
-
In the Google Cloud console, go to BigQuery Studio.
-
In the Explorerpane, select the project and then the dataset for which you want to generate insights.
-
Click the Insightstab.
-
To generate insights and publish them to Knowledge Catalog, click Generate and publish.
To generate insights without publishing them to Knowledge Catalog, click Generate without publishing.
For more information about the differences between the Generate and publishand Generate without publishingmodes, see Modes for generating dataset insights .
-
If your dataset is in a multi-region, you might be prompted to select a region to generate insights. Select a region corresponding to the multi-region where the insights scan is going to be created.
It takes a few minutes for the insights to be populated. The quality of insights improves if the tables in the dataset have data profiling results .
After insights are generated, BigQuery displays a dataset description, a relationship graph, a relationship table, and sample cross-table queries.
REST
To generate insights programmatically, use the Knowledge Catalog DataScans API . To do this, complete the following steps:
- Generate a data documentation datascan for the BigQuery dataset
- Check the data documentation scan status
- Verify publishing to Knowledge Catalog
Generate a data documentation datascan for the BigQuery dataset
-
Create a data documentation data scan using the
dataScans.createmethod . Optionally, you can publish these insights to Knowledge Catalog by setting thecatalog_publishing_enabledparameter totrue.For example:
alias gcurl='curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json"' gcurl -X POST \ https://dataplex.googleapis.com/v1/projects/ PROJECT_ID /locations/ LOCATION /\ dataScans?dataScanId= DATASCAN_ID \ -d '{ "data": { "resource": "//bigquery.googleapis.com/projects/ PROJECT_ID /datasets/ DATASET_ID " }, "executionSpec": { "trigger": { "onDemand": {} } }, "type": "DATA_DOCUMENTATION", "dataDocumentationSpec": { "catalog_publishing_enabled": true } }'Replace the following:
- PROJECT_ID : the ID of your Google Cloud project where the dataset resides
- LOCATION : the region where the data scan runs
- DATASCAN_ID : a unique name you provide for this scan
- DATASET_ID : the ID of the BigQuery dataset being scanned
-
Start the data documentation scan job using the
dataScans.runmethod .For example:
gcurl -X POST \ https://dataplex.googleapis.com/v1/projects/ PROJECT_ID /locations/ LOCATION /\ dataScans/ DATASCAN_ID :runThis request returns a unique job ID along with the initial state.
Check the data documentation scan status
Check completion of the scan job run using the dataScans.get
method
.
To retrieve the full results, including the insights and the publishing status,
set the view
parameter to FULL
.
Use the job ID to fetch the status of the job. For example:
gcurl -X GET https://dataplex.googleapis.com/v1/projects/ PROJECT_ID
/locations/ LOCATION
/dataScans/ DATASCAN_ID
/jobs/ JOB_ID
?view=FULL
The job completes when the status is either SUCCEEDED
or FAILURE
.
A successful job response contains the generated insights in the dataDocumentationResult
field.
Verify publishing to Knowledge Catalog
If catalog_publishing_enabled
is set to true
, then the insights are
published to Knowledge Catalog asynchronously after
the datascan job completes. To verify that insights were persisted, use
the Dataplex API to inspect the aspects of the dataset.
While insights are generated from the dataset-level datascan, the resulting
entry links are stored between the tables they connect. To verify these
relationships, use the lookupEntryLinks
method
to retrieve the entry links associated with a specific table entry.
To retrieve metadata for your BigQuery dataset, use the entries.get
method
.
To include all aspects, set the view
parameter to FULL
. For example:
gcurl -X GET https://dataplex.googleapis.com/v1/projects/ PROJECT_ID
/locations/ LOCATION
/entryGroups/@bigquery/entries/bigquery.googleapis.com/projects/ DATASET_PROJECT_ID
/datasets/ DATASET_ID
?view=FULL
Replace the following:
- PROJECT_ID : the ID of your Google Cloud project where the DataScan was configured
- LOCATION : the region where the entry group resides
- DATASET_PROJECT_ID : the ID of the Google Cloud project where the BigQuery dataset resides
- DATASET : the ID of the BigQuery dataset
If publishing to Knowledge Catalog is successful, the following aspects are attached to the BigQuery dataset:
- Descriptions: contains AI-generated descriptions of the dataset
- Queries: contains relevant SQL queries related to the dataset
- Relationships: persisted as entry links between the tables present in the dataset
View and save the dataset description
Gemini generates a natural language description of the dataset, summarizing the types of tables it contains and the business domain it represents. To save this description to the metadata of the dataset, click Save to details.
You can edit the description before saving the details.
Explore the relationship graph
The Relationshipsgraph provides a visual representation of how tables in the dataset relate to each other. It displays the top 10 most connected tables as nodes, with lines representing relationships between them.
- To see relationship details, such as the columns that join two tables, hover over the edge connecting the table nodes.
- To rearrange the graph for better visibility, drag the table nodes.
Use the relationship table
The Relationship tablelists the discovered relationships in a tabular format. Each row represents a relationship between two tables, showing the source table and column, and the destination table and column. The Sourcecolumn indicates how the relationship was determined:
- LLM inferred.Relationships inferred by Gemini, based on table and column names and descriptions across the dataset.
- Usage based.Relationships extracted from query logs, based on frequent joins.
- Schema-defined.Relationships derived from existing primary key and foreign key mappings in the table schema.
You can filter the relationships for a specific table or provide feedback on the quality of detected relationships. To export the generated dataset description and relationships to a JSON file, click Export to JSON.
Use query recommendations
Based on the discovered relationships, Gemini generates sample queries. These are natural language questions with corresponding SQL queries that join multiple tables in the dataset.
-
To view a SQL query, click a question.
-
To open the query in the BigQuery query editor, click Copy to query. You can then run the query or modify it.
-
To ask a follow up question, click Ask a follow-up, which opens an untitled data canvas where you can chat with Gemini to explore your data.
What's next
- Learn about data insights overview .
- Learn how to generate table insights .
- Learn more about Knowledge Catalog data profiling .

