Generate dataset insights

This document describes how to generate dataset insights for BigQuery datasets. Dataset insights help you understand relationships between tables in a dataset by generating relationship graphs and cross-table queries.

Dataset insights help you accelerate the exploration of datasets with multiple tables by automatically discovering and visualizing relationships between tables in a graph, identifying primary-key and foreign-key relationships, and generating sample cross-table queries. This is useful for understanding data structure without documentation, discovering schema-defined, usage-based, or AI-inferred relationships between tables, and generating complex queries that join multiple tables.

For an overview of table and dataset insights, see Data insights overview .

Modes for generating dataset insights

When generating dataset insights, BigQuery provides two modes:

Mode

Description

Usage

Generate and publish

Persists generated dataset insights into Knowledge Catalog as metadata aspects and relationships. You must have the necessary permissions to publish. When you use Generate and publish , BigQuery does the following:

Stores the dataset description in Knowledge Catalog.
Captures suggested queries and questions as reusable aspects.
Captures relationships as metadata in Knowledge Catalog.
Makes published insights accessible to all users who have appropriate Knowledge Catalog access, ensuring shared organizational knowledge.
Lets you edit and save descriptions directly in Knowledge Catalog using the API. You can edit the suggested queries using the Google Cloud console.

Use this mode for enterprise-wide data documentation that persists and is reusable, or when building catalog-driven governance workflows.

Generate without publishing

Creates dataset insights such as descriptions, natural language questions, relationships, and SQL queries on demand. Generate without publishing doesn't publish insights to Knowledge Catalog.

Use this mode for quick, ad hoc exploration to avoid cluttering the catalog.

Before you begin

Data insights are generated using Gemini in BigQuery . To start generating insights, you must first set up Gemini in BigQuery .

Enable APIs

To use data insights, enable the following APIs in your project: Dataplex API, BigQuery API, and Gemini for Google Cloud API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role ( roles/serviceusage.serviceUsageAdmin ), which contains the serviceusage.services.enable permission. Learn how to grant roles .

Enable the APIs

For more information about enabling the Gemini for Google Cloud API, see Enable the Gemini for Google Cloud API in a Google Cloud project .

Complete a data profile scan

To improve the quality of insights, generate data profile scan for tables in your dataset.

Required roles

To get the permissions that you need to generate, manage, and retrieve dataset insights, ask your administrator to grant you the following IAM roles:

To generate, manage, and retrieve insights:
- Dataplex DataScan Editor ( roles/dataplex.dataScanEditor ) or Dataplex DataScan Administrator ( roles/dataplex.dataScanAdmin ) on project
- BigQuery Data Editor ( roles/bigquery.dataEditor ) on tables
- BigQuery User ( roles/bigquery.user ) or BigQuery Studio User ( roles/bigquery.studioUser ) on project
- BigQuery Resource Viewer ( roles/bigquery.resourceViewer ) on project
To view insights:
- Dataplex DataScan DataViewer ( roles/dataplex.dataScanDataViewer ) on project
- BigQuery Data Viewer ( roles/bigquery.dataViewer ) on dataset
To publish insights to Knowledge Catalog: Dataplex Entry and EntryLink Owner ( roles/dataplex.entryOwner ) on entry group

For more information about granting roles, see Manage access to projects, folders, and organizations .

You might also be able to get the required permissions through custom roles or other predefined roles .

To see the exact permissions that are required to generate insights, expand the Required permissionssection:

Required permissions

bigquery.datasets.get : read dataset metadata
bigquery.jobs.create : create jobs
bigquery.jobs.listAll : list all jobs in the project
bigquery.tables.get : get table metadata
bigquery.tables.getData : get table data and metadata
dataplex.datascans.create : create DataScan resource
dataplex.datascans.get : read DataScan resource metadata
dataplex.datascans.getData : read DataScan execution results
dataplex.datascans.run : run on-demand DataScan
dataplex.entryGroups.useSchemaJoinEntryLink : use schema-join entry links
dataplex.entryGroups.useSchemaJoinAspect : use schema join aspects
dataplex.entryLinks.create : create entry links
dataplex.entryLinks.update : update entry links
dataplex.entryLinks.delete : delete entry links
dataplex.entries.link : link entries
dataplex.entries.update : update entries
dataplex.entryGroups.useDescriptionsAspect : use description aspects
dataplex.entryGroups.useQueriesAspect : use query aspects

Generate dataset insights

Console

In the Google Cloud console, go to BigQuery Studio.

Go to BigQuery Studio
In the Explorerpane, select the project and then the dataset for which you want to generate insights.
Click the Insightstab.
To generate insights and publish them to Knowledge Catalog, click Generate and publish.

To generate insights without publishing them to Knowledge Catalog, click Generate without publishing.

For more information about the differences between the Generate and publishand Generate without publishingmodes, see Modes for generating dataset insights .
If your dataset is in a multi-region, you might be prompted to select a region to generate insights. Select a region corresponding to the multi-region where the insights scan is going to be created.

It takes a few minutes for the insights to be populated. The quality of insights improves if the tables in the dataset have data profiling results .

After insights are generated, BigQuery displays a dataset description, a relationship graph, a relationship table, and sample cross-table queries.

REST

To generate insights programmatically, use the Knowledge Catalog DataScans API . To do this, complete the following steps:

Generate a data documentation datascan for the BigQuery dataset
Check the data documentation scan status
Verify publishing to Knowledge Catalog

Generate a data documentation datascan for the BigQuery dataset

Create a data documentation data scan using the dataScans.create method . Optionally, you can publish these insights to Knowledge Catalog by setting the catalog_publishing_enabled parameter to true .

For example:

 alias gcurl='curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json"'
gcurl -X POST \
https://dataplex.googleapis.com/v1/projects/ PROJECT_ID 
/locations/ LOCATION 
/\
dataScans?dataScanId= DATASCAN_ID 
\
-d '{
  "data": {
    "resource": "//bigquery.googleapis.com/projects/ PROJECT_ID 
/datasets/ DATASET_ID 
"
  },
  "executionSpec": {
    "trigger": { "onDemand": {} }
  },
  "type": "DATA_DOCUMENTATION",
  "dataDocumentationSpec": {
    "catalog_publishing_enabled": true
  }
}'

Replace the following:

PROJECT_ID : the ID of your Google Cloud project where the dataset resides
LOCATION : the region where the data scan runs
DATASCAN_ID : a unique name you provide for this scan
DATASET_ID : the ID of the BigQuery dataset being scanned

Start the data documentation scan job using the dataScans.run method .

For example:
```
 gcurl -X POST \
https://dataplex.googleapis.com/v1/projects/ PROJECT_ID 
/locations/ LOCATION 
/\
dataScans/ DATASCAN_ID 
:run 
```
This request returns a unique job ID along with the initial state.

Check the data documentation scan status

Check completion of the scan job run using the dataScans.get method . To retrieve the full results, including the insights and the publishing status, set the view parameter to FULL .

Use the job ID to fetch the status of the job. For example:

 gcurl -X GET https://dataplex.googleapis.com/v1/projects/ PROJECT_ID 
/locations/ LOCATION 
/dataScans/ DATASCAN_ID 
/jobs/ JOB_ID 
?view=FULL

The job completes when the status is either SUCCEEDED or FAILURE .

A successful job response contains the generated insights in the dataDocumentationResult field.

Verify publishing to Knowledge Catalog

If catalog_publishing_enabled is set to true , then the insights are published to Knowledge Catalog asynchronously after the datascan job completes. To verify that insights were persisted, use the Dataplex API to inspect the aspects of the dataset.

While insights are generated from the dataset-level datascan, the resulting entry links are stored between the tables they connect. To verify these relationships, use the lookupEntryLinks method to retrieve the entry links associated with a specific table entry.

To retrieve metadata for your BigQuery dataset, use the entries.get method . To include all aspects, set the view parameter to FULL . For example:

 gcurl -X GET https://dataplex.googleapis.com/v1/projects/ PROJECT_ID 
/locations/ LOCATION 
/entryGroups/@bigquery/entries/bigquery.googleapis.com/projects/ DATASET_PROJECT_ID 
/datasets/ DATASET_ID 
?view=FULL

Replace the following:

PROJECT_ID : the ID of your Google Cloud project where the DataScan was configured
LOCATION : the region where the entry group resides
DATASET_PROJECT_ID : the ID of the Google Cloud project where the BigQuery dataset resides
DATASET : the ID of the BigQuery dataset

If publishing to Knowledge Catalog is successful, the following aspects are attached to the BigQuery dataset:

Descriptions: contains AI-generated descriptions of the dataset
Queries: contains relevant SQL queries related to the dataset
Relationships: persisted as entry links between the tables present in the dataset

View and save the dataset description

Gemini generates a natural language description of the dataset, summarizing the types of tables it contains and the business domain it represents. To save this description to the metadata of the dataset, click Save to details.

You can edit the description before saving the details.

Explore the relationship graph

The Relationshipsgraph provides a visual representation of how tables in the dataset relate to each other. It displays the top 10 most connected tables as nodes, with lines representing relationships between them.

To see relationship details, such as the columns that join two tables, hover over the edge connecting the table nodes.
To rearrange the graph for better visibility, drag the table nodes.

Use the relationship table

The Relationship tablelists the discovered relationships in a tabular format. Each row represents a relationship between two tables, showing the source table and column, and the destination table and column. The Sourcecolumn indicates how the relationship was determined:

LLM inferred.Relationships inferred by Gemini, based on table and column names and descriptions across the dataset.
Usage based.Relationships extracted from query logs, based on frequent joins.
Schema-defined.Relationships derived from existing primary key and foreign key mappings in the table schema.

You can filter the relationships for a specific table or provide feedback on the quality of detected relationships. To export the generated dataset description and relationships to a JSON file, click Export to JSON.

Use query recommendations

Based on the discovered relationships, Gemini generates sample queries. These are natural language questions with corresponding SQL queries that join multiple tables in the dataset.

To view a SQL query, click a question.
To open the query in the BigQuery query editor, click Copy to query. You can then run the query or modify it.
To ask a follow up question, click Ask a follow-up, which opens an untitled data canvas where you can chat with Gemini to explore your data.

Manage generated insights

After you generate insights for a dataset, you can manage, update, or delete them in Knowledge Catalog. For more information, see manage dataset insights .

What's next

Learn about data insights overview .
Learn how to generate table insights .
Learn more about Knowledge Catalog data profiling .