As of April 10, 2026, Dataplex Universal Catalog is now called Knowledge Catalog. The API, client library, CLI, and IAM names remain unchanged.

Use auto data quality

This document describes how to use Knowledge Catalog (formerly Dataplex Universal Catalog) data quality scans to measure, monitor, and manage the quality of your data. Data quality scans help you automate the process of validating your data for completeness, validity, and consistency.

With data quality scans, you can define rules to check for missing values, ensure values match a regular expression or belong to a set, verify uniqueness, or use custom SQL for more complex validations such as anomaly detection. This document explains how to create and manage data quality scans.

To learn more about data quality scans, see About auto data quality .

Before you begin

Enable the Dataplex API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role ( roles/serviceusage.serviceUsageAdmin ), which contains the serviceusage.services.enable permission. Learn how to grant roles .

Enable the API
Optional: If you want Knowledge Catalog to generate recommendations for data quality rules based on the results of a data profile scan, create and run the data profile scan .

Required roles and permissions

This section describes the IAM roles and permissions needed to use Knowledge Catalog data quality scans.

User roles and permissions

To get the permissions that you need to run and manage data quality scans, ask your administrator to grant you the following IAM roles:

Run a data quality scan on a BigQuery table:
- BigQuery Job User ( roles/bigquery.jobUser ) on the project to run scan jobs
- BigQuery Data Viewer ( roles/bigquery.dataViewer ) on the BigQuery table to be scanned
Publish data quality scan results to Knowledge Catalog:
- BigQuery Data Editor ( roles/bigquery.dataEditor ) on the scanned table
- Dataplex Catalog Editor ( roles/dataplex.catalogEditor ) on the @bigquery entry group in the same location as the table
Perform specific tasks on DataScan resources:
- Dataplex DataScan Administrator ( roles/dataplex.dataScanAdmin ) on the project for full access
- Dataplex DataScan Creator ( roles/dataplex.dataScanCreator ) on the project to create scans
- Dataplex DataScan Editor ( roles/dataplex.dataScanEditor ) on the project for write access
- Dataplex DataScan Viewer ( roles/dataplex.dataScanViewer ) on the project to read scan metadata
- Dataplex DataScan DataViewer ( roles/dataplex.dataScanDataViewer ) on the project to read scan data including rules and results

For more information about granting roles, see Manage access to projects, folders, and organizations .

These predefined roles contain the permissions required to run and manage data quality scans. To see the exact permissions that are required, expand the Required permissionssection:

Required permissions

The following permissions are required to run and manage data quality scans:

Run a data quality scan on a BigQuery table:
- bigquery.jobs.create on the project to run scan jobs
- bigquery.tables.get on the BigQuery table to be scanned
- bigquery.tables.getData on the BigQuery table to be scanned
Publish data quality scan results to Knowledge Catalog:
- bigquery.tables.update on the scanned table
- dataplex.entryGroups.useDataQualityScorecardAspect on the @bigquery entry group in the same location as the table
Create a DataScan : dataplex.datascans.create on the project
Delete a DataScan : dataplex.datascans.delete on the project
View DataScan metadata: dataplex.datascans.get on the project
View DataScan details including rules and results: dataplex.datascans.getData on the project
List DataScan s: dataplex.datascans.list on the project
Run a DataScan : dataplex.datascans.run on the project
Update a DataScan : dataplex.datascans.update on the project
Get or set IAM policy on a DataScan :
- dataplex.datascans.getIamPolicy on the project
- dataplex.datascans.setIamPolicy on the project

You might also be able to get these permissions with custom roles or other predefined roles .

If you need to access columns protected by BigQuery column-level access policies, then you also need permissions for those columns.

Knowledge Catalog service account roles and permissions

If you haven't created any data quality or data profile scans or you don't have a Knowledge Catalog lake in this project, create a service identifier by running: gcloud beta services identity create --service=dataplex.googleapis.com . This command returns a Knowledge Catalog service identifier if it exists.

To ensure that the Knowledge Catalog service account of the project containing the data quality scan has the necessary permissions to read data from various sources and export results, ask your administrator to grant the following IAM roles to the Knowledge Catalog service account of the project containing the data quality scan:

Read BigQuery table data: BigQuery Data Viewer ( roles/bigquery.dataViewer ) on BigQuery tables to be scanned and any other tables referenced in rules
Export scan results to a BigQuery table: BigQuery Data Editor ( roles/bigquery.dataEditor ) on the results dataset and table
Scan BigQuery data organized in a Knowledge Catalog lake:
- Dataplex Metadata Reader ( roles/dataplex.metadataReader ) on Dataplex resources
- Dataplex Viewer ( roles/dataplex.viewer ) on Dataplex resources
Scan a BigQuery external table from Cloud Storage: Storage Object Viewer ( roles/storage.objectViewer ) on the Cloud Storage bucket

For more information about granting roles, see Manage access to projects, folders, and organizations .

These predefined roles contain the permissions required to read data from various sources and export results. To see the exact permissions that are required, expand the Required permissionssection:

Required permissions

The following permissions are required to read data from various sources and export results:

Read BigQuery table data:
- bigquery.tables.get on BigQuery tables
- bigquery.tables.getData on BigQuery tables
Export scan results to a BigQuery table:
- bigquery.datasets.get on results dataset and table
- bigquery.tables.create on results dataset and table
- bigquery.tables.get on results dataset and table
- bigquery.tables.getData on results dataset and table
- bigquery.tables.update on results dataset and table
- bigquery.tables.updateData on results dataset and table
Scan BigQuery data organized in a Knowledge Catalog lake:
- dataplex.lakes.list on Dataplex resources
- dataplex.lakes.get on Dataplex resources
- dataplex.zones.list on Dataplex resources
- dataplex.zones.get on Dataplex resources
- dataplex.entities.list on Dataplex resources
- dataplex.entities.get on Dataplex resources
- dataplex.operations.get on Dataplex resources
Scan a BigQuery external table from Cloud Storage:
- storage.buckets.get on the Cloud Storage bucket
- storage.objects.get on the Cloud Storage bucket

Your administrator might also be able to give the Knowledge Catalog service account of the project containing the data quality scan these permissions with custom roles or other predefined roles .

If you need to access columns protected by BigQuery column-level access policies, then assign the Knowledge Catalog service account permissions for those columns.

If a table has BigQuery row-level access policies enabled, then you can only scan rows visible to the Knowledge Catalog service account. Note that the individual user's access privileges are not evaluated for row-level policies.

Define data quality rules

You can define data quality rules by using built-in rules or custom SQL checks . If you're using the Google Cloud CLI, you can define these rules in a JSON or YAML file .

The examples in the following sections show how to define a variety of data quality rules. The rules validate a sample table that contains data about customer transactions. Assume the table has the following schema:

Column name	Column type	Column description
transaction_timestamp	Timestamp	Timestamp of the transaction. The table is partitioned on this field.
customer_id	String	A customer ID in the format of 8 letters followed by 16 digits.
transaction_id	String	The transaction ID needs to be unique across the table.
currency_id	String	One of the supported currencies.The currency type must match one of the available currencies in the dimension table `dim_currency` .
amount	float	Transaction amount.
discount_pct	float	Discount percentage. This value must be between 0 and 100.

Define data quality rules using built-in rule types

The following example rules are based on built-in rule types. You can create rules based on built-in rule types using the Google Cloud console or the API. Knowledge Catalog might recommend some of these rules.

Column name	Rule Type	Suggested dimension	Rule parameters
`transaction_id`	Uniqueness check	Uniqueness	Threshold: `Not Applicable`
`amount`	Null check	Completeness	Threshold: `100%`
`customer_id`	Regex (regular expression) check	Validity	Regular expression: `^[0-9]{8}[a-zA-Z]{16}$` Threshold: `100%`
`currency_id`	Value set check	Validity	Set of: `USD,JPY,INR,GBP,CAN` Threshold: `100%`

Define data quality rules using custom SQL rules

To build custom SQL rules, use the following framework:

When you create a rule that evaluates one row at a time, create an expression that generates the number of successful rows when Knowledge Catalog evaluates the query SELECT COUNTIF( CUSTOM_SQL_EXPRESSION ) FROM TABLE . Knowledge Catalog checks the number of successful rows against the threshold.
When you create a rule that evaluates across the rows or uses a table condition, create an expression that returns success or failure when Knowledge Catalog evaluates the query SELECT IF( CUSTOM_SQL_EXPRESSION ) FROM TABLE .
When you create a rule that evaluates the invalid state of a dataset, provide a statement that returns invalid rows. If any rows are returned, the rule fails. Omit the trailing semicolon from the SQL statement.
You can refer to a data source table and all of its precondition filters by using the data reference parameter ${data()} in a rule, instead of explicitly mentioning the source table and its filters. Examples of precondition filters include row filters, sampling percents, and incremental filters. The ${data()} parameter is case-sensitive.

The following example rules are based on custom SQL rules.

Rule type	Rule description	SQL expression
Row condition	Checks if the value of the `discount_pct` is between 0 and 100.	`0` < `discount_pct` AND `discount_pct` < `100`
Row condition	Reference check to validate that `currency_id` is one of the supported currencies.	`currency_id in (select id from my_project_id.dim_dataset.dim_currency)`
Table condition	Aggregate SQL expression that checks if the average `discount_pct` is between 30% and 50%.	`30<avg(discount) AND avg(discount) <50`
Row condition	Checks if a date is not in the future.	`TIMESTAMP(transaction_timestamp) < CURRENT_TIMESTAMP()`
Table condition	A BigQuery user-defined function (UDF) to check that the average transaction amount is less than a predefined value per country. Create the (Javascript) UDF by running the following command: CREATE OR REPLACE FUNCTION myProject.myDataset.average_by_country ( country STRING, average FLOAT64) RETURNS BOOL LANGUAGE js AS R""" if (country = "CAN" && average < 5000){ return 1 } else if (country = "IND" && average < 1000){ return 1 } else { return 0 } """;	Example rule to check the average transaction amount for `country=CAN` . myProject.myDataset.average_by_country( "CAN", (SELECT avg(amount) FROM myProject.myDataset.transactions_table WHERE currency_id = 'CAN' ))
Table condition	A BigQuery ML predict clause to identify anomalies in `discount_pct` . It checks if a discount should be applied based on `customer` , `currency` , and `transaction` . The rule checks if the prediction matches the actual value, at least 99% of times. Assumption: The ML model is created before using the rule. Create the ML model using the following command: CREATE MODEL model-project-id.dataset-id.model-name OPTIONS(model_type='logistic_reg') AS SELECT IF(discount_pct IS NULL, 0, 1) AS label, IFNULL(customer_id, "") AS customer, IFNULL(currency_id, "") AS currency, IFNULL(amount, 0.0) AS amount FROM `data-project-id.dataset-id.table-names` WHERE transaction_timestamp < '2022-01-01';	The following rule checks if prediction accuracy is greater than 99%. SELECT accuracy > 0.99 FROM ML.EVALUATE (MODEL model-project-id.dataset-id.model-name, ( SELECT customer_id, currency_id, amount, discount_pct FROM data-project-id.dataset-id.table-names WHERE transaction_timestamp > '2022-01-01'; ) )
Row condition	A BigQuery ML predict function to identify anomalies in `discount_pct` . The function checks if a discount should be applied based on `customer` , `currency` and `transaction` . The rule identifies all the occurrences where the prediction didn't match. Assumption: The ML model is created before using the rule. Create the ML model using the following command: CREATE MODEL model-project-id.dataset-id.model-name OPTIONS(model_type='logistic_reg') AS SELECT IF(discount_pct IS NULL, 0, 1) AS label, IFNULL(customer_id, "") AS customer, IFNULL(currency_id, "") AS currency, IFNULL(amount, 0.0) AS amount FROM `data-project-id.dataset-id.table-names` WHERE transaction_timestamp < '2022-01-01';	The following rule checks if the discount prediction matches with the actual for every row. IF(discount_pct > 0, 1, 0) =(SELECT predicted_label FROM ML.PREDICT( MODEL model-project-id.dataset-id.model-name, ( SELECT customer_id, currency_id, amount, discount_pct FROM data-project-id.dataset-id.table-names AS t WHERE t.transaction_timestamp = transaction_timestamp LIMIT 1 ) ) )
SQL assertion	Validates if the `discount_pct` is greater than 30% for today by checking whether any rows exist with a discount percent less than or equal to 30.	`SELECT * FROM my_project_id.dim_dataset.dim_currency WHERE discount_pct <= 30 AND transaction_timestamp >= current_date()`
SQL assertion (with data reference parameter )	Checks if the `discount_pct` is greater than 30% for all the supported currencies today. The date filter `transaction_timestamp >= current_date()` is applied as a row filter on the data source table. The data reference parameter `${data()}` acts as a placeholder for `my_project_id.dim_dataset.dim_currency WHERE transaction_timestamp >= current_date()` and applies the row filter.	`SELECT * FROM ${data()} WHERE discount_pct > 30`

Define data quality rules using the gcloud CLI

The following example YAML file uses some of the same rules as the sample rules using built-in types and the sample custom SQL rules . This YAML file also contains other specifications for the data quality scan, such as filters and sampling percent. When you use the gcloud CLI to create or update a data quality scan, you can use a YAML file like this as input to the --data-quality-spec-file argument.

   rules 
 : 
 - 
  
 uniquenessExpectation 
 : 
  
 {} 
  
 column 
 : 
  
 transaction_id 
  
 dimension 
 : 
  
 UNIQUENESS 
 - 
  
 nonNullExpectation 
 : 
  
 {} 
  
 column 
 : 
  
 amount 
  
 dimension 
 : 
  
 COMPLETENESS 
  
 threshold 
 : 
  
 1 
 - 
  
 regexExpectation 
 : 
  
 regex 
 : 
  
 '^[0-9]{8}[a-zA-Z]{16}$' 
  
 column 
 : 
  
 customer_id 
  
 ignoreNull 
 : 
  
 true 
  
 dimension 
 : 
  
 VALIDITY 
  
 threshold 
 : 
  
 1 
 - 
  
 setExpectation 
 : 
  
 values 
 : 
  
 - 
  
 'USD' 
  
 - 
  
 'JPY' 
  
 - 
  
 'INR' 
  
 - 
  
 'GBP' 
  
 - 
  
 'CAN' 
  
 column 
 : 
  
 currency_id 
  
 ignoreNull 
 : 
  
 true 
  
 dimension 
 : 
  
 VALIDITY 
  
 threshold 
 : 
  
 1 
 - 
  
 rangeExpectation 
 : 
  
 minValue 
 : 
  
 '0' 
  
 maxValue 
 : 
  
 '100' 
  
 column 
 : 
  
 discount_pct 
  
 ignoreNull 
 : 
  
 true 
  
 dimension 
 : 
  
 VALIDITY 
  
 threshold 
 : 
  
 1 
 - 
  
 rowConditionExpectation 
 : 
  
 sqlExpression 
 : 
  
 0 < `discount_pct` AND `discount_pct` < 100 
  
 column 
 : 
  
 discount_pct 
  
 dimension 
 : 
  
 VALIDITY 
  
 threshold 
 : 
  
 1 
 - 
  
 rowConditionExpectation 
 : 
  
 sqlExpression 
 : 
  
 currency_id in (select id from `my_project_id.dim_dataset.dim_currency`) 
  
 column 
 : 
  
 currency_id 
  
 dimension 
 : 
  
 VALIDITY 
  
 threshold 
 : 
  
 1 
 - 
  
 tableConditionExpectation 
 : 
  
 sqlExpression 
 : 
  
 30 < avg(discount_pct) AND avg(discount_pct) < 50 
  
 dimension 
 : 
  
 VALIDITY 
 - 
  
 rowConditionExpectation 
 : 
  
 sqlExpression 
 : 
  
 TIMESTAMP(transaction_timestamp) < CURRENT_TIMESTAMP() 
  
 column 
 : 
  
 transaction_timestamp 
  
 dimension 
 : 
  
 VALIDITY 
  
 threshold 
 : 
  
 1 
 - 
  
 sqlAssertion 
 : 
  
 sqlStatement 
 : 
  
 SELECT * FROM `my_project_id.dim_dataset.dim_currency` WHERE discount_pct > 100 
  
 dimension 
 : 
  
 VALIDITY 
  
 debugQueries 
 : 
  
 - 
  
 sqlStatement 
 : 
  
 SELECT MAX(discount_pct) FROM `my_project_id.dim_dataset.dim_currency` 
 samplingPercent 
 : 
  
 50 
 rowFilter 
 : 
  
 discount_pct > 100 
 postScanActions 
 : 
  
 bigqueryExport 
 : 
  
 resultsTable 
 : 
  
 projects/my_project_id/datasets/dim_dataset/tables/dim_currency 
  
 notificationReport 
 : 
  
 recipients 
 : 
  
 emails 
 : 
  
 - 
  
 '222larabrown@gmail.com' 
  
 - 
  
 'cloudysanfrancisco@gmail.com' 
  
 scoreThresholdTrigger 
 : 
  
 scoreThreshold 
 : 
  
 50 
  
 jobFailureTrigger 
 : 
  
 {} 
  
 jobEndTrigger 
 : 
  
 {} 
 catalogPublishingEnabled 
 : 
  
 true

Configure execution identity

By default, data quality scans run using the Knowledge Catalog service account. You can override this to use a custom service account or your own End-User Credentials (EUC).

Using a custom execution identity changes how you are billed for the scan. When you specify a custom execution identity, the compute and storage costs associated with the scan are billed directly to your BigQuery project, bypassing the standard Knowledge Catalog Premium SKUs.

Required permissions for custom execution identities

To configure a custom service account or use end-user credentials, you must have the following additional IAM permissions:

To use a custom service account, you need:
- The iam.serviceAccounts.actAs permission granted for the project that contains the service account (for example, roles/iam.serviceAccountUser ).
- Your project's Service Agent ( service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com ) needs the iam.serviceAccounts.getAccessToken permission on the custom service account (for example, by having the roles/iam.serviceAccountTokenCreator role).
- The custom service account needs bigquery.tables.getData on the table to scan, bigquery.jobs.insert in the scan project, and bigquery.dataEditor on the export dataset (if using export).
To use End-User Credentials, you need:
- bigquery.tables.getData on the table to scan.
- bigquery.jobs.insert in the scan project.
- bigquery.dataEditor on the export dataset (if using export).

Console

To configure the execution identity in the Google Cloud console, select the identity when you create your data quality scan .

In the Execution Identitysection, select one of the following:
- Dataplex Service Agent: The default behavior.
- Specific Service Account: Enter the email address of the service account you want to use.
- User Credentials: Use your own credentials to run the scan.

Use auto data quality Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Required roles and permissions

User roles and permissions

Required permissions

Knowledge Catalog service account roles and permissions

Required permissions

Define data quality rules

Define data quality rules using built-in rule types

Define data quality rules using custom SQL rules

Define data quality rules using the gcloud CLI

Configure execution identity

Required permissions for custom execution identities

Console

Google Cloud CLI

API

Create a data quality scan

Console

gcloud

C#

C#

Go

Go

Java

Java

Node.js

Node.js

Python

Python

Ruby

Ruby

REST

Export table schema

Run a data quality scan

Console

gcloud

C#

C#

Go

Go

Java

Java

Python

Python

Ruby

Ruby

REST

View the data quality scan results

Console

gcloud

C#

C#

Go

Go

Java

Java

Python

Python

Ruby

Ruby

REST

View published results

View historical scan results

Console

gcloud

C#

C#

Go

Go

Java

Java

Python

Python

Ruby

Ruby

REST

Grant access to data quality scan results

Set alerts in Cloud Logging

Console

gcloud

Use auto data quality