This document describes how to use Knowledge Catalog (formerly Dataplex Universal Catalog) data quality scans to measure, monitor, and manage the quality of your data. Data quality scans help you automate the process of validating your data for completeness, validity, and consistency.
With data quality scans, you can define rules to check for missing values, ensure values match a regular expression or belong to a set, verify uniqueness, or use custom SQL for more complex validations such as anomaly detection. This document explains how to create and manage data quality scans.
To learn more about data quality scans, see About auto data quality .
Before you begin
-
Enable the Dataplex API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles . - Optional: If you want Knowledge Catalog to generate recommendations for data quality rules based on the results of a data profile scan, create and run the data profile scan .
Required roles and permissions
This section describes the IAM roles and permissions needed to use Knowledge Catalog data quality scans.
User roles and permissions
To get the permissions that you need to run and manage data quality scans, ask your administrator to grant you the following IAM roles:
- Run a data quality scan on a BigQuery table:
- BigQuery Job User
(
roles/bigquery.jobUser) on the project to run scan jobs - BigQuery Data Viewer
(
roles/bigquery.dataViewer) on the BigQuery table to be scanned
- BigQuery Job User
(
- Publish data quality scan results to Knowledge Catalog:
- BigQuery Data Editor
(
roles/bigquery.dataEditor) on the scanned table - Dataplex Catalog Editor
(
roles/dataplex.catalogEditor) on the@bigqueryentry group in the same location as the table
- BigQuery Data Editor
(
- Perform specific tasks on
DataScanresources:- Dataplex DataScan Administrator
(
roles/dataplex.dataScanAdmin) on the project for full access - Dataplex DataScan Creator
(
roles/dataplex.dataScanCreator) on the project to create scans - Dataplex DataScan Editor
(
roles/dataplex.dataScanEditor) on the project for write access - Dataplex DataScan Viewer
(
roles/dataplex.dataScanViewer) on the project to read scan metadata - Dataplex DataScan DataViewer
(
roles/dataplex.dataScanDataViewer) on the project to read scan data including rules and results
- Dataplex DataScan Administrator
(
For more information about granting roles, see Manage access to projects, folders, and organizations .
These predefined roles contain the permissions required to run and manage data quality scans. To see the exact permissions that are required, expand the Required permissionssection:
Required permissions
The following permissions are required to run and manage data quality scans:
- Run a data quality scan on a BigQuery table:
-
bigquery.jobs.createon the project to run scan jobs -
bigquery.tables.geton the BigQuery table to be scanned -
bigquery.tables.getDataon the BigQuery table to be scanned
-
- Publish data quality scan results to Knowledge Catalog:
-
bigquery.tables.updateon the scanned table -
dataplex.entryGroups.useDataQualityScorecardAspecton the@bigqueryentry group in the same location as the table
-
- Create a
DataScan:dataplex.datascans.createon the project - Delete a
DataScan:dataplex.datascans.deleteon the project - View
DataScanmetadata:dataplex.datascans.geton the project - View
DataScandetails including rules and results:dataplex.datascans.getDataon the project - List
DataScans:dataplex.datascans.liston the project - Run a
DataScan:dataplex.datascans.runon the project - Update a
DataScan:dataplex.datascans.updateon the project - Get or set IAM policy on a
DataScan:-
dataplex.datascans.getIamPolicyon the project -
dataplex.datascans.setIamPolicyon the project
-
You might also be able to get these permissions with custom roles or other predefined roles .
If you need to access columns protected by BigQuery column-level access policies, then you also need permissions for those columns.
Knowledge Catalog service account roles and permissions
If you haven't created any data quality or data profile scans or you
don't have a Knowledge Catalog lake in this project, create a
service identifier by running: gcloud beta services identity create --service=dataplex.googleapis.com
.
This command returns a Knowledge Catalog service identifier if it exists.
To ensure that the Knowledge Catalog service account of the project containing the data quality scan has the necessary permissions to read data from various sources and export results, ask your administrator to grant the following IAM roles to the Knowledge Catalog service account of the project containing the data quality scan:
- Read BigQuery table data: BigQuery Data Viewer
(
roles/bigquery.dataViewer) on BigQuery tables to be scanned and any other tables referenced in rules - Export scan results to a BigQuery table: BigQuery Data Editor
(
roles/bigquery.dataEditor) on the results dataset and table - Scan BigQuery data organized in a Knowledge Catalog lake:
- Dataplex Metadata Reader
(
roles/dataplex.metadataReader) on Dataplex resources - Dataplex Viewer
(
roles/dataplex.viewer) on Dataplex resources
- Dataplex Metadata Reader
(
- Scan a BigQuery external table from Cloud Storage: Storage Object Viewer
(
roles/storage.objectViewer) on the Cloud Storage bucket
For more information about granting roles, see Manage access to projects, folders, and organizations .
These predefined roles contain the permissions required to read data from various sources and export results. To see the exact permissions that are required, expand the Required permissionssection:
Required permissions
The following permissions are required to read data from various sources and export results:
- Read BigQuery table data:
-
bigquery.tables.geton BigQuery tables -
bigquery.tables.getDataon BigQuery tables
-
- Export scan results to a BigQuery table:
-
bigquery.datasets.geton results dataset and table -
bigquery.tables.createon results dataset and table -
bigquery.tables.geton results dataset and table -
bigquery.tables.getDataon results dataset and table -
bigquery.tables.updateon results dataset and table -
bigquery.tables.updateDataon results dataset and table
-
- Scan BigQuery data organized in a Knowledge Catalog lake:
-
dataplex.lakes.liston Dataplex resources -
dataplex.lakes.geton Dataplex resources -
dataplex.zones.liston Dataplex resources -
dataplex.zones.geton Dataplex resources -
dataplex.entities.liston Dataplex resources -
dataplex.entities.geton Dataplex resources -
dataplex.operations.geton Dataplex resources
-
- Scan a BigQuery external table from Cloud Storage:
-
storage.buckets.geton the Cloud Storage bucket -
storage.objects.geton the Cloud Storage bucket
-
Your administrator might also be able to give the Knowledge Catalog service account of the project containing the data quality scan these permissions with custom roles or other predefined roles .
If you need to access columns protected by BigQuery column-level access policies, then assign the Knowledge Catalog service account permissions for those columns.
If a table has BigQuery row-level access policies enabled, then you can only scan rows visible to the Knowledge Catalog service account. Note that the individual user's access privileges are not evaluated for row-level policies.
Define data quality rules
You can define data quality rules by using built-in rules or custom SQL checks . If you're using the Google Cloud CLI, you can define these rules in a JSON or YAML file .
The examples in the following sections show how to define a variety of data quality rules. The rules validate a sample table that contains data about customer transactions. Assume the table has the following schema:
| Column name | Column type | Column description |
|---|---|---|
|
transaction_timestamp
|
Timestamp | Timestamp of the transaction. The table is partitioned on this field. |
|
customer_id
|
String | A customer ID in the format of 8 letters followed by 16 digits. |
|
transaction_id
|
String | The transaction ID needs to be unique across the table. |
|
currency_id
|
String | One of the supported currencies.The currency type must match one of the available currencies in the dimension table dim_currency
. |
|
amount
|
float | Transaction amount. |
|
discount_pct
|
float | Discount percentage. This value must be between 0 and 100. |
Define data quality rules using built-in rule types
The following example rules are based on built-in rule types. You can create rules based on built-in rule types using the Google Cloud console or the API. Knowledge Catalog might recommend some of these rules.
| Column name | Rule Type | Suggested dimension | Rule parameters |
|---|---|---|---|
transaction_id
|
Uniqueness check | Uniqueness | Threshold: Not Applicable
|
amount
|
Null check | Completeness | Threshold: 100%
|
customer_id
|
Regex (regular expression) check | Validity | Regular expression: ^[0-9]{8}[a-zA-Z]{16}$
Threshold: 100%
|
currency_id
|
Value set check | Validity | Set of: USD,JPY,INR,GBP,CAN
Threshold: 100%
|
Define data quality rules using custom SQL rules
To build custom SQL rules, use the following framework:
-
When you create a rule that evaluates one row at a time, create an expression that generates the number of successful rows when Knowledge Catalog evaluates the query
SELECT COUNTIF( CUSTOM_SQL_EXPRESSION ) FROM TABLE. Knowledge Catalog checks the number of successful rows against the threshold. -
When you create a rule that evaluates across the rows or uses a table condition, create an expression that returns success or failure when Knowledge Catalog evaluates the query
SELECT IF( CUSTOM_SQL_EXPRESSION ) FROM TABLE. -
When you create a rule that evaluates the invalid state of a dataset, provide a statement that returns invalid rows. If any rows are returned, the rule fails. Omit the trailing semicolon from the SQL statement.
-
You can refer to a data source table and all of its precondition filters by using the data reference parameter
${data()}in a rule, instead of explicitly mentioning the source table and its filters. Examples of precondition filters include row filters, sampling percents, and incremental filters. The${data()}parameter is case-sensitive.
The following example rules are based on custom SQL rules.
| Rule type | Rule description | SQL expression |
|---|---|---|
|
Row condition
|
Checks if the value of the discount_pct
is between 0 and 100. |
0
< discount_pct
AND discount_pct
< 100
|
|
Row condition
|
Reference check to validate that currency_id
is one of the
supported currencies. |
currency_id in (select id from my_project_id.dim_dataset.dim_currency)
|
|
Table condition
|
Aggregate SQL expression that checks if the average discount_pct
is between 30% and 50%. |
30<avg(discount) AND avg(discount) <50
|
|
Row condition
|
Checks if a date is not in the future. | TIMESTAMP(transaction_timestamp) < CURRENT_TIMESTAMP()
|
|
Table condition
|
A BigQuery user-defined function (UDF)
to check that the average transaction amount is less than a predefined
value per country. Create the (Javascript) UDF by running the following
command: CREATE OR REPLACE FUNCTION
myProject.myDataset.average_by_country (
country STRING, average FLOAT64)
RETURNS BOOL LANGUAGE js AS R"""
if (country = "CAN" && average < 5000){
return 1
} else if (country = "IND" && average < 1000){
return 1
} else { return 0 }
""";
|
Example rule to check the average transaction amount for country=CAN
. myProject.myDataset.average_by_country(
"CAN",
(SELECT avg(amount) FROM
myProject.myDataset.transactions_table
WHERE currency_id = 'CAN'
))
|
|
Table condition
|
A BigQuery ML
predict clause to identify anomalies in discount_pct
. It checks
if a discount should be applied based on customer
, currency
, and transaction
. The rule checks if the
prediction matches the actual value, at least 99% of times. Assumption: The
ML model is created before using the rule. Create the ML model using the
following command: CREATE MODEL
model-project-id.dataset-id.model-name
OPTIONS(model_type='logistic_reg') AS
SELECT
IF(discount_pct IS NULL, 0, 1) AS label,
IFNULL(customer_id, "") AS customer,
IFNULL(currency_id, "") AS currency,
IFNULL(amount, 0.0) AS amount
FROM
`data-project-id.dataset-id.table-names`
WHERE transaction_timestamp < '2022-01-01';
|
The following rule checks if prediction accuracy is greater than 99%. SELECT
accuracy > 0.99
FROM
ML.EVALUATE
(MODEL model-project-id.dataset-id.model-name,
(
SELECT
customer_id,
currency_id,
amount,
discount_pct
FROM
data-project-id.dataset-id.table-names
WHERE transaction_timestamp > '2022-01-01';
)
)
|
|
Row condition
|
A BigQuery ML
predict
function to identify anomalies in discount_pct
. The function
checks if a discount should be applied based on customer
, currency
and transaction
.
The rule identifies all the occurrences where the prediction didn't match.
Assumption: The ML model is created before using the rule. Create the ML
model using the following command: CREATE MODEL
model-project-id.dataset-id.model-name
OPTIONS(model_type='logistic_reg') AS
SELECT
IF(discount_pct IS NULL, 0, 1) AS label,
IFNULL(customer_id, "") AS customer,
IFNULL(currency_id, "") AS currency,
IFNULL(amount, 0.0) AS amount
FROM
`data-project-id.dataset-id.table-names`
WHERE transaction_timestamp < '2022-01-01';
|
The following rule checks if the discount prediction matches with the
actual for every row. IF(discount_pct > 0, 1, 0)
=(SELECT predicted_label FROM
ML.PREDICT(
MODEL model-project-id.dataset-id.model-name,
(
SELECT
customer_id,
currency_id,
amount,
discount_pct
FROM
data-project-id.dataset-id.table-names AS t
WHERE t.transaction_timestamp =
transaction_timestamp
LIMIT 1
)
)
)
|
|
SQL assertion
|
Validates if the discount_pct
is greater than 30% for today
by checking whether any rows exist with a discount percent less than or
equal to 30. |
SELECT * FROM my_project_id.dim_dataset.dim_currency WHERE discount_pct <= 30 AND transaction_timestamp >= current_date()
|
| Checks if the The date filter The data reference parameter |
SELECT * FROM ${data()} WHERE discount_pct > 30
|
Define data quality rules using the gcloud CLI
The following example YAML file uses some of the same rules as the sample rules using built-in types
and the sample custom SQL rules
. This YAML file also contains
other specifications for the data quality scan, such as filters and sampling
percent. When you use the gcloud CLI to create or update a data
quality scan, you can use a YAML file like this as input to the --data-quality-spec-file
argument.
rules
:
-
uniquenessExpectation
:
{}
column
:
transaction_id
dimension
:
UNIQUENESS
-
nonNullExpectation
:
{}
column
:
amount
dimension
:
COMPLETENESS
threshold
:
1
-
regexExpectation
:
regex
:
'^[0-9]{8}[a-zA-Z]{16}$'
column
:
customer_id
ignoreNull
:
true
dimension
:
VALIDITY
threshold
:
1
-
setExpectation
:
values
:
-
'USD'
-
'JPY'
-
'INR'
-
'GBP'
-
'CAN'
column
:
currency_id
ignoreNull
:
true
dimension
:
VALIDITY
threshold
:
1
-
rangeExpectation
:
minValue
:
'0'
maxValue
:
'100'
column
:
discount_pct
ignoreNull
:
true
dimension
:
VALIDITY
threshold
:
1
-
rowConditionExpectation
:
sqlExpression
:
0 < `discount_pct` AND `discount_pct` < 100
column
:
discount_pct
dimension
:
VALIDITY
threshold
:
1
-
rowConditionExpectation
:
sqlExpression
:
currency_id in (select id from `my_project_id.dim_dataset.dim_currency`)
column
:
currency_id
dimension
:
VALIDITY
threshold
:
1
-
tableConditionExpectation
:
sqlExpression
:
30 < avg(discount_pct) AND avg(discount_pct) < 50
dimension
:
VALIDITY
-
rowConditionExpectation
:
sqlExpression
:
TIMESTAMP(transaction_timestamp) < CURRENT_TIMESTAMP()
column
:
transaction_timestamp
dimension
:
VALIDITY
threshold
:
1
-
sqlAssertion
:
sqlStatement
:
SELECT * FROM `my_project_id.dim_dataset.dim_currency` WHERE discount_pct > 100
dimension
:
VALIDITY
debugQueries
:
-
sqlStatement
:
SELECT MAX(discount_pct) FROM `my_project_id.dim_dataset.dim_currency`
samplingPercent
:
50
rowFilter
:
discount_pct > 100
postScanActions
:
bigqueryExport
:
resultsTable
:
projects/my_project_id/datasets/dim_dataset/tables/dim_currency
notificationReport
:
recipients
:
emails
:
-
'222larabrown@gmail.com'
-
'cloudysanfrancisco@gmail.com'
scoreThresholdTrigger
:
scoreThreshold
:
50
jobFailureTrigger
:
{}
jobEndTrigger
:
{}
catalogPublishingEnabled
:
true
Configure execution identity
By default, data quality scans run using the Knowledge Catalog service account. You can override this to use a custom service account or your own End-User Credentials (EUC).
Using a custom execution identity changes how you are billed for the scan. When you specify a custom execution identity, the compute and storage costs associated with the scan are billed directly to your BigQuery project, bypassing the standard Knowledge Catalog Premium SKUs.
Required permissions for custom execution identities
To configure a custom service account or use end-user credentials, you must have the following additional IAM permissions:
- To use a custom service account, you need:
- The
iam.serviceAccounts.actAspermission granted for the project that contains the service account (for example,roles/iam.serviceAccountUser). - Your project's Service Agent
(
service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com) needs theiam.serviceAccounts.getAccessTokenpermission on the custom service account (for example, by having theroles/iam.serviceAccountTokenCreatorrole). - The custom service account needs
bigquery.tables.getDataon the table to scan,bigquery.jobs.insertin the scan project, andbigquery.dataEditoron the export dataset (if using export).
- The
- To use End-User Credentials, you need:
-
bigquery.tables.getDataon the table to scan. -
bigquery.jobs.insertin the scan project. -
bigquery.dataEditoron the export dataset (if using export).
-
Console
To configure the execution identity in the Google Cloud console, select the identity when you create your data quality scan .
- In the Execution Identitysection, select one of the
following:
- Dataplex Service Agent: The default behavior.
- Specific Service Account: Enter the email address of the service account you want to use.
- User Credentials: Use your own credentials to run the scan.

