Scan for data quality issues
This document explains how to use BigQuery and Dataplex Universal Catalog together to ensure that data meets your quality expectations. Dataplex Universal Catalog automatic data quality lets you define and measure the quality of the data in your BigQuery tables. You can automate the scanning of data, validate data against defined rules, and log alerts if your data doesn't meet quality requirements.
For more information about automatic data quality, see the Auto data quality overview .
Before you begin
-
Enable the Dataplex API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles . - Optional: If you want Dataplex Universal Catalog to generate recommendations for data quality rules based on the results of a data profile scan, create and run the data profile scan .
Required roles
This section describes the IAM roles and permissions needed to use Dataplex Universal Catalog data quality scans.
User roles and permissions
To get the permissions that you need to run and manage data quality scans, ask your administrator to grant you the following IAM roles :
- Run a data quality scan on a BigQuery table:
- BigQuery Job User
(
roles/bigquery.jobUser) on the project to run scan jobs - BigQuery Data Viewer
(
roles/bigquery.dataViewer) on the BigQuery table to be scanned
- BigQuery Job User
(
- Publish data quality scan results to Dataplex Universal Catalog:
- BigQuery Data Editor
(
roles/bigquery.dataEditor) on the scanned table - Dataplex Catalog Editor
(
roles/dataplex.catalogEditor) on the@bigqueryentry group in the same location as the table
- BigQuery Data Editor
(
- Perform specific tasks on
DataScanresources:- Dataplex DataScan Administrator
(
roles/dataplex.dataScanAdmin) on the project for full access - Dataplex DataScan Creator
(
roles/dataplex.dataScanCreator) on the project to create scans - Dataplex DataScan Editor
(
roles/dataplex.dataScanEditor) on the project for write access - Dataplex DataScan Viewer
(
roles/dataplex.dataScanViewer) on the project to read scan metadata - Dataplex DataScan DataViewer
(
roles/dataplex.dataScanDataViewer) on the project to read scan data including rules and results
- Dataplex DataScan Administrator
(
For more information about granting roles, see Manage access to projects, folders, and organizations .
These predefined roles contain the permissions required to run and manage data quality scans. To see the exact permissions that are required, expand the Required permissionssection:
Required permissions
The following permissions are required to run and manage data quality scans:
- Run a data quality scan on a BigQuery table:
-
bigquery.jobs.createon the project to run scan jobs -
bigquery.tables.geton the BigQuery table to be scanned -
bigquery.tables.getDataon the BigQuery table to be scanned
-
- Publish data quality scan results to Dataplex Universal Catalog:
-
bigquery.tables.updateon the scanned table -
dataplex.entryGroups.useDataQualityScorecardAspecton the@bigqueryentry group in the same location as the table
-
- Create a
DataScan:dataplex.datascans.createon the project - Delete a
DataScan:dataplex.datascans.deleteon the project - View
DataScanmetadata:dataplex.datascans.geton the project - View
DataScandetails including rules and results:dataplex.datascans.getDataon the project - List
DataScans:dataplex.datascans.liston the project - Run a
DataScan:dataplex.datascans.runon the project - Update a
DataScan:dataplex.datascans.updateon the project - Get or set IAM policy on a
DataScan:-
dataplex.datascans.getIamPolicyon the project -
dataplex.datascans.setIamPolicyon the project
-
You might also be able to get these permissions with custom roles or other predefined roles .
If you need to access columns protected by BigQuery column-level access policies, then you also need permissions for those columns.
Dataplex Universal Catalog service account roles and permissions
If you haven't created any data quality or data profile scans or you
don't have a Dataplex Universal Catalog lake in this project, create a
service identifier by running: gcloud beta services identity create --service=dataplex.googleapis.com
.
This command returns a Dataplex Universal Catalog service identifier if it exists.
To ensure that the Dataplex Universal Catalog service account of the project containing the data quality scan has the necessary permissions to read data from various sources and export results, ask your administrator to grant the following IAM roles to the Dataplex Universal Catalog service account of the project containing the data quality scan:
- Read BigQuery table data: BigQuery Data Viewer
(
roles/bigquery.dataViewer) on BigQuery tables to be scanned and any other tables referenced in rules - Export scan results to a BigQuery table: BigQuery Data Editor
(
roles/bigquery.dataEditor) on the results dataset and table - Scan BigQuery data organized in a Dataplex Universal Catalog lake:
- Dataplex Metadata Reader
(
roles/dataplex.metadataReader) on Dataplex resources - Dataplex Viewer
(
roles/dataplex.viewer) on Dataplex resources
- Dataplex Metadata Reader
(
- Scan a BigQuery external table from Cloud Storage: Storage Object Viewer
(
roles/storage.objectViewer) on the Cloud Storage bucket
For more information about granting roles, see Manage access to projects, folders, and organizations .
These predefined roles contain the permissions required to read data from various sources and export results. To see the exact permissions that are required, expand the Required permissionssection:
Required permissions
The following permissions are required to read data from various sources and export results:
- Read BigQuery table data:
-
bigquery.tables.geton BigQuery tables -
bigquery.tables.getDataon BigQuery tables
-
- Export scan results to a BigQuery table:
-
bigquery.datasets.geton results dataset and table -
bigquery.tables.createon results dataset and table -
bigquery.tables.geton results dataset and table -
bigquery.tables.getDataon results dataset and table -
bigquery.tables.updateon results dataset and table -
bigquery.tables.updateDataon results dataset and table
-
- Scan BigQuery data organized in a Dataplex Universal Catalog lake:
-
dataplex.lakes.liston Dataplex resources -
dataplex.lakes.geton Dataplex resources -
dataplex.zones.liston Dataplex resources -
dataplex.zones.geton Dataplex resources -
dataplex.entities.liston Dataplex resources -
dataplex.entities.geton Dataplex resources -
dataplex.operations.geton Dataplex resources
-
- Scan a BigQuery external table from Cloud Storage:
-
storage.buckets.geton the Cloud Storage bucket -
storage.objects.geton the Cloud Storage bucket
-
Your administrator might also be able to give the Dataplex Universal Catalog service account of the project containing the data quality scan these permissions with custom roles or other predefined roles .
If you need to access columns protected by BigQuery column-level access policies, then assign the Dataplex Universal Catalog service account permissions for those columns.
If a table has BigQuery row-level access policies enabled, then you can only scan rows visible to the Dataplex Universal Catalog service account. Note that the individual user's access privileges are not evaluated for row-level policies.
Create a data quality scan
Console
-
In the Google Cloud console, on the BigQuery Metadata curationpage, go to the Data profiling & qualitytab.
-
Click Create data quality scan.
-
In the Define scanwindow, fill in the following fields:
-
Optional: Enter a Display name.
-
Enter an ID. See the resource naming conventions .
-
Optional: Enter a Description.
-
In the Tablefield, click Browse. Choose the table to scan, and then click Select. Only standard BigQuery tables are supported.
For tables in multi-region datasets, choose a region where to create the data scan.
To browse the tables organized within Dataplex Universal Catalog lakes, click Browse within Dataplex Lakes.
-
In the Scopefield, choose Incrementalor Entire data.
- If you choose Incremental: In the Timestamp columnfield,
select a column of type
DATEorTIMESTAMPfrom your BigQuery table that increases as new records are added, and that can be used to identify new records. It can be a column that partitions the table.
- If you choose Incremental: In the Timestamp columnfield,
select a column of type
-
To filter your data, select the Filter rowscheckbox. Provide a row filter consisting of a valid SQL expression that can be used as a part of a
WHEREclause in GoogleSQL syntax . For example,col1 >= 0. The filter can be a combination of multiple column conditions. For example,col1 >= 0 AND col2 < 10. -
To sample your data, in the Sampling sizelist, select a sampling percentage. Choose a percentage value that ranges between 0.0% and 100.0% with up to 3 decimal digits. For larger datasets, choose a lower sampling percentage. For example, for a 1 PB table, if you enter a value between 0.1% and 1.0%, the data quality scan samples between 1-10 TB of data. For incremental data scans, the data quality scan applies sampling to the latest increment.
-
To publish the data quality scan results as Dataplex Universal Catalog metadata, select the Publish results to Dataplex Catalogcheckbox.
You can view the latest scan results on the Data qualitytab in the BigQuery and Dataplex Universal Catalog pages for the source table. To enable users to access the published scan results, see the Grant access to data quality scan results section of this document.
-
In the Schedulesection, choose one of the following options:
-
Repeat: Run the data quality scan on a schedule: hourly, daily, weekly, monthly, or custom. Specify how often the scan runs and at what time. If you choose custom, use cron format to specify the schedule.
-
On-demand: Run the data quality scan on demand.
-
One-time run: Run the data quality scan once now, and remove the scan after the time-to-live period. This feature is in preview .
-
Time to live: The time-to-live value is the time span between when the scan is executed and when the scan is deleted. A data quality scan without a specified time-to-live is automatically deleted 24 hours after its execution. The time-to-live can range from 0 seconds (immediate deletion) to 365 days.
-
-
Click Continue.
-
-
In the Data quality ruleswindow, define the rules to configure for this data quality scan.
-
Click Add rules, and then choose from the following options.
-
Profile based recommendations: Build rules from the recommendations based on an existing data profiling scan.
-
Choose columns: Select the columns to get recommended rules for.
-
Choose scan project: If the data profiling scan is in a different project than the project where you are creating the data quality scan, then select the project to pull profile scans from.
-
Choose profile results: Select one or more profile results and then click OK. This populates a list of suggested rules that you can use as a starting point.
-
Select the checkbox for the rules that you want to add, and then click Select. Once selected, the rules are added to your current rule list. Then, you can edit the rules.
-
-
Built-in rule types: Build rules from predefined rules. See the list of predefined rules .
-
Choose columns: Select the columns to select rules for.
-
Choose rule types: Select the rule types that you want to choose from, and then click OK. The rule types that appear depend on the columns that you selected.
-
Select the checkbox for the rules that you want to add, and then click Select. Once selected, the rules are added to your current rules list. Then, you can edit the rules.
-
-
SQL row check rule: Create a custom SQL rule to apply to each row.
-
In Dimension, choose one dimension.
-
In Passing threshold, choose a percentage of records that must pass the check.
-
In Column name, choose a column.
-
In the Provide a SQL expressionfield, enter a SQL expression that evaluates to a boolean
true(pass) orfalse(fail). For more information, see Supported custom SQL rule types and the examples in Define data quality rules . -
Click Add.
-
-
SQL aggregate check rule: Create a custom SQL table condition rule.
-
In Dimension, choose one dimension.
-
In Column name, choose a column.
-
In the Provide a SQL expressionfield, enter a SQL expression that evaluates to a boolean
true(pass) orfalse(fail). For more information, see Supported custom SQL rule types and the examples in Define data quality rules . -
Click Add.
-
-
SQL assertion rule: Create a custom SQL assertion rule to check for an invalid state of the data.
-
In Dimension, choose one dimension.
-
Optional: In Column name, choose a column.
-
In the Provide a SQL statementfield, enter a SQL statement that returns rows that match the invalid state. If any rows are returned, this rule fails. Omit the trailing semicolon from the SQL statement. For more information, see Supported custom SQL rule types and the examples in Define data quality rules .
-
Click Add.
-
-
-
Optional: For any data quality rule, you can assign a custom rule name to use for monitoring and alerting, and a description. To do this, edit a rule and specify the following details:
- Rule name: Enter a custom rule name with up to 63 characters. The rule name can include letters (a-z, A-Z), digits (0-9), and hyphens (-) and must start with a letter and end with a number or a letter.
- Description: Enter a rule description with a maximum length of 1,024 characters.
-
Repeat the previous steps to add additional rules to the data quality scan. When finished, click Continue.
-
-
Optional: Export the scan results to a BigQuery standard table. In the Export scan results to BigQuery tablesection, do the following:
-
In the Select BigQuery datasetfield, click Browse. Select a BigQuery dataset to store the data quality scan results.
-
In the BigQuery tablefield, specify the table to store the data quality scan results. If you're using an existing table, make sure that it is compatible with the export table schema . If the specified table doesn't exist, Dataplex Universal Catalog creates it for you.
-
-
Optional: Add labels. Labels are key-value pairs that let you group related objects together or with other Google Cloud resources.
-
Optional: Set up email notification reports to alert people about the status and results of a data quality scan job. In the Notification reportsection, click Add email IDand enter up to five email addresses. Then, select the scenarios that you want to send reports for:
- Quality score (<=): sends a report when a job succeeds with a data quality score that is lower than the specified target score. Enter a target quality score between 0 and 100.
- Job failures: sends a report when the job itself fails, regardless of the data quality results.
- Job completion (success or failure): sends a report when the job ends, regardless of the data quality results.
-
Click Create.
After the scan is created, you can run it at any time by clicking Run now.
gcloud
To create a data quality scan, use the gcloud dataplex datascans create data-quality
command
.
If the source data is organized in a Dataplex Universal Catalog lake, include the --data-source-entity
flag:
gcloud
dataplex
datascans
create
data-quality
DATASCAN
\
--location =
LOCATION
\
--data-quality-spec-file =
DATA_QUALITY_SPEC_FILE
\
--data-source-entity =
DATA_SOURCE_ENTITY
If the source data isn't organized in a Dataplex Universal Catalog lake, include
the --data-source-resource
flag:
gcloud
dataplex
datascans
create
data-quality
DATASCAN
\
--location =
LOCATION
\
--data-quality-spec-file =
DATA_QUALITY_SPEC_FILE
\
--data-source-resource =
DATA_SOURCE_RESOURCE
Replace the following variables:
-
DATASCAN: The name of the data quality scan. -
LOCATION: The Google Cloud region in which to create the data quality scan. -
DATA_QUALITY_SPEC_FILE: The path to the JSON or YAML file containing the specifications for the data quality scan. The file can be a local file or a Cloud Storage path with the prefixgs://. Use this file to specify the data quality rules for the scan. You can also specify additional details in this file, such as filters, sampling percent, and post-scan actions like exporting to BigQuery or sending email notification reports. See the documentation for JSON representation and the example YAML representation . -
DATA_SOURCE_ENTITY: The Dataplex Universal Catalog entity that contains the data for the data quality scan. For example,projects/test-project/locations/test-location/lakes/test-lake/zones/test-zone/entities/test-entity. -
DATA_SOURCE_RESOURCE: The name of the resource that contains the data for the data quality scan. For example,//bigquery.googleapis.com/projects/test-project/datasets/test-dataset/tables/test-table.
C#
C#
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment .


