Profile your data
This document explains how to use data profile scans to better understand your data. BigQuery uses Dataplex Universal Catalog to analyze the statistical characteristics of your data, such as average values, unique values, and maximum values. Dataplex Universal Catalog also uses this information to recommend rules for data quality checks .
For more information about data profiling, see About data profiling .
Before you begin
Enable the Dataplex API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM
role ( roles/serviceusage.serviceUsageAdmin
), which
contains the serviceusage.services.enable
permission. Learn how to grant
roles
.
Required roles
To get the permissions that you need to create and manage data profile scans, ask your administrator to grant you the following IAM roles on your resource such as the project or table:
- To create, run, update, and delete data profile scans: Dataplex DataScan Editor(
roles/dataplex.dataScanEditor) role on the project containing the data scan. - To allow Dataplex Universal Catalog to run data profile scans against BigQuery data, grant the following roles to the Dataplex Universal Catalog service account
: BigQuery Job User(
roles/bigquery.jobUser) role on the project running the scan; BigQuery Data Viewer(roles/bigquery.dataViewer) role on the tables being scanned. - To run data profile scans for BigQuery external tables that use Cloud Storage data:
grant the Dataplex Universal Catalog service account
the Storage Object Viewer(
roles/storage.objectViewer) and Storage Legacy Bucket Reader(roles/storage.legacyBucketReader) roles on the Cloud Storage bucket. - To view data profile scan results, jobs, and history: Dataplex DataScan Viewer(
roles/dataplex.dataScanViewer) role on the project containing the data scan. - To export data profile scan results to a BigQuery table: BigQuery Data Editor(
roles/bigquery.dataEditor) role on the table. - To publish data profile scan results to Dataplex Universal Catalog: Dataplex Catalog Editor(
roles/dataplex.catalogEditor) role on the@bigqueryentry group. - To view published data profile scan results in BigQuery on the Data profiletab: BigQuery Data Viewer(
roles/bigquery.dataViewer) role on the table.
For more information about granting roles, see Manage access to projects, folders, and organizations .
You might also be able to get the required permissions through custom roles or other predefined roles .
Required permissions
If you use custom roles, you need to grant the following IAM permissions:
- To create, run, update, and delete data profile scans:
-
dataplex.datascans.createon project—Create aDataScan -
dataplex.datascans.updateon data scan—Update the description of aDataScan -
dataplex.datascans.deleteon data scan—Delete aDataScan -
dataplex.datascans.runon data scan—Run aDataScan -
dataplex.datascans.geton data scan—ViewDataScandetails excluding results -
dataplex.datascans.liston project—ListDataScans -
dataplex.dataScanJobs.geton data scan job—Read DataScan job resources -
dataplex.dataScanJobs.liston data scan—List DataScan job resources in a project
-
- To allow Dataplex Universal Catalog to run data profile scans against BigQuery data:
-
bigquery.jobs.createon project—Run jobs -
bigquery.tables.geton table—Get table metadata -
bigquery.tables.getDataon table—Get table data
-
- To run data profile scans for BigQuery external tables that use Cloud Storage data:
-
storage.buckets.geton bucket—Read bucket metadata -
storage.objects.geton object—Read object data
-
- To view data profile scan results, jobs, and history:
-
dataplex.datascans.getDataon data scan—ViewDataScandetails including results -
dataplex.datascans.liston project—ListDataScans -
dataplex.dataScanJobs.geton data scan job—Read DataScan job resources -
dataplex.dataScanJobs.liston data scan—List DataScan job resources in a project
-
- To export data profile scan results to a BigQuery table:
-
bigquery.tables.createon dataset—Create tables -
bigquery.tables.updateDataon table—Write data to tables
-
- To publish data profile scan results to Dataplex Universal Catalog:
-
dataplex.entryGroups.useDataProfileAspecton entry group—Allows Dataplex Universal Catalog data profile scans to save their results to Dataplex Universal Catalog - Additionally, you need one of the following permissions:
-
bigquery.tables.updateon table—Update table metadata -
dataplex.entries.updateon entry—Update entries
-
-
- To view published data profile results for a table in BigQuery or Dataplex Universal Catalog:
-
bigquery.tables.geton table—Get table metadata -
bigquery.tables.getDataon table—Get table data
-
If a table uses BigQuery row-level
security
, then Dataplex Universal Catalog
can only scan rows visible to the Dataplex Universal Catalog service account. To
allow Dataplex Universal Catalog to scan all rows, add its service account to a row
filter where the predicate is TRUE
.
If a table uses BigQuery column-level security
, then Dataplex Universal Catalog
requires access to scan protected columns. To grant access, give the
Dataplex Universal Catalog service account the Data Catalog Fine-Grained Reader( roles/datacatalog.fineGrainedReader
)
role on all policy tags used in the table. The user creating or updating a data
scan also needs permissions on protected columns.
Grant roles to the Dataplex Universal Catalog service account
To run data profile scans, Dataplex Universal Catalog uses a service account that requires permissions to run BigQuery jobs and read BigQuery table data. To grant the required roles, follow these steps:
-
Get the Dataplex Universal Catalog service account email address. If you haven't created a data profile or data quality scan in this project before, run the following
gcloudcommand to generate the service identity:gcloud beta services identity create --service = dataplex.googleapis.comThe command returns the service account email, which has the following format:
service- PROJECT_NUMBER @gcp-sa-dataplex.iam.gserviceaccount.com.If the service account already exists, you can find its email by viewing principals with the Dataplexname on the IAMpage in the Google Cloud console.
-
Grant the service account the BigQuery Job User(
roles/bigquery.jobUser) role on your project. This role lets the service account run BigQuery jobs for the scan.gcloud projects add-iam-policy-binding PROJECT_ID \ --member = "serviceAccount:service- PROJECT_NUMBER @gcp-sa-dataplex.iam.gserviceaccount.com" \ --role = "roles/bigquery.jobUser"Replace the following:
-
PROJECT_ID: your Google Cloud project ID. -
service- PROJECT_NUMBER @gcp-sa-dataplex.iam.gserviceaccount.com: the email of the Dataplex Universal Catalog service account.
-
-
Grant the service account the BigQuery Data Viewer(
roles/bigquery.dataViewer) role for each table that you want to profile. This role grants read-only access to the tables.gcloud bigquery tables add-iam-policy-binding DATASET_ID . TABLE_ID \ --member = "serviceAccount:service- PROJECT_NUMBER @gcp-sa-dataplex.iam.gserviceaccount.com" \ --role = "roles/bigquery.dataViewer"Replace the following:
-
DATASET_ID: the ID of the dataset containing the table. -
TABLE_ID: the ID of the table to profile. -
service- PROJECT_NUMBER @gcp-sa-dataplex.iam.gserviceaccount.com: the email of the Dataplex Universal Catalog service account.Create a data profile scan
Console
-
In the Google Cloud console, on the BigQuery Metadata curationpage, go to the Data profiling & qualitytab.
-
Click Create data profile scan.
-
Optional: Enter a Display name.
-
Enter an ID. See the Resource naming conventions .
-
Optional: Enter a Description.
-
In the Tablefield, click Browse. Choose the table to scan, and then click Select.
For tables in multi-region datasets, choose a region where to create the data scan.
To browse the tables organized within Dataplex Universal Catalog lakes, click Browse within Dataplex Lakes.
-
In the Scopefield, choose Incrementalor Entire data.
- If you choose Incremental data, in the Timestamp columnfield,
select a column of type
DATEorTIMESTAMPfrom your BigQuery table that increases as new records are added, and that can be used to identify new records. For tables partitioned on a column of typeDATEorTIMESTAMP, we recommend using the partition column as the timestamp field.
- If you choose Incremental data, in the Timestamp columnfield,
select a column of type
-
Optional: To filter your data, do any of the following:
-
To filter by rows, click select the Filter rowscheckbox. Enter a valid SQL expression that can be used in a
WHEREclause in GoogleSQL syntax . For example:col1 >= 0.The filter can be a combination of SQL conditions over multiple columns. For example:
col1 >= 0 AND col2 < 10. -
To filter by columns, select the Filter columnscheckbox.
-
To include columns in the profile scan, in the Include columnsfield, click Browse. Select the columns to include, and then click Select.
-
To exclude columns from the profile scan, in the Exclude columnsfield, click Browse. Select the columns to exclude, and then click Select.
-
-
-
To apply sampling to your data profile scan, in the Sampling sizelist, select a sampling percentage. Choose a percentage value that ranges between 0.0% and 100.0% with up to 3 decimal digits.
-
For larger datasets, choose a lower sampling percentage. For example, for a 1 PB table, if you enter a value between 0.1% and 1.0%, the data profile samples between 1-10 TB of data.
-
There must be at least 100 records in the sampled data to return a result.
-
For incremental data scans, the data profile scan applies sampling to the latest increment.
-
-
Optional: Publish the data profile scan results in the BigQuery and Dataplex Universal Catalog pages in the Google Cloud console for the source table. Select the Publish results to BigQuery and Dataplex Catalogcheckbox.
You can view the latest scan results in the Data profiletab in the BigQuery and Dataplex Universal Catalog pages for the source table. To enable users to access the published scan results, see the Grant access to data profile scan results section of this document.
The publishing option might not be available in the following cases:
- You don't have the required permissions on the table.
- Another data quality scan is set to publish results.
-
In the Schedulesection, choose one of the following options:
-
Repeat: Run the data profile scan on a schedule: hourly, daily, weekly, monthly, or custom. Specify how often the scan should run and at what time. If you choose custom, use cron format to specify the schedule.
-
On-demand: Run the data profile scan on demand.
-
-
Click Continue.
-
Optional: Export the scan results to a BigQuery standard table. In the Export scan results to BigQuery tablesection, do the following:
-
In the Select BigQuery datasetfield, click Browse. Select a BigQuery dataset to store the data profile scan results.
-
In the BigQuery tablefield, specify the table to store the data profile scan results. If you're using an existing table, make sure that it is compatible with the export table schema . If the specified table doesn't exist, Dataplex Universal Catalog creates it for you.
-
-
Optional: Add labels. Labels are key-value pairs that let you group related objects together or with other Google Cloud resources.
-
To create the scan, click Create.
If you set the schedule to on-demand, you can also run the scan now by clicking Run scan.
gcloud
To create a data profile scan, use the
gcloud dataplex datascans create data-profilecommand .If the source data is organized in a Dataplex Universal Catalog lake, include the
--data-source-entityflag:gcloud dataplex datascans create data-profile DATASCAN \ --location= LOCATION \ --data-source-entity= DATA_SOURCE_ENTITY
If the source data isn't organized in a Dataplex Universal Catalog lake, include the
--data-source-resourceflag:gcloud dataplex datascans create data-profile DATASCAN \ --location= LOCATION \ --data-source-resource= DATA_SOURCE_RESOURCE
Replace the following variables:
-
DATASCAN: The name of the data profile scan. -
LOCATION: The Google Cloud region in which to create the data profile scan. -
DATA_SOURCE_ENTITY: The Dataplex Universal Catalog entity that contains the data for the data profile scan. For example,projects/test-project/locations/test-location/lakes/test-lake/zones/test-zone/entities/test-entity. -
DATA_SOURCE_RESOURCE: The name of the resource that contains the data for the data profile scan. For example,//bigquery.googleapis.com/projects/test-project/datasets/test-dataset/tables/test-table.
C#
C#
Before trying this sample, follow the C# setup instructions in the Dataplex Universal Catalog quickstart using client libraries . For more information, see the Dataplex Universal Catalog C# API reference documentation .
To authenticate to Dataplex Universal Catalog, set up Application Default Credentials. For more information, see Set up authentication for a local development environment .
-
-


