Create and use data profile scans

Dataplex Universal Catalog lets you identify common statistical characteristics (common values, data distribution, null counts) of the columns in your BigQuery tables. This information helps you to understand and analyze your data more effectively.

For more information about Dataplex Universal Catalog data profile scans, see About data profiling .

Before you begin

Enable the Dataplex API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role ( roles/serviceusage.serviceUsageAdmin ), which contains the serviceusage.services.enable permission. Learn how to grant roles .

Enable the API

Required roles and permissions

This section describes the IAM roles and permissions needed to use Dataplex Universal Catalog data profile scans.

User roles and permissions

To get the permissions that you need to create and manage data profile scans, ask your administrator to grant you the following IAM roles :

Create, run, update, and delete data profile scans: Dataplex DataScan Editor ( roles/dataplex.dataScanEditor ) on the project containing the data scan
View data profile scan results, jobs, and history: Dataplex DataScan Viewer ( roles/dataplex.dataScanViewer ) on the project containing the data scan
Publish data profile scan results to Dataplex Universal Catalog: Dataplex Catalog Editor ( roles/dataplex.catalogEditor ) on the @bigquery entry group
View published data profile scan results in BigQuery on the Data profiletab: BigQuery Data Viewer ( roles/bigquery.dataViewer ) on the table

For more information about granting roles, see Manage access to projects, folders, and organizations .

These predefined roles contain the permissions required to create and manage data profile scans. To see the exact permissions that are required, expand the Required permissionssection:

Required permissions

The following permissions are required to create and manage data profile scans:

Create, run, update, and delete data profile scans:
- dataplex.datascans.create on project
- dataplex.datascans.update on data scan
- dataplex.datascans.delete on data scan
- dataplex.datascans.run on data scan
- dataplex.datascans.get on data scan
- dataplex.datascans.list on project
- dataplex.dataScanJobs.get on data scan job
- dataplex.dataScanJobs.list on data scan
View data profile scan results, jobs, and history:
- dataplex.datascans.getData on data scan
- dataplex.datascans.list on project
- dataplex.dataScanJobs.get on data scan job
- dataplex.dataScanJobs.list on data scan
Publish data profile scan results to Dataplex Universal Catalog:
- dataplex.entryGroups.useDataProfileAspect on entry group
- bigquery.tables.update on table
- dataplex.entries.update on entry
View published data profile results for a table in BigQuery or Dataplex Universal Catalog:
- bigquery.tables.get on table
- bigquery.tables.getData on table

You might also be able to get these permissions with custom roles or other predefined roles .

Dataplex Universal Catalog service account roles and permissions

To ensure that the Dataplex Universal Catalog service account has the necessary permissions to run data profile scans and export results, ask your administrator to grant the following IAM roles to the Dataplex Universal Catalog service account:

Run data profile scans against BigQuery data:
- BigQuery Job User ( roles/bigquery.jobUser ) on project running the scan
- BigQuery Data Viewer ( roles/bigquery.dataViewer ) on tables being scanned
Run data profile scans for BigQuery external tables that use Cloud Storage data:
- Storage Object Viewer ( roles/storage.objectViewer ) on Cloud Storage bucket
- Storage Legacy Bucket Reader ( roles/storage.legacyBucketReader ) on Cloud Storage bucket
Export data profile scan results to a BigQuery table: BigQuery Data Editor ( roles/bigquery.dataEditor ) on table

For more information about granting roles, see Manage access to projects, folders, and organizations .

These predefined roles contain the permissions required to run data profile scans and export results. To see the exact permissions that are required, expand the Required permissionssection:

Required permissions

The following permissions are required to run data profile scans and export results:

Run data profile scans against BigQuery data:
- bigquery.jobs.create on project
- bigquery.tables.get on table
- bigquery.tables.getData on table
Run data profile scans for BigQuery external tables that use Cloud Storage data:
- storage.buckets.get on bucket
- storage.objects.get on object
Export data profile scan results to a BigQuery table:
- bigquery.tables.create on dataset
- bigquery.tables.updateData on table

Your administrator might also be able to give the Dataplex Universal Catalog service account these permissions with custom roles or other predefined roles .

If a table uses BigQuery row-level security , then Dataplex Universal Catalog can only scan rows visible to the Dataplex Universal Catalog service account. To allow Dataplex Universal Catalog to scan all rows, add its service account to a row filter where the predicate is TRUE .

If a table uses BigQuery column-level security , then Dataplex Universal Catalog requires access to scan protected columns. To grant access, give the Dataplex Universal Catalog service account the Data Catalog Fine-Grained Reader( roles/datacatalog.fineGrainedReader ) role on all policy tags used in the table. The user creating or updating a data scan also needs permissions on protected columns.

Grant roles to the Dataplex Universal Catalog service account

To run data profile scans, Dataplex Universal Catalog uses a service account that requires permissions to run BigQuery jobs and read BigQuery table data. To grant the required roles, follow these steps:

Get the Dataplex Universal Catalog service account email address. If you haven't created a data profile or data quality scan in this project before, run the following gcloud command to generate the service identity:
```
 gcloud  
beta  
services  
identity  
create  
--service = 
dataplex.googleapis.com 
```
The command returns the service account email, which has the following format: service- PROJECT_ID @gcp-sa-dataplex.iam.gserviceaccount.com.

If the service account already exists, you can find its email by viewing principals with the Dataplexname on the IAMpage in the Google Cloud console.
Grant the service account the BigQuery Job User( roles/bigquery.jobUser ) role on your project. This role lets the service account run BigQuery jobs for the scan.
```
 gcloud  
projects  
add-iam-policy-binding  
 PROJECT_ID 
  
 \ 
  
--member = 
 "serviceAccount:service- PROJECT_NUMBER 
@gcp-sa-dataplex.iam.gserviceaccount.com" 
  
 \ 
  
--role = 
 "roles/bigquery.jobUser" 
 
```
Replace the following:
- PROJECT_ID : your Google Cloud project ID.
- service- PROJECT_NUMBER @gcp-sa-dataplex.iam.gserviceaccount.com : the email of the Dataplex Universal Catalog service account.
Grant the service account the BigQuery Data Viewer( roles/bigquery.dataViewer ) role for each table that you want to profile. This role grants read-only access to the tables.
```
 gcloud  
bigquery  
tables  
add-iam-policy-binding  
 DATASET_ID 
. TABLE_ID 
  
 \ 
  
--member = 
 "serviceAccount:service- PROJECT_NUMBER 
@gcp-sa-dataplex.iam.gserviceaccount.com" 
  
 \ 
  
--role = 
 "roles/bigquery.dataViewer" 
 
```
Replace the following:
- DATASET_ID : the ID of the dataset containing the table.
- TABLE_ID : the ID of the table to profile.
- service- PROJECT_NUMBER @gcp-sa-dataplex.iam.gserviceaccount.com : the email of the Dataplex Universal Catalog service account.

Create a data profile scan

Console

In the Google Cloud console, go to the Dataplex Universal Catalog Data profiling & qualitypage.

Go to Data profiling & quality
Click Create data profile scan.
Optional: Enter a Display name.
Enter an ID. See the Resource naming conventions .
Optional: Enter a Description.
In the Tablefield, click Browse. Choose the table to scan, and then click Select.

For tables in multi-region datasets, choose a region where to create the data scan.

To browse the tables organized within Dataplex Universal Catalog lakes, click Browse within Dataplex Lakes.
In the Scopefield, choose Incrementalor Entire data.
- If you choose Incremental data, in the Timestamp columnfield, select a column of type DATE or TIMESTAMP from your BigQuery table that increases as new records are added, and that can be used to identify new records. For tables partitioned on a column of type DATE or TIMESTAMP , we recommend using the partition column as the timestamp field.
Optional: To filter your data, do any of the following:
- To filter by rows, select the Filter rowscheckbox. Enter a valid SQL expression that can be used in a WHERE clause in GoogleSQL syntax . For example: col1 >= 0 .
  
  The filter can be a combination of SQL conditions over multiple columns. For example: col1 >= 0 AND col2 < 10 .
- To filter by columns, select the Filter columnscheckbox.
  - To include columns in the profile scan, in the Include columnsfield, click Browse. Select the columns to include, and then click Select.
  - To exclude columns from the profile scan, in the Exclude columnsfield, click Browse. Select the columns to exclude, and then click Select.
  Note: You can use Include columns, Exclude columns, or both. If you use both the fields, then the data profile scan first selects the columns based on your input in the Include columnsfield and then excludes the columns based on your input in the Exclude columnsfield.
To apply sampling to your data profile scan, in the Sampling sizelist, select a sampling percentage. Choose a percentage value that ranges between 0.0% and 100.0% with up to 3 decimal digits.
- For larger datasets, choose a lower sampling percentage. For example, for a 1 PB table, if you enter a value between 0.1% and 1.0%, the data profile samples between 1-10 TB of data.
- There must be at least 100 records in the sampled data to return a result.
- For incremental data scans, the data profile scan applies sampling to the latest increment.
Optional: Publish the data profile scan results in the BigQuery and Dataplex Universal Catalog pages in the Google Cloud console for the source table. Select the Publish results to Dataplex Catalogcheckbox.

You can view the latest scan results in the Data profiletab in the BigQuery and Dataplex Universal Catalog pages for the source table. To enable users to access the published scan results, see the Grant access to data profile scan results section of this document.

The publishing option might not be available in the following cases:
- You don't have the required permissions on the table.
- Another data profile scan is set to publish results.
In the Schedulesection, choose one of the following options:
- Repeat: Run the data profile scan on a schedule: hourly, daily, weekly, monthly, or custom. Specify how often the scan should run and at what time. If you choose custom, use cron format to specify the schedule.
- On-demand: Run the data profile scan on demand.
- One-time: Run the data profile scan once now, and remove the scan after the time-to-live period.
- Time to live: The time-to-live value defines the duration a data profile scan remains active after execution. A data profile scan without a specified time-to-live is automatically removed after 24 hours. The time-to-live can range from 0 seconds (immediate deletion) to 365 days.
Click Continue.
Optional: Export the scan results to a BigQuery standard table. In the Export scan results to BigQuery tablesection, do the following:
1. In the Select BigQuery datasetfield, click Browse. Select a BigQuery dataset to store the data profile scan results.
2. In the BigQuery tablefield, specify the table to store the data profile scan results. If you're using an existing table, make sure that it is compatible with the export table schema . If the specified table doesn't exist, Dataplex Universal Catalog creates it for you.
  
  Note: You can use the same results table for multiple data profile scans.
Optional: Add labels. Labels are key-value pairs that let you group related objects together or with other Google Cloud resources.
To create the scan, click Create.

If you set the schedule to on-demand, you can also run the scan now by clicking Run scan.

gcloud

To create a data profile scan, use the gcloud dataplex datascans create data-profile command .

If the source data is organized in a Dataplex Universal Catalog lake, include the --data-source-entity flag:

gcloud dataplex datascans create data-profile DATASCAN 
\
--location= LOCATION 
\
--data-source-entity= DATA_SOURCE_ENTITY

If the source data isn't organized in a Dataplex Universal Catalog lake, include the --data-source-resource flag:

gcloud dataplex datascans create data-profile DATASCAN 
\
--location= LOCATION 
\
--data-source-resource= DATA_SOURCE_RESOURCE

Replace the following variables:

DATASCAN : The name of the data profile scan.
LOCATION : The Google Cloud region in which to create the data profile scan.
DATA_SOURCE_ENTITY : The Dataplex Universal Catalog entity that contains the data for the data profile scan. For example, projects/test-project/locations/test-location/lakes/test-lake/zones/test-zone/entities/test-entity .
DATA_SOURCE_RESOURCE : The name of the resource that contains the data for the data profile scan. For example, //bigquery.googleapis.com/projects/test-project/datasets/test-dataset/tables/test-table .

C#

Before trying this sample, follow the C# setup instructions in the Dataplex Universal Catalog quickstart using client libraries . For more information, see the Dataplex Universal Catalog C# API reference documentation .

To authenticate to Dataplex Universal Catalog, set up Application Default Credentials. For more information, see Set up authentication for a local development environment .

  using 
  
  Google.Api.Gax.ResourceNames 
 
 ; 
 using 
  
  Google.Cloud.Dataplex.V1 
 
 ; 
 using 
  
  Google.LongRunning 
 
 ; 
 public 
  
 sealed 
  
 partial 
  
 class 
  
 GeneratedDataScanServiceClientSnippets 
 { 
  
 /// <summary>Snippet for CreateDataScan</summary> 
  
 /// <remarks> 
  
 /// This snippet has been automatically generated and should be regarded as a code template only. 
  
 /// It will require modifications to work: 
  
 /// - It may require correct/in-range values for request initialization. 
  
 /// - It may require specifying regional endpoints when creating the service client as shown in 
  
 ///   https://cloud.google.com/dotnet/docs/reference/help/client-configuration#endpoint. 
  
 /// </remarks> 
  
 public 
  
 void 
  
 CreateDataScanRequestObject 
 () 
  
 { 
  
 // Create client 
  
  DataScanServiceClient 
 
  
 dataScanServiceClient 
  
 = 
  
  DataScanServiceClient 
 
 . 
  Create 
 
 (); 
  
 // Initialize request argument(s) 
  
  CreateDataScanRequest 
 
  
 request 
  
 = 
  
 new 
  
  CreateDataScanRequest 
 
  
 { 
  
 ParentAsLocationName 
  
 = 
  
  LocationName 
 
 . 
  FromProjectLocation 
 
 ( 
 "[PROJECT]" 
 , 
  
 "[LOCATION]" 
 ), 
  
 DataScan 
  
 = 
  
 new 
  
  DataScan 
 
 (), 
  
 DataScanId 
  
 = 
  
 "" 
 , 
  
 ValidateOnly 
  
 = 
  
 false 
 , 
  
 }; 
  
 // Make the request 
  
 Operation<DataScan 
 , 
  
 OperationMetadata 
>  
 response 
  
 = 
  
 dataScanServiceClient 
 . 
  CreateDataScan 
 
 ( 
 request 
 ); 
  
 // Poll until the returned long-running operation is complete 
  
 Operation<DataScan 
 , 
  
 OperationMetadata 
>  
 completedResponse 
  
 = 
  
 response 
 . 
 PollUntilCompleted 
 (); 
  
 // Retrieve the operation result 
  
  DataScan 
 
  
 result 
  
 = 
  
 completedResponse 
 . 
 Result 
 ; 
  
 // Or get the name of the operation 
  
 string 
  
 operationName 
  
 = 
  
 response 
 . 
 Name 
 ; 
  
 // This name can be stored, then the long-running operation retrieved later by name 
  
 Operation<DataScan 
 , 
  
 OperationMetadata 
>  
 retrievedResponse 
  
 = 
  
 dataScanServiceClient 
 . 
  PollOnceCreateDataScan 
 
 ( 
 operationName 
 ); 
  
 // Check if the retrieved long-running operation has completed 
  
 if 
  
 ( 
 retrievedResponse 
 . 
 IsCompleted 
 ) 
  
 { 
  
 // If it has completed, then access the result 
  
  DataScan 
 
  
 retrievedResult 
  
 = 
  
 retrievedResponse 
 . 
 Result 
 ; 
  
 } 
  
 } 
 }

Create and use data profile scans Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Required roles and permissions

User roles and permissions

Required permissions

Dataplex Universal Catalog service account roles and permissions

Required permissions

Grant roles to the Dataplex Universal Catalog service account

Create a data profile scan

Console

gcloud

C#

C#

Go

Go

Java

Java

Python

Python

Ruby

Ruby

REST

Export table schema

Export table setup

Create multiple data profile scans

Run a data profile scan

Console

gcloud

C#

C#

Go

Go

Java

Java

Python

Python

Ruby

Ruby

REST

View data profile scan results

Console

gcloud

C#

C#

Go

Go

Java

Java

Python

Python

Ruby

Ruby

REST

View published results

View the most recent data profile scan job

Console

gcloud

REST

View historical scan results

Console

gcloud

C#

C#

Go

Go

Java

Java

Python

Python

Ruby

Ruby

REST

Grant access to data profile scan results

Manage data profile scans for a specific table

Update a data profile scan

Console

gcloud

C#

C#

Go

Create and use data profile scans