Profile and ensure quality in data

Build a policy-as-code data quality workflow. This tutorial covers how to move beyond manual, UI-driven processes by defining data quality expectations in declarative, version-controlled files.

Using a Human-in-the-Loopapproach, where AI drafts the initial rules and you review, refine, and validate them, you can quickly translate profile statistics into a data quality framework.

Objectives

  • Flatten nested BigQuery data with materialized views to enable Knowledge Catalog profiling.
  • Run Knowledge Catalog profile scans using the Python client library.
  • Use the Gemini CLI to generate data quality rules based on profile statistics.
  • Validate and deploy AI-generated rules as Knowledge Catalog quality scans using a Human-in-the-Loop review process.

Before you begin

Before you begin, make sure you have a Google Cloud project with billing enabled.

Prepare your environment

The following steps use Cloud Shell , a command-line environment running in the cloud.

  1. In the Google Cloud console, click Activate Cloud Shellin the top right toolbar. The environment takes a few moments to provision and connect.

  2. In Cloud Shell, set up your project ID and environment variables:

      export 
      
     PROJECT_ID 
     = 
     $( 
    gcloud  
    config  
    get-value  
    project ) 
    gcloud  
    config  
     set 
      
    project  
     $PROJECT_ID 
     export 
      
     LOCATION 
     = 
     "us-central1" 
     export 
      
     BQ_LOCATION 
     = 
     "us" 
     export 
      
     DATASET_ID 
     = 
     "dataplex_dq_codelab" 
     export 
      
     TABLE_ID 
     = 
     "ga4_transactions" 
     
    

    Use us (multi-region) as the location since the public sample data is also located in the us (multi-region). For BigQuery queries, the source data and destination table must be in the same location.

  3. Enable the required services:

     gcloud  
    services  
     enable 
      
    dataplex.googleapis.com  
     \ 
      
    bigquery.googleapis.com  
     \ 
      
    serviceusage.googleapis.com 
    
  4. Create a BigQuery dataset to store sample data and results:

     bq  
    --location = 
    us  
    mk  
    --dataset  
     $PROJECT_ID 
    : $DATASET_ID 
     
    
  5. Prepare the sample data, which comes from a public ecommerce dataset from the Google Merchandise Store.

    The following bq command creates a new table, ga4_transactions , in your dataplex_dq_codelab dataset. To ensure scans run quickly, it only copies data from one day (2021-01-31).

     bq  
    query  
     \ 
    --use_legacy_sql = 
     false 
      
     \ 
    --destination_table = 
     $PROJECT_ID 
    : $DATASET_ID 
    . $TABLE_ID 
      
     \ 
    --replace = 
     true 
      
     \ 
     'SELECT * FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210131`' 
     
    
  6. Clone the GitHub repository that contains the folder structure and supporting files for this tutorial:

      # Perform a shallow clone to get only the latest repository structure without the full history 
    git  
    clone  
    --depth  
     1 
      
    --filter = 
    blob:none  
    --sparse  
    https://github.com/GoogleCloudPlatform/devrel-demos.git cd 
      
    devrel-demos # Specify and download only the folder we need for this lab 
    git  
    sparse-checkout  
     set 
      
    data-analytics/programmatic-dq cd 
      
    data-analytics/programmatic-dq 
    

    This directory is your active working area.

Profile nested data

With data profiling , Knowledge Catalog finds statistics for top-level columns, like null percentages, uniqueness, and value distributions in your data to help you understand it.

To get statistics for nested fields, you can flatten the data using a set of materialized views. This turns each nested field into a top-level column that Knowledge Catalog can profile.

Get the nested schema

Get the full schema of your source table, including all nested structures, and save the output as a JSON file:

 bq  
show  
--schema  
--format = 
json  
 $PROJECT_ID 
: $DATASET_ID 
. $TABLE_ID 
 > 
bq_schema.json 

View the schema:

 jq < 
bq_schema.json 

The bq_schema.json file reveals complex structures.

Flatten data with a materialized view

When you flatten nested data, it's important not to unnest multiple independent arrays in the same view. Doing so performs an implicit cross join (Cartesian product) between the arrays, which multiplies rows incorrectly and corrupts your data.

It's best to create multiple views instead, each built for a specific purpose. Each view should keep a single, clear level of detail. In this step, you create the following materialized views:

  • Session flat view( mv_ga4_user_session_flat.sql ): one row per event.
  • Transactions view( mv_ga4_ecommerce_transactions.sql ): one row per transaction.
  • Items view( mv_ga4_ecommerce_items.sql ): one row per item.

The project repository provides three SQL files in the devrel-demos/data-analytics/programmatic-dq directory that define these views.

Run these files from the Cloud Shell using the following BigQuery commands.

 envsubst < 
mv_ga4_user_session_flat.sql  
 | 
  
bq  
query  
--use_legacy_sql = 
 false 
envsubst < 
mv_ga4_ecommerce_transactions.sql  
 | 
  
bq  
query  
--use_legacy_sql = 
 false 
envsubst < 
mv_ga4_ecommerce_items.sql  
 | 
  
bq  
query  
--use_legacy_sql = 
 false 
 

Run profile scans with the Python client

You can now create and run Knowledge Catalog data profile scans for each materialized view. The following Python script uses the google-cloud-dataplex client library to automate this process.

Before you run the script, create an isolated Python virtual environment in your project directory.

  # Create the virtual environment 
python3  
-m  
venv  
dq_venv # Activate the environment 
 source 
  
dq_venv/bin/activate 

Install the Knowledge Catalog client library inside the virtual environment.

  # Install the Dataplex client library 
pip  
install  
google-cloud-dataplex 

Now that you've set up the environment and installed the library, you're ready to use the 1_run_dataplex_scans.py script. This script profiles your three materialized views by creating and running a scan for each one. When it finishes, it outputs a rich statistical summary that you use in the next step to generate AI-powered data quality rules.

Run the script from your Cloud Shell terminal.

 python3  
1_run_dataplex_scans.py 

Check your profile scans

You can check out the new profile scans in the Google Cloud console.

  1. In the navigation menu, go to Knowledge Catalogand select Data profiling & qualityin the Governsection.
  2. Find your three profile scans listed, along with their latest job status. Click a scan to explore its detailed results.

Export profile results to JSON

In order for Gemini to read your profile scans, you need to extract their contents into a local file.

Use the 2_dq_profile_save.py script to find the latest successful scan for the mv_ga4_user_session_flat view, download the profile data, and save it to a file named dq_profile_results.json .

 python3  
2_dq_profile_save.py 

When the script finishes, it creates a dq_profile_results.json file in the directory. This file holds the detailed statistical metadata you need to generate data quality rules. Take a look at its contents by running the following command:

 cat  
dq_profile_results.json 

Generate data quality rules with the Gemini CLI

Now you can use the Gemini CLI to read the local profile scan results.

Manually writing data quality rules for complex datasets is time-consuming and error-prone. Generative AI accelerates this workflow by generating a comprehensive initial data quality configuration in seconds. This helps you pivot from manual task execution to high-level oversight.

To start the Gemini CLI, use the following command:

 gemini 

Now you're ready to generate quality rules. Because the CLI can read files in your current directory, it can use your new profile scan data directly.

Prompt Gemini to create a plan

Ask Gemini to act as an expert analyst and propose a plan for creating your data quality rules. Tell Gemini not to write the YAML file yet so it focuses on analysis. Gemini analyzes the JSON file and returns a structured plan

 You  
are  
an  
expert  
Google  
Cloud  
Dataplex  
engineer.
Your  
first  
task  
is  
to  
create  
a  
plan.  
I  
have  
a  
file  
 in 
  
the  
current  
directory  
named  
./dq_profile_results.json.
Based  
on  
the  
statistical  
data  
within  
that  
file,  
propose  
a  
step-by-step  
plan  
to  
create  
a  
Dataplex  
data  
quality  
rules  
file.
Your  
plan  
should  
identify  
which  
specific  
columns  
are  
good  
candidates  
 for 
  
rules  
like  
nonNullExpectation,  
setExpectation,  
or  
rangeExpectation,  
and  
explain  
why  
based  
on  
the  
metrics  
 ( 
 for 
  
example,  
 "Plan to create a nonNullExpectation for column X because its null percentage is 0%" 
 ) 
.
Do  
not  
write  
the  
YAML  
file  
yet.  
Just  
provide  
the  
plan. 

Generate data quality rules

Gemini's plan relies entirely on statistical patterns and lacks your specific business knowledge.

Review the plan and ask yourself the following questions:

  • Does it align with your business goals and context?
  • Are any statistically sound rules actually impractical (like a strict rowCount for a growing table)?

Refine the plan with Gemini, or approve it as-is using the following example prompt. The prompt starts by providing some feedback, then instructs Gemini to generate the dq_rules.yaml file in your working directory and conform to the DataQualityRule specification because Knowledge Catalog requires a precise YAML structure. This helps prevent syntax errors or the use of outdated schema versions.

 -  
 "The plan looks good. Please proceed." 
-  
 "The rowCount rule is not necessary, as the table size changes daily. The rest of the plan is approved. Please proceed." 
-  
 "For the setExpectation on the geo_continent column, please also include 'Antarctica'." 
Once  
you  
have  
incorporated  
my  
feedback,  
please  
generate  
the  
 ` 
dq_rules.yaml ` 
  
file.

You  
must  
adhere  
to  
the  
following  
strict  
requirements:

-  
Schema  
Compliance:  
The  
YAML  
structure  
must  
strictly  
conform  
to  
the  
DataQualityRule  
specification.  
For  
a  
definitive  
 source 
  
of  
truth,  
you  
must  
refer  
to  
the  
sample_rule.yaml  
file  
 in 
  
the  
current  
directory  
and  
the  
DataQualityRule  
class  
definition.  
Search  
 for 
  
the  
 ` 
data_quality.py ` 
  
file  
inside  
the  
 ` 
./dq_venv/lib/ ` 
  
directory  
to  
 read 
  
this  
class  
definition.

-  
Data-Driven  
Values:  
All  
rule  
parameters,  
such  
as  
thresholds  
or  
expected  
values,  
must  
be  
derived  
directly  
from  
the  
statistical  
metrics  
 in 
  
dq_profile_results.json.

-  
Rule  
Justification:  
For  
each  
rule,  
add  
a  
comment  
 ( 
 #) on the line above explaining the justification, as you outlined in your plan. 
-  
Output  
Purity:  
The  
final  
output  
must  
only  
be  
the  
raw  
YAML  
code  
block,  
perfectly  
formatted  
and  
ready  
 for 
  
immediate  
deployment. 

Create and run a data quality scan

You now have an agent-generated set of data quality rules that you can register and deploy as a scan.

  1. Exit the Gemini CLI by entering /quit or pressing Ctrl+C twice.

  2. Then, create a data scan in Knowledge Catalog:

      export 
      
     DQ_SCAN 
     = 
     "dq-scan" 
    gcloud  
    dataplex  
    datascans  
    create  
    data-quality  
     $DQ_SCAN 
      
     \ 
      
    --project = 
     $PROJECT_ID 
      
     \ 
      
    --location = 
     $LOCATION 
      
     \ 
      
    --data-quality-spec-file = 
    dq_rules.yaml  
     \ 
      
    --data-source-resource = 
     "//bigquery.googleapis.com/projects/ 
     $PROJECT_ID 
     /datasets/ 
     $DATASET_ID 
     /tables/mv_ga4_user_session_flat" 
     
    
  3. Finally, run the scan:

     gcloud  
    dataplex  
    datascans  
    run  
     $DQ_SCAN 
      
    --location = 
     $LOCATION 
      
    --project = 
     $PROJECT_ID 
     
    

    This command creates a data quality scan named dq-scan .

  4. Check your scan's progress in the Knowledge Catalog section of the Google Cloud console.

    1. In the navigation menu, go to Knowledge Catalogand select Data profiling & qualityin the Governsection.
    2. Find the dq-scan . When the scan completes, click the scan to see the results.

Clean up

To avoid recurring billing charges for the resources you created in this tutorial, delete them.

Delete the Knowledge Catalog scans

Delete your profile and quality scans using the specific scan names from this codelab:

  # Delete the Data Quality Scan 
gcloud  
dataplex  
datascans  
delete  
dq-scan  
 \ 
  
--location = 
us-central1  
 \ 
  
--project = 
 $PROJECT_ID 
  
--quiet # Delete the Data Profile Scans 
gcloud  
dataplex  
datascans  
delete  
profile-scan-mv-ga4-user-session-flat  
 \ 
  
--location = 
us-central1  
 \ 
  
--project = 
 $PROJECT_ID 
  
--quiet

gcloud  
dataplex  
datascans  
delete  
profile-scan-mv-ga4-ecommerce-transactions  
 \ 
  
--location = 
us-central1  
 \ 
  
--project = 
 $PROJECT_ID 
  
--quiet

gcloud  
dataplex  
datascans  
delete  
profile-scan-mv-ga4-ecommerce-items  
 \ 
  
--location = 
us-central1  
 \ 
  
--project = 
 $PROJECT_ID 
  
--quiet 

Delete the sample dataset

Delete your temporary BigQuery dataset and its tables.

 bq  
rm  
-r  
-f  
--dataset  
 $PROJECT_ID 
:dataplex_dq_codelab 

Delete local files

Deactivate the Python virtual environment and remove the cloned repository and its contents:

 deactivate cd 
  
../../..
rm  
-rf  
devrel-demos 

Conclusion

Congratulations, you just built an end-to-end, programmatic data governance workflow.

By pairing Gemini with Knowledge Catalog, you've built a foundation for AI-assisted governance. This approach doesn't replace the governance loop, but accelerates the process of rule creation so that you can focus on validating and refining rules based on your business logic.

What's next

Create a Mobile Website
View Site in Mobile | Classic
Share by: