Using a Human-in-the-Loopapproach, where AI drafts the initial rules and you review, refine, and validate them, you can quickly translate profile statistics into a data quality framework.
Objectives
- Flatten nested BigQuery data with materialized views to enable Knowledge Catalog profiling.
- Run Knowledge Catalog profile scans using the Python client library.
- Use the Gemini CLI to generate data quality rules based on profile statistics.
- Validate and deploy AI-generated rules as Knowledge Catalog quality scans using a Human-in-the-Loop review process.
Before you begin
Before you begin, make sure you have a Google Cloud project with billing enabled.
Prepare your environment
The following steps use Cloud Shell , a command-line environment running in the cloud.
-
In the Google Cloud console, click Activate Cloud Shellin the top right toolbar. The environment takes a few moments to provision and connect.
-
In Cloud Shell, set up your project ID and environment variables:
export PROJECT_ID = $( gcloud config get-value project ) gcloud config set project $PROJECT_ID export LOCATION = "us-central1" export BQ_LOCATION = "us" export DATASET_ID = "dataplex_dq_codelab" export TABLE_ID = "ga4_transactions"Use
us(multi-region) as the location since the public sample data is also located in theus(multi-region). For BigQuery queries, the source data and destination table must be in the same location. -
Enable the required services:
gcloud services enable dataplex.googleapis.com \ bigquery.googleapis.com \ serviceusage.googleapis.com -
Create a BigQuery dataset to store sample data and results:
bq --location = us mk --dataset $PROJECT_ID : $DATASET_ID -
Prepare the sample data, which comes from a public ecommerce dataset from the Google Merchandise Store.
The following
bqcommand creates a new table,ga4_transactions, in yourdataplex_dq_codelabdataset. To ensure scans run quickly, it only copies data from one day (2021-01-31).bq query \ --use_legacy_sql = false \ --destination_table = $PROJECT_ID : $DATASET_ID . $TABLE_ID \ --replace = true \ 'SELECT * FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210131`' -
Clone the GitHub repository that contains the folder structure and supporting files for this tutorial:
# Perform a shallow clone to get only the latest repository structure without the full history git clone --depth 1 --filter = blob:none --sparse https://github.com/GoogleCloudPlatform/devrel-demos.git cd devrel-demos # Specify and download only the folder we need for this lab git sparse-checkout set data-analytics/programmatic-dq cd data-analytics/programmatic-dqThis directory is your active working area.
Profile nested data
With data profiling , Knowledge Catalog finds statistics for top-level columns, like null percentages, uniqueness, and value distributions in your data to help you understand it.
To get statistics for nested fields, you can flatten the data using a set of materialized views. This turns each nested field into a top-level column that Knowledge Catalog can profile.
Get the nested schema
Get the full schema of your source table, including all nested structures, and save the output as a JSON file:
bq
show
--schema
--format =
json
$PROJECT_ID
: $DATASET_ID
. $TABLE_ID
>
bq_schema.json
View the schema:
jq <
bq_schema.json
The bq_schema.json
file reveals complex structures.
Flatten data with a materialized view
When you flatten nested data, it's important not to unnest multiple independent arrays in the same view. Doing so performs an implicit cross join (Cartesian product) between the arrays, which multiplies rows incorrectly and corrupts your data.
It's best to create multiple views instead, each built for a specific purpose. Each view should keep a single, clear level of detail. In this step, you create the following materialized views:
- Session flat view(
mv_ga4_user_session_flat.sql): one row per event. - Transactions view(
mv_ga4_ecommerce_transactions.sql): one row per transaction. - Items view(
mv_ga4_ecommerce_items.sql): one row per item.
The project repository provides three SQL files in the devrel-demos/data-analytics/programmatic-dq
directory that define these views.
Run these files from the Cloud Shell using the following BigQuery commands.
envsubst <
mv_ga4_user_session_flat.sql
|
bq
query
--use_legacy_sql =
false
envsubst <
mv_ga4_ecommerce_transactions.sql
|
bq
query
--use_legacy_sql =
false
envsubst <
mv_ga4_ecommerce_items.sql
|
bq
query
--use_legacy_sql =
false
Run profile scans with the Python client
You can now create and run Knowledge Catalog data profile scans for each materialized view. The following Python script uses the google-cloud-dataplex
client library to automate this process.
Before you run the script, create an isolated Python virtual environment in your project directory.
# Create the virtual environment
python3
-m
venv
dq_venv # Activate the environment
source
dq_venv/bin/activate
Install the Knowledge Catalog client library inside the virtual environment.
# Install the Dataplex client library
pip
install
google-cloud-dataplex
Now that you've set up the environment and installed the library, you're ready to use the 1_run_dataplex_scans.py
script. This script profiles your three materialized views by creating and running a scan for each one. When it finishes, it outputs a rich statistical summary that you use in the next step to generate AI-powered data quality rules.
Run the script from your Cloud Shell terminal.
python3
1_run_dataplex_scans.py
Check your profile scans
You can check out the new profile scans in the Google Cloud console.
- In the navigation menu, go to Knowledge Catalogand select Data profiling & qualityin the Governsection.
- Find your three profile scans listed, along with their latest job status. Click a scan to explore its detailed results.
Export profile results to JSON
In order for Gemini to read your profile scans, you need to extract their contents into a local file.
Use the 2_dq_profile_save.py
script to find the latest successful scan for the mv_ga4_user_session_flat
view, download the profile data, and save it to a file named dq_profile_results.json
.
python3
2_dq_profile_save.py
When the script finishes, it creates a dq_profile_results.json
file in the directory. This file holds the detailed statistical metadata you need to generate data quality rules. Take a look at its contents by running the following command:
cat
dq_profile_results.json
Generate data quality rules with the Gemini CLI
Now you can use the Gemini CLI to read the local profile scan results.
Manually writing data quality rules for complex datasets is time-consuming and error-prone. Generative AI accelerates this workflow by generating a comprehensive initial data quality configuration in seconds. This helps you pivot from manual task execution to high-level oversight.
To start the Gemini CLI, use the following command:
gemini
Now you're ready to generate quality rules. Because the CLI can read files in your current directory, it can use your new profile scan data directly.
Prompt Gemini to create a plan
Ask Gemini to act as an expert analyst and propose a plan for creating your data quality rules. Tell Gemini not to write the YAML file yet so it focuses on analysis. Gemini analyzes the JSON file and returns a structured plan
You
are
an
expert
Google
Cloud
Dataplex
engineer.
Your
first
task
is
to
create
a
plan.
I
have
a
file
in
the
current
directory
named
./dq_profile_results.json.
Based
on
the
statistical
data
within
that
file,
propose
a
step-by-step
plan
to
create
a
Dataplex
data
quality
rules
file.
Your
plan
should
identify
which
specific
columns
are
good
candidates
for
rules
like
nonNullExpectation,
setExpectation,
or
rangeExpectation,
and
explain
why
based
on
the
metrics
(
for
example,
"Plan to create a nonNullExpectation for column X because its null percentage is 0%"
)
.
Do
not
write
the
YAML
file
yet.
Just
provide
the
plan.
Generate data quality rules
Gemini's plan relies entirely on statistical patterns and lacks your specific business knowledge.
Review the plan and ask yourself the following questions:
- Does it align with your business goals and context?
- Are any statistically sound rules actually impractical (like a strict rowCount for a growing table)?
Refine the plan with Gemini, or approve it as-is using the following example prompt. The prompt starts by providing some feedback, then instructs Gemini to generate the dq_rules.yaml
file in your working directory and conform to the DataQualityRule specification because Knowledge Catalog requires a precise YAML structure. This helps prevent syntax errors or the use of outdated schema versions.
-
"The plan looks good. Please proceed."
-
"The rowCount rule is not necessary, as the table size changes daily. The rest of the plan is approved. Please proceed."
-
"For the setExpectation on the geo_continent column, please also include 'Antarctica'."
Once
you
have
incorporated
my
feedback,
please
generate
the
`
dq_rules.yaml `
file.
You
must
adhere
to
the
following
strict
requirements:
-
Schema
Compliance:
The
YAML
structure
must
strictly
conform
to
the
DataQualityRule
specification.
For
a
definitive
source
of
truth,
you
must
refer
to
the
sample_rule.yaml
file
in
the
current
directory
and
the
DataQualityRule
class
definition.
Search
for
the
`
data_quality.py `
file
inside
the
`
./dq_venv/lib/ `
directory
to
read
this
class
definition.
-
Data-Driven
Values:
All
rule
parameters,
such
as
thresholds
or
expected
values,
must
be
derived
directly
from
the
statistical
metrics
in
dq_profile_results.json.
-
Rule
Justification:
For
each
rule,
add
a
comment
(
#) on the line above explaining the justification, as you outlined in your plan.
-
Output
Purity:
The
final
output
must
only
be
the
raw
YAML
code
block,
perfectly
formatted
and
ready
for
immediate
deployment.
Create and run a data quality scan
You now have an agent-generated set of data quality rules that you can register and deploy as a scan.
-
Exit the Gemini CLI by entering
/quitor pressingCtrl+Ctwice. -
Then, create a data scan in Knowledge Catalog:
export DQ_SCAN = "dq-scan" gcloud dataplex datascans create data-quality $DQ_SCAN \ --project = $PROJECT_ID \ --location = $LOCATION \ --data-quality-spec-file = dq_rules.yaml \ --data-source-resource = "//bigquery.googleapis.com/projects/ $PROJECT_ID /datasets/ $DATASET_ID /tables/mv_ga4_user_session_flat" -
Finally, run the scan:
gcloud dataplex datascans run $DQ_SCAN --location = $LOCATION --project = $PROJECT_IDThis command creates a data quality scan named
dq-scan. -
Check your scan's progress in the Knowledge Catalog section of the Google Cloud console.
- In the navigation menu, go to Knowledge Catalogand select Data profiling & qualityin the Governsection.
- Find the
dq-scan. When the scan completes, click the scan to see the results.
Clean up
To avoid recurring billing charges for the resources you created in this tutorial, delete them.
Delete the Knowledge Catalog scans
Delete your profile and quality scans using the specific scan names from this codelab:
# Delete the Data Quality Scan
gcloud
dataplex
datascans
delete
dq-scan
\
--location =
us-central1
\
--project =
$PROJECT_ID
--quiet # Delete the Data Profile Scans
gcloud
dataplex
datascans
delete
profile-scan-mv-ga4-user-session-flat
\
--location =
us-central1
\
--project =
$PROJECT_ID
--quiet
gcloud
dataplex
datascans
delete
profile-scan-mv-ga4-ecommerce-transactions
\
--location =
us-central1
\
--project =
$PROJECT_ID
--quiet
gcloud
dataplex
datascans
delete
profile-scan-mv-ga4-ecommerce-items
\
--location =
us-central1
\
--project =
$PROJECT_ID
--quiet
Delete the sample dataset
Delete your temporary BigQuery dataset and its tables.
bq
rm
-r
-f
--dataset
$PROJECT_ID
:dataplex_dq_codelab
Delete local files
Deactivate the Python virtual environment and remove the cloned repository and its contents:
deactivate cd
../../..
rm
-rf
devrel-demos
Conclusion
Congratulations, you just built an end-to-end, programmatic data governance workflow.
By pairing Gemini with Knowledge Catalog, you've built a foundation for AI-assisted governance. This approach doesn't replace the governance loop, but accelerates the process of rule creation so that you can focus on validating and refining rules based on your business logic.
What's next
- Read more about the philosophy behind this architecture in AI-Assisted Governance: Accelerating Data Quality with Human Oversight .
- Manage data quality as code by creating a CI/CD pipeline .
- Explore using custom SQL rules to enforce business-specific logic.
- Optimize your scans with filters and sampling to reduce costs .
- Automate your infrastructure by provisioning Knowledge Catalog resources with Terraform to manage your data governance at scale.
- Learn more about the Gemini CLI open-source AI agent .
- Try other Knowledge Catalog use cases

