Stay organized with collectionsSave and categorize content based on your preferences.
The ML.TFDV_VALIDATE function
This document describes theML.TFDV_VALIDATEfunction, which you can use to
compare the statistics for training and serving data, or two sets of
serving data, in order to identify anomalous differences between the two data
sets. Calling this function provides the same behavior as calling the
TensorFlowvalidate_statisticsAPI.
You can use the data output by this function formodel monitoring.
base_statistics: the statistics of the training or serving data
that you want to use as the baseline for comparison. This must be
a TensorFlowDatasetFeatureStatisticsListprotocol bufferin JSON format. You can generate a protocol buffer in the correct
format by running theML.TFDV_DESCRIBEfunction,
or you can load it from outside of BigQuery.
study_statistics: the statistics of the training or serving data
that you want to compare to the baseline. This must be
a TensorFlowDatasetFeatureStatisticsListprotocol buffer
in JSON format. You can generate a protocol buffer in the correct format by
running theML.TFDV_DESCRIBEfunction, or you can load it from outside of
BigQuery.
detection_type: aSTRINGvalue that specifies the type of comparison that
you want to make. Valid values are as follows:
SKEW: returns the data skew, which represents the statistical variation
between training and serving data.
DRIFT: returns the data drift, which represents the statistical
variation between two different sets of serving data.
categorical_default_threshold: aFLOAT64value that specifies the custom
threshold to use for anomaly detection for categorical andARRAY<categorical>features. The value
must be in the range[0, 1). The default value is0.3.
categorical_metric_type: aSTRINGvalue that specifies the metric used
to compare statistics for categorical andARRAY<categorical>features.
Valid values are as follows:
numerical_default_threshold: aFLOAT64value that specifies the custom
threshold to use for anomaly detection for numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>features. The value
must be in the range[0, 1). The default value is0.3.
numerical_metric_type: aSTRINGvalue that specifies the metric used
to compare statistics for numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>features. The only valid value isJENSEN_SHANNON_DIVERGENCE.
thresholds: anARRAY<STRUCT<STRING, FLOAT64>>value
that specifies the anomaly detection thresholds for one or more columns
for which you don't want to use the default threshold. TheSTRINGvalue in
the struct specifies the column name, and theFLOAT64value specifies the
threshold. TheFLOAT64value must be in the range[0, 1). For example,[('col_a', 0.1), ('col_b', 0.8)].
ML.TFDV_VALIDATEuses positional arguments, so if you specify an
optional argument, you must also specify all arguments prior to that argument.
For more information on argument types, seeNamed arguments.
The following example returns the skew between training and serving data
and also sets custom anomaly detection thresholds for two of the feature
columns:
If you specifyJENSEN_SHANNON_DIVERGENCEfor thecategorical_default_thresholdornumerical_default_thresholdargument, the feature isn't included in the final anomaly report.
If you specifyL_INFTYfor thecategorical_default_thresholdargument, the function outputs the computed feature distance as expected.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-29 UTC."],[[["\u003cp\u003eThe \u003ccode\u003eML.TFDV_VALIDATE\u003c/code\u003e function compares statistics between two datasets, such as training and serving data, to identify statistical differences.\u003c/p\u003e\n"],["\u003cp\u003eThis function can detect \u003ccode\u003eSKEW\u003c/code\u003e (variations between training and serving data) or \u003ccode\u003eDRIFT\u003c/code\u003e (variations between two sets of serving data).\u003c/p\u003e\n"],["\u003cp\u003eIt uses TensorFlow \u003ccode\u003eDatasetFeatureStatisticsList\u003c/code\u003e protocol buffers as input, which can be generated using the \u003ccode\u003eML.TFDV_DESCRIBE\u003c/code\u003e function.\u003c/p\u003e\n"],["\u003cp\u003eThe function supports customizable anomaly detection thresholds and metric types for categorical and numerical features, providing flexibility in how anomalies are identified.\u003c/p\u003e\n"],["\u003cp\u003eThe function returns a TensorFlow \u003ccode\u003eAnomalies\u003c/code\u003e protocol buffer, and it does not perform schema validation, while also handling type mismatch in specific ways.\u003c/p\u003e\n"]]],[],null,["# The ML.TFDV_VALIDATE function\n=============================\n\nThis document describes the `ML.TFDV_VALIDATE` function, which you can use to\ncompare the statistics for training and serving data, or two sets of\nserving data, in order to identify anomalous differences between the two data\nsets. Calling this function provides the same behavior as calling the\nTensorFlow\n[`validate_statistics` API](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/validate_statistics).\nYou can use the data output by this function for\n[model monitoring](/bigquery/docs/model-monitoring-overview).\n\nSyntax\n------\n\n```sql\nML.TFDV_VALIDATE(\n base_statistics,\n study_statistics\n [, detection_type]\n [, categorical_default_threshold]\n [, categorical_metric_type]\n [, numerical_default_threshold]\n [, numerical_metric_type]\n [, thresholds]\n)\n```\n\n### Arguments\n\n`ML.TFDV_VALIDATE` takes the following arguments:\n\n- `base_statistics`: the statistics of the training or serving data that you want to use as the baseline for comparison. This must be a TensorFlow [`DatasetFeatureStatisticsList` protocol buffer](https://www.tensorflow.org/tfx/tf_metadata/api_docs/python/tfmd/proto/statistics_pb2/DatasetFeatureStatisticsList) in JSON format. You can generate a protocol buffer in the correct format by running the [`ML.TFDV_DESCRIBE` function](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-tfdv-describe), or you can load it from outside of BigQuery.\n- `study_statistics`: the statistics of the training or serving data that you want to compare to the baseline. This must be a TensorFlow `DatasetFeatureStatisticsList` protocol buffer in JSON format. You can generate a protocol buffer in the correct format by running the `ML.TFDV_DESCRIBE` function, or you can load it from outside of BigQuery.\n- `detection_type`: a `STRING` value that specifies the type of comparison that you want to make. Valid values are as follows:\n - `SKEW`: returns the data skew, which represents the statistical variation between training and serving data.\n - `DRIFT`: returns the data drift, which represents the statistical variation between two different sets of serving data.\n- `categorical_default_threshold`: a `FLOAT64` value that specifies the custom threshold to use for anomaly detection for categorical and `ARRAY\u003ccategorical\u003e` features. The value must be in the range `[0, 1)`. The default value is `0.3`.\n- `categorical_metric_type`: a `STRING` value that specifies the metric used to compare statistics for categorical and `ARRAY\u003ccategorical\u003e`features. Valid values are as follows:\n - `L_INFTY`: use [L-infinity distance](https://en.wikipedia.org/wiki/Chebyshev_distance). This value is the default.\n - `JENSEN_SHANNON_DIVERGENCE`: use [Jensen--Shannon divergence](https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence).\n- `numerical_default_threshold`: a `FLOAT64` value that specifies the custom threshold to use for anomaly detection for numerical, `ARRAY\u003cnumerical\u003e`, and `ARRAY\u003cSTRUCT\u003cINT64, numerical\u003e\u003e` features. The value must be in the range `[0, 1)`. The default value is `0.3`.\n- `numerical_metric_type`: a `STRING` value that specifies the metric used to compare statistics for numerical, `ARRAY\u003cnumerical\u003e`, and `ARRAY\u003cSTRUCT\u003cINT64, numerical\u003e\u003e` features. The only valid value is `JENSEN_SHANNON_DIVERGENCE`.\n- `thresholds`: an `ARRAY\u003cSTRUCT\u003cSTRING, FLOAT64\u003e\u003e` value that specifies the anomaly detection thresholds for one or more columns for which you don't want to use the default threshold. The `STRING` value in the struct specifies the column name, and the `FLOAT64` value specifies the threshold. The `FLOAT64` value must be in the range `[0, 1)`. For example, `[('col_a', 0.1), ('col_b', 0.8)]`.\n\n`ML.TFDV_VALIDATE` uses positional arguments, so if you specify an\noptional argument, you must also specify all arguments prior to that argument.\nFor more information on argument types, see\n[Named arguments](/bigquery/docs/reference/standard-sql/functions-reference#named_arguments).\n\nOutput\n------\n\n`ML.TFDV_VALIDATE` returns a TensorFlow\n[`Anomalies` protocol buffer](https://www.tensorflow.org/tfx/tf_metadata/api_docs/python/tfmd/proto/anomalies_pb2/Anomalies)\nin JSON format.\n\nExamples\n--------\n\nThe following example returns the skew between training and serving data\nand also sets custom anomaly detection thresholds for two of the feature\ncolumns: \n\n```sql\nDECLARE stats1 JSON;\nDECLARE stats2 JSON;\n\nSET stats1 = (SELECT * FROM ML.TFDV_DESCRIBE(TABLE `myproject.mydataset.training`));\n\nSET stats2 = (SELECT * FROM ML.TFDV_DESCRIBE(TABLE `myproject.mydataset.serving`));\n\nSELECT ML.TFDV_VALIDATE(\n stats1, stats2, 'SKEW', .3, 'L_INFTY', .3, 'JENSEN_SHANNON_DIVERGENCE', [('feature1', 0.2), ('feature2', 0.5)]\n);\n\nINSERT `myproject.mydataset.serve_stats`\n (t, dataset_feature_statistics_list)\nSELECT CURRENT_TIMESTAMP() AS t, stats1;\n```\n\nThe following example returns the drift between two sets of serving data: \n\n```sql\nSELECT ML.TFDV_VALIDATE(\n (SELECT dataset_feature_statistics_list FROM `myproject.mydataset.servingJan24`),\n (SELECT * FROM ML.TFDV_DESCRIBE(TABLE `myproject.mydataset.serving`)),\n 'DRIFT'\n);\n```\n\nLimitations\n-----------\n\nThe `ML.TFDV_VALIDATE` function doesn't conduct schema validation.\n\n`ML.TFDV_VALIDATE` handles type mismatch as follows:\n\n- If you specify `JENSEN_SHANNON_DIVERGENCE` for the `categorical_default_threshold` or `numerical_default_threshold` argument, the feature isn't included in the final anomaly report.\n- If you specify `L_INFTY` for the `categorical_default_threshold` argument, the function outputs the computed feature distance as expected."]]