The ML.TFDV_DESCRIBE function
This document describes the ML.TFDV_DESCRIBE
function, which you can use
to generate fine-grained statistics for the columns in a table. For example, you
might want to know statistics for a table of training or serving data
statistics that you plan to use with a machine learning (ML) model. Calling
this function provides the same behavior as calling the TensorFlow TensorFlow tfdv.generate_statistics_from_csv
API
.
You can use the data output by this function for such purposes as feature preprocessing
or model monitoring
.
Syntax
ML . TFDV_DESCRIBE ( { TABLE ` PROJECT_ID . DATASET . TABLE_NAME ` | ( QUERY_STATEMENT ) } , STRUCT ( [ NUM_HISTOGRAM_BUCKETS AS num_histogram_buckets ] [, NUM_QUANTILES_HISTOGRAM_BUCKETS AS num_quantiles_histogram_buckets ] [, NUM_VALUES_HISTOGRAM_BUCKETS AS num_values_histogram_buckets ] [, NUM_RANK_HISTOGRAM_BUCKETS AS num_rank_histogram_buckets ]) )
Arguments
ML.TFDV_DESCRIBE
takes the following arguments:
-
PROJECT_ID: your project ID. -
DATASET: the BigQuery dataset that contains the table. -
TABLE_NAME: the name of the input table that contains the training or serving data to calculate statistics for. -
QUERY_STATEMENT: a query that generates the training or serving data to calculate statistics for. For the supported SQL syntax of theQUERY_STATEMENTclause, see GoogleSQL query syntax . -
NUM_HISTOGRAM_BUCKETS: anINT64value that specifies the number of buckets to use for a histogram with equal-width buckets. Only applies to numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>columns. Thenum_histogram_bucketsvalue must be in the range[1, 1,000]. The default value is10. -
NUM_QUANTILES_HISTOGRAM_BUCKETS: anINT64value that specifies the number of buckets to use for a quantiles histogram. Only applies to numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>columns. Thenum_quantiles_histogram_bucketsvalue must be in the range[1, 1,000]. The default value is10. -
NUM_VALUES_HISTOGRAM_BUCKETS: anINT64value that specifies the number of buckets to use for a quantiles histogram. Only applies toARRAYcolumns. Thenum_values_histogram_bucketsvalue must be in the range[1, 1,000]. The default value is10. -
NUM_RANK_HISTOGRAM_BUCKETS: anINT64value that specifies the number of buckets to use for a rank histogram. Only applies to categorical andARRAY<categorical>columns. Thenum_rank_histogram_bucketsvalue must be in the range[1, 10,000]. The default value is50.
Output
ML.TFDV_DESCRIBE
returns a column named dataset_feature_statistics_list
that contains a TensorFlow DatasetFeatureStatisticsList
protocol buffer
in JSON format.
Example
The following example returns statistics for the penguins
public dataset and
uses 20 buckets for rank histograms for string values:
SELECT * FROM ML . TFDV_DESCRIBE ( TABLE ` bigquery - public - data . ml_datasets . penguins ` , STRUCT ( 20 AS num_rank_histogram_buckets ) );
Limitations
Input data for the ML.TFDV_DESCRIBE
function can only contain columns of the
following data types:
- Numeric types
-
STRING -
BOOL -
BYTE -
DATE -
DATETIME -
TIME -
TIMESTAMP -
ARRAY<STRUCT<INT64, FLOAT64>>(a sparse tensor) -
STRUCTcolumns that contain any of the following types:- Numeric types
-
STRING -
BOOL -
BYTE -
DATE -
DATETIME -
TIME -
TIMESTAMP
-
ARRAYcolumns that contain any of the following types:- Numeric types
-
STRING -
BOOL -
BYTE -
DATE -
DATETIME -
TIME -
TIMESTAMP
Pricing
The ML.TFDV_DESCRIBE
function uses BigQuery on-demand compute pricing
.
What's next
- For more information about model monitoring in BigQuery ML, see Model monitoring overview .
- For more information about supported SQL statements and functions for ML models, see End-to-end user journeys for ML models .

