The ML.TFDV_DESCRIBE function
This document describes the ML.TFDV_DESCRIBE 
function, which you can use
to generate fine-grained statistics for the columns in a table. For example, you
might want to know statistics for a table of training or serving data
statistics that you plan to use with a machine learning (ML) model. Calling
this function provides the same behavior as calling the TensorFlow TensorFlow tfdv.generate_statistics_from_csv 
API 
.
You can use the data output by this function for such purposes as feature preprocessing 
or model monitoring 
.
Syntax
ML . TFDV_DESCRIBE ( { TABLE ` PROJECT_ID . DATASET . TABLE_NAME ` | ( QUERY_STATEMENT ) } , STRUCT ( [ NUM_HISTOGRAM_BUCKETS AS num_histogram_buckets ] [, NUM_QUANTILES_HISTOGRAM_BUCKETS AS num_quantiles_histogram_buckets ] [, NUM_VALUES_HISTOGRAM_BUCKETS AS num_values_histogram_buckets ] [, NUM_RANK_HISTOGRAM_BUCKETS AS num_rank_histogram_buckets ]) )
Arguments
 ML.TFDV_DESCRIBE 
takes the following arguments:
-  PROJECT_ID: your project ID.
-  DATASET: the BigQuery dataset that contains the table.
-  TABLE_NAME: the name of the input table that contains the training or serving data to calculate statistics for.
-  QUERY_STATEMENT: a query that generates the training or serving data to calculate statistics for. For the supported SQL syntax of theQUERY_STATEMENTclause, see GoogleSQL query syntax .
-  NUM_HISTOGRAM_BUCKETS: anINT64value that specifies the number of buckets to use for a histogram with equal-width buckets. Only applies to numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>columns. Thenum_histogram_bucketsvalue must be in the range[1, 1,000]. The default value is10.
-  NUM_QUANTILES_HISTOGRAM_BUCKETS: anINT64value that specifies the number of buckets to use for a quantiles histogram. Only applies to numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>columns. Thenum_quantiles_histogram_bucketsvalue must be in the range[1, 1,000]. The default value is10.
-  NUM_VALUES_HISTOGRAM_BUCKETS: anINT64value that specifies the number of buckets to use for a quantiles histogram. Only applies toARRAYcolumns. Thenum_values_histogram_bucketsvalue must be in the range[1, 1,000]. The default value is10.
-  NUM_RANK_HISTOGRAM_BUCKETS: anINT64value that specifies the number of buckets to use for a rank histogram. Only applies to categorical andARRAY<categorical>columns. Thenum_rank_histogram_bucketsvalue must be in the range[1, 10,000]. The default value is50.
Output
 ML.TFDV_DESCRIBE 
returns a column named dataset_feature_statistics_list 
that contains a TensorFlow  DatasetFeatureStatisticsList 
protocol buffer 
in JSON format.
Example
The following example returns statistics for the penguins 
public dataset and
uses 20 buckets for rank histograms for string values:
SELECT * FROM ML . TFDV_DESCRIBE ( TABLE ` bigquery - public - data . ml_datasets . penguins ` , STRUCT ( 20 AS num_rank_histogram_buckets ) );
Limitations
Input data for the ML.TFDV_DESCRIBE 
function can only contain columns of the
following data types:
- Numeric types
-  STRING
-  BOOL
-  BYTE
-  DATE
-  DATETIME
-  TIME
-  TIMESTAMP
-  ARRAY<STRUCT<INT64, FLOAT64>>(a sparse tensor)
-  STRUCTcolumns that contain any of the following types:- Numeric types
-  STRING
-  BOOL
-  BYTE
-  DATE
-  DATETIME
-  TIME
-  TIMESTAMP
 
-  ARRAYcolumns that contain any of the following types:- Numeric types
-  STRING
-  BOOL
-  BYTE
-  DATE
-  DATETIME
-  TIME
-  TIMESTAMP
 
Pricing
The ML.TFDV_DESCRIBE 
function uses BigQuery on-demand compute pricing 
.
What's next
- For more information about model monitoring in BigQuery ML, see Model monitoring overview .
- For more information about supported SQL statements and functions for ML models, see End-to-end user journeys for ML models .

