The ML.DESCRIBE_DATA function
This document describes the ML.DESCRIBE_DATA
function, which you can use to
generate descriptive statistics for the columns in a table or subquery. For
example, you might want to know statistics for a table of training or serving
data that you plan to use with a machine learning (ML) model. You can use the
data output by this function for such purposes as feature preprocessing
or model
monitoring
.
Syntax
ML . DESCRIBE_DATA ( { TABLE ` PROJECT_ID . DATASET . TABLE_NAME ` | ( QUERY_STATEMENT ) } , STRUCT ( [ NUM_QUANTILES AS num_quantiles ] [, NUM_ARRAY_LENGTH_QUANTILES AS num_array_length_quantiles ] [, TOP_K AS top_k ]) )
Arguments
ML.DESCRIBE_DATA
takes the following arguments:
-
PROJECT_ID: your project ID. -
DATASET: the BigQuery dataset that contains the table. -
TABLE_NAME: the name of the input table that contains the training or serving data to calculate statistics for. -
QUERY_STATEMENT: a query that generates the training or serving data to calculate statistics for. For the supported SQL syntax of theQUERY_STATEMENTclause, see GoogleSQL query syntax . -
NUM_QUANTILES: anINT64value that specifies the number of quantiles to return for numerical,ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>columns. This affects the number of results shown in thequantilesoutput column. These quantiles describe the distribution of the data in the column. Specify a lower value for coarser-grained distribution information and a higher value for finer-grained distribution information. TheNUM_QUANTILESvalue must be in the range[1, 100,000]. The default value is2. -
NUM_ARRAY_LENGTH_QUANTILES: anINT64value that specifies the number of quantiles to return forARRAYcolumns. This affects the number of results shown in thearray_length_quantilesoutput column. These quantiles describe the distribution of the length of the arrays in the column. Specify a lower value for coarser-grained distribution information and a higher value for finer-grained distribution information. TheNUM_ARRAY_LENGTH_QUANTILESvalue must be in the range[1, 100,000]. The default value is10. -
TOP_K: anINT64value that specifies the number of top values to return for categorical andARRAY<categorical>columns. This affects the number of results shown in thetop_valuesoutput column. The top values are the values that are shown most frequently in the column. TheTOP_Kvalue must be in the range[1, 10,000]. The default value is1.
Details
ML.DESCRIBE_DATA
handles input columns as follows:
-
ARRAYcolumns are unnested before statistics are computed on them. -
ARRAY<STRUCT<INT64, numerical>>. TheINT64value is the index, and the numerical value is the value. For statistics computation, BigQuery ML treats columns of this type asARRAY<numerical>based on the value. The value of the dimension column in the output isMAX(index) + 1. -
STRUCTfields are expanded, and then categorical columns are cast toSTRINGand numerical columns are cast toFLOAT64. - Columns of the following data types are cast
to
STRINGand return the same statistics asSTRINGcolumns:-
BOOL -
BYTE -
DATE -
DATETIME -
TIME -
TIMESTAMPColumns of the following data types are cast toFLOAT64and return the same statistics asFLOAT64columns: -
INT64 -
NUMERIC -
BIGNUMERIC
-
Output
ML.DESCRIBE_DATA
returns one row for each column in the input data. ML.DESCRIBE_DATA
output contains the following columns:
-
name: aSTRINGcolumn that contains the name of the input column. -
num_rows: anINT64column that contains the total number of rows for the input column. -
num_nulls: anINT64column that returns the number ofNULLvalues found in the column. -
num_zeros: anINT64column that contains one of the following:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns the number of0values found in the column. - For categorical or
ARRAY<categorical>input columns, returnsNULL.
- For numerical,
-
min: aSTRINGcolumn that contains theMINvalue for the column. -
max: aSTRINGcolumn that contains theMAXvalue for the column. -
mean: aFLOAT64column that contains one of the following:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns the mean value calculated for the column. - For categorical or
ARRAY<categorical>input columns, returnsNULL.
- For numerical,
-
stdev: aFLOAT64column that contains one of the following:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns the standard deviation value calculated for the column. - For categorical or
ARRAY<categorical>input columns, returnsNULL.
- For numerical,
-
median: aFLOAT64column that contains one of the following:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns the median value calculated for the column. - For categorical or
ARRAY<categorical>input columns, returnsNULL.
- For numerical,
-
quantiles: anARRAY<FLOAT64>column that contains information about the quantiles in an input column, as computed by theAPPROX_QUANTILESfunction. Thequantilescolumn contains one of the following values:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returns the quantiles computed for the column. - For categorical or
ARRAY<categorical>input columns, returnsNULL.
- For numerical,
-
unique: anINT64column that contains information about the number of unique values in an input column, as computed by theAPPROX_COUNT_DISTINCTfunction. Theuniquecolumn contains one of the following values:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returnsNULL. - For categorical or
ARRAY<categorical>input columns, returns the number of unique values in the input column.
- For numerical,
-
avg_string_length: aFLOAT64column that contains one of the following:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returnsNULL. - For categorical or
ARRAY<categorical>input columns, returns the average length of the values in the column.
- For numerical,
-
num_values: anINT64column that contains the number of array elements forARRAYcolumns, and the number of values in the column for other types of columns. -
top_values: aARRAY<STRUCT<STRING, INT64>>column that contains information about the top values and number of occurrences in an input column, as computed by theAPPROX_TOP_COUNTfunction. Thetop_valuescolumn contains the following fields:-
top_values.value: aSTRINGfield that contains one of the following values:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returnsNULL. - For categorical or
ARRAY<categorical>input columns, returns one of the top values in the input column.
- For numerical,
-
top_values.count: anINT64field that contains one of the following values:- For numerical,
ARRAY<numerical>, andARRAY<STRUCT<INT64, numerical>>input columns, returnsNULL. - For categorical or
ARRAY<categorical>input columns, returns the number of times the related top value appears.
- For numerical,
-
-
min_array_length: anINT64column that contains one of the following values:- For
ARRAYinput columns, returns the minimum length of an array in the column. - For other types of input columns, returns
NULL.
- For
-
max_array_length: anINT64column that contains one of the following values:- For
ARRAYinput columns, returns the maximum length of an array in the column. - For other types of input columns, returns
NULL.
- For
-
avg_array_length: aFLOAT64column that contains one of the following values:- For
ARRAYinput columns, returns the average length of an array in the column. - For other types of input columns, returns
NULL.
- For
-
total_array_length: anINT64column that contains one of the following values:- For
ARRAYinput columns, returns the sum of the size of the arrays in the column. - For other types of input columns, returns
NULL.
- For
-
array_length_quantiles: anARRAY<INT64>column that contains the information about the quantiles for the array length in an input column, as computed by theAPPROX_QUANTILESfunction. Thearray_length_quantilescolumn contains one of the following values:- For
ARRAYinput columns, returns the quantiles for the array length computed for the column. - For other types of input columns, returns
0.
- For
-
dimension: anINT64column that contains one of the following:- For
ARRAY<STRUCT<INT64, numerical>>input columns, returns the dimension computed for the column, which isMAX(index) + 1for sparse input . - For other types of input columns, returns
NULL.
- For
Example
The following example returns statistics for a table with five quantiles calculated for numeric columns and three top values returned for non-numeric columns:
SELECT * FROM ML . DESCRIBE_DATA ( TABLE ` myproject . mydataset . mytable ` , STRUCT ( 5 AS num_quantiles , 3 AS top_k ) );
Limitations
Input data for the ML.DESCRIBE_DATA
function can only contain columns of the
following data types:
- Numeric types
-
STRING -
BOOL -
BYTE -
DATE -
DATETIME -
TIME -
TIMESTAMP -
ARRAY<STRUCT<INT64, FLOAT64>>(a sparse tensor) -
STRUCTcolumns that contain any of the following types:- Numeric types
-
STRING -
BOOL -
BYTE -
DATE -
DATETIME -
TIME -
TIMESTAMP
-
ARRAYcolumns that contain any of the following types:- Numeric types
-
STRING -
BOOL -
BYTE -
DATE -
DATETIME -
TIME -
TIMESTAMP
Pricing
The ML.DESCRIBE_DATA
function uses BigQuery on-demand compute pricing
.
What's next
- For more information about model monitoring in BigQuery ML, see Model monitoring overview .
- For more information about supported SQL statements and functions for ML models, see End-to-end user journeys for ML models .

