The ML.DESCRIBE_DATA function
This document describes the ML.DESCRIBE_DATA
function, which you can use to
generate descriptive statistics for the columns in a table or subquery. For
example, you might want to know statistics for a table of training or serving
data that you plan to use with a machine learning (ML) model. You can use the
data output by this function for such purposes as feature preprocessing
or model
monitoring
.
Syntax
ML . DESCRIBE_DATA ( { TABLE ` PROJECT_ID . DATASET . TABLE_NAME ` | ( QUERY_STATEMENT ) } , STRUCT ( [ NUM_QUANTILES AS num_quantiles ] [, NUM_ARRAY_LENGTH_QUANTILES AS num_array_length_quantiles ] [, TOP_K AS top_k ]) )
Arguments
ML.DESCRIBE_DATA
takes the following arguments:
-
PROJECT_ID
: your project ID. -
DATASET
: the BigQuery dataset that contains the table. -
TABLE_NAME
: the name of the input table that contains the training or serving data to calculate statistics for. -
QUERY_STATEMENT
: a query that generates the training or serving data to calculate statistics for. For the supported SQL syntax of theQUERY_STATEMENT
clause, see GoogleSQL query syntax . -
NUM_QUANTILES
: anINT64
value that specifies the number of quantiles to return for numerical,ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
columns. This affects the number of results shown in thequantiles
output column. These quantiles describe the distribution of the data in the column. Specify a lower value for coarser-grained distribution information and a higher value for finer-grained distribution information. TheNUM_QUANTILES
value must be in the range[1, 100,000]
. The default value is2
. -
NUM_ARRAY_LENGTH_QUANTILES
: anINT64
value that specifies the number of quantiles to return forARRAY
columns. This affects the number of results shown in thearray_length_quantiles
output column. These quantiles describe the distribution of the length of the arrays in the column. Specify a lower value for coarser-grained distribution information and a higher value for finer-grained distribution information. TheNUM_ARRAY_LENGTH_QUANTILES
value must be in the range[1, 100,000]
. The default value is10
. -
TOP_K
: anINT64
value that specifies the number of top values to return for categorical andARRAY<categorical>
columns. This affects the number of results shown in thetop_values
output column. The top values are the values that are shown most frequently in the column. TheTOP_K
value must be in the range[1, 10,000]
. The default value is1
.
Details
ML.DESCRIBE_DATA
handles input columns as follows:
-
ARRAY
columns are unnested before statistics are computed on them. -
ARRAY<STRUCT<INT64, numerical>>
. TheINT64
value is the index, and the numerical value is the value. For statistics computation, BigQuery ML treats columns of this type asARRAY<numerical>
based on the value. The value of the dimension column in the output isMAX(index) + 1
. -
STRUCT
fields are expanded, and then categorical columns are cast toSTRING
and numerical columns are cast toFLOAT64
. - Columns of the following data types are cast
to
STRING
and return the same statistics asSTRING
columns:-
BOOL
-
BYTE
-
DATE
-
DATETIME
-
TIME
-
TIMESTAMP
Columns of the following data types are cast toFLOAT64
and return the same statistics asFLOAT64
columns: -
INT64
-
NUMERIC
-
BIGNUMERIC
-
Output
ML.DESCRIBE_DATA
returns one row for each column in the input data. ML.DESCRIBE_DATA
output contains the following columns:
-
name
: aSTRING
column that contains the name of the input column. -
num_rows
: anINT64
column that contains the total number of rows for the input column. -
num_nulls
: anINT64
column that returns the number ofNULL
values found in the column. -
num_zeros
: anINT64
column that contains one of the following:- For numerical,
ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
input columns, returns the number of0
values found in the column. - For categorical or
ARRAY<categorical>
input columns, returnsNULL
.
- For numerical,
-
min
: aSTRING
column that contains theMIN
value for the column. -
max
: aSTRING
column that contains theMAX
value for the column. -
mean
: aFLOAT64
column that contains one of the following:- For numerical,
ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
input columns, returns the mean value calculated for the column. - For categorical or
ARRAY<categorical>
input columns, returnsNULL
.
- For numerical,
-
stdev
: aFLOAT64
column that contains one of the following:- For numerical,
ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
input columns, returns the standard deviation value calculated for the column. - For categorical or
ARRAY<categorical>
input columns, returnsNULL
.
- For numerical,
-
median
: aFLOAT64
column that contains one of the following:- For numerical,
ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
input columns, returns the median value calculated for the column. - For categorical or
ARRAY<categorical>
input columns, returnsNULL
.
- For numerical,
-
quantiles
: anARRAY<FLOAT64>
column that contains information about the quantiles in an input column, as computed by theAPPROX_QUANTILES
function. Thequantiles
column contains one of the following values:- For numerical,
ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
input columns, returns the quantiles computed for the column. - For categorical or
ARRAY<categorical>
input columns, returnsNULL
.
- For numerical,
-
unique
: anINT64
column that contains information about the number of unique values in an input column, as computed by theAPPROX_COUNT_DISTINCT
function. Theunique
column contains one of the following values:- For numerical,
ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
input columns, returnsNULL
. - For categorical or
ARRAY<categorical>
input columns, returns the number of unique values in the input column.
- For numerical,
-
avg_string_length
: aFLOAT64
column that contains one of the following:- For numerical,
ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
input columns, returnsNULL
. - For categorical or
ARRAY<categorical>
input columns, returns the average length of the values in the column.
- For numerical,
-
num_values
: anINT64
column that contains the number of array elements forARRAY
columns, and the number of values in the column for other types of columns. -
top_values
: aARRAY<STRUCT<STRING, INT64>>
column that contains information about the top values and number of occurrences in an input column, as computed by theAPPROX_TOP_COUNT
function. Thetop_values
column contains the following fields:-
top_values.value
: aSTRING
field that contains one of the following values:- For numerical,
ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
input columns, returnsNULL
. - For categorical or
ARRAY<categorical>
input columns, returns one of the top values in the input column.
- For numerical,
-
top_values.count
: anINT64
field that contains one of the following values:- For numerical,
ARRAY<numerical>
, andARRAY<STRUCT<INT64, numerical>>
input columns, returnsNULL
. - For categorical or
ARRAY<categorical>
input columns, returns the number of times the related top value appears.
- For numerical,
-
-
min_array_length
: anINT64
column that contains one of the following values:- For
ARRAY
input columns, returns the minimum length of an array in the column. - For other types of input columns, returns
NULL
.
- For
-
max_array_length
: anINT64
column that contains one of the following values:- For
ARRAY
input columns, returns the maximum length of an array in the column. - For other types of input columns, returns
NULL
.
- For
-
avg_array_length
: aFLOAT64
column that contains one of the following values:- For
ARRAY
input columns, returns the average length of an array in the column. - For other types of input columns, returns
NULL
.
- For
-
total_array_length
: anINT64
column that contains one of the following values:- For
ARRAY
input columns, returns the sum of the size of the arrays in the column. - For other types of input columns, returns
NULL
.
- For
-
array_length_quantiles
: anARRAY<INT64>
column that contains the information about the quantiles for the array length in an input column, as computed by theAPPROX_QUANTILES
function. Thearray_length_quantiles
column contains one of the following values:- For
ARRAY
input columns, returns the quantiles for the array length computed for the column. - For other types of input columns, returns
0
.
- For
-
dimension
: anINT64
column that contains one of the following:- For
ARRAY<STRUCT<INT64, numerical>>
input columns, returns the dimension computed for the column, which isMAX(index) + 1
for sparse input . - For other types of input columns, returns
NULL
.
- For
Example
The following example returns statistics for a table with five quantiles calculated for numeric columns and three top values returned for non-numeric columns:
SELECT * FROM ML . DESCRIBE_DATA ( TABLE ` myproject . mydataset . mytable ` , STRUCT ( 5 AS num_quantiles , 3 AS top_k ) );
Limitations
Input data for the ML.DESCRIBE_DATA
function can only contain columns of the
following data types:
- Numeric types
-
STRING
-
BOOL
-
BYTE
-
DATE
-
DATETIME
-
TIME
-
TIMESTAMP
-
ARRAY<STRUCT<INT64, FLOAT64>>
(a sparse tensor) -
STRUCT
columns that contain any of the following types:- Numeric types
-
STRING
-
BOOL
-
BYTE
-
DATE
-
DATETIME
-
TIME
-
TIMESTAMP
-
ARRAY
columns that contain any of the following types:- Numeric types
-
STRING
-
BOOL
-
BYTE
-
DATE
-
DATETIME
-
TIME
-
TIMESTAMP