The ML.DESCRIBE_DATA function

This document describes the ML.DESCRIBE_DATA function, which you can use to generate descriptive statistics for the columns in a table or subquery. For example, you might want to know statistics for a table of training or serving data that you plan to use with a machine learning (ML) model. You can use the data output by this function for such purposes as feature preprocessing or model monitoring .

Syntax

 ML 
 . 
 DESCRIBE_DATA 
 ( 
  
 { 
  
 TABLE 
  
`  PROJECT_ID 
 
 . 
  DATASET 
 
 . 
  TABLE_NAME 
 
`  
 | 
  
 ( 
  QUERY_STATEMENT 
 
 ) 
  
 } 
 , 
  
 STRUCT 
 ( 
  
 [ 
  NUM_QUANTILES 
 
  
 AS 
  
 num_quantiles 
 ] 
  
 [, 
  
  NUM_ARRAY_LENGTH_QUANTILES 
 
  
 AS 
  
 num_array_length_quantiles 
 ] 
  
 [, 
  
  TOP_K 
 
  
 AS 
  
 top_k 
 ]) 
 ) 

Arguments

ML.DESCRIBE_DATA takes the following arguments:

  • PROJECT_ID : your project ID.
  • DATASET : the BigQuery dataset that contains the table.
  • TABLE_NAME : the name of the input table that contains the training or serving data to calculate statistics for.
  • QUERY_STATEMENT : a query that generates the training or serving data to calculate statistics for. For the supported SQL syntax of the QUERY_STATEMENT clause, see GoogleSQL query syntax .
  • NUM_QUANTILES : an INT64 value that specifies the number of quantiles to return for numerical, ARRAY<numerical> , and ARRAY<STRUCT<INT64, numerical>> columns. This affects the number of results shown in the quantiles output column. These quantiles describe the distribution of the data in the column. Specify a lower value for coarser-grained distribution information and a higher value for finer-grained distribution information. The NUM_QUANTILES value must be in the range [1, 100,000] . The default value is 2 .
  • NUM_ARRAY_LENGTH_QUANTILES : an INT64 value that specifies the number of quantiles to return for ARRAY columns. This affects the number of results shown in the array_length_quantiles output column. These quantiles describe the distribution of the length of the arrays in the column. Specify a lower value for coarser-grained distribution information and a higher value for finer-grained distribution information. The NUM_ARRAY_LENGTH_QUANTILES value must be in the range [1, 100,000] . The default value is 10 .
  • TOP_K : an INT64 value that specifies the number of top values to return for categorical and ARRAY<categorical> columns. This affects the number of results shown in the top_values output column. The top values are the values that are shown most frequently in the column. The TOP_K value must be in the range [1, 10,000] . The default value is 1 .

Details

ML.DESCRIBE_DATA handles input columns as follows:

  • ARRAY columns are unnested before statistics are computed on them.
  • ARRAY<STRUCT<INT64, numerical>> . The INT64 value is the index, and the numerical value is the value. For statistics computation, BigQuery ML treats columns of this type as ARRAY<numerical> based on the value. The value of the dimension column in the output is MAX(index) + 1 .
  • STRUCT fields are expanded, and then categorical columns are cast to STRING and numerical columns are cast to FLOAT64 .
  • Columns of the following data types are cast to STRING and return the same statistics as STRING columns:
    • BOOL
    • BYTE
    • DATE
    • DATETIME
    • TIME
    • TIMESTAMP Columns of the following data types are cast to FLOAT64 and return the same statistics as FLOAT64 columns:
    • INT64
    • NUMERIC
    • BIGNUMERIC

Output

ML.DESCRIBE_DATA returns one row for each column in the input data. ML.DESCRIBE_DATA output contains the following columns:

  • name : a STRING column that contains the name of the input column.
  • num_rows : an INT64 column that contains the total number of rows for the input column.
  • num_nulls : an INT64 column that returns the number of NULL values found in the column.
  • num_zeros : an INT64 column that contains one of the following:
    • For numerical, ARRAY<numerical> , and ARRAY<STRUCT<INT64, numerical>> input columns, returns the number of 0 values found in the column.
    • For categorical or ARRAY<categorical> input columns, returns NULL .
  • min : a STRING column that contains the MIN value for the column.
  • max : a STRING column that contains the MAX value for the column.
  • mean : a FLOAT64 column that contains one of the following:
    • For numerical, ARRAY<numerical> , and ARRAY<STRUCT<INT64, numerical>> input columns, returns the mean value calculated for the column.
    • For categorical or ARRAY<categorical> input columns, returns NULL .
  • stdev : a FLOAT64 column that contains one of the following:
    • For numerical, ARRAY<numerical> , and ARRAY<STRUCT<INT64, numerical>> input columns, returns the standard deviation value calculated for the column.
    • For categorical or ARRAY<categorical> input columns, returns NULL .
  • median : a FLOAT64 column that contains one of the following:
    • For numerical, ARRAY<numerical> , and ARRAY<STRUCT<INT64, numerical>> input columns, returns the median value calculated for the column.
    • For categorical or ARRAY<categorical> input columns, returns NULL .
  • quantiles : an ARRAY<FLOAT64> column that contains information about the quantiles in an input column, as computed by the APPROX_QUANTILES function. The quantiles column contains one of the following values:
    • For numerical, ARRAY<numerical> , and ARRAY<STRUCT<INT64, numerical>> input columns, returns the quantiles computed for the column.
    • For categorical or ARRAY<categorical> input columns, returns NULL .
  • unique : an INT64 column that contains information about the number of unique values in an input column, as computed by the APPROX_COUNT_DISTINCT function. The unique column contains one of the following values:
    • For numerical, ARRAY<numerical> , and ARRAY<STRUCT<INT64, numerical>> input columns, returns NULL .
    • For categorical or ARRAY<categorical> input columns, returns the number of unique values in the input column.
  • avg_string_length : a FLOAT64 column that contains one of the following:
    • For numerical, ARRAY<numerical> , and ARRAY<STRUCT<INT64, numerical>> input columns, returns NULL .
    • For categorical or ARRAY<categorical> input columns, returns the average length of the values in the column.
  • num_values : an INT64 column that contains the number of array elements for ARRAY columns, and the number of values in the column for other types of columns.
  • top_values : a ARRAY<STRUCT<STRING, INT64>> column that contains information about the top values and number of occurrences in an input column, as computed by the APPROX_TOP_COUNT function. The top_values column contains the following fields:
    • top_values.value : a STRING field that contains one of the following values:
      • For numerical, ARRAY<numerical> , and ARRAY<STRUCT<INT64, numerical>> input columns, returns NULL .
      • For categorical or ARRAY<categorical> input columns, returns one of the top values in the input column.
    • top_values.count : an INT64 field that contains one of the following values:
      • For numerical, ARRAY<numerical> , and ARRAY<STRUCT<INT64, numerical>> input columns, returns NULL .
      • For categorical or ARRAY<categorical> input columns, returns the number of times the related top value appears.
  • min_array_length : an INT64 column that contains one of the following values:
    • For ARRAY input columns, returns the minimum length of an array in the column.
    • For other types of input columns, returns NULL .
  • max_array_length : an INT64 column that contains one of the following values:
    • For ARRAY input columns, returns the maximum length of an array in the column.
    • For other types of input columns, returns NULL .
  • avg_array_length : a FLOAT64 column that contains one of the following values:
    • For ARRAY input columns, returns the average length of an array in the column.
    • For other types of input columns, returns NULL .
  • total_array_length : an INT64 column that contains one of the following values:
    • For ARRAY input columns, returns the sum of the size of the arrays in the column.
    • For other types of input columns, returns NULL .
  • array_length_quantiles : an ARRAY<INT64> column that contains the information about the quantiles for the array length in an input column, as computed by the APPROX_QUANTILES function. The array_length_quantiles column contains one of the following values:
    • For ARRAY input columns, returns the quantiles for the array length computed for the column.
    • For other types of input columns, returns 0 .
  • dimension : an INT64 column that contains one of the following:
    • For ARRAY<STRUCT<INT64, numerical>> input columns, returns the dimension computed for the column, which is MAX(index) + 1 for sparse input .
    • For other types of input columns, returns NULL .

Example

The following example returns statistics for a table with five quantiles calculated for numeric columns and three top values returned for non-numeric columns:

 SELECT 
  
 * 
 FROM 
  
 ML 
 . 
 DESCRIBE_DATA 
 ( 
  
 TABLE 
  
 ` 
 myproject 
 . 
 mydataset 
 . 
 mytable 
 ` 
 , 
  
 STRUCT 
 ( 
 5 
  
 AS 
  
 num_quantiles 
 , 
  
 3 
  
 AS 
  
 top_k 
 ) 
 ); 

Limitations

Input data for the ML.DESCRIBE_DATA function can only contain columns of the following data types:

  • Numeric types
  • STRING
  • BOOL
  • BYTE
  • DATE
  • DATETIME
  • TIME
  • TIMESTAMP
  • ARRAY<STRUCT<INT64, FLOAT64>> (a sparse tensor)
  • STRUCT columns that contain any of the following types:
    • Numeric types
    • STRING
    • BOOL
    • BYTE
    • DATE
    • DATETIME
    • TIME
    • TIMESTAMP
  • ARRAY columns that contain any of the following types:
    • Numeric types
    • STRING
    • BOOL
    • BYTE
    • DATE
    • DATETIME
    • TIME
    • TIMESTAMP
Design a Mobile Site
View Site in Mobile | Classic
Share by: