The ML.TFDV_DESCRIBE function

This document describes the ML.TFDV_DESCRIBE function, which you can use to generate fine-grained statistics for the columns in a table. For example, you might want to know statistics for a table of training or serving data statistics that you plan to use with a machine learning (ML) model. Calling this function provides the same behavior as calling the TensorFlow TensorFlow tfdv.generate_statistics_from_csv API . You can use the data output by this function for such purposes as feature preprocessing or model monitoring .

Syntax

 ML 
 . 
 TFDV_DESCRIBE 
 ( 
  
 { 
  
 TABLE 
  
`  PROJECT_ID 
 
 . 
  DATASET 
 
 . 
  TABLE_NAME 
 
`  
 | 
  
 ( 
  QUERY_STATEMENT 
 
 ) 
  
 } 
 , 
  
 STRUCT 
 ( 
  
 [ 
  NUM_HISTOGRAM_BUCKETS 
 
  
 AS 
  
 num_histogram_buckets 
 ] 
  
 [, 
  
  NUM_QUANTILES_HISTOGRAM_BUCKETS 
 
  
 AS 
  
 num_quantiles_histogram_buckets 
 ] 
  
 [, 
  
  NUM_VALUES_HISTOGRAM_BUCKETS 
 
  
 AS 
  
 num_values_histogram_buckets 
 ] 
  
 [, 
  
  NUM_RANK_HISTOGRAM_BUCKETS 
 
  
 AS 
  
 num_rank_histogram_buckets 
 ]) 
 ) 

Arguments

ML.TFDV_DESCRIBE takes the following arguments:

  • PROJECT_ID : your project ID.
  • DATASET : the BigQuery dataset that contains the table.
  • TABLE_NAME : the name of the input table that contains the training or serving data to calculate statistics for.
  • QUERY_STATEMENT : a query that generates the training or serving data to calculate statistics for. For the supported SQL syntax of the QUERY_STATEMENT clause, see GoogleSQL query syntax .
  • NUM_HISTOGRAM_BUCKETS : an INT64 value that specifies the number of buckets to use for a histogram with equal-width buckets. Only applies to numerical, ARRAY<numerical> , and ARRAY<STRUCT<INT64, numerical>> columns. The num_histogram_buckets value must be in the range [1, 1,000] . The default value is 10 .
  • NUM_QUANTILES_HISTOGRAM_BUCKETS : an INT64 value that specifies the number of buckets to use for a quantiles histogram. Only applies to numerical, ARRAY<numerical> , and ARRAY<STRUCT<INT64, numerical>> columns. The num_quantiles_histogram_buckets value must be in the range [1, 1,000] . The default value is 10 .
  • NUM_VALUES_HISTOGRAM_BUCKETS : an INT64 value that specifies the number of buckets to use for a quantiles histogram. Only applies to ARRAY columns. The num_values_histogram_buckets value must be in the range [1, 1,000] . The default value is 10 .
  • NUM_RANK_HISTOGRAM_BUCKETS : an INT64 value that specifies the number of buckets to use for a rank histogram. Only applies to categorical and ARRAY<categorical> columns. The num_rank_histogram_buckets value must be in the range [1, 10,000] . The default value is 50 .

Output

ML.TFDV_DESCRIBE returns a column named dataset_feature_statistics_list that contains a TensorFlow DatasetFeatureStatisticsList protocol buffer in JSON format.

Example

The following example returns statistics for the penguins public dataset and uses 20 buckets for rank histograms for string values:

 SELECT 
  
 * 
  
 FROM 
  
 ML 
 . 
 TFDV_DESCRIBE 
 ( 
  
 TABLE 
  
 ` 
 bigquery 
 - 
 public 
 - 
 data 
 . 
 ml_datasets 
 . 
 penguins 
 ` 
 , 
  
 STRUCT 
 ( 
 20 
  
 AS 
  
 num_rank_histogram_buckets 
 ) 
 ); 

Limitations

Input data for the ML.TFDV_DESCRIBE function can only contain columns of the following data types:

  • Numeric types
  • STRING
  • BOOL
  • BYTE
  • DATE
  • DATETIME
  • TIME
  • TIMESTAMP
  • ARRAY<STRUCT<INT64, FLOAT64>> (a sparse tensor)
  • STRUCT columns that contain any of the following types:
    • Numeric types
    • STRING
    • BOOL
    • BYTE
    • DATE
    • DATETIME
    • TIME
    • TIMESTAMP
  • ARRAY columns that contain any of the following types:
    • Numeric types
    • STRING
    • BOOL
    • BYTE
    • DATE
    • DATETIME
    • TIME
    • TIMESTAMP
Design a Mobile Site
View Site in Mobile | Classic
Share by: