The ML.QUANTILE_BUCKETIZE function

This document describes the ML.QUANTILE_BUCKETIZE function, which lets you break a continuous numerical feature into buckets based on quantiles.

When used in the TRANSFORM clause , the same quantiles are automatically used in prediction.

You can use this function with models that support manual feature preprocessing . For more information, see the following documents:

Syntax

ML.QUANTILE_BUCKETIZE(numerical_expression, num_buckets [, output_format]) OVER()

Arguments

ML.QUANTILE_BUCKETIZE takes the following arguments:

numerical_expression : the numerical expression to bucketize.
num_buckets : an INT64 value that specifies the number of buckets to split numerical_expression into.
output_format : a STRING value that specifies the output format of the bucket. Valid output formats are as follows:
- bucket_names : returns a STRING value in the format bin_<bucket_index> . For example, bin_3 . The bucket_index value starts at 1. This is the default bucket format.
- bucket_ranges : returns a STRING value in the format [lower_bound, upper_bound) in interval notation . For example, (-inf, 2.5) , [2.5, 4.6) , [4.6, +inf) .
- bucket_ranges_json : returns a JSON-formatted STRING value in the format {"start": "lower_bound", "end": "upper_bound"} . For example, {"start": "-Infinity", "end": "2.5"} , {"start": "2.5", "end": "4.6"} , {"start": "4.6", "end": "Infinity"} . The inclusivity and exclusivity of the lower and upper bound follow the same pattern as the bucket_ranges option.

Output

ML.QUANTILE_BUCKETIZE returns a STRING value that contains the name of the bucket, in the format specified by the output_format argument.

Example

The following example breaks a numerical expression of five elements into three buckets:

 SELECT 
  
 f 
 , 
  
 ML 
 . 
 QUANTILE_BUCKETIZE 
 ( 
 f 
 , 
  
 3 
 ) 
  
 OVER 
 () 
  
 AS 
  
 bucket 
 , 
  
 ML 
 . 
 QUANTILE_BUCKETIZE 
 ( 
 f 
 , 
  
 3 
 , 
  
 "bucket_ranges" 
 ) 
  
 OVER 
 () 
  
 AS 
  
 bucket_ranges 
 , 
  
 ML 
 . 
 QUANTILE_BUCKETIZE 
 ( 
 f 
 , 
  
 3 
 , 
  
 "bucket_ranges_json" 
 ) 
  
 OVER 
 () 
  
 AS 
  
 bucket_ranges_json 
 FROM 
  
 UNNEST 
 ([ 
 1 
 , 
 2 
 , 
 3 
 , 
 4 
 , 
 5 
 ]) 
  
 AS 
  
 f 
 ORDER 
  
 BY 
  
 f 
 ;

The output looks similar to the following:

+---+--------+---------------+------------------------------------+
| f | bucket | bucket_ranges | bucket_ranges_json                 |
|---|--------|---------------|------------------------------------|
| 1 | bin_1  | (-inf, 2)     | {"start": "-Infinity", "end": "2"} |
| 2 | bin_2  | [2, 4)        | {"start": "2", "end": "4"}         |
| 3 | bin_2  | [2, 4)        | {"start": "2", "end": "4"}         |
| 4 | bin_3  | [4, +inf)     | {"start": "4", "end": "Infinity"}  |
| 5 | bin_3  | [4, +inf)     | {"start": "4", "end": "Infinity"}  |
+---+--------+---------------+------------------------------------+

What's next

For information about feature preprocessing, see Feature preprocessing overview .