The ML.BUCKETIZE function

This document describes the ML.BUCKETIZE function, which lets you split a numerical expression into buckets.

You can use this function with models that support manual feature preprocessing . For more information, see the following documents:

Syntax

ML.BUCKETIZE(numerical_expression, array_split_points [, exclude_boundaries] [, output_format])

Arguments

ML.BUCKETIZE takes the following arguments:

numerical_expression : the numerical expression to bucketize.
array_split_points : an array of numerical values that provide the points at which to split the numerical_expression value. The numerical values in the array must be finite, so not -inf , inf , or NaN . Provide the numerical values in order, lowest to highest. The range of possible buckets is determined by the upper and lower boundaries of the array. For example, if the array_split_points value is [1, 2, 3, 4] , then there are five potential buckets that the numerical_expression value can be bucketized into.
exclude_boundaries : a BOOL value that determines whether the upper and lower boundaries from array_split_points are used. If TRUE , then the boundary values aren't used to create buckets. For example, if the array_split_points value is [1, 2, 3, 4] and exclude_boundaries is TRUE , then there are three potential buckets that the numerical_expression value can be bucketized into. The default value is FALSE .
output_format : a STRING value that specifies the output format of the bucket. Valid output formats are as follows:
- bucket_names : returns a STRING value in the format bin_<bucket_index> . For example, bin_3 . The bucket_index value starts at 1. This is the default bucket format.
- bucket_ranges : returns a STRING value in the format [lower_bound, upper_bound) in interval notation . For example, (-inf, 2.5) , [2.5, 4.6) , [4.6, +inf) .
- bucket_ranges_json : returns a JSON-formatted STRING value in the format {"start": "lower_bound", "end": "upper_bound"} . For example, {"start": "-Infinity", "end": "2.5"} , {"start": "2.5", "end": "4.6"} , {"start": "4.6", "end": "Infinity"} . The inclusivity and exclusivity of the lower and upper bound follow the same pattern as the bucket_ranges option.

Output

ML.BUCKETIZE returns a STRING value that contains the name of the bucket, in the format specified by the output_format argument.

Example

The following example bucketizes a numerical expression both with and without boundaries:

 SELECT 
  
 ML 
 . 
 BUCKETIZE 
 ( 
 2 
 . 
 5 
 , 
  
 [ 
 1 
 , 
  
 2 
 , 
  
 3 
 ]) 
  
 AS 
  
 bucket 
 , 
  
 ML 
 . 
 BUCKETIZE 
 ( 
 2 
 . 
 5 
 , 
  
 [ 
 1 
 , 
  
 2 
 , 
  
 3 
 ], 
  
 TRUE 
 ) 
  
 AS 
  
 bucket_without_boundaries 
 , 
  
 ML 
 . 
 BUCKETIZE 
 ( 
 2 
 . 
 5 
 , 
  
 [ 
 1 
 , 
  
 2 
 , 
  
 3 
 ], 
  
 FALSE 
 , 
  
 "bucket_ranges" 
 ) 
  
 AS 
  
 bucket_ranges 
 , 
  
 ML 
 . 
 BUCKETIZE 
 ( 
 2 
 . 
 5 
 , 
  
 [ 
 1 
 , 
  
 2 
 , 
  
 3 
 ], 
  
 FALSE 
 , 
  
 "bucket_ranges_json" 
 ) 
  
 AS 
  
 bucket_ranges_json 
 ;

The output looks similar to the following:

+--------+---------------------------+---------------+----------------------------+
| bucket | bucket_without_boundaries | bucket_ranges | bucket_ranges_json         |
|--------|---------------------------|---------------|----------------------------|
| bin_3  | bin_2                     | [2, 3)        | {"start": "2", "end": "3"} |
+--------+---------------------------+---------------+----------------------------+

What's next

For information about feature preprocessing, see Feature preprocessing overview .