Stay organized with collectionsSave and categorize content based on your preferences.
The ML.ONE_HOT_ENCODER function
This document describes theML.ONE_HOT_ENCODERfunction, which lets you
encode a string expression using aone-hotordummyencoding scheme.
The encoding vocabulary is sorted alphabetically.NULLvalues and categories
that aren't in the vocabulary are encoded with anindexvalue of0. If you
use dummy encoding, the dropped category is encoded with avalueof0.
When used in theTRANSFORMclause,
the vocabulary and dropped category values calculated during training, along
with the topkand frequency threshold values that you specified, are
automatically used in prediction.
drop: aSTRINGvalue that specifies whether the function drops
a category. Valid values are as follows:
none: Retain all categories. This is the default value.
most_frequent: Drop the most frequent category found in
the string expression. Selecting this value causes the function to use
dummy encoding.
top_k: anINT64value that specifies the number of categories
included in the encoding vocabulary. The function selects thetop_kmost frequent categories in the data and uses those; categories below this
threshold are encoded to0. This value must be less than1,000,000to avoid problems due to high dimensionality. The default value is32,000.
frequency_threshold: anINT64value that limits the categories
included in the encoding vocabulary based on category frequency. The
function uses categories whose frequency is greater than or equal tofrequency_threshold; categories below this threshold are encoded to0.
The default value is5.
Output
ML.ONE_HOT_ENCODERreturns an array of struct values, in the formARRAY<STRUCT<INT64, FLOAT64>>. The first element in the struct provides the
index of the encoded string expression, and the second element provides the
value of the encoded string expression.
Example
The following example performs dummy encoding on a set of string expressions.
It limits the encoding vocabulary to the ten categories that occur the most
frequently in the data and that also occur zero or more times.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-29 UTC."],[[["\u003cp\u003e\u003ccode\u003eML.ONE_HOT_ENCODER\u003c/code\u003e encodes string expressions using one-hot or dummy encoding, sorting the vocabulary alphabetically.\u003c/p\u003e\n"],["\u003cp\u003eThe function handles \u003ccode\u003eNULL\u003c/code\u003e values and out-of-vocabulary categories by encoding them with an \u003ccode\u003eindex\u003c/code\u003e of \u003ccode\u003e0\u003c/code\u003e, and uses a \u003ccode\u003evalue\u003c/code\u003e of \u003ccode\u003e0\u003c/code\u003e for dropped categories in dummy encoding.\u003c/p\u003e\n"],["\u003cp\u003e\u003ccode\u003eML.ONE_HOT_ENCODER\u003c/code\u003e supports parameters like \u003ccode\u003edrop\u003c/code\u003e, \u003ccode\u003etop_k\u003c/code\u003e, and \u003ccode\u003efrequency_threshold\u003c/code\u003e to customize the encoding process.\u003c/p\u003e\n"],["\u003cp\u003eThe \u003ccode\u003etop_k\u003c/code\u003e parameter limits the vocabulary to the most frequent categories, while the \u003ccode\u003efrequency_threshold\u003c/code\u003e parameter filters categories based on their occurrence frequency, and any category not satisfying the criteria is encoded to \u003ccode\u003e0\u003c/code\u003e.\u003c/p\u003e\n"],["\u003cp\u003eOutput is returned as an array of structs with \u003ccode\u003eindex\u003c/code\u003e and \u003ccode\u003evalue\u003c/code\u003e, showing the encoded representation of each string expression.\u003c/p\u003e\n"]]],[],null,["# The ML.ONE_HOT_ENCODER function\n===============================\n\nThis document describes the `ML.ONE_HOT_ENCODER` function, which lets you\nencode a string expression using a\n[one-hot](/bigquery/docs/auto-preprocessing#one_hot_encoding)\nor [dummy](/bigquery/docs/auto-preprocessing#dummy_encoding)\nencoding scheme.\n\nThe encoding vocabulary is sorted alphabetically. `NULL` values and categories\nthat aren't in the vocabulary are encoded with an `index` value of `0`. If you\nuse dummy encoding, the dropped category is encoded with a `value` of `0`.\n\nWhen used in the\n[`TRANSFORM` clause](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create#transform),\nthe vocabulary and dropped category values calculated during training, along\nwith the top *k* and frequency threshold values that you specified, are\nautomatically used in prediction.\n\nSyntax\n------\n\n```sql\nML.ONE_HOT_ENCODER(string_expression [, drop] [, top_k] [, frequency_threshold]) OVER()\n```\n\n### Arguments\n\n`ML.ONE_HOT_ENCODER` takes the following arguments:\n\n- `string_expression`: the `STRING` expression to encode.\n- `drop`: a `STRING` value that specifies whether the function drops a category. Valid values are as follows:\n - `none`: Retain all categories. This is the default value.\n - `most_frequent`: Drop the most frequent category found in the string expression. Selecting this value causes the function to use dummy encoding.\n- `top_k`: an `INT64` value that specifies the number of categories included in the encoding vocabulary. The function selects the `top_k` most frequent categories in the data and uses those; categories below this threshold are encoded to `0`. This value must be less than `1,000,000` to avoid problems due to high dimensionality. The default value is `32,000`.\n- `frequency_threshold`: an `INT64` value that limits the categories included in the encoding vocabulary based on category frequency. The function uses categories whose frequency is greater than or equal to `frequency_threshold`; categories below this threshold are encoded to `0`. The default value is `5`.\n\nOutput\n------\n\n`ML.ONE_HOT_ENCODER` returns an array of struct values, in the form\n`ARRAY\u003cSTRUCT\u003cINT64, FLOAT64\u003e\u003e`. The first element in the struct provides the\nindex of the encoded string expression, and the second element provides the\nvalue of the encoded string expression.\n\nExample\n-------\n\nThe following example performs dummy encoding on a set of string expressions.\nIt limits the encoding vocabulary to the ten categories that occur the most\nfrequently in the data and that also occur zero or more times. \n\n```sql\nSELECT f, ML.ONE_HOT_ENCODER(f, 'most_frequent', 10, 0) OVER () AS output\nFROM UNNEST([NULL, 'a', 'b', 'b', 'c', 'c', 'c', 'd', 'd']) AS f\nORDER BY f;\n```\n\nThe output looks similar to the following: \n\n```\n+------+-----------------------------+\n| f | output.index | output.value |\n+------+--------------+--------------+\n| NULL | 0 | 1.0 |\n| a | 1 | 1.0 |\n| b | 2 | 1.0 |\n| b | 2 | 1.0 |\n| c | 3 | 0.0 |\n| c | 3 | 0.0 |\n| c | 3 | 0.0 |\n| d | 4 | 1.0 |\n| d | 4 | 1.0 |\n+------+-----------------------------+\n```\n\nWhat's next\n-----------\n\n- For information about feature preprocessing, see [Feature preprocessing overview](/bigquery/docs/preprocess-overview).\n- For information about the supported SQL statements and functions for each model type, see [End-to-end user journey for each model](/bigquery/docs/e2e-journey)."]]