Text analysis functions

GoogleSQL for BigQuery supports the following text analysis functions.

Function list

Name	Summary
`BAG_OF_WORDS`	Gets the frequency of each term (token) in a tokenized document.
`TEXT_ANALYZE`	Extracts terms (tokens) from text and converts them into a tokenized document.
`TF_IDF`	Evaluates how relevant a term (token) is to a tokenized document in a set of tokenized documents.

`BAG_OF_WORDS`

  BAG_OF_WORDS 
 ( 
 tokenized_document 
 )

Definition

Gets the frequency of each term (token) in a tokenized document.

Definitions

tokenized_document : ARRAY<STRING> value that represents a document that has been tokenized. A tokenized document is a collection of terms (tokens), which are used for text analysis.

Return type

ARRAY<STRUCT<term STRING, count INT64>>

Definitions:

term : A unique term in the tokenized document.
count : The number of times the term was found in the tokenized document.

Examples

The following query produces terms and their frequencies in two tokenized documents:

  WITH 
  
 ExampleTable 
  
 AS 
  
 ( 
  
 SELECT 
  
 1 
  
 AS 
  
 id 
 , 
  
 [ 
 'I' 
 , 
  
 'like' 
 , 
  
 'pie' 
 , 
  
 'pie' 
 , 
  
 'pie' 
 , 
  
 NULL 
 ] 
  
 AS 
  
 f 
  
 UNION 
  
 ALL 
  
 SELECT 
  
 2 
  
 AS 
  
 id 
 , 
  
 [ 
 'yum' 
 , 
  
 'yum' 
 , 
  
 'pie' 
 , 
  
 NULL 
 ] 
  
 AS 
  
 f 
  
 ) 
 SELECT 
  
 id 
 , 
  
 BAG_OF_WORDS 
 ( 
 f 
 ) 
  
 AS 
  
 results 
 FROM 
  
 ExampleTable 
 ORDER 
  
 BY 
  
 id 
 ; 
 /*----+------------------------------------------------* 
 | id | results                                        | 
 +----+------------------------------------------------+ 
 | 1  | [(null, 1), ('I', 1), ('like', 1), ('pie', 3)] | 
 | 2  | [(null, 1), ('pie', 1), ('yum', 2)]            | 
 *----+------------------------------------------------*/

`TEXT_ANALYZE`

  TEXT_ANALYZE 
 ( 
  
 text 
  
 [ 
 , 
  
 analyzer 
  
 = 
>  
 { 
  
 'LOG_ANALYZER' 
  
 | 
  
 'NO_OP_ANALYZER' 
  
 | 
  
 'PATTERN_ANALYZER' 
  
 } 
  
 ] 
  
 [ 
 , 
  
 analyzer_options 
  
 = 
>  
 analyzer_options_values 
  
 ] 
 )

Description

Extracts terms (tokens) from text and converts them into a tokenized document.

Definitions

text : STRING value that represents the input text to tokenize.
analyzer : A named argument with a STRING value. Determines which analyzer to use to convert text into an array of terms (tokens). This can be:
- 'LOG_ANALYZER' (default): Breaks the input into terms when delimiters are encountered and then normalizes the terms. If analyzer isn't specified, this is used by default. For more information, see LOG_ANALYZER text analyzer .
- 'NO_OP_ANALYZER' : Extracts the text as a single term (token), but doesn't apply normalization. For more information, see NO_OP_ANALYZER text analyzer .
- 'PATTERN_ANALYZER' : Breaks the input into terms that match a regular expression. For more information, see PATTERN_ANALYZER text analyzer .
analyzer_options : A named argument with a JSON-formatted STRING value. Takes a list of text analysis rules. For more information, see Text analyzer options .

Details

There is no guarantee on the order of the tokens produced by this function.

If no analyzer is specified, the LOG_ANALYZER analyzer is used by default.

Return type

ARRAY<STRING>

Examples

The following query uses the default text analyzer, LOG_ANALYZER , with the input text:

  SELECT 
  
 TEXT_ANALYZE 
 ( 
 'I like pie, you like-pie, they like 2 PIEs.' 
 ) 
  
 AS 
  
 results 
 /*--------------------------------------------------------------------------* 
 | results                                                                  | 
 +--------------------------------------------------------------------------+ 
 | ['i', 'like', 'pie', 'you', 'like', 'pie', 'they', 'like', '2', 'pies' ] | 
 *--------------------------------------------------------------------------*/

The following query uses the NO_OP_ANALYZER text analyzer with the input text:

  SELECT 
  
 TEXT_ANALYZE 
 ( 
  
 'I like pie, you like-pie, they like 2 PIEs.' 
 , 
  
 analyzer 
 = 
> 'NO_OP_ANALYZER' 
 ) 
  
 AS 
  
 results 
 /*-----------------------------------------------* 
 | results                                       | 
 +-----------------------------------------------+ 
 | 'I like pie, you like-pie, they like 2 PIEs.' | 
 *-----------------------------------------------*/

The following query uses the PATTERN_ANALYZER text analyzer with the input text:

  SELECT 
  
 TEXT_ANALYZE 
 ( 
  
 'I like pie, you like-pie, they like 2 PIEs.' 
 , 
  
 analyzer 
 = 
> 'PATTERN_ANALYZER' 
 ) 
  
 AS 
  
 results 
 /*----------------------------------------------------------------* 
 | results                                                        | 
 +----------------------------------------------------------------+ 
 | ['like', 'pie', 'you', 'like', 'pie', 'they', 'like', 'pies' ] | 
 *----------------------------------------------------------------*/

For additional examples that include analyzer options, see Text analysis .

For helpful analyzer recipes that you can use to enhance analyzer-supported queries, see Search with text analyzers .

`TF_IDF`

  TF_IDF 
 ( 
 tokenized_document 
 ) 
  
 OVER 
 ()

  TF_IDF 
 ( 
 tokenized_document 
 , 
  
 max_distinct_tokens 
 ) 
  
 OVER 
 ()

  TF_IDF 
 ( 
 tokenized_document 
 , 
  
 max_distinct_tokens 
 , 
  
 frequency_threshold 
 ) 
  
 OVER 
 ()

Description

Evaluates how relevant a term is to a tokenized document in a set of tokenized documents, using the TF-IDF (term frequency-inverse document frequency) algorithm.

Definitions

tokenized_document : ARRAY<STRING> value that represents a document that has been tokenized. A tokenized document is a collection of terms (tokens), which are used for text analysis.
max_distinct_tokens : Optional argument. Takes a non-negative INT64 value, which represents the size of the dictionary, excluding the unknown term.

Terms are added to the dictionary until this threshold is met. So, if this value is 20 , the first 20 unique terms are added and then no additional terms are added.

If this argument isn't provided, the default value is 32000 . If this argument is specified, the maximum value is 1048576 .
frequency_threshold : Optional argument. Takes a non-negative INT64 value that represents the minimum number of times a term must appear in a tokenized document to be included in the dictionary. So, if this value is 3 , a term must appear at least three times in the tokenized document to be added to the dictionary.

If this argument isn't provided, the default value is 5 .

Details

This function uses a TF-IDF (term frequency-inverse document frequency) algorithm to compute the relevance of terms in a set of tokenized documents. TF-IDF multiplies two metrics: how many times a term appears in a document (term frequency), and the inverse document frequency of the term across a collection of documents (inverse document frequency).

TDIF:

  term 
  
 frequency 
  
 * 
  
 inverse 
  
 document 
  
 frequency

term frequency:

  ( 
 count 
  
 of 
  
 term 
  
 in 
  
 document 
 ) 
  
 / 
  
 ( 
 document 
  
 size 
 )

inverse document frequency:

  log 
 ( 
 1 
  
 + 
  
 document 
  
 set 
  
 size 
  
 / 
  
 ( 
 1 
  
 + 
  
 count 
  
 of 
  
 documents 
  
 containing 
  
 term 
 ))

Terms are added to a dictionary of terms if they satisfy the criteria for max_distinct_tokens and frequency_threshold , otherwise they are considered the unknown term . The unknown term is always the first term in the dictionary and represented as NULL . The rest of the dictionary is ordered by term frequency rather than alphabetically.

Return type

ARRAY<STRUCT<term STRING, tf_idf DOUBLE>>

Definitions:

term : The unique term that was added to the dictionary.
tf_idf : The TF-IDF computation for the term.

Examples

The following query computes the relevance of up to 10 terms that appear at least twice in a set of tokenized documents. In this example, 10 represents max_distinct_tokens and 2 represents frequency_threshold :

  WITH 
  
 ExampleTable 
  
 AS 
  
 ( 
  
 SELECT 
  
 1 
  
 AS 
  
 id 
 , 
  
 [ 
 'I' 
 , 
  
 'like' 
 , 
  
 'pie' 
 , 
  
 'pie' 
 , 
  
 'pie' 
 , 
  
 NULL 
 ] 
  
 AS 
  
 f 
  
 UNION 
  
 ALL 
  
 SELECT 
  
 2 
  
 AS 
  
 id 
 , 
  
 [ 
 'yum' 
 , 
  
 'yum' 
 , 
  
 'pie' 
 , 
  
 NULL 
 ] 
  
 AS 
  
 f 
  
 UNION 
  
 ALL 
  
 SELECT 
  
 3 
  
 AS 
  
 id 
 , 
  
 [ 
 'I' 
 , 
  
 'yum' 
 , 
  
 'pie' 
 , 
  
 NULL 
 ] 
  
 AS 
  
 f 
  
 UNION 
  
 ALL 
  
 SELECT 
  
 4 
  
 AS 
  
 id 
 , 
  
 [ 
 'you' 
 , 
  
 'like' 
 , 
  
 'pie' 
 , 
  
 'too' 
 , 
  
 NULL 
 ] 
  
 AS 
  
 f 
 ) 
 SELECT 
  
 id 
 , 
  
 TF_IDF 
 ( 
 f 
 , 
  
 10 
 , 
  
 2 
 ) 
  
 OVER 
 () 
  
 AS 
  
 results 
 FROM 
  
 ExampleTable 
 ORDER 
  
 BY 
  
 id 
 ; 
 /*----+-------------------------------------------------* 
 | id | results                                         | 
 +----+-------------------------------------------------+ 
 | 1  | [{"index":null,"value":"0.1304033435859887"},   | 
 |    |  {"index":"I","value":"0.1412163100645339"},    | 
 |    |  {"index":"like","value":"0.1412163100645339"}, | 
 |    |  {"index":"pie","value":"0.29389333245105953"}] | 
 +----+-------------------------------------------------+ 
 | 2  | [{"index":null,"value":"0.1956050153789831"},   | 
 |    |  {"index":"pie","value":"0.14694666622552977"}, | 
 |    |  {"index":"yum","value":"0.4236489301936017"}]  | 
 +----+-------------------------------------------------+ 
 | 3  | [{"index":null,"value":"0.1956050153789831"},   | 
 |    |  {"index":"I","value":"0.21182446509680086"},   | 
 |    |  {"index":"pie","value":"0.14694666622552977"}, | 
 |    |  {"index":"yum","value":"0.21182446509680086"}] | 
 +----+-------------------------------------------------+ 
 | 4  | [{"index":null,"value":"0.4694520369095594"},   | 
 |    |  {"index":"like","value":"0.1694595720774407"}, | 
 |    |  {"index":"pie","value":"0.11755733298042381"}] | 
 *----+-------------------------------------------------*/

The following query computes the relevance of up to three terms that appear at least once in a set of tokenized documents:

  WITH 
  
 ExampleTable 
  
 AS 
  
 ( 
  
 SELECT 
  
 1 
  
 AS 
  
 id 
 , 
  
 [ 
 'I' 
 , 
  
 'like' 
 , 
  
 'pie' 
 , 
  
 'pie' 
 , 
  
 'pie' 
 , 
  
 NULL 
 ] 
  
 AS 
  
 f 
  
 UNION 
  
 ALL 
  
 SELECT 
  
 2 
  
 AS 
  
 id 
 , 
  
 [ 
 'yum' 
 , 
  
 'yum' 
 , 
  
 'pie' 
 , 
  
 NULL 
 ] 
  
 AS 
  
 f 
  
 UNION 
  
 ALL 
  
 SELECT 
  
 3 
  
 AS 
  
 id 
 , 
  
 [ 
 'I' 
 , 
  
 'yum' 
 , 
  
 'pie' 
 , 
  
 NULL 
 ] 
  
 AS 
  
 f 
  
 UNION 
  
 ALL 
  
 SELECT 
  
 4 
  
 AS 
  
 id 
 , 
  
 [ 
 'you' 
 , 
  
 'like' 
 , 
  
 'pie' 
 , 
  
 'too' 
 , 
  
 NULL 
 ] 
  
 AS 
  
 f 
 ) 
 SELECT 
  
 id 
 , 
  
 TF_IDF 
 ( 
 f 
 , 
  
 3 
 , 
  
 2 
 ) 
  
 OVER 
 () 
  
 AS 
  
 results 
 FROM 
  
 ExampleTable 
 ORDER 
  
 BY 
  
 id 
 ; 
 /*----+-------------------------------------------------* 
 | id | results                                         | 
 +----+-------------------------------------------------+ 
 | 1  | [{"index":null,"value":"0.12679902142647365"},  | 
 |    |  {"index":"I","value":"0.1412163100645339"},    | 
 |    |  {"index":"like","value":"0.1412163100645339"}, | 
 |    |  {"index":"pie","value":"0.29389333245105953"}] | 
 +----+-------------------------------------------------+ 
 | 2  | [{"index":null,"value":"0.5705955964191315"},   | 
 |    |  {"index":"pie","value":"0.14694666622552977"}] | 
 +----+-------------------------------------------------+ 
 | 3  | [{"index":null,"value":"0.380397064279421"},    | 
 |    |  {"index":"I","value":"0.21182446509680086"},   | 
 |    |  {"index":"pie","value":"0.14694666622552977"}] | 
 +----+-------------------------------------------------+ 
 | 4  | [{"index":null,"value":"0.45647647713530515"},  | 
 |    |  {"index":"like","value":"0.1694595720774407"}, | 
 |    |  {"index":"pie","value":"0.11755733298042381"}] | 
 *----+-------------------------------------------------*/

Text analysis functions Stay organized with collections Save and categorize content based on your preferences.

Function list

BAG_OF_WORDS

TEXT_ANALYZE

TF_IDF

Text analysis functions

`BAG_OF_WORDS`

`TEXT_ANALYZE`

`TF_IDF`