The ML.CORRELATION function
This document describes the ML.CORRELATION
function, which calculates
statistical correlations between a target column and one or more metric columns. ML.CORRELATION
offers the following features:
- Multi-column matrix: correlates one target variable against multiple metrics simultaneously.
- Dimensional slicing: automatically computes correlations for all
combinations of specified dimensions, similar to a
GROUP BY CUBEoperation. - Flexible methods: supports Pearson , Spearman , and Kendall correlation methods.
Syntax
ML.CORRELATION( { TABLE TABLE_NAME | ( QUERY_STATEMENT ) }, target_col => TARGET_COL , target_correlation_cols => TARGET_CORRELATION_COLS [, dimension_cols => DIMENSION_COLS , ] [, method => METHOD ] );
Arguments
The ML.CORRELATION
function takes the following arguments:
-
TABLE_NAME: the name of a BigQuery table that contains the data to analyze. -
QUERY_STATEMENT: a SQL query whose results contain the data to analyze. -
TARGET_COL: aSTRINGthat contains the name of the primary numerical column to analyze. -
TARGET_CORRELATION_COLS: aSTRINGorARRAY<STRING>value that contains the names of one or more numerical columns to correlate against theTARGET_COLcolumn. -
DIMENSION_COLS: aSTRINGorARRAY<STRING>value that contains the names of columns to slice the data by. The function calculates correlations for every combination of these dimensions. You can specify a maximum of 12 columns. Each column must be a groupable type . -
METHOD: aSTRINGthat specifies the statistical method to use for correlation. Supported values arePEARSON,SPEARMAN, andKENDALL. The default value isPEARSON. TheKENDALLmethod has higher complexity and can be slow on large datasets. For large tables, we recommend that you usePEARSONorSPEARMAN.
Output
The ML.CORRELATION
function returns a table where each row
represents the correlation for a specific pair of columns
within a specific data segment. The results are sorted by segment_size
in descending order, and then by corr_col
in ascending order.
The output table contains the following columns:
-
segment: anARRAY<STRUCT<dimension_col STRING, dimension_value JSON>>value that contains the key-value pair for each dimension. TheJSONvalue fordimension_valueis generated using theTO_JSONfunction. -
dimension_col: a column fromdimension_cols, if you specified any dimension columns. The output includes one column for each dimension specified. ANULLvalue in one of these columns indicates either a trueNULLvalue in the column or a placeholderNULLvalue that means the column was part of a rollup. This is conceptually similar to the presence ofNULLplaceholder values generated when you use grouping sets . To determine whether theNULLvalue is from the column itself, check for aNULLvalue in thedimension_valuefield for that column in thesegmentcolumn of the output. -
target_col: aSTRINGvalue that contains the name of the input target column. -
corr_col: aSTRINGvalue that contains the name of the metric column being correlated against the target column. -
correlation: aFLOAT64value that contains the correlation coefficient in the range of-1.0to1.0. -
segment_size: anINT64value that contains the number of rows used to calculate the correlation for this segment. -
segment_proportion: aFLOAT64value that contains the fraction of total rows in the input table (segment_size/total rows) that belong to this segment.
Examples
The following examples show how to use the ML.CORRELATION
function with
the my_dataset.marketing_sample
table:
CREATE
OR
REPLACE
TABLE
my_dataset
.
marketing_sample
AS
(
-- New York data
SELECT
'USA'
AS
country
,
'New York'
AS
city
,
'Electronics'
AS
product_category
,
100
AS
ad_spend
,
150
AS
budget
,
1000
AS
revenue
UNION
ALL
SELECT
'USA'
,
'New York'
,
'Electronics'
,
150
,
200
,
1500
UNION
ALL
SELECT
'USA'
,
'New York'
,
'Apparel'
,
200
,
250
,
1800
UNION
ALL
-- Seattle data
SELECT
'USA'
,
'Seattle'
,
'Apparel'
,
220
,
240
,
2400
UNION
ALL
SELECT
'USA'
,
'Seattle'
,
'Apparel'
,
300
,
320
,
2700
UNION
ALL
-- London data (Genuine NULL country)
SELECT
NULL
,
'London'
,
'Electronics'
,
100
,
120
,
500
UNION
ALL
SELECT
NULL
,
'London'
,
'Electronics'
,
200
,
220
,
900
UNION
ALL
-- Missing city data (Genuine NULL city)
SELECT
NULL
,
NULL
,
'Apparel'
,
200
,
200
,
1000
UNION
ALL
SELECT
NULL
,
NULL
,
'Apparel'
,
250
,
250
,
1200
);
/*---------+----------+------------------+----------+--------+---------+
| country | city | product_category | ad_spend | budget | revenue |
+---------+----------+------------------+----------+--------+---------+
| USA | New York | Electronics | 100 | 150 | 1000 |
| USA | New York | Electronics | 150 | 200 | 1500 |
| USA | New York | Apparel | 200 | 250 | 1800 |
| USA | Seattle | Apparel | 220 | 240 | 2400 |
| USA | Seattle | Apparel | 300 | 320 | 2700 |
| NULL | London | Electronics | 100 | 120 | 500 |
| NULL | London | Electronics | 200 | 220 | 900 |
| NULL | NULL | Apparel | 200 | 200 | 1000 |
| NULL | NULL | Apparel | 250 | 250 | 1200 |
+---------+----------+------------------+----------+--------+---------*/
Calculate Pearson correlation
The following example calculates the Pearson correlation between revenue
and ad_spend
from the table my_dataset.marketing_sample
and uses country
as
a dimension column:
SELECT
country
,
segment
,
correlation
,
segment_size
FROM
ML
.
CORRELATION
(
TABLE
my_dataset
.
marketing_sample
,
target_col
=
>
'revenue'
,
target_correlation_cols
=
>
'ad_spend'
,
dimension_cols
=
>
[
'country'
]
);
/*---------+--------------------------------------------------------+-------------+--------------+
| country | segment | correlation | segment_size |
+---------+--------------------------------------------------------+-------------+--------------+
| NULL | [] | 0.698 | 9 |
| 'USA' | [{dimension_col: 'country', dimension_value: '"USA"'}] | 0.968 | 5 |
| NULL | [{dimension_col: 'country', dimension_value: 'null'}] | 0.990 | 4 |
+---------+--------------------------------------------------------+-------------+--------------*/
The first
row of the result contains NULL
for country because it corresponds to
aggregation over all countries. The third row of the result corresponds to a
genuine NULL
in the input data for country because the dimension_value
field is NULL
.
Calculate correlation for multiple columns
The following example calculates the correlation of revenue
with ad_spend
and budget
, sliced by city
and product_category
, from
the table my_dataset.marketing_sample
:
SELECT
*
FROM
ML
.
CORRELATION
(
(
SELECT
*
FROM
my_dataset
.
marketing_sample
WHERE
country
=
'USA'
),
target_col
=
>
'revenue'
,
target_correlation_cols
=
>
[
'ad_spend'
,
'budget'
]
,
dimension_cols
=
>
[
'city'
,
'product_category'
]
)
ORDER
BY
segment_size
DESC
,
corr_col
LIMIT
5
;
/*---------------------------------------------------------+----------+------------------+------------+----------+-------------+--------------+--------------------+
| segment | city | product_category | target_col | corr_col | correlation | segment_size | segment_proportion |
+---------------------------------------------------------+----------+------------------+------------+----------+-------------+--------------+--------------------+
| [] | NULL | NULL | revenue | ad_spend | 0.968 | 5 | 1.0 |
| [] | NULL | NULL | revenue | budget | 0.924 | 5 | 1.0 |
| [{dimension_col: 'city', dimension_value: '"New York"'}]| New York | NULL | revenue | ad_spend | 0.990 | 3 | 0.6 |
| [{dimension_col: 'product_category', ...: '"Apparel"'}] | NULL | Apparel | revenue | ad_spend | 0.866 | 3 | 0.6 |
| [{dimension_col: 'city', dimension_value: '"New York"'}]| New York | NULL | revenue | budget | 0.990 | 3 | 0.6 |
+---------------------------------------------------------+----------+------------------+------------+----------+-------------+--------------+--------------------*/
Distinguish between NULLs caused by global aggregates and NULLs caused by missing data
The following example shows how to use the segment
column to label your rows
clearly for reporting. If a city is NULL
and isn't
in the segment array,
it's a global aggregate. If a city is NULL
and is
in the segment array, it's
missing data.
SELECT
-- Create a clean label for reporting
CASE
-- If 'city' is NULL and not in the segment array, it's a global rollup
WHEN
city
IS
NULL
AND
NOT
EXISTS
(
SELECT
1
FROM
UNNEST
(
segment
)
s
WHERE
s
.
dimension_col
=
'city'
)
THEN
'ALL CITIES (Global)'
-- If 'city' is NULL and in the segment array, it's missing data
WHEN
city
IS
NULL
THEN
'UNKNOWN CITY'
ELSE
city
END
AS
city_label
,
correlation
,
segment_size
FROM
ML
.
CORRELATION
(
TABLE
my_dataset
.
marketing_sample
,
target_col
=
>
'revenue'
,
target_correlation_cols
=
>
'ad_spend'
,
dimension_cols
=
>
[
'city'
]
);
/*---------------------+-------------+--------------+
| city_label | correlation | segment_size |
+---------------------+-------------+--------------+
| ALL CITIES (Global) | 0.698 | 9 |
| New York | 0.990 | 3 |
| UNKNOWN CITY | 1.0 | 2 |
| London | 1.0 | 2 |
| Seattle | 1.0 | 2 |
+---------------------+-------------+--------------*/

