Get data insights from a contribution analysis model
In this tutorial, you use a contribution analysis model to analyze taxi fare changes between 2011 and 2012 in New York City. This tutorial guides you through performing the following tasks:
- Creating an input table based on publicly available taxi data.
- Creating a contribution analysis model that uses a summable metric . This type of model summarizes a given metric for a combination of one or more dimensions in the data, to determine how those dimensions contribute to the metric value.
- Get the metric insights from the model by using the
ML.GET_INSIGHTS
function .
Before starting this tutorial, you should be familiar with the contribution analysis use case.
Required permissions
-
To create the dataset, you need the
bigquery.datasets.create
Identity and Access Management (IAM) permission. -
To create the model, you need the following permissions:
-
bigquery.jobs.create
-
bigquery.models.create
-
bigquery.models.getData
-
bigquery.models.updateData
-
-
To run inference, you need the following permissions:
-
bigquery.models.getData
-
bigquery.jobs.create
-
Costs
In this document, you use the following billable components of Google Cloud:
- BigQuery ML : You incur costs for the data that you process in BigQuery.
To generate a cost estimate based on your projected usage,
use the pricing calculator
.
For more information about BigQuery pricing, see BigQuery pricing in the BigQuery documentation.
Before you begin
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project .
-
Enable the BigQuery API.
Create a dataset
Create a BigQuery dataset to store your ML model:
-
In the Google Cloud console, go to the BigQuery page.
-
In the Explorerpane, click your project name.
-
Click View actions > Create dataset.
-
On the Create datasetpage, do the following:
-
For Dataset ID, enter
bqml_tutorial
. -
For Location type, select Multi-region, and then select US (multiple regions in United States).
The public datasets are stored in the
US
multi-region . For simplicity, store your dataset in the same location. -
Leave the remaining default settings as they are, and click Create dataset.
-
Create a table of input data
Create a table that contains test and control data to analyze. The following query creates two intermediate tables, a test table with taxi data from 2012 and a control table with taxi data from 2011, and then performs a union of the intermediate tables to create a table with both test and control rows and the same set of columns.
-
In the Google Cloud console, go to the BigQuerypage.
-
In the query editor, run the following statement:
CREATE OR REPLACE TABLE bqml_tutorial . taxi_control_and_test AS ( SELECT vendor_id , passenger_count , payment_type , pickup_location_id , EXTRACT ( MONTH FROM pickup_datetime ) AS month , AVG ( total_amount ) AS avg_total_fare , FALSE AS is_test FROM ` bigquery - public - data . new_york_taxi_trips . tlc_yellow_trips_2011 ` WHERE total_amount > 0 GROUP BY vendor_id , passenger_count , payment_type , pickup_location_id , month , is_test ) UNION ALL ( SELECT vendor_id , passenger_count , payment_type , pickup_location_id , EXTRACT ( MONTH FROM pickup_datetime ) AS month , AVG ( total_amount ) AS avg_total_fare , TRUE AS is_test FROM ` bigquery - public - data . new_york_taxi_trips . tlc_yellow_trips_2012 ` WHERE total_amount > 0 GROUP BY vendor_id , passenger_count , payment_type , pickup_location_id , month , is_test );
Create the model
Create a contribution analysis model:
-
In the Google Cloud console, go to the BigQuerypage.
-
In the query editor, run the following statement:
CREATE OR REPLACE MODEL ` bqml_tutorial . taxi_contribution_analysis_model ` OPTIONS ( MODEL_TYPE = 'CONTRIBUTION_ANALYSIS' , CONTRIBUTION_METRIC = 'SUM(avg_total_fare)' , DIMENSION_ID_COLS = [ 'vendor_id' , 'passenger_count' , 'pickup_location_id' , 'payment_type' , 'month' ], IS_TEST_COL = 'is_test' , MIN_APRIORI_SUPPORT = 0 . 05 ) AS SELECT * FROM bqml_tutorial . taxi_control_and_test ;
The query takes approximately 20 seconds to complete, after which the model taxi_contribution_analysis_model
appears in the bqml_tutorial
dataset in
the Explorerpane. Because the query uses a CREATE MODEL
statement to
create a model, there are no query results.
Get insights from the model
Get insights generated by the contribution analysis model by using the ML.GET_INSIGHTS
function.
-
In the Google Cloud console, go to the BigQuerypage.
-
In the query editor, run the following statement:
SELECT contributors , metric_test , metric_control , difference , relative_difference , unexpected_difference , relative_unexpected_difference , apriori_support FROM ML . GET_INSIGHTS ( MODEL ` bqml_tutorial . taxi_contribution_analysis_model ` ) WHERE relative_difference IS NOT NULL ORDER BY unexpected_difference DESC ;
The
WHERE
clause excludes the results for contributors that are present in the test set, but not in the control set.If you had used
SELECT *
rather than specifying columns in theSELECT
statement for the function, the output would also include the dimension columns specified in theDIMENSION_ID_COLS
option of theCREATE MODEL
statement.The first several rows of the output should look similar to the following:
+-------------------------------------------------------------+------------------+------------------+------------------+---------------------+-----------------------+--------------------------------+-----------------+ | contributors | metric_test | metric_control | difference | relative_difference | unexpected_difference | relative_unexpected_difference | apriori_support | +-------------------------------------------------------------+------------------+------------------+------------------+---------------------+-----------------------+--------------------------------+-----------------+ | ["payment_type=5"] | 82996.99307095 | 138.26 | 82858.73307095 | 599.296492629 | 82825.246757081 | 482.253417818 | 0.063593301 | +-------------------------------------------------------------+------------------+------------------+------------------+---------------------+-----------------------+--------------------------------+-----------------+ | ["vendor_id=1"] | 651674.026105812 | 475749.798118647 | 175924.227987165 | 0.369783085 | 39985.82041025 | 0.065369611 | 0.499320531 | +-------------------------------------------------------------+------------------+------------------+------------------+---------------------+-----------------------+--------------------------------+-----------------+ | ["passenger_count=1"] | 304615.252142054 | 214839.058249037 | 89776.193893017 | 0.417876501 | 25149.907437652 | 0.08999294 | 0.233399895 | +-------------------------------------------------------------+------------------+------------------+------------------+---------------------+-----------------------+--------------------------------+-----------------+ | ["passenger_count=6"] | 106980.236314059 | 64751.247894565 | 42228.988419494 | 0.65217258 | 22582.018639759 | 0.267565108 | 0.081969553 | +-------------------------------------------------------------+------------------+------------------+------------------+---------------------+-----------------------+--------------------------------+-----------------+ | ["vendor_id=1","passenger_count=1"] | 169184.64374 | 114583.997774386 | 54600.645965614 | 0.476511965 | 19471.037967023 | 0.130055237 | 0.129631323 | +-------------------------------------------------------------+------------------+------------------+------------------+---------------------+-----------------------+--------------------------------+-----------------+
Because you performed contribution analysis on a summable metric, the results contain the summable metric output columns .
Clean up
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete .
- In the dialog, type the project ID, and then click Shut down to delete the project.