Create recommendations based on explicit feedback with a matrix factorization modelStay organized with collectionsSave and categorize content based on your preferences.
This tutorial teaches you how to create amatrix factorization modeland train it on the customer movie ratings in themovielens1mdataset. You then
use the matrix factorization model to generate movie recommendations for users.
Using customer-provided ratings to train the model is called
training withexplicit feedback. Matrix factorization models are trained
using theAlternating Least Squares algorithmwhen you use
explicit feedback as training data.
Objectives
This tutorial guides you through completing the following tasks:
Creating a matrix factorization model by using theCREATE MODELstatement.
Sign in to your Google Cloud account. If you're new to
Google Cloud,create an accountto evaluate how our products perform in
real-world scenarios. New customers also get $300 in free credits to
run, test, and deploy workloads.
In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
Roles required to select or create a project
Select a project: Selecting a project doesn't require a specific
IAM role—you can select any project that you've been
granted a role on.
Create a project: To create a project, you need the Project Creator role
(roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant
roles.
In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
Roles required to select or create a project
Select a project: Selecting a project doesn't require a specific
IAM role—you can select any project that you've been
granted a role on.
Create a project: To create a project, you need the Project Creator role
(roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant
roles.
BigQuery is automatically enabled in new projects.
To activate BigQuery in a pre-existing project, go to
Enable the BigQuery API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM
role (roles/serviceusage.serviceUsageAdmin), which
contains theserviceusage.services.enablepermission.Learn how to grant
roles.
First, create aClientobject withbqclient = google.cloud.bigquery.Client(), then load themovielens1mdata
into the dataset you created in the previous step.
importioimportzipfileimportgoogle.api_core.exceptionsimportrequeststry:# Check if you've already created the Movielens tables to avoid downloading# and uploading the dataset unnecessarily.bqclient.get_table("bqml_tutorial.ratings")bqclient.get_table("bqml_tutorial.movies")exceptgoogle.api_core.exceptions.NotFound:# Download the https://grouplens.org/datasets/movielens/1m/ dataset.ml1m=requests.get("http://files.grouplens.org/datasets/movielens/ml-1m.zip")ml1m_file=io.BytesIO(ml1m.content)ml1m_zip=zipfile.ZipFile(ml1m_file)# Upload the ratings data into the ratings table.withml1m_zip.open("ml-1m/ratings.dat")asratings_file:ratings_content=ratings_file.read()ratings_csv=io.BytesIO(ratings_content.replace(b"::",b","))ratings_config=google.cloud.bigquery.LoadJobConfig()ratings_config.source_format="CSV"ratings_config.write_disposition="WRITE_TRUNCATE"ratings_config.schema=[google.cloud.bigquery.SchemaField("user_id","INT64"),google.cloud.bigquery.SchemaField("item_id","INT64"),google.cloud.bigquery.SchemaField("rating","FLOAT64"),google.cloud.bigquery.SchemaField("timestamp","TIMESTAMP"),]bqclient.load_table_from_file(ratings_csv,"bqml_tutorial.ratings",job_config=ratings_config).result()# Upload the movie data into the movies table.withml1m_zip.open("ml-1m/movies.dat")asmovies_file:movies_content=movies_file.read()movies_csv=io.BytesIO(movies_content.replace(b"::",b"@"))movies_config=google.cloud.bigquery.LoadJobConfig()movies_config.source_format="CSV"movies_config.field_delimiter="@"movies_config.write_disposition="WRITE_TRUNCATE"movies_config.schema=[google.cloud.bigquery.SchemaField("movie_id","INT64"),google.cloud.bigquery.SchemaField("movie_title","STRING"),google.cloud.bigquery.SchemaField("genre","STRING"),]bqclient.load_table_from_file(movies_csv,"bqml_tutorial.movies",job_config=movies_config).result()
Create the model
Create a matrix factorization model and train it on the data in theratingstable. The model is trained to predict a rating for every user-item pair,
based on the customer-provided movie ratings.
SQL
The followingCREATE MODELstatement uses these columns to generate
recommendations:
user_id—The user ID.
item_id—The movie ID.
rating—The explicit rating from 1 to 5 that the user gave the
item.
Follow these steps to create the model:
In the Google Cloud console, go to theBigQuerypage.
The query takes about 10 minutes to complete, after which themf_explicitmodel appears in theExplorerpane. Because
the query uses aCREATE MODELstatement to create a model, you don't see
query results.
frombigframes.mlimportdecompositionimportbigframes.pandasasbpd# Load data from BigQuerybq_df=bpd.read_gbq("bqml_tutorial.ratings",columns=("user_id","item_id","rating"))# Create the Matrix Factorization modelmodel=decomposition.MatrixFactorization(num_factors=34,feedback_type="explicit",user_col="user_id",item_col="item_id",rating_col="rating",l2_reg=9.83,)model.fit(bq_df)model.to_gbq(your_model_id,replace=True# For example: "bqml_tutorial.mf_explicit")
The code takes about 10 minutes to complete, after which themf_explicitmodel appears in theExplorerpane.
Get training statistics
Optionally, you can view the model's training statistics in the
Google Cloud console.
A machine learning algorithm builds a model by creating many iterations of
the model using different parameters, and then selecting the version of the
model that minimizesloss.
This process is called empirical risk minimization. The model's training
statistics let you see the loss associated with each iteration of the model.
Follow these steps to view the model's training statistics:
In the Google Cloud console, go to theBigQuerypage.
TheTraining Data Losscolumn represents the loss metric calculated
after the model is trained. Because this is a matrix factorization model,
this column shows themean squared error.
Evaluate the performance of the model by comparing the predicted movie ratings
returned by the model against the actual user movie ratings from the training
data.
SQL
Use theML.EVALUATEfunction to evaluate the model:
In the Google Cloud console, go to theBigQuerypage.
An important metric in the evaluation results is theR2score.
The R2score is a statistical measure that determines if the
linear regression predictions approximate the actual data. A value of0indicates that the model explains none of the variability of the
response data around the mean. A value of1indicates that the model
explains all the variability of the response data around the mean.
For more information about theML.EVALUATEfunction output, seeOutput.
You can also callML.EVALUATEwithout providing the input data. It will
use the evaluation metrics calculated during training.
Join the predicted ratings with the movie information, and select the top
five results per user. In the query editor, paste in the
following query and clickRun:
# import bigframes.bigquery as bbq# Load moviesmovies=bpd.read_gbq("bqml_tutorial.movies")# Merge the movies df with the previously created predicted dfmerged_df=bpd.merge(predicted,movies,left_on="item_id",right_on="movie_id")# Separate users and predicted data, setting the index to 'movie_id'users=merged_df[["user_id","movie_id"]].set_index("movie_id")# Take the predicted data and sort it in descending order by 'predicted_rating', setting the index to 'movie_id'sort_data=(merged_df[["movie_title","genre","predicted_rating","movie_id"]].sort_values(by="predicted_rating",ascending=False).set_index("movie_id"))# re-merge the separated dfs by indexmerged_user=sort_data.join(users,how="outer")# group the users and set the user_id as the indexmerged_user.groupby("user_id").head(5).set_index("user_id").sort_index()print(merged_user)# Output:# movie_title genre predicted_rating# user_id# 1 Saving Private Ryan (1998) Action|Drama|War 5.19326# 1 Fargo (1996) Crime|Drama|Thriller 4.996954# 1 Driving Miss Daisy (1989) Drama 4.983671# 1 Ben-Hur (1959) Action|Adventure|Drama 4.877622# 1 Schindler's List (1993) Drama|War 4.802336# 2 Saving Private Ryan (1998) Action|Drama|War 5.19326# 2 Braveheart (1995) Action|Drama|War 5.174145# 2 Gladiator (2000) Action|Drama 5.066372# 2 On Golden Pond (1981) Drama 5.01198# 2 Driving Miss Daisy (1989) Drama 4.983671
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this
tutorial, either delete the project that contains the resources, or keep the project and
delete the individual resources.
You can delete the project you created.
Or you can keep the project and delete the dataset.
Delete your dataset
Deleting your project removes all datasets and all tables in the project. If you
prefer to reuse the project, you can delete the dataset you created in this
tutorial:
If necessary, open the BigQuery page in the Google Cloud console.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2026-04-08 UTC."],[],[]]