Perform classification with a boosted trees modelStay organized with collectionsSave and categorize content based on your preferences.
This tutorial teaches you how to use aboosted trees classifier modelto predict the income range of individuals based on their demographic data.
The model predicts whether a value falls into one of two categories, in this
case whether an individual's annual income falls above or below $50,000.
Sign in to your Google Cloud account. If you're new to
Google Cloud,create an accountto evaluate how our products perform in
real-world scenarios. New customers also get $300 in free credits to
run, test, and deploy workloads.
In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
Roles required to select or create a project
Select a project: Selecting a project doesn't require a specific
IAM role—you can select any project that you've been
granted a role on.
Create a project: To create a project, you need the Project Creator role
(roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant
roles.
In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
Roles required to select or create a project
Select a project: Selecting a project doesn't require a specific
IAM role—you can select any project that you've been
granted a role on.
Create a project: To create a project, you need the Project Creator role
(roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant
roles.
BigQuery is automatically enabled in new projects.
To activate BigQuery in a pre-existing project, go to
Enable the BigQuery API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM
role (roles/serviceusage.serviceUsageAdmin), which
contains theserviceusage.services.enablepermission.Learn how to grant
roles.
The model you create in this tutorial predicts the income bracket for census
respondents, based on the following features:
Age
Type of work performed
Marital status
Level of education
Occupation
Hours worked per week
Theeducationcolumn isn't included in the training data, because
theeducationandeducation_numcolumns both express the respondent's level
of education in different formats.
You separate the data into training, evaluation, and prediction sets by creating
a newdataframecolumn that is derived from thefunctional_weightcolumn.
Eighty percent of the data is used for training the model, and the remaining
twenty percent of the data is used for evaluation and prediction.
SQL
To prepare your sample data, create aviewto
contain the training data. This view is used by theCREATE MODELstatement
later in this tutorial.
Run the query that prepares the sample data:
In the Google Cloud console, go to theBigQuerypage.
Create a boosted trees model to predict census respondents' income bracket, and
train it on the census data. The query takes about 30 minutes to complete.
SQL
Follow these steps to create the model:
In the Google Cloud console, go to theBigQuerypage.
After the query completes, thetree_modelmodel can be accessed through theExplorerpane. Because
the query uses aCREATE MODELstatement to create a model, you don't see
query results.
frombigframes.mlimportensemble# input_data is defined in an earlier step.training_data=input_data[input_data["dataframe"]=="training"]X=training_data.drop(columns=["income_bracket","dataframe"])y=training_data["income_bracket"]# create and train the modeltree_model=ensemble.XGBClassifier(n_estimators=1,booster="gbtree",tree_method="hist",max_iterations=1,# For a more accurate model, try 50 iterations.subsample=0.85,)tree_model.fit(X,y)tree_model.to_gbq(your_model_id,# For example: "your-project.bqml_tutorial.tree_model"replace=True,)
Evaluate the model
SQL
Follow these steps to evaluate the model:
In the Google Cloud console, go to theBigQuerypage.
# Select model you'll use for predictions. `read_gbq_model` loads model# data from BigQuery, but you could also use the `tree_model` object# from the previous step.tree_model=bpd.read_gbq_model(your_model_id,# For example: "your-project.bqml_tutorial.tree_model")# input_data is defined in an earlier step.evaluation_data=input_data[input_data["dataframe"]=="evaluation"]X=evaluation_data.drop(columns=["income_bracket","dataframe"])y=evaluation_data["income_bracket"]# The score() method evaluates how the model performs compared to the# actual data. Output DataFrame matches that of ML.EVALUATE().score=tree_model.score(X,y)score.peek()# Output:# precision recall accuracy f1_score log_loss roc_auc# 0 0.671924 0.578804 0.839429 0.621897 0.344054 0.887335
The evaluation metrics indicate good model performance, in particular,
the fact that theroc_aucscoreis greater than0.8.
For more information about the evaluation metrics, seeOutput.
Use the model to predict classifications
SQL
Follow these steps to forecast data with the model:
In the Google Cloud console, go to theBigQuerypage.
# Select model you'll use for predictions. `read_gbq_model` loads model# data from BigQuery, but you could also use the `tree_model` object# from previous steps.tree_model=bpd.read_gbq_model(your_model_id,# For example: "your-project.bqml_tutorial.tree_model")# input_data is defined in an earlier step.prediction_data=input_data[input_data["dataframe"]=="prediction"]predictions=tree_model.predict(prediction_data)predictions.peek()# Output:# predicted_income_bracket predicted_income_bracket_probs.label predicted_income_bracket_probs.prob# <=50K >50K 0.05183430016040802# <50K 0.94816571474075317# <=50K >50K 0.00365859130397439# <50K 0.99634140729904175# <=50K >50K 0.037775970995426178# <50K 0.96222406625747681
Thepredicted_income_bracketcontains the predicted value from the model.
Thepredicted_income_bracket_probs.labelshows the two labels that the
model had to choose between, and thepredicted_income_bracket_probs.probcolumn shows the probability of the given label being the
correct one.
To avoid incurring charges to your Google Cloud account for the resources used in this
tutorial, either delete the project that contains the resources, or keep the project and
delete the individual resources.
You can delete the project you created.
Or you can keep the project and delete the dataset.
Delete your dataset
Deleting your project removes all datasets and all tables in the project. If you
prefer to reuse the project, you can delete the dataset you created in this
tutorial:
If necessary, open the BigQuery page in the
Google Cloud console.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2026-04-01 UTC."],[],[]]