Forecast multiple time series with an ARIMA_PLUS univariate modelStay organized with collectionsSave and categorize content based on your preferences.
This tutorial teaches you how to use anARIMA_PLUSunivariate time series modelto forecast the future value of a given
column, based on the historical values for that column.
This tutorial forecasts for multiple time series. Forecasted values are
calculated for each time point, for each value in one or more specified columns.
For example, if you wanted to forecast weather and specified a column containing
city data, the forecasted data would contain forecasts for all time points for
City A, then forecasted values for all time points for City B, and so forth.
Retrieving the forecasted bike ride information from the model by using theML.FORECASTfunction.
Retrieving components of the time series, such as seasonality and trend,
by using theML.EXPLAIN_FORECASTfunction.
You can inspect these time series components in order to explain the
forecasted values.
Costs
This tutorial uses billable components of Google Cloud, including:
BigQuery
BigQuery ML
For more information about BigQuery costs, see theBigQuery pricingpage.
Sign in to your Google Cloud account. If you're new to
Google Cloud,create an accountto evaluate how our products perform in
real-world scenarios. New customers also get $300 in free credits to
run, test, and deploy workloads.
In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
Roles required to select or create a project
Select a project: Selecting a project doesn't require a specific
IAM role—you can select any project that you've been
granted a role on.
Create a project: To create a project, you need the Project Creator role
(roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant
roles.
In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
Roles required to select or create a project
Select a project: Selecting a project doesn't require a specific
IAM role—you can select any project that you've been
granted a role on.
Create a project: To create a project, you need the Project Creator role
(roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission.Learn how to grant
roles.
BigQuery is automatically enabled in new projects.
To activate BigQuery in a pre-existing project, go to
Enable the BigQuery API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM
role (roles/serviceusage.serviceUsageAdmin), which
contains theserviceusage.services.enablepermission.Learn how to grant
roles.
Before creating the model, you can optionally visualize your input
time series data to get a sense of the distribution. You can do this by using
Looker Studio.
SQL
TheSELECTstatement of the following query uses theEXTRACTfunctionto extract the date information from thestarttimecolumn. The query uses
theCOUNT(*)clause to get the daily total number of Citi Bike trips.
Follow these steps to visualize the time series data:
In the Google Cloud console, go to theBigQuerypage.
importbigframes.pandasasbpddf=bpd.read_gbq("bigquery-public-data.new_york.citibike_trips")features=bpd.DataFrame({"num_trips":df.starttime,"date":df["starttime"].dt.date,})date=df["starttime"].dt.datedf.groupby([date])num_trips=features.groupby(["date"]).count()# Results from running "print(num_trips)"# num_trips# date# 2013-07-01 16650# 2013-07-02 22745# 2013-07-03 21864# 2013-07-04 22326# 2013-07-05 21842# 2013-07-06 20467# 2013-07-07 20477# 2013-07-08 21615# 2013-07-09 26641# 2013-07-10 25732# 2013-07-11 24417# 2013-07-12 19006# 2013-07-13 26119# 2013-07-14 29287# 2013-07-15 28069# 2013-07-16 29842# 2013-07-17 30550# 2013-07-18 28869# 2013-07-19 26591# 2013-07-20 25278# 2013-07-21 30297# 2013-07-22 25979# 2013-07-23 32376# 2013-07-24 35271# 2013-07-25 31084num_trips.plot.line(# Rotate the x labels so they are more visible.rot=45,)
Create the time series model
You want to forecast the number of bike trips for each Citi Bike station, which requires many time series models; one for each Citi Bike station that is included in the input data. You can create multiple models to do this, but that can be a tedious and time-consuming process, especially when you have a large number of time series. Instead, you can use a single query to create and fit a set of time series models in order to forecast multiple time series at once.
SQL
In the following query, theOPTIONS(model_type='ARIMA_PLUS', time_series_timestamp_col='date', ...)clause indicates that you are creating anARIMA-based
time series model. You use thetime_series_id_coloptionof theCREATE MODELstatement to specify one or more columns in the input data
that you want to get forecasts for, in this case the Citi Bike station, as
represented by thestart_station_namecolumn. You use theWHEREclause to
limit the start stations to those withCentral Parkin their names. Theauto_arima_max_orderoptionof theCREATE MODELstatement controls the
search space for hyperparameter tuning in theauto.ARIMAalgorithm. Thedecompose_time_seriesoptionof theCREATE MODELstatement defaults toTRUE, so that information about
the time series data is returned when you evaluate the model in the next step.
Follow these steps to create the model:
In the Google Cloud console, go to theBigQuerypage.
The query takes approximately 24 seconds to complete, after which you can access thenyc_citibike_arima_model_groupmodel. Because the query uses aCREATE MODELstatement, you don't see
query results.
This query creates twelve time series models, one for each of the twelve
Citi Bike start stations in the input data. The time cost, approximately 24
seconds, is only 1.4 times more than that of creating a single time series
model because of the parallelism. However, if you remove theWHERE ... LIKE ...clause, there would be 600+ time series to forecast, and
they wouldn't be forecast completely in parallel because of slot capacity
limitations. In that case, the query would take approximately 15 minutes to
finish. To reduce the query runtime with the compromise of a potential slight
drop in model quality, you could decrease the value of theauto_arima_max_order.
This shrinks the search space of hyperparameter tuning in theauto.ARIMAalgorithm. For more information, seeLarge-scale time series forecasting best practices.
BigQuery DataFrames
In the following snippet, you are creating anARIMA-based
time series model.
frombigframes.mlimportforecastingimportbigframes.pandasasbpdmodel=forecasting.ARIMAPlus(# To reduce the query runtime with the compromise of a potential slight# drop in model quality, you could decrease the value of the# auto_arima_max_order. This shrinks the search space of hyperparameter# tuning in the auto.ARIMA algorithm.auto_arima_max_order=5,)df=bpd.read_gbq("bigquery-public-data.new_york.citibike_trips")# This query creates twelve time series models, one for each of the twelve# Citi Bike start stations in the input data. If you remove this row# filter, there would be 600+ time series to forecast.df=df[df["start_station_name"].str.contains("Central Park")]features=bpd.DataFrame({"start_station_name":df["start_station_name"],"num_trips":df["starttime"],"date":df["starttime"].dt.date,})num_trips=features.groupby(["start_station_name","date"],as_index=False,).count()X=num_trips["date"].to_frame()y=num_trips["num_trips"].to_frame()model.fit(X,y,# The input data that you want to get forecasts for,# in this case the Citi Bike station, as represented by the# start_station_name column.id_col=num_trips["start_station_name"].to_frame(),)# The model.fit() call above created a temporary model.# Use the to_gbq() method to write to a permanent location.model.to_gbq(your_model_id,# For example: "bqml_tutorial.nyc_citibike_arima_model",replace=True,)
This creates twelve time series models, one for each of the twelve Citi Bike start stations in the input data. The time cost, approximately 24 seconds, is only 1.4 times more than that of creating a single time series model because of the parallelism.
Evaluate the model
SQL
Evaluate the time series model by using theML.ARIMA_EVALUATEfunction. TheML.ARIMA_EVALUATEfunction shows you the evaluation metrics that
were generated for the model during the process of automatic
hyperparameter tuning.
Follow these steps to evaluate the model:
In the Google Cloud console, go to theBigQuerypage.
Whileauto.ARIMAevaluates dozens of candidate ARIMA models for each
time series,ML.ARIMA_EVALUATEby default only outputs the information of the
best model to make the output table compact. To view all the candidate models,
you can set theML.ARIMA_EVALUATEfunction'sshow_all_candidate_modelargumenttoTRUE.
# Evaluate the time series models by using the summary() function. The summary()# function shows you the evaluation metrics of all the candidate models evaluated# during the process of automatic hyperparameter tuning.summary=model.summary()print(summary.peek())# Expected output:# start_station_name non_seasonal_p non_seasonal_d non_seasonal_q has_drift log_likelihood AIC variance ...# 1 Central Park West & W 72 St 0 1 5 False -1966.449243 3944.898487 1215.689281 ...# 8 Central Park W & W 96 St 0 0 5 False -274.459923 562.919847 655.776577 ...# 9 Central Park West & W 102 St 0 0 0 False -226.639918 457.279835 258.83582 ...# 11 Central Park West & W 76 St 1 1 2 False -1700.456924 3408.913848 383.254161 ...# 4 Grand Army Plaza & Central Park S 0 1 5 False -5507.553498 11027.106996 624.138741 ...
Thestart_station_namecolumn identifies the input data column for which
time series were created. This is the column that you specified with thetime_series_id_coloption when creating the model.
Thenon_seasonal_p,non_seasonal_d,non_seasonal_q, andhas_driftoutput columns define an ARIMA model in the training pipeline. Thelog_likelihood,AIC, andvarianceoutput columns are relevant to the ARIMA
model fitting process. The fitting process determines the best ARIMA model by
using theauto.ARIMAalgorithm, one for each time series.
Theauto.ARIMAalgorithm uses theKPSS testto determine the best value
fornon_seasonal_d, which in this case is1. Whennon_seasonal_dis1,
the auto.ARIMA algorithm trains 42 different candidate ARIMA models in parallel.
In this example, all 42 candidate models are valid, so the output contains 42
rows, one for each candidate ARIMA model; in cases where some of the models
aren't valid, they are excluded from the output. These candidate models are
returned in ascending order by AIC. The model in the first row has the lowest
AIC, and is considered as the best model. This best model is saved as the final
model and is used when you forecast data, evaluate the model, and
inspect the model's coefficients as shown in the following steps.
Theseasonal_periodscolumn contains information about the seasonal pattern
identified in the time series data. Each time series can have different seasonal
patterns. For example, from the figure, you can see that one time series has a
yearly pattern, while others don't.
Thehas_holiday_effect,has_spikes_and_dips, andhas_step_changescolumns
are only populated whendecompose_time_series=TRUE. These columns also reflect
information about the input time series data, and are not related to the ARIMA
modeling. These columns also have the same values across all output rows.
Inspect the model's coefficients
SQL
Inspect the time series model's coefficients by using theML.ARIMA_COEFFICIENTSfunction.
Follow these steps to retrieve the model's coefficients:
In the Google Cloud console, go to theBigQuerypage.
coef=model.coef_print(coef.peek())# Expected output:# start_station_name ar_coefficients ma_coefficients intercept_or_drift# 5 Central Park West & W 68 St [] [-0.41014089 0.21979212 -0.59854213 -0.251438... 0.0# 6 Central Park S & 6 Ave [] [-0.71488957 -0.36835772 0.61008532 0.183290... 0.0# 0 Central Park West & W 85 St [] [-0.39270166 -0.74494638 0.76432596 0.489146... 0.0# 3 W 82 St & Central Park West [-0.50219511 -0.64820817] [-0.20665325 0.67683137 -0.68108631] 0.0# 11 W 106 St & Central Park West [-0.70442887 -0.66885553 -0.25030325 -0.34160669] [] 0.0
Thestart_station_namecolumn identifies the input data column for which
time series were created. This is the column that you specified in thetime_series_id_coloption when creating the model.
Thear_coefficientsoutput column shows the model coefficients of the
autoregressive (AR) part of the ARIMA model. Similarly, thema_coefficientsoutput column shows the model coefficients of the moving-average (MA) part of
the ARIMA model. Both of these columns contain array values, whose lengths are
equal tonon_seasonal_pandnon_seasonal_q, respectively. Theintercept_or_driftvalue is the constant term in the ARIMA model.
Use the model to forecast data
SQL
Forecast future time series values by using theML.FORECASTfunction.
In the following GoogleSQL query, theSTRUCT(3 AS horizon, 0.9 AS confidence_level)clause indicates that the
query forecasts 3 future time points, and generates a prediction interval
with a 90% confidence level.
Follow these steps to forecast data with the model:
In the Google Cloud console, go to theBigQuerypage.
prediction=model.predict(horizon=3,confidence_level=0.9)print(prediction.peek())# Expected output:# forecast_timestamp start_station_name forecast_value standard_error confidence_level ...# 4 2016-10-01 00:00:00+00:00 Central Park S & 6 Ave 302.377201 32.572948 0.9 ...# 14 2016-10-02 00:00:00+00:00 Central Park North & Adam Clayton Powell Blvd 263.917567 45.284082 0.9 ...# 1 2016-09-25 00:00:00+00:00 Central Park West & W 85 St 189.574706 39.874856 0.9 ...# 20 2016-10-02 00:00:00+00:00 Central Park West & W 72 St 175.474862 40.940794 0.9 ...# 12 2016-10-01 00:00:00+00:00 W 106 St & Central Park West 63.88163 18.088868 0.9 ...
The first column,start_station_name, annotates the time series that each
time series model is fitted against. Eachstart_station_namehas three
rows of forecasted results, as specified by thehorizonvalue.
For eachstart_station_name, the output rows are in chronological order by theforecast_timestampcolumn value. In time series forecasting, the prediction
interval, as represented by theprediction_interval_lower_boundandprediction_interval_upper_boundcolumn values, is as important as theforecast_valuecolumn value. Theforecast_valuevalue is the middle point
of the prediction interval. The prediction interval depends on thestandard_errorandconfidence_levelcolumn values.
Explain the forecasting results
SQL
You can get explainability metrics in addition to forecast data by using theML.EXPLAIN_FORECASTfunction. TheML.EXPLAIN_FORECASTfunction forecasts
future time series values and also returns all the separate components of the
time series. If you just want to return forecast data, use theML.FORECASTfunction instead, as shown inUse the model to forecast data.
TheSTRUCT(3 AS horizon, 0.9 AS confidence_level)clause used in theML.EXPLAIN_FORECASTfunction indicates that the query forecasts 3 future
time points and generates a prediction interval with 90% confidence.
Follow these steps to explain the model's results:
In the Google Cloud console, go to theBigQuerypage.
The query takes less than a second to complete. The results should look
like the following:
The first thousand rows returned are all history data. You must scroll
through the results to see the forecast data.
The output rows are ordered first bystart_station_name, then
chronologically by thetime_series_timestampcolumn value. In time series
forecasting, the prediction
interval, as represented by theprediction_interval_lower_boundandprediction_interval_upper_boundcolumn values, is as important as theforecast_valuecolumn value. Theforecast_valuevalue is the middle point
of the prediction interval. The prediction interval depends on thestandard_errorandconfidence_levelcolumn values.
You can get explainability metrics in addition to forecast data by using thepredict_explainfunction. Thepredict_explainfunction forecasts
future time series values and also returns all the separate components of the
time series. If you just want to return forecast data, use thepredictfunction instead, as shown inUse the model to forecast data.
Thehorizon=3, confidence_level=0.9clause used in thepredict_explainfunction indicates that the query forecasts 3 future
time points and generates a prediction interval with 90% confidence.
explain=model.predict_explain(horizon=3,confidence_level=0.9)print(explain.peek(5))# Expected output:# time_series_timestamp start_station_name time_series_type time_series_data time_series_adjusted_data standard_error confidence_level prediction_interval_lower_bound prediction_interval_upper_bound trend seasonal_period_yearly seasonal_period_quarterly seasonal_period_monthly seasonal_period_weekly seasonal_period_daily holiday_effect spikes_and_dips step_changes residual# 0 2013-07-01 00:00:00+00:00 Central Park S & 6 Ave history 69.0 154.168527 32.572948 <NA> <NA> <NA> 0.0 35.477484 <NA> <NA> -28.402102 <NA> <NA> 0.0 -85.168527 147.093145# 1 2013-07-01 00:00:00+00:00 Grand Army Plaza & Central Park S history 79.0 79.0 24.982769 <NA> <NA> <NA> 0.0 43.46428 <NA> <NA> -30.01599 <NA> <NA> 0.0 0.0 65.55171# 2 2013-07-02 00:00:00+00:00 Central Park S & 6 Ave history 180.0 204.045651 32.572948 <NA> <NA> <NA> 147.093045 72.498327 <NA> <NA> -15.545721 <NA> <NA> 0.0 -85.168527 61.122876# 3 2013-07-02 00:00:00+00:00 Grand Army Plaza & Central Park S history 129.0 99.556269 24.982769 <NA> <NA> <NA> 65.551665 45.836432 <NA> <NA> -11.831828 <NA> <NA> 0.0 0.0 29.443731# 4 2013-07-03 00:00:00+00:00 Central Park S & 6 Ave history 115.0 205.968236 32.572948 <NA> <NA> <NA> 191.32754 59.220766 <NA> <NA> -44.580071 <NA> <NA> 0.0 -85.168527 -5.799709
The output rows are ordered first bytime_series_timestamp, then
chronologically by thestart_station_namecolumn value. In time series
forecasting, the prediction
interval, as represented by theprediction_interval_lower_boundandprediction_interval_upper_boundcolumn values, is as important as theforecast_valuecolumn value. Theforecast_valuevalue is the middle point
of the prediction interval. The prediction interval depends on thestandard_errorandconfidence_levelcolumn values.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this
tutorial, either delete the project that contains the resources, or keep the project and
delete the individual resources.
You can delete the project you created.
Or you can keep the project and delete the dataset.
Delete your dataset
Deleting your project removes all datasets and all tables in the project. If you
prefer to reuse the project, you can delete the dataset you created in this
tutorial:
If necessary, open the BigQuery page in the
Google Cloud console.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2026-03-19 UTC."],[],[]]