Work with time series data

This document describes how to use SQL functions to support time series analysis.

Introduction

A time series is a sequence of data points, each consisting of a time and a value associated with that time. Usually, a time series also has an identifier, which uniquely names the time series.

In relational databases, a time series is modeled as a table with the following groups of columns:

Time column
Might have partitioning columns, for example, zip code
One or more value columns, or a STRUCT type combining multiple values, for example, temperature and AQI

The following is an example of time series data modeled as a table:

Time series table example.

Aggregate a time series

In time series analysis, time aggregation is an aggregation performed along the time axis.

You can perform time aggregation in BigQuery with the help of time bucketing functions ( TIMESTAMP_BUCKET , DATE_BUCKET , and DATETIME_BUCKET ). Time bucketing functions map input time values to the bucket they belong to.

Typically, time aggregation is performed to combine multiple data points in a time window into a single data point, using an aggregation function, such as AVG , MIN , MAX , COUNT , or SUM . For example, 15-minute average request latency, daily minimum and maximum temperatures, and daily number of taxi trips.

For the queries in this section, create a table called mydataset.environmental_data_hourly :

  CREATE 
  
 OR 
  
 REPLACE 
  
 TABLE 
  
 mydataset 
 . 
 environmental_data_hourly 
  
 AS 
 SELECT 
  
 * 
  
 FROM 
  
 UNNEST 
 ( 
  
 ARRAY<STRUCT<zip_code 
  
 INT64 
 , 
  
 time 
  
 TIMESTAMP 
 , 
  
 aqi 
  
 INT64 
 , 
  
 temperature 
  
 INT64 
>> [ 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 00:30:51' 
 , 
  
 22 
 , 
  
 66 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 01:32:10' 
 , 
  
 23 
 , 
  
 63 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 02:30:35' 
 , 
  
 22 
 , 
  
 60 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 03:29:39' 
 , 
  
 21 
 , 
  
 58 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 04:33:05' 
 , 
  
 21 
 , 
  
 59 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 05:32:01' 
 , 
  
 21 
 , 
  
 57 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 06:31:14' 
 , 
  
 22 
 , 
  
 56 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 07:31:06' 
 , 
  
 28 
 , 
  
 55 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 08:29:59' 
 , 
  
 30 
 , 
  
 55 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 09:29:34' 
 , 
  
 31 
 , 
  
 55 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 10:31:24' 
 , 
  
 38 
 , 
  
 56 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 11:31:24' 
 , 
  
 38 
 , 
  
 56 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 12:32:38' 
 , 
  
 38 
 , 
  
 57 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 13:29:59' 
 , 
  
 38 
 , 
  
 56 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 14:31:22' 
 , 
  
 43 
 , 
  
 59 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 15:31:38' 
 , 
  
 42 
 , 
  
 63 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 16:34:22' 
 , 
  
 43 
 , 
  
 65 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 17:33:23' 
 , 
  
 42 
 , 
  
 68 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 18:28:47' 
 , 
  
 36 
 , 
  
 69 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 19:30:28' 
 , 
  
 34 
 , 
  
 67 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 20:30:53' 
 , 
  
 29 
 , 
  
 67 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 21:32:28' 
 , 
  
 27 
 , 
  
 67 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 22:31:45' 
 , 
  
 25 
 , 
  
 65 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 23:31:02' 
 , 
  
 22 
 , 
  
 63 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 00:07:11' 
 , 
  
 60 
 , 
  
 74 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 01:07:24' 
 , 
  
 61 
 , 
  
 73 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 02:08:07' 
 , 
  
 60 
 , 
  
 71 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 03:11:05' 
 , 
  
 69 
 , 
  
 69 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 04:07:26' 
 , 
  
 72 
 , 
  
 67 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 05:08:11' 
 , 
  
 70 
 , 
  
 66 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 06:07:30' 
 , 
  
 68 
 , 
  
 65 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 07:07:10' 
 , 
  
 77 
 , 
  
 64 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 08:06:35' 
 , 
  
 81 
 , 
  
 64 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 09:10:18' 
 , 
  
 82 
 , 
  
 63 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 10:08:10' 
 , 
  
 107 
 , 
  
 62 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 11:08:01' 
 , 
  
 115 
 , 
  
 62 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 12:07:39' 
 , 
  
 120 
 , 
  
 62 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 13:06:03' 
 , 
  
 125 
 , 
  
 61 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 14:08:37' 
 , 
  
 129 
 , 
  
 62 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 15:09:19' 
 , 
  
 150 
 , 
  
 62 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 16:06:39' 
 , 
  
 151 
 , 
  
 62 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 17:08:01' 
 , 
  
 155 
 , 
  
 63 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 18:09:23' 
 , 
  
 154 
 , 
  
 64 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 19:08:43' 
 , 
  
 151 
 , 
  
 67 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 20:07:19' 
 , 
  
 150 
 , 
  
 69 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 21:07:37' 
 , 
  
 148 
 , 
  
 72 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 22:08:01' 
 , 
  
 143 
 , 
  
 76 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 23:08:41' 
 , 
  
 137 
 , 
  
 75 
 ) 
 ] 
 );

One interesting observation about the preceding data is that measurements are taken at arbitrary time periods, known as unaligned time series . Aggregation is one of the ways by which a time series can be aligned.

Get a 3-hour average

The following query computes a 3-hour average air quality index (AQI) and temperature for each zip code. The TIMESTAMP_BUCKET function performs time aggregation by assigning each time value to a particular day.

  SELECT 
  
 TIMESTAMP_BUCKET 
 ( 
 time 
 , 
  
 INTERVAL 
  
 3 
  
 HOUR 
 ) 
  
 AS 
  
 time 
 , 
  
 zip_code 
 , 
  
 CAST 
 ( 
 AVG 
 ( 
 aqi 
 ) 
  
 AS 
  
 INT64 
 ) 
  
 AS 
  
 aqi 
 , 
  
 CAST 
 ( 
 AVG 
 ( 
 temperature 
 ) 
  
 AS 
  
 INT64 
 ) 
  
 AS 
  
 temperature 
 FROM 
  
 mydataset 
 . 
 environmental_data_hourly 
 GROUP 
  
 BY 
  
 zip_code 
 , 
  
 time 
 ORDER 
  
 BY 
  
 zip_code 
 , 
  
 time 
 ; 
 /*---------------------+----------+-----+-------------+ 
 |        time         | zip_code | aqi | temperature | 
 +---------------------+----------+-----+-------------+ 
 | 2020-09-08 00:00:00 |    60606 |  22 |          63 | 
 | 2020-09-08 03:00:00 |    60606 |  21 |          58 | 
 | 2020-09-08 06:00:00 |    60606 |  27 |          55 | 
 | 2020-09-08 09:00:00 |    60606 |  36 |          56 | 
 | 2020-09-08 12:00:00 |    60606 |  40 |          57 | 
 | 2020-09-08 15:00:00 |    60606 |  42 |          65 | 
 | 2020-09-08 18:00:00 |    60606 |  33 |          68 | 
 | 2020-09-08 21:00:00 |    60606 |  25 |          65 | 
 | 2020-09-08 00:00:00 |    94105 |  60 |          73 | 
 | 2020-09-08 03:00:00 |    94105 |  70 |          67 | 
 | 2020-09-08 06:00:00 |    94105 |  75 |          64 | 
 | 2020-09-08 09:00:00 |    94105 | 101 |          62 | 
 | 2020-09-08 12:00:00 |    94105 | 125 |          62 | 
 | 2020-09-08 15:00:00 |    94105 | 152 |          62 | 
 | 2020-09-08 18:00:00 |    94105 | 152 |          67 | 
 | 2020-09-08 21:00:00 |    94105 | 143 |          74 | 
 +---------------------+----------+-----+-------------*/

Get a 3-hour minimum and maximum value

In the following query, you compute 3-hour minimum and maximum temperatures for each zip code:

  SELECT 
  
 TIMESTAMP_BUCKET 
 ( 
 time 
 , 
  
 INTERVAL 
  
 3 
  
 HOUR 
 ) 
  
 AS 
  
 time 
 , 
  
 zip_code 
 , 
  
 MIN 
 ( 
 temperature 
 ) 
  
 AS 
  
 temperature_min 
 , 
  
 MAX 
 ( 
 temperature 
 ) 
  
 AS 
  
 temperature_max 
 , 
 FROM 
  
 mydataset 
 . 
 environmental_data_hourly 
 GROUP 
  
 BY 
  
 zip_code 
 , 
  
 time 
 ORDER 
  
 BY 
  
 zip_code 
 , 
  
 time 
 ; 
 /*---------------------+----------+-----------------+-----------------+ 
 |        time         | zip_code | temperature_min | temperature_max | 
 +---------------------+----------+-----------------+-----------------+ 
 | 2020-09-08 00:00:00 |    60606 |              60 |              66 | 
 | 2020-09-08 03:00:00 |    60606 |              57 |              59 | 
 | 2020-09-08 06:00:00 |    60606 |              55 |              56 | 
 | 2020-09-08 09:00:00 |    60606 |              55 |              56 | 
 | 2020-09-08 12:00:00 |    60606 |              56 |              59 | 
 | 2020-09-08 15:00:00 |    60606 |              63 |              68 | 
 | 2020-09-08 18:00:00 |    60606 |              67 |              69 | 
 | 2020-09-08 21:00:00 |    60606 |              63 |              67 | 
 | 2020-09-08 00:00:00 |    94105 |              71 |              74 | 
 | 2020-09-08 03:00:00 |    94105 |              66 |              69 | 
 | 2020-09-08 06:00:00 |    94105 |              64 |              65 | 
 | 2020-09-08 09:00:00 |    94105 |              62 |              63 | 
 | 2020-09-08 12:00:00 |    94105 |              61 |              62 | 
 | 2020-09-08 15:00:00 |    94105 |              62 |              63 | 
 | 2020-09-08 18:00:00 |    94105 |              64 |              69 | 
 | 2020-09-08 21:00:00 |    94105 |              72 |              76 | 
 +---------------------+----------+-----------------+-----------------*/

Get a 3-hour average with custom alignment

When you perform time series aggregation, you use a specific alignment for time series windows, either implicitly or explicitly. The preceding queries used implicit alignment, which produced buckets that started at times like 00:00:00 , 03:00:00 , and 06:00:00 . To explicitly set this alignment in the TIMESTAMP_BUCKET function, pass an optional argument that specifies the origin.

In the following query, the origin is set as 2020-01-01 02:00:00 . This changes the alignment and produces buckets that start at times like 02:00:00 , 05:00:00 , and 08:00:00 :

  SELECT 
  
 TIMESTAMP_BUCKET 
 ( 
 time 
 , 
  
 INTERVAL 
  
 3 
  
 HOUR 
 , 
  
 TIMESTAMP 
  
 '2020-01-01 02:00:00' 
 ) 
  
 AS 
  
 time 
 , 
  
 zip_code 
 , 
  
 CAST 
 ( 
 AVG 
 ( 
 aqi 
 ) 
  
 AS 
  
 INT64 
 ) 
  
 AS 
  
 aqi 
 , 
  
 CAST 
 ( 
 AVG 
 ( 
 temperature 
 ) 
  
 AS 
  
 INT64 
 ) 
  
 AS 
  
 temperature 
 FROM 
  
 mydataset 
 . 
 environmental_data_hourly 
 GROUP 
  
 BY 
  
 zip_code 
 , 
  
 time 
 ORDER 
  
 BY 
  
 zip_code 
 , 
  
 time 
 ; 
 /*---------------------+----------+-----+-------------+ 
 |        time         | zip_code | aqi | temperature | 
 +---------------------+----------+-----+-------------+ 
 | 2020-09-07 23:00:00 |    60606 |  23 |          65 | 
 | 2020-09-08 02:00:00 |    60606 |  21 |          59 | 
 | 2020-09-08 05:00:00 |    60606 |  24 |          56 | 
 | 2020-09-08 08:00:00 |    60606 |  33 |          55 | 
 | 2020-09-08 11:00:00 |    60606 |  38 |          56 | 
 | 2020-09-08 14:00:00 |    60606 |  43 |          62 | 
 | 2020-09-08 17:00:00 |    60606 |  37 |          68 | 
 | 2020-09-08 20:00:00 |    60606 |  27 |          66 | 
 | 2020-09-08 23:00:00 |    60606 |  22 |          63 | 
 | 2020-09-07 23:00:00 |    94105 |  61 |          74 | 
 | 2020-09-08 02:00:00 |    94105 |  67 |          69 | 
 | 2020-09-08 05:00:00 |    94105 |  72 |          65 | 
 | 2020-09-08 08:00:00 |    94105 |  90 |          63 | 
 | 2020-09-08 11:00:00 |    94105 | 120 |          62 | 
 | 2020-09-08 14:00:00 |    94105 | 143 |          62 | 
 | 2020-09-08 17:00:00 |    94105 | 153 |          65 | 
 | 2020-09-08 20:00:00 |    94105 | 147 |          72 | 
 | 2020-09-08 23:00:00 |    94105 | 137 |          75 | 
 +---------------------+----------+-----+-------------*/

Aggregate a time series with gap filling

Sometimes after you aggregate a time series, the data might have gaps that need to be filled with some values for further analysis or presentation of the data. The technique used to fill in those gaps is called gap filling . In BigQuery, you can use the GAP_FILL table function for filling gaps in time series data, using one of the provided gap-filling methods:

NULL, also known as constant
LOCF, last observation carried forward
Linear, linear interpolation between the two neighboring data points

For the queries in this section, create a table called mydataset.environmental_data_hourly_with_gaps , which is based on the data used in the preceding section, but with gaps in it. In the real world scenarios, data could have missing data points due to a short-term weather station malfunction.

  CREATE 
  
 OR 
  
 REPLACE 
  
 TABLE 
  
 mydataset 
 . 
 environmental_data_hourly_with_gaps 
  
 AS 
 SELECT 
  
 * 
  
 FROM 
  
 UNNEST 
 ( 
  
 ARRAY<STRUCT<zip_code 
  
 INT64 
 , 
  
 time 
  
 TIMESTAMP 
 , 
  
 aqi 
  
 INT64 
 , 
  
 temperature 
  
 INT64 
>> [ 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 00:30:51' 
 , 
  
 22 
 , 
  
 66 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 01:32:10' 
 , 
  
 23 
 , 
  
 63 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 02:30:35' 
 , 
  
 22 
 , 
  
 60 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 03:29:39' 
 , 
  
 21 
 , 
  
 58 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 04:33:05' 
 , 
  
 21 
 , 
  
 59 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 05:32:01' 
 , 
  
 21 
 , 
  
 57 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 06:31:14' 
 , 
  
 22 
 , 
  
 56 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 07:31:06' 
 , 
  
 28 
 , 
  
 55 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 08:29:59' 
 , 
  
 30 
 , 
  
 55 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 09:29:34' 
 , 
  
 31 
 , 
  
 55 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 10:31:24' 
 , 
  
 38 
 , 
  
 56 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 11:31:24' 
 , 
  
 38 
 , 
  
 56 
 ), 
  
 -- No data points between hours 12 and 15. 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 16:34:22' 
 , 
  
 43 
 , 
  
 65 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 17:33:23' 
 , 
  
 42 
 , 
  
 68 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 18:28:47' 
 , 
  
 36 
 , 
  
 69 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 19:30:28' 
 , 
  
 34 
 , 
  
 67 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 20:30:53' 
 , 
  
 29 
 , 
  
 67 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 21:32:28' 
 , 
  
 27 
 , 
  
 67 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 22:31:45' 
 , 
  
 25 
 , 
  
 65 
 ), 
  
 STRUCT 
 ( 
 60606 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 23:31:02' 
 , 
  
 22 
 , 
  
 63 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 00:07:11' 
 , 
  
 60 
 , 
  
 74 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 01:07:24' 
 , 
  
 61 
 , 
  
 73 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 02:08:07' 
 , 
  
 60 
 , 
  
 71 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 03:11:05' 
 , 
  
 69 
 , 
  
 69 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 04:07:26' 
 , 
  
 72 
 , 
  
 67 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 05:08:11' 
 , 
  
 70 
 , 
  
 66 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 06:07:30' 
 , 
  
 68 
 , 
  
 65 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 07:07:10' 
 , 
  
 77 
 , 
  
 64 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 08:06:35' 
 , 
  
 81 
 , 
  
 64 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 09:10:18' 
 , 
  
 82 
 , 
  
 63 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 10:08:10' 
 , 
  
 107 
 , 
  
 62 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 11:08:01' 
 , 
  
 115 
 , 
  
 62 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 12:07:39' 
 , 
  
 120 
 , 
  
 62 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 13:06:03' 
 , 
  
 125 
 , 
  
 61 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 14:08:37' 
 , 
  
 129 
 , 
  
 62 
 ), 
  
 -- No data points between hours 15 and 18. 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 19:08:43' 
 , 
  
 151 
 , 
  
 67 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 20:07:19' 
 , 
  
 150 
 , 
  
 69 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 21:07:37' 
 , 
  
 148 
 , 
  
 72 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 22:08:01' 
 , 
  
 143 
 , 
  
 76 
 ), 
  
 STRUCT 
 ( 
 94105 
 , 
  
 TIMESTAMP 
  
 '2020-09-08 23:08:41' 
 , 
  
 137 
 , 
  
 75 
 ) 
 ] 
 );

Get a 3-hour average (include gaps)

The following query computes 3-hour average AQI and temperature for each zip code:

  SELECT 
  
 TIMESTAMP_BUCKET 
 ( 
 time 
 , 
  
 INTERVAL 
  
 3 
  
 HOUR 
 ) 
  
 AS 
  
 time 
 , 
  
 zip_code 
 , 
  
 CAST 
 ( 
 AVG 
 ( 
 aqi 
 ) 
  
 AS 
  
 INT64 
 ) 
  
 AS 
  
 aqi 
 , 
  
 CAST 
 ( 
 AVG 
 ( 
 temperature 
 ) 
  
 AS 
  
 INT64 
 ) 
  
 AS 
  
 temperature 
 FROM 
  
 mydataset 
 . 
 environmental_data_hourly_with_gaps 
 GROUP 
  
 BY 
  
 zip_code 
 , 
  
 time 
 ORDER 
  
 BY 
  
 zip_code 
 , 
  
 time 
 ; 
 /*---------------------+----------+-----+-------------+ 
 |        time         | zip_code | aqi | temperature | 
 +---------------------+----------+-----+-------------+ 
 | 2020-09-08 00:00:00 |    60606 |  22 |          63 | 
 | 2020-09-08 03:00:00 |    60606 |  21 |          58 | 
 | 2020-09-08 06:00:00 |    60606 |  27 |          55 | 
 | 2020-09-08 09:00:00 |    60606 |  36 |          56 | 
 | 2020-09-08 15:00:00 |    60606 |  43 |          67 | 
 | 2020-09-08 18:00:00 |    60606 |  33 |          68 | 
 | 2020-09-08 21:00:00 |    60606 |  25 |          65 | 
 | 2020-09-08 00:00:00 |    94105 |  60 |          73 | 
 | 2020-09-08 03:00:00 |    94105 |  70 |          67 | 
 | 2020-09-08 06:00:00 |    94105 |  75 |          64 | 
 | 2020-09-08 09:00:00 |    94105 | 101 |          62 | 
 | 2020-09-08 12:00:00 |    94105 | 125 |          62 | 
 | 2020-09-08 18:00:00 |    94105 | 151 |          68 | 
 | 2020-09-08 21:00:00 |    94105 | 143 |          74 | 
 +---------------------+----------+-----+-------------*/

Note how the output has gaps at certain time intervals. For example, the time series for zip code 60606 doesn't have a data point at 2020-09-08 12:00:00 , and the time series for zip code 94105 doesn't have a data point at 2020-09-08 15:00:00 .

Get a 3-hour average (fill gaps)

Use the query from the previous section and add the GAP_FILL function to fill the gaps:

  WITH 
  
 aggregated_3_hr 
  
 AS 
  
 ( 
  
 SELECT 
  
 TIMESTAMP_BUCKET 
 ( 
 time 
 , 
  
 INTERVAL 
  
 3 
  
 HOUR 
 ) 
  
 AS 
  
 time 
 , 
  
 zip_code 
 , 
  
 CAST 
 ( 
 AVG 
 ( 
 aqi 
 ) 
  
 AS 
  
 INT64 
 ) 
  
 AS 
  
 aqi 
 , 
  
 CAST 
 ( 
 AVG 
 ( 
 temperature 
 ) 
  
 AS 
  
 INT64 
 ) 
  
 AS 
  
 temperature 
  
 FROM 
  
 mydataset 
 . 
 environmental_data_hourly_with_gaps 
  
 GROUP 
  
 BY 
  
 zip_code 
 , 
  
 time 
 ) 
 SELECT 
  
 * 
 FROM 
  
 GAP_FILL 
 ( 
  
 TABLE 
  
 aggregated_3_hr 
 , 
  
 ts_column 
  
 = 
>  
 'time' 
 , 
  
 bucket_width 
  
 = 
>  
 INTERVAL 
  
 3 
  
 HOUR 
 , 
  
 partitioning_columns 
  
 = 
>  
 [ 
 'zip_code' 
 ] 
 ) 
 ORDER 
  
 BY 
  
 zip_code 
 , 
  
 time 
 ; 
 /*---------------------+----------+------+-------------+ 
 |        time         | zip_code | aqi  | temperature | 
 +---------------------+----------+------+-------------+ 
 | 2020-09-08 00:00:00 |    60606 |   22 |          63 | 
 | 2020-09-08 03:00:00 |    60606 |   21 |          58 | 
 | 2020-09-08 06:00:00 |    60606 |   27 |          55 | 
 | 2020-09-08 09:00:00 |    60606 |   36 |          56 | 
 | 2020-09-08 12:00:00 |    60606 | NULL |        NULL | 
 | 2020-09-08 15:00:00 |    60606 |   43 |          67 | 
 | 2020-09-08 18:00:00 |    60606 |   33 |          68 | 
 | 2020-09-08 21:00:00 |    60606 |   25 |          65 | 
 | 2020-09-08 00:00:00 |    94105 |   60 |          73 | 
 | 2020-09-08 03:00:00 |    94105 |   70 |          67 | 
 | 2020-09-08 06:00:00 |    94105 |   75 |          64 | 
 | 2020-09-08 09:00:00 |    94105 |  101 |          62 | 
 | 2020-09-08 12:00:00 |    94105 |  125 |          62 | 
 | 2020-09-08 15:00:00 |    94105 | NULL |        NULL | 
 | 2020-09-08 18:00:00 |    94105 |  151 |          68 | 
 | 2020-09-08 21:00:00 |    94105 |  143 |          74 | 
 +---------------------+----------+------+-------------*/

The output table now contains a missing row at 2020-09-08 12:00:00 for zip code 60606 and at 2020-09-08 15:00:00 for zip code 94105 , with NULL values in the corresponding metric columns. Since you didn't specify any gap-filling method, GAP_FILL used the default gap-filling method, NULL.

Fill gaps with linear and LOCF gap filling

In the following query, the GAP_FILL function is used with the LOCF gap-filling method for the aqi column and linear interpolation for the temperature column:

  WITH 
  
 aggregated_3_hr 
  
 AS 
  
 ( 
  
 SELECT 
  
 TIMESTAMP_BUCKET 
 ( 
 time 
 , 
  
 INTERVAL 
  
 3 
  
 HOUR 
 ) 
  
 AS 
  
 time 
 , 
  
 zip_code 
 , 
  
 CAST 
 ( 
 AVG 
 ( 
 aqi 
 ) 
  
 AS 
  
 INT64 
 ) 
  
 AS 
  
 aqi 
 , 
  
 CAST 
 ( 
 AVG 
 ( 
 temperature 
 ) 
  
 AS 
  
 INT64 
 ) 
  
 AS 
  
 temperature 
  
 FROM 
  
 mydataset 
 . 
 environmental_data_hourly_with_gaps 
  
 GROUP 
  
 BY 
  
 zip_code 
 , 
  
 time 
 ) 
 SELECT 
  
 * 
 FROM 
  
 GAP_FILL 
 ( 
  
 TABLE 
  
 aggregated_3_hr 
 , 
  
 ts_column 
  
 = 
>  
 'time' 
 , 
  
 bucket_width 
  
 = 
>  
 INTERVAL 
  
 3 
  
 HOUR 
 , 
  
 partitioning_columns 
  
 = 
>  
 [ 
 'zip_code' 
 ] 
 , 
  
 value_columns 
  
 = 
>  
 [ 
  
 ( 
 'aqi' 
 , 
  
 'locf' 
 ), 
  
 ( 
 'temperature' 
 , 
  
 'linear' 
 ) 
  
 ] 
 ) 
 ORDER 
  
 BY 
  
 zip_code 
 , 
  
 time 
 ; 
 /*---------------------+----------+-----+-------------+ 
 |        time         | zip_code | aqi | temperature | 
 +---------------------+----------+-----+-------------+ 
 | 2020-09-08 00:00:00 |    60606 |  22 |          63 | 
 | 2020-09-08 03:00:00 |    60606 |  21 |          58 | 
 | 2020-09-08 06:00:00 |    60606 |  27 |          55 | 
 | 2020-09-08 09:00:00 |    60606 |  36 |          56 | 
 | 2020-09-08 12:00:00 |    60606 |  36 |          62 | 
 | 2020-09-08 15:00:00 |    60606 |  43 |          67 | 
 | 2020-09-08 18:00:00 |    60606 |  33 |          68 | 
 | 2020-09-08 21:00:00 |    60606 |  25 |          65 | 
 | 2020-09-08 00:00:00 |    94105 |  60 |          73 | 
 | 2020-09-08 03:00:00 |    94105 |  70 |          67 | 
 | 2020-09-08 06:00:00 |    94105 |  75 |          64 | 
 | 2020-09-08 09:00:00 |    94105 | 101 |          62 | 
 | 2020-09-08 12:00:00 |    94105 | 125 |          62 | 
 | 2020-09-08 15:00:00 |    94105 | 125 |          65 | 
 | 2020-09-08 18:00:00 |    94105 | 151 |          68 | 
 | 2020-09-08 21:00:00 |    94105 | 143 |          74 | 
 +---------------------+----------+-----+-------------*/

In this query, the first gap-filled row has aqi value 36 , which is taken from the previous data point of this time series (zip code 60606 ) at 2020-09-08 09:00:00 . The temperature value 62 is a result of linear interpolation between data points 2020-09-08 09:00:00 and 2020-09-08 15:00:00 . The other missing row was created in a similar way - aqi value 125 was carried over from the previous data point of this time series (zip code 94105 ), and the temperature value 65 is a result of linear interpolation between the previous and the next available data points.

Align a time series with gap filling

Time series can be aligned or unaligned. A time series is aligned when data points only occur at regular intervals.

In the real world, at the time of collection, time series are rarely aligned and usually require some further processing to align them.

For example, consider IoT devices that send their metrics to a centralized collector every minute. It would be unreasonable to expect the devices to send their metrics at exactly the same instants of time. Usually, each device sends its metrics with the same frequency (period) but with different time offset (alignment). The following diagram illustrates this example. You can see each device sending its data with a one-minute interval with some instances of missing data (Device 3 at 9:36:39 ) and delayed data (Device 1 at 9:37:28 ).

Align time series example

You can perform time series alignment on unaligned data, using time aggregation . This is helpful if you want to change the sampling period of the time series, such as changing from the original 1-minute sampling period to a 15-minute period. You can align data for further time series processing, such as joining the time series data, or for display purposes (such as graphing).

You can use the GAP_FILL table function with LOCF or linear gap-filling methods to perform time series alignment. The idea is to use GAP_FILL with the selected output period and alignment (controlled by the optional origin argument). The result of the operation is a table with aligned time series, where values for each data point are derived from the input time series with the gap-filling method used for that particular value column (LOCF of linear).

Create a table mydataset.device_data , which resembles the previous illustration:

  CREATE 
  
 OR 
  
 REPLACE 
  
 TABLE 
  
 mydataset 
 . 
 device_data 
  
 AS 
 SELECT 
  
 * 
  
 FROM 
  
 UNNEST 
 ( 
  
 ARRAY<STRUCT<device_id 
  
 INT64 
 , 
  
 time 
  
 TIMESTAMP 
 , 
  
 signal 
  
 INT64 
 , 
  
 state 
  
 STRING 
>> [ 
  
 STRUCT 
 ( 
 2 
 , 
  
 TIMESTAMP 
  
 '2023-11-01 09:35:07' 
 , 
  
 87 
 , 
  
 'ACTIVE' 
 ), 
  
 STRUCT 
 ( 
 1 
 , 
  
 TIMESTAMP 
  
 '2023-11-01 09:35:26' 
 , 
  
 82 
 , 
  
 'ACTIVE' 
 ), 
  
 STRUCT 
 ( 
 3 
 , 
  
 TIMESTAMP 
  
 '2023-11-01 09:35:39' 
 , 
  
 74 
 , 
  
 'INACTIVE' 
 ), 
  
 STRUCT 
 ( 
 2 
 , 
  
 TIMESTAMP 
  
 '2023-11-01 09:36:07' 
 , 
  
 88 
 , 
  
 'ACTIVE' 
 ), 
  
 STRUCT 
 ( 
 1 
 , 
  
 TIMESTAMP 
  
 '2023-11-01 09:36:26' 
 , 
  
 82 
 , 
  
 'ACTIVE' 
 ), 
  
 STRUCT 
 ( 
 2 
 , 
  
 TIMESTAMP 
  
 '2023-11-01 09:37:07' 
 , 
  
 88 
 , 
  
 'ACTIVE' 
 ), 
  
 STRUCT 
 ( 
 1 
 , 
  
 TIMESTAMP 
  
 '2023-11-01 09:37:28' 
 , 
  
 80 
 , 
  
 'ACTIVE' 
 ), 
  
 STRUCT 
 ( 
 3 
 , 
  
 TIMESTAMP 
  
 '2023-11-01 09:37:39' 
 , 
  
 77 
 , 
  
 'ACTIVE' 
 ), 
  
 STRUCT 
 ( 
 2 
 , 
  
 TIMESTAMP 
  
 '2023-11-01 09:38:07' 
 , 
  
 86 
 , 
  
 'ACTIVE' 
 ), 
  
 STRUCT 
 ( 
 1 
 , 
  
 TIMESTAMP 
  
 '2023-11-01 09:38:26' 
 , 
  
 81 
 , 
  
 'ACTIVE' 
 ), 
  
 STRUCT 
 ( 
 3 
 , 
  
 TIMESTAMP 
  
 '2023-11-01 09:38:39' 
 , 
  
 77 
 , 
  
 'ACTIVE' 
 ) 
 ] 
 );

The following is the actual data ordered by time and device_id columns:

  SELECT 
  
 * 
  
 FROM 
  
 mydataset 
 . 
 device_data 
  
 ORDER 
  
 BY 
  
 time 
 , 
  
 device_id 
 ; 
 /*-----------+---------------------+--------+----------+ 
 | device_id |        time         | signal |  state   | 
 +-----------+---------------------+--------+----------+ 
 |         2 | 2023-11-01 09:35:07 |     87 | ACTIVE   | 
 |         1 | 2023-11-01 09:35:26 |     82 | ACTIVE   | 
 |         3 | 2023-11-01 09:35:39 |     74 | INACTIVE | 
 |         2 | 2023-11-01 09:36:07 |     88 | ACTIVE   | 
 |         1 | 2023-11-01 09:36:26 |     82 | ACTIVE   | 
 |         2 | 2023-11-01 09:37:07 |     88 | ACTIVE   | 
 |         1 | 2023-11-01 09:37:28 |     80 | ACTIVE   | 
 |         3 | 2023-11-01 09:37:39 |     77 | ACTIVE   | 
 |         2 | 2023-11-01 09:38:07 |     86 | ACTIVE   | 
 |         1 | 2023-11-01 09:38:26 |     81 | ACTIVE   | 
 |         3 | 2023-11-01 09:38:39 |     77 | ACTIVE   | 
 +-----------+---------------------+--------+----------*/

The table contains the time series for each device with two metric columns:

signal - signal level as observed by the device at the time of sampling, represented as an integer value between 0 and 100 .
state - state of the device at the time of sampling, represented as a free form string.

In the following query, the GAP_FILL function is used to align the time series at 1-minute intervals. Note how linear interpolation is used to compute values for the signal column and LOCF for the state column. For this example data, linear interpolation is a suitable choice to compute the output values.

  SELECT 
  
 * 
 FROM 
  
 GAP_FILL 
 ( 
  
 TABLE 
  
 mydataset 
 . 
 device_data 
 , 
  
 ts_column 
  
 = 
>  
 'time' 
 , 
  
 bucket_width 
  
 = 
>  
 INTERVAL 
  
 1 
  
 MINUTE 
 , 
  
 partitioning_columns 
  
 = 
>  
 [ 
 'device_id' 
 ] 
 , 
  
 value_columns 
  
 = 
>  
 [ 
  
 ( 
 'signal' 
 , 
  
 'linear' 
 ), 
  
 ( 
 'state' 
 , 
  
 'locf' 
 ) 
  
 ] 
 ) 
 ORDER 
  
 BY 
  
 time 
 , 
  
 device_id 
 ; 
  
 /*---------------------+-----------+--------+----------+ 
 |        time         | device_id | signal |  state   | 
 +---------------------+-----------+--------+----------+ 
 | 2023-11-01 09:36:00 |         1 |     82 | ACTIVE   | 
 | 2023-11-01 09:36:00 |         2 |     88 | ACTIVE   | 
 | 2023-11-01 09:36:00 |         3 |     75 | INACTIVE | 
 | 2023-11-01 09:37:00 |         1 |     81 | ACTIVE   | 
 | 2023-11-01 09:37:00 |         2 |     88 | ACTIVE   | 
 | 2023-11-01 09:37:00 |         3 |     76 | INACTIVE | 
 | 2023-11-01 09:38:00 |         1 |     81 | ACTIVE   | 
 | 2023-11-01 09:38:00 |         2 |     86 | ACTIVE   | 
 | 2023-11-01 09:38:00 |         3 |     77 | ACTIVE   | 
 +---------------------+-----------+--------+----------*/

The output table contains an aligned time series for each device and value columns ( signal and state ), computed using the gap-filling methods specified in the function call.

Join time series data

You can join time series data using a windowed join or AS OF join.

Windowed join

Sometimes you need to join two or more tables with time series data. Consider the following two tables:

mydataset.sensor_temperatures , contains temperature data reported by each sensor every 15 seconds.
mydataset.sensor_fuel_rates , contains the fuel consumption rate measured by each sensor every 15 seconds.

To create these tables, run the following queries:

  CREATE 
  
 OR 
  
 REPLACE 
  
 TABLE 
  
 mydataset 
 . 
 sensor_temperatures 
  
 AS 
 SELECT 
  
 * 
  
 FROM 
  
 UNNEST 
 ( 
  
 ARRAY<STRUCT<sensor_id 
  
 INT64 
 , 
  
 ts 
  
 TIMESTAMP 
 , 
  
 temp 
  
 FLOAT64 
>> [ 
  
 ( 
 1 
 , 
  
 TIMESTAMP 
  
 '2020-01-01 12:00:00.063' 
 , 
  
 37.1 
 ), 
  
 ( 
 1 
 , 
  
 TIMESTAMP 
  
 '2020-01-01 12:00:15.024' 
 , 
  
 37.2 
 ), 
  
 ( 
 1 
 , 
  
 TIMESTAMP 
  
 '2020-01-01 12:00:30.032' 
 , 
  
 37.3 
 ), 
  
 ( 
 2 
 , 
  
 TIMESTAMP 
  
 '2020-01-01 12:00:01.001' 
 , 
  
 38.1 
 ), 
  
 ( 
 2 
 , 
  
 TIMESTAMP 
  
 '2020-01-01 12:00:15.082' 
 , 
  
 38.2 
 ), 
  
 ( 
 2 
 , 
  
 TIMESTAMP 
  
 '2020-01-01 12:00:31.009' 
 , 
  
 38.3 
 ) 
 ] 
 ); 
 CREATE 
  
 OR 
  
 REPLACE 
  
 TABLE 
  
 mydataset 
 . 
 sensor_fuel_rates 
  
 AS 
 SELECT 
  
 * 
  
 FROM 
  
 UNNEST 
 ( 
  
 ARRAY<STRUCT<sensor_id 
  
 INT64 
 , 
  
 ts 
  
 TIMESTAMP 
 , 
  
 rate 
  
 FLOAT64 
>> [ 
  
 ( 
 1 
 , 
  
 TIMESTAMP 
  
 '2020-01-01 12:00:11.016' 
 , 
  
 10.1 
 ), 
  
 ( 
 1 
 , 
  
 TIMESTAMP 
  
 '2020-01-01 12:00:26.015' 
 , 
  
 10.2 
 ), 
  
 ( 
 1 
 , 
  
 TIMESTAMP 
  
 '2020-01-01 12:00:41.014' 
 , 
  
 10.3 
 ), 
  
 ( 
 2 
 , 
  
 TIMESTAMP 
  
 '2020-01-01 12:00:08.099' 
 , 
  
 11.1 
 ), 
  
 ( 
 2 
 , 
  
 TIMESTAMP 
  
 '2020-01-01 12:00:23.087' 
 , 
  
 11.2 
 ), 
  
 ( 
 2 
 , 
  
 TIMESTAMP 
  
 '2020-01-01 12:00:38.077' 
 , 
  
 11.3 
 ) 
 ] 
 );

The following is the actual data from the tables:

  SELECT 
  
 * 
  
 FROM 
  
 mydataset 
 . 
 sensor_temperatures 
  
 ORDER 
  
 BY 
  
 sensor_id 
 , 
  
 ts 
 ; 
  
 /*-----------+---------------------+------+ 
 | sensor_id |         ts          | temp | 
 +-----------+---------------------+------+ 
 |         1 | 2020-01-01 12:00:00 | 37.1 | 
 |         1 | 2020-01-01 12:00:15 | 37.2 | 
 |         1 | 2020-01-01 12:00:30 | 37.3 | 
 |         2 | 2020-01-01 12:00:01 | 38.1 | 
 |         2 | 2020-01-01 12:00:15 | 38.2 | 
 |         2 | 2020-01-01 12:00:31 | 38.3 | 
 +-----------+---------------------+------*/ 
 SELECT 
  
 * 
  
 FROM 
  
 mydataset 
 . 
 sensor_fuel_rates 
  
 ORDER 
  
 BY 
  
 sensor_id 
 , 
  
 ts 
 ; 
  
 /*-----------+---------------------+------+ 
 | sensor_id |         ts          | rate | 
 +-----------+---------------------+------+ 
 |         1 | 2020-01-01 12:00:11 | 10.1 | 
 |         1 | 2020-01-01 12:00:26 | 10.2 | 
 |         1 | 2020-01-01 12:00:41 | 10.3 | 
 |         2 | 2020-01-01 12:00:08 | 11.1 | 
 |         2 | 2020-01-01 12:00:23 | 11.2 | 
 |         2 | 2020-01-01 12:00:38 | 11.3 | 
 +-----------+---------------------+------*/

To check the fuel consumption rate at the temperature reported by each sensor, you can join the two time series.

Although the data in the two time series is unaligned, it is sampled at the same interval (15 seconds), therefore such data is a good candidate for windowed join. Use the time bucketing functions to align timestamps used as join keys.

The following queries illustrate how each timestamp can be assigned to 15-second windows using the TIMESTAMP_BUCKET function:

  SELECT 
  
 * 
 , 
  
 TIMESTAMP_BUCKET 
 ( 
 ts 
 , 
  
 INTERVAL 
  
 15 
  
 SECOND 
 ) 
  
 ts_window 
 FROM 
  
 mydataset 
 . 
 sensor_temperatures 
 ORDER 
  
 BY 
  
 sensor_id 
 , 
  
 ts 
 ; 
 /*-----------+---------------------+------+---------------------+ 
 | sensor_id |         ts          | temp |      ts_window      | 
 +-----------+---------------------+------+---------------------+ 
 |         1 | 2020-01-01 12:00:00 | 37.1 | 2020-01-01 12:00:00 | 
 |         1 | 2020-01-01 12:00:15 | 37.2 | 2020-01-01 12:00:15 | 
 |         1 | 2020-01-01 12:00:30 | 37.3 | 2020-01-01 12:00:30 | 
 |         2 | 2020-01-01 12:00:01 | 38.1 | 2020-01-01 12:00:00 | 
 |         2 | 2020-01-01 12:00:15 | 38.2 | 2020-01-01 12:00:15 | 
 |         2 | 2020-01-01 12:00:31 | 38.3 | 2020-01-01 12:00:30 | 
 +-----------+---------------------+------+---------------------*/ 
 SELECT 
  
 * 
 , 
  
 TIMESTAMP_BUCKET 
 ( 
 ts 
 , 
  
 INTERVAL 
  
 15 
  
 SECOND 
 ) 
  
 ts_window 
 FROM 
  
 mydataset 
 . 
 sensor_fuel_rates 
 ORDER 
  
 BY 
  
 sensor_id 
 , 
  
 ts 
 ; 
 /*-----------+---------------------+------+---------------------+ 
 | sensor_id |         ts          | rate |      ts_window      | 
 +-----------+---------------------+------+---------------------+ 
 |         1 | 2020-01-01 12:00:11 | 10.1 | 2020-01-01 12:00:00 | 
 |         1 | 2020-01-01 12:00:26 | 10.2 | 2020-01-01 12:00:15 | 
 |         1 | 2020-01-01 12:00:41 | 10.3 | 2020-01-01 12:00:30 | 
 |         2 | 2020-01-01 12:00:08 | 11.1 | 2020-01-01 12:00:00 | 
 |         2 | 2020-01-01 12:00:23 | 11.2 | 2020-01-01 12:00:15 | 
 |         2 | 2020-01-01 12:00:38 | 11.3 | 2020-01-01 12:00:30 | 
 +-----------+---------------------+------+---------------------*/

You can use this concept to join the fuel consumption rate data with the temperature reported by each sensor:

  SELECT 
  
 t1 
 . 
 sensor_id 
  
 AS 
  
 sensor_id 
 , 
  
 t1 
 . 
 ts 
  
 AS 
  
 temp_ts 
 , 
  
 t1 
 . 
 temp 
  
 AS 
  
 temp 
 , 
  
 t2 
 . 
 ts 
  
 AS 
  
 rate_ts 
 , 
  
 t2 
 . 
 rate 
  
 AS 
  
 rate 
 FROM 
  
 mydataset 
 . 
 sensor_temperatures 
  
 t1 
 LEFT 
  
 JOIN 
  
 mydataset 
 . 
 sensor_fuel_rates 
  
 t2 
 ON 
  
 TIMESTAMP_BUCKET 
 ( 
 t1 
 . 
 ts 
 , 
  
 INTERVAL 
  
 15 
  
 SECOND 
 ) 
  
 = 
  
 TIMESTAMP_BUCKET 
 ( 
 t2 
 . 
 ts 
 , 
  
 INTERVAL 
  
 15 
  
 SECOND 
 ) 
  
 AND 
  
 t1 
 . 
 sensor_id 
  
 = 
  
 t2 
 . 
 sensor_id 
 ORDER 
  
 BY 
  
 sensor_id 
 , 
  
 temp_ts 
 ; 
 /*-----------+---------------------+------+---------------------+------+ 
 | sensor_id |       temp_ts       | temp |       rate_ts       | rate | 
 +-----------+---------------------+------+---------------------+------+ 
 |         1 | 2020-01-01 12:00:00 | 37.1 | 2020-01-01 12:00:11 | 10.1 | 
 |         1 | 2020-01-01 12:00:15 | 37.2 | 2020-01-01 12:00:26 | 10.2 | 
 |         1 | 2020-01-01 12:00:30 | 37.3 | 2020-01-01 12:00:41 | 10.3 | 
 |         2 | 2020-01-01 12:00:01 | 38.1 | 2020-01-01 12:00:08 | 11.1 | 
 |         2 | 2020-01-01 12:00:15 | 38.2 | 2020-01-01 12:00:23 | 11.2 | 
 |         2 | 2020-01-01 12:00:31 | 38.3 | 2020-01-01 12:00:38 | 11.3 | 
 +-----------+---------------------+------+---------------------+------*/

`AS OF` join

For this section, use the mydataset.sensor_temperatures table and create a new table, mydataset.sensor_location .

The mydataset.sensor_temperatures table contains temperature data from different sensors, reported every 15 seconds:

  SELECT 
  
 * 
  
 FROM 
  
 mydataset 
 . 
 sensor_temperatures 
  
 ORDER 
  
 BY 
  
 sensor_id 
 , 
  
 ts 
 ; 
 /*-----------+---------------------+------+ 
 | sensor_id |         ts          | temp | 
 +-----------+---------------------+------+ 
 |         1 | 2020-01-01 12:00:00 | 37.1 | 
 |         1 | 2020-01-01 12:00:15 | 37.2 | 
 |         1 | 2020-01-01 12:00:30 | 37.3 | 
 |         2 | 2020-01-01 12:00:45 | 38.1 | 
 |         2 | 2020-01-01 12:01:01 | 38.2 | 
 |         2 | 2020-01-01 12:01:15 | 38.3 | 
 +-----------+---------------------+------*/

To create mydataset.sensor_location , run the following query:

  CREATE 
  
 OR 
  
 REPLACE 
  
 TABLE 
  
 mydataset 
 . 
 sensor_locations 
  
 AS 
 SELECT 
  
 * 
  
 FROM 
  
 UNNEST 
 ( 
  
 ARRAY<STRUCT<sensor_id 
  
 INT64 
 , 
  
 ts 
  
 TIMESTAMP 
 , 
  
 location 
  
 GEOGRAPHY 
>> [ 
  
 ( 
 1 
 , 
  
 TIMESTAMP 
  
 '2020-01-01 11:59:47.063' 
 , 
  
 ST_GEOGPOINT 
 ( 
 - 
 122.022 
 , 
  
 37.406 
 )), 
  
 ( 
 1 
 , 
  
 TIMESTAMP 
  
 '2020-01-01 12:00:08.185' 
 , 
  
 ST_GEOGPOINT 
 ( 
 - 
 122.021 
 , 
  
 37.407 
 )), 
  
 ( 
 1 
 , 
  
 TIMESTAMP 
  
 '2020-01-01 12:00:28.032' 
 , 
  
 ST_GEOGPOINT 
 ( 
 - 
 122.020 
 , 
  
 37.405 
 )), 
  
 ( 
 2 
 , 
  
 TIMESTAMP 
  
 '2020-01-01 07:28:41.239' 
 , 
  
 ST_GEOGPOINT 
 ( 
 - 
 122.390 
 , 
  
 37.790 
 )) 
 ] 
 ); 
 /*-----------+---------------------+------------------------+ 
 | sensor_id |         ts          |        location        | 
 +-----------+---------------------+------------------------+ 
 |         1 | 2020-01-01 11:59:47 | POINT(-122.022 37.406) | 
 |         1 | 2020-01-01 12:00:08 | POINT(-122.021 37.407) | 
 |         1 | 2020-01-01 12:00:28 |  POINT(-122.02 37.405) | 
 |         2 | 2020-01-01 07:28:41 |   POINT(-122.39 37.79) | 
 +-----------+---------------------+------------------------*/

Now join data from mydataset.sensor_temperatures with data from mydataset.sensor_location .

In this scenario, you can't use a windowed join, since the temperature data and location date are not reported at the same interval.

One way to do this in BigQuery is to transform the timestamp data into a range, using the RANGE data type. The range represents the temporal validity of a row, providing the start and end time for which the row is valid.

Use the LEAD window function to find the next data point in the time series, relative to the current data point, which is also the end-boundary of the temporal validity of the current row. The following queries demonstrate this, converting location data to validity ranges:

  WITH 
  
 locations_ranges 
  
 AS 
  
 ( 
  
 SELECT 
  
 sensor_id 
 , 
  
 RANGE 
 ( 
 ts 
 , 
  
 LEAD 
 ( 
 ts 
 ) 
  
 OVER 
  
 ( 
 PARTITION 
  
 BY 
  
 sensor_id 
  
 ORDER 
  
 BY 
  
 ts 
  
 ASC 
 )) 
  
 AS 
  
 ts_range 
 , 
  
 location 
  
 FROM 
  
 mydataset 
 . 
 sensor_locations 
 ) 
 SELECT 
  
 * 
  
 FROM 
  
 locations_ranges 
  
 ORDER 
  
 BY 
  
 sensor_id 
 , 
  
 ts_range 
 ; 
 /*-----------+--------------------------------------------+------------------------+ 
 | sensor_id |                  ts_range                  |        location        | 
 +-----------+--------------------------------------------+------------------------+ 
 |         1 | [2020-01-01 11:59:47, 2020-01-01 12:00:08) | POINT(-122.022 37.406) | 
 |         1 | [2020-01-01 12:00:08, 2020-01-01 12:00:28) | POINT(-122.021 37.407) | 
 |         1 |           [2020-01-01 12:00:28, UNBOUNDED) |  POINT(-122.02 37.405) | 
 |         2 |           [2020-01-01 07:28:41, UNBOUNDED) |   POINT(-122.39 37.79) | 
 +-----------+--------------------------------------------+------------------------*/

Now you can join temperatures data (left) with the location data (right):

  WITH 
  
 locations_ranges 
  
 AS 
  
 ( 
  
 SELECT 
  
 sensor_id 
 , 
  
 RANGE 
 ( 
 ts 
 , 
  
 LEAD 
 ( 
 ts 
 ) 
  
 OVER 
  
 ( 
 PARTITION 
  
 BY 
  
 sensor_id 
  
 ORDER 
  
 BY 
  
 ts 
  
 ASC 
 )) 
  
 AS 
  
 ts_range 
 , 
  
 location 
  
 FROM 
  
 mydataset 
 . 
 sensor_locations 
 ) 
 SELECT 
  
 t1 
 . 
 sensor_id 
  
 AS 
  
 sensor_id 
 , 
  
 t1 
 . 
 ts 
  
 AS 
  
 temp_ts 
 , 
  
 t1 
 . 
 temp 
  
 AS 
  
 temp 
 , 
  
 t2 
 . 
 location 
  
 AS 
  
 location 
 FROM 
  
 mydataset 
 . 
 sensor_temperatures 
  
 t1 
 LEFT 
  
 JOIN 
  
 locations_ranges 
  
 t2 
 ON 
  
 RANGE_CONTAINS 
 ( 
 t2 
 . 
 ts_range 
 , 
  
 t1 
 . 
 ts 
 ) 
 AND 
  
 t1 
 . 
 sensor_id 
  
 = 
  
 t2 
 . 
 sensor_id 
 ORDER 
  
 BY 
  
 sensor_id 
 , 
  
 temp_ts 
 ; 
 /*-----------+---------------------+------+------------------------+ 
 | sensor_id |       temp_ts       | temp |        location        | 
 +-----------+---------------------+------+------------------------+ 
 |         1 | 2020-01-01 12:00:00 | 37.1 | POINT(-122.022 37.406) | 
 |         1 | 2020-01-01 12:00:15 | 37.2 | POINT(-122.021 37.407) | 
 |         1 | 2020-01-01 12:00:30 | 37.3 |  POINT(-122.02 37.405) | 
 |         2 | 2020-01-01 12:00:01 | 38.1 |   POINT(-122.39 37.79) | 
 |         2 | 2020-01-01 12:00:15 | 38.2 |   POINT(-122.39 37.79) | 
 |         2 | 2020-01-01 12:00:31 | 38.3 |   POINT(-122.39 37.79) | 
 +-----------+---------------------+------+------------------------*/

Combine and split range data

In this section, combine range data that have overlapping ranges and split range data into smaller ranges.

Combine range data

Tables with range values might have overlapping ranges. In the following query, the time ranges capture the state of sensors at approximately 5-minute intervals:

  CREATE 
  
 OR 
  
 REPLACE 
  
 TABLE 
  
 mydataset 
 . 
 sensor_metrics 
  
 AS 
 SELECT 
  
 * 
  
 FROM 
  
 UNNEST 
 ( 
  
 ARRAY<STRUCT<sensor_id 
  
 INT64 
 , 
  
 duration 
  
 RANGE<DATETIME> 
 , 
  
 flow 
  
 INT64 
 , 
  
 spins 
  
 INT64 
>> [ 
  
 ( 
 1 
 , 
  
 RANGE<DATETIME> 
  
 "[2020-01-01 12:00:01, 2020-01-01 12:05:23)" 
 , 
  
 10 
 , 
  
 1 
 ), 
  
 ( 
 1 
 , 
  
 RANGE<DATETIME> 
  
 "[2020-01-01 12:05:12, 2020-01-01 12:10:46)" 
 , 
  
 10 
 , 
  
 20 
 ), 
  
 ( 
 1 
 , 
  
 RANGE<DATETIME> 
  
 "[2020-01-01 12:10:27, 2020-01-01 12:15:56)" 
 , 
  
 11 
 , 
  
 4 
 ), 
  
 ( 
 1 
 , 
  
 RANGE<DATETIME> 
  
 "[2020-01-01 12:16:00, 2020-01-01 12:20:58)" 
 , 
  
 11 
 , 
  
 9 
 ), 
  
 ( 
 1 
 , 
  
 RANGE<DATETIME> 
  
 "[2020-01-01 12:20:33, 2020-01-01 12:25:08)" 
 , 
  
 11 
 , 
  
 8 
 ), 
  
 ( 
 2 
 , 
  
 RANGE<DATETIME> 
  
 "[2020-01-01 12:00:19, 2020-01-01 12:05:08)" 
 , 
  
 21 
 , 
  
 31 
 ), 
  
 ( 
 2 
 , 
  
 RANGE<DATETIME> 
  
 "[2020-01-01 12:05:08, 2020-01-01 12:10:30)" 
 , 
  
 21 
 , 
  
 2 
 ), 
  
 ( 
 2 
 , 
  
 RANGE<DATETIME> 
  
 "[2020-01-01 12:10:22, 2020-01-01 12:15:42)" 
 , 
  
 21 
 , 
  
 10 
 ) 
 ] 
 );

The following query on the table shows several overlapping ranges:

  SELECT 
  
 * 
  
 FROM 
  
 mydataset 
 . 
 sensor_metrics 
 ; 
 /*-----------+--------------------------------------------+------+-------+ 
 | sensor_id |                  duration                  | flow | spins | 
 +-----------+--------------------------------------------+------+-------+ 
 |         1 | [2020-01-01 12:00:01, 2020-01-01 12:05:23) | 10   |     1 | 
 |         1 | [2020-01-01 12:05:12, 2020-01-01 12:10:46) | 10   |    20 | 
 |         1 | [2020-01-01 12:10:27, 2020-01-01 12:15:56) | 11   |     4 | 
 |         1 | [2020-01-01 12:16:00, 2020-01-01 12:20:58) | 11   |     9 | 
 |         1 | [2020-01-01 12:20:33, 2020-01-01 12:25:08) | 11   |     8 | 
 |         2 | [2020-01-01 12:00:19, 2020-01-01 12:05:08) | 21   |    31 | 
 |         2 | [2020-01-01 12:05:08, 2020-01-01 12:10:30) | 21   |     2 | 
 |         2 | [2020-01-01 12:10:22, 2020-01-01 12:15:42) | 21   |    10 | 
 +-----------+--------------------------------------------+------+-------*/

For some of the overlapping ranges, the value in the flow column is the same. For example, rows 1 and 2 overlap, and also have the same flow readings. You can combine these two rows to reduce the number of rows in the table. You can use the RANGE_SESSIONIZE table function to find ranges that overlap with each row, and provide an additional session_range column that contains a range that is the union of all ranges that overlap. To display the session ranges for each row, run the following query:

  SELECT 
  
 sensor_id 
 , 
  
 session_range 
 , 
  
 flow 
 FROM 
  
 RANGE_SESSIONIZE 
 ( 
  
 # Input data. 
  
 ( 
 SELECT 
  
 sensor_id 
 , 
  
 duration 
 , 
  
 flow 
  
 FROM 
  
 mydataset 
 . 
 sensor_metrics 
 ), 
  
 # Range column. 
  
 "duration" 
 , 
  
 # Partitioning columns. Ranges are sessionized only within these partitions. 
  
 [ 
 "sensor_id" 
 , 
  
 "flow" 
 ] 
 , 
  
 # Sessionize mode. 
  
 "OVERLAPS" 
 ) 
 ORDER 
  
 BY 
  
 sensor_id 
 , 
  
 session_range 
 ; 
 /*-----------+--------------------------------------------+------+ 
 | sensor_id |                session_range               | flow | 
 +-----------+--------------------------------------------+------+ 
 |         1 | [2020-01-01 12:00:01, 2020-01-01 12:10:46) | 10   | 
 |         1 | [2020-01-01 12:00:01, 2020-01-01 12:10:46) | 10   | 
 |         1 | [2020-01-01 12:10:27, 2020-01-01 12:15:56) | 11   | 
 |         1 | [2020-01-01 12:16:00, 2020-01-01 12:25:08) | 11   | 
 |         1 | [2020-01-01 12:16:00, 2020-01-01 12:25:08) | 11   | 
 |         2 | [2020-01-01 12:00:19, 2020-01-01 12:05:08) | 21   | 
 |         2 | [2020-01-01 12:05:08, 2020-01-01 12:15:42) | 21   | 
 |         2 | [2020-01-01 12:05:08, 2020-01-01 12:15:42) | 21   | 
 +-----------+--------------------------------------------+------*/

Note that for sensor_id having value 2 , the first row's end boundary has the same datetime value as the second row's start boundary. However, because end boundaries are exclusive, they don't overlap (only meet) and hence were not in the same session ranges. If you want to place these two rows in the same session ranges, use the MEETS sessionize mode.

To combine the ranges, group the results by session_range and the partitioning columns ( sensor_id and flow ):

  SELECT 
  
 sensor_id 
 , 
  
 session_range 
 , 
  
 flow 
 FROM 
  
 RANGE_SESSIONIZE 
 ( 
  
 ( 
 SELECT 
  
 sensor_id 
 , 
  
 duration 
 , 
  
 flow 
  
 FROM 
  
 mydataset 
 . 
 sensor_metrics 
 ), 
  
 "duration" 
 , 
  
 [ 
 "sensor_id" 
 , 
  
 "flow" 
 ] 
 , 
  
 "OVERLAPS" 
 ) 
 GROUP 
  
 BY 
  
 sensor_id 
 , 
  
 session_range 
 , 
  
 flow 
 ORDER 
  
 BY 
  
 sensor_id 
 , 
  
 session_range 
 ; 
 /*-----------+--------------------------------------------+------+ 
 | sensor_id |                session_range               | flow | 
 +-----------+--------------------------------------------+------+ 
 |         1 | [2020-01-01 12:00:01, 2020-01-01 12:10:46) | 10   | 
 |         1 | [2020-01-01 12:10:27, 2020-01-01 12:15:56) | 11   | 
 |         1 | [2020-01-01 12:16:00, 2020-01-01 12:25:08) | 11   | 
 |         2 | [2020-01-01 12:00:19, 2020-01-01 12:05:08) | 21   | 
 |         2 | [2020-01-01 12:05:08, 2020-01-01 12:15:42) | 21   | 
 +-----------+--------------------------------------------+------*/

Finally, add the spins column in the session data by aggregating it using SUM .

  SELECT 
  
 sensor_id 
 , 
  
 session_range 
 , 
  
 flow 
 , 
  
 SUM 
 ( 
 spins 
 ) 
  
 as 
  
 spins 
 FROM 
  
 RANGE_SESSIONIZE 
 ( 
  
 TABLE 
  
 mydataset 
 . 
 sensor_metrics 
 , 
  
 "duration" 
 , 
  
 [ 
 "sensor_id" 
 , 
  
 "flow" 
 ] 
 , 
  
 "OVERLAPS" 
 ) 
 GROUP 
  
 BY 
  
 sensor_id 
 , 
  
 session_range 
 , 
  
 flow 
 ORDER 
  
 BY 
  
 sensor_id 
 , 
  
 session_range 
 ; 
 /*-----------+--------------------------------------------+------+-------+ 
 | sensor_id |                session_range               | flow | spins | 
 +-----------+--------------------------------------------+------+-------+ 
 |         1 | [2020-01-01 12:00:01, 2020-01-01 12:10:46) | 10   |    21 | 
 |         1 | [2020-01-01 12:10:27, 2020-01-01 12:15:56) | 11   |     4 | 
 |         1 | [2020-01-01 12:16:00, 2020-01-01 12:25:08) | 11   |    17 | 
 |         2 | [2020-01-01 12:00:19, 2020-01-01 12:05:08) | 21   |    31 | 
 |         2 | [2020-01-01 12:05:08, 2020-01-01 12:15:42) | 21   |    12 | 
 +-----------+--------------------------------------------+------+-------*/

Split range data

You can also split a range into smaller ranges. For this example, use the following table with range data:

  /*-----------+--------------------------+------+-------+ 
 | sensor_id |         duration         | flow | spins | 
 +-----------+--------------------------+------+-------+ 
 |         1 | [2020-01-01, 2020-12-31) | 10   |    21 | 
 |         1 | [2021-01-01, 2021-12-31) | 11   |     4 | 
 |         2 | [2020-04-15, 2021-04-15) | 21   |    31 | 
 |         2 | [2021-04-15, 2021-04-15) | 21   |    12 | 
 +-----------+--------------------------+------+-------*/

Now, split the original ranges into 3-month intervals:

  WITH 
  
 sensor_data 
  
 AS 
  
 ( 
  
 SELECT 
  
 * 
  
 FROM 
  
 UNNEST 
 ( 
  
 ARRAY<STRUCT<sensor_id 
  
 INT64 
 , 
  
 duration 
  
 RANGE<DATE> 
 , 
  
 flow 
  
 INT64 
 , 
  
 spins 
  
 INT64 
>> [ 
  
 ( 
 1 
 , 
  
 RANGE<DATE> 
  
 "[2020-01-01, 2020-12-31)" 
 , 
  
 10 
 , 
  
 21 
 ), 
  
 ( 
 1 
 , 
  
 RANGE<DATE> 
  
 "[2021-01-01, 2021-12-31)" 
 , 
  
 11 
 , 
  
 4 
 ), 
  
 ( 
 2 
 , 
  
 RANGE<DATE> 
  
 "[2020-04-15, 2021-04-15)" 
 , 
  
 21 
 , 
  
 31 
 ), 
  
 ( 
 2 
 , 
  
 RANGE<DATE> 
  
 "[2021-04-15, 2022-04-15)" 
 , 
  
 21 
 , 
  
 12 
 ) 
  
 ] 
 ) 
 ) 
 SELECT 
  
 sensor_id 
 , 
  
 expanded_range 
 , 
  
 flow 
 , 
  
 spins 
 FROM 
  
 sensor_data 
 , 
  
 UNNEST 
 ( 
 GENERATE_RANGE_ARRAY 
 ( 
 duration 
 , 
  
 INTERVAL 
  
 3 
  
 MONTH 
 )) 
  
 AS 
  
 expanded_range 
 ; 
 /*-----------+--------------------------+------+-------+ 
 | sensor_id |      expanded_range      | flow | spins | 
 +-----------+--------------------------+------+-------+ 
 |         1 | [2020-01-01, 2020-04-01) |   10 |    21 | 
 |         1 | [2020-04-01, 2020-07-01) |   10 |    21 | 
 |         1 | [2020-07-01, 2020-10-01) |   10 |    21 | 
 |         1 | [2020-10-01, 2020-12-31) |   10 |    21 | 
 |         1 | [2021-01-01, 2021-04-01) |   11 |     4 | 
 |         1 | [2021-04-01, 2021-07-01) |   11 |     4 | 
 |         1 | [2021-07-01, 2021-10-01) |   11 |     4 | 
 |         1 | [2021-10-01, 2021-12-31) |   11 |     4 | 
 |         2 | [2020-04-15, 2020-07-15) |   21 |    31 | 
 |         2 | [2020-07-15, 2020-10-15) |   21 |    31 | 
 |         2 | [2020-10-15, 2021-01-15) |   21 |    31 | 
 |         2 | [2021-01-15, 2021-04-15) |   21 |    31 | 
 |         2 | [2021-04-15, 2021-07-15) |   21 |    12 | 
 |         2 | [2021-07-15, 2021-10-15) |   21 |    12 | 
 |         2 | [2021-10-15, 2022-01-15) |   21 |    12 | 
 |         2 | [2022-01-15, 2022-04-15) |   21 |    12 | 
 +-----------+--------------------------+------+-------*/

In the previous query, each original range was broken down into smaller ranges, with width set to INTERVAL 3 MONTH . However, the 3-month ranges are not aligned to a common origin. To align these ranges to a common origin 2020-01-01 , run the following query:

  WITH 
  
 sensor_data 
  
 AS 
  
 ( 
  
 SELECT 
  
 * 
  
 FROM 
  
 UNNEST 
 ( 
  
 ARRAY<STRUCT<sensor_id 
  
 INT64 
 , 
  
 duration 
  
 RANGE<DATE> 
 , 
  
 flow 
  
 INT64 
 , 
  
 spins 
  
 INT64 
>> [ 
  
 ( 
 1 
 , 
  
 RANGE<DATE> 
  
 "[2020-01-01, 2020-12-31)" 
 , 
  
 10 
 , 
  
 21 
 ), 
  
 ( 
 1 
 , 
  
 RANGE<DATE> 
  
 "[2021-01-01, 2021-12-31)" 
 , 
  
 11 
 , 
  
 4 
 ), 
  
 ( 
 2 
 , 
  
 RANGE<DATE> 
  
 "[2020-04-15, 2021-04-15)" 
 , 
  
 21 
 , 
  
 31 
 ), 
  
 ( 
 2 
 , 
  
 RANGE<DATE> 
  
 "[2021-04-15, 2022-04-15)" 
 , 
  
 21 
 , 
  
 12 
 ) 
  
 ] 
 ) 
 ) 
 SELECT 
  
 sensor_id 
 , 
  
 expanded_range 
 , 
  
 flow 
 , 
  
 spins 
 FROM 
  
 sensor_data 
 JOIN 
  
 UNNEST 
 ( 
 GENERATE_RANGE_ARRAY 
 ( 
 RANGE<DATE> 
  
 "[2020-01-01, 2022-12-31)" 
 , 
  
 INTERVAL 
  
 3 
  
 MONTH 
 )) 
  
 AS 
  
 expanded_range 
 ON 
  
 RANGE_OVERLAPS 
 ( 
 duration 
 , 
  
 expanded_range 
 ); 
 /*-----------+--------------------------+------+-------+ 
 | sensor_id |      expanded_range      | flow | spins | 
 +-----------+--------------------------+------+-------+ 
 |         1 | [2020-01-01, 2020-04-01) |   10 |    21 | 
 |         1 | [2020-04-01, 2020-07-01) |   10 |    21 | 
 |         1 | [2020-07-01, 2020-10-01) |   10 |    21 | 
 |         1 | [2020-10-01, 2021-01-01) |   10 |    21 | 
 |         1 | [2021-01-01, 2021-04-01) |   11 |     4 | 
 |         1 | [2021-04-01, 2021-07-01) |   11 |     4 | 
 |         1 | [2021-07-01, 2021-10-01) |   11 |     4 | 
 |         1 | [2021-10-01, 2022-01-01) |   11 |     4 | 
 |         2 | [2020-04-01, 2020-07-01) |   21 |    31 | 
 |         2 | [2020-07-01, 2020-10-01) |   21 |    31 | 
 |         2 | [2020-10-01, 2021-01-01) |   21 |    31 | 
 |         2 | [2021-01-01, 2021-04-01) |   21 |    31 | 
 |         2 | [2021-04-01, 2021-07-01) |   21 |    31 | 
 |         2 | [2021-04-01, 2021-07-01) |   21 |    12 | 
 |         2 | [2021-07-01, 2021-10-01) |   21 |    12 | 
 |         2 | [2021-10-01, 2022-01-01) |   21 |    12 | 
 |         2 | [2022-01-01, 2022-04-01) |   21 |    12 | 
 |         2 | [2022-04-01, 2022-07-01) |   21 |    12 | 
 +-----------+--------------------------+------+-------*/

In the previous query, the row with the range [2020-04-15, 2021-04-15) is split into 5 ranges, starting with the range [2020-04-01, 2020-07-01) . Note that the start boundary now extends beyond the original start boundary, in order to align with the common origin. If you don't want the start boundary to not extend beyond the original start boundary, you can restrict the JOIN condition:

  WITH 
  
 sensor_data 
  
 AS 
  
 ( 
  
 SELECT 
  
 * 
  
 FROM 
  
 UNNEST 
 ( 
  
 ARRAY<STRUCT<sensor_id 
  
 INT64 
 , 
  
 duration 
  
 RANGE<DATE> 
 , 
  
 flow 
  
 INT64 
 , 
  
 spins 
  
 INT64 
>> [ 
  
 ( 
 1 
 , 
  
 RANGE<DATE> 
  
 "[2020-01-01, 2020-12-31)" 
 , 
  
 10 
 , 
  
 21 
 ), 
  
 ( 
 1 
 , 
  
 RANGE<DATE> 
  
 "[2021-01-01, 2021-12-31)" 
 , 
  
 11 
 , 
  
 4 
 ), 
  
 ( 
 2 
 , 
  
 RANGE<DATE> 
  
 "[2020-04-15, 2021-04-15)" 
 , 
  
 21 
 , 
  
 31 
 ), 
  
 ( 
 2 
 , 
  
 RANGE<DATE> 
  
 "[2021-04-15, 2022-04-15)" 
 , 
  
 21 
 , 
  
 12 
 ) 
  
 ] 
 ) 
 ) 
 SELECT 
  
 sensor_id 
 , 
  
 expanded_range 
 , 
  
 flow 
 , 
  
 spins 
 FROM 
  
 sensor_data 
 JOIN 
  
 UNNEST 
 ( 
 GENERATE_RANGE_ARRAY 
 ( 
 RANGE<DATE> 
  
 "[2020-01-01, 2022-12-31)" 
 , 
  
 INTERVAL 
  
 3 
  
 MONTH 
 )) 
  
 AS 
  
 expanded_range 
 ON 
  
 RANGE_CONTAINS 
 ( 
 duration 
 , 
  
 RANGE_START 
 ( 
 expanded_range 
 )); 
 /*-----------+--------------------------+------+-------+ 
 | sensor_id |      expanded_range      | flow | spins | 
 +-----------+--------------------------+------+-------+ 
 |         1 | [2020-01-01, 2020-04-01) |   10 |    21 | 
 |         1 | [2020-04-01, 2020-07-01) |   10 |    21 | 
 |         1 | [2020-07-01, 2020-10-01) |   10 |    21 | 
 |         1 | [2020-10-01, 2021-01-01) |   10 |    21 | 
 |         1 | [2021-01-01, 2021-04-01) |   11 |     4 | 
 |         1 | [2021-04-01, 2021-07-01) |   11 |     4 | 
 |         1 | [2021-07-01, 2021-10-01) |   11 |     4 | 
 |         1 | [2021-10-01, 2022-01-01) |   11 |     4 | 
 |         2 | [2020-07-01, 2020-10-01) |   21 |    31 | 
 |         2 | [2020-10-01, 2021-01-01) |   21 |    31 | 
 |         2 | [2021-01-01, 2021-04-01) |   21 |    31 | 
 |         2 | [2021-04-01, 2021-07-01) |   21 |    31 | 
 |         2 | [2021-07-01, 2021-10-01) |   21 |    12 | 
 |         2 | [2021-10-01, 2022-01-01) |   21 |    12 | 
 |         2 | [2022-01-01, 2022-04-01) |   21 |    12 | 
 |         2 | [2022-04-01, 2022-07-01) |   21 |    12 | 
 +-----------+--------------------------+------+-------*/

You now see that the range [2020-04-15, 2021-04-15) was split into 4 ranges, starting with the range [2020-07-01, 2020-10-01) .

Best practices for storing data

When storing time series data, it is important to consider the query patterns that are used against the tables where the data is stored. Typically, when querying time series data, you can filter the data for a specific time range.
To optimize these usage patterns, it is recommended to store time series data in partitioned tables , with data partitioned either by the time column or ingestion time . This can significantly improve the query time performance of the time series data, as this lets BigQuery to prune partitions that don't contain queried data.
You can enable clustering on the time, range, or one of the partitioning columns for further query time performance improvements.