Query statistics

Spanner provides built-in tables that keep many statistics for the queries and data manipulation language (DML) statements that used the most CPU, and all queries in aggregate (including change stream queries).

Access query statistics

Spanner provides the query statistics in the SPANNER_SYS schema. You can use the following ways to access SPANNER_SYS data:

A database's Spanner Studio page in the Google Cloud console.
The gcloud spanner databases execute-sql command.
Query insights dashboards.
The executeSql or the executeStreamingSql method.

The following single read methods that Spanner provides don't support SPANNER_SYS :

Performing a strong read from a single row or multiple rows in a table.
Performing a stale read from a single row or multiple rows in a table.
Reading from a single row or multiple rows in a secondary index.

For more information, see Single read methods .

CPU usage grouped by query

The following tables track the queries with the highest CPU usage during a specific time period:

SPANNER_SYS.QUERY_STATS_TOP_MINUTE : Queries during 1 minute intervals
SPANNER_SYS.QUERY_STATS_TOP_10MINUTE : Queries during 10 minute intervals
SPANNER_SYS.QUERY_STATS_TOP_HOUR : Queries during 1 hour intervals

These tables have the following properties:

Each table contains data for non-overlapping time intervals of the length the table name specifies.
Intervals are based on clock times. 1 minute intervals end on the minute, 10 minute intervals end every 10 minutes starting on the hour, and 1 hour intervals end on the hour.

For example, at 11:59:30 AM, the most recent intervals available to SQL queries are:
- 1 minute: 11:58:00–11:58:59 AM
- 10 minute: 11:40:00–11:49:59 AM
- 1 hour: 10:00:00–10:59:59 AM
Spanner groups the statistics by the text of the SQL query. If a query uses query parameters , Spanner groups all executions of that query into one row. If the query uses string literals, Spanner only groups the statistics if the full query text is identical; when any text differs, each query appears as a separate row. For batch DML, Spanner normalizes the batch by deduplicating consecutive identical statements prior to generating the fingerprint.
If a request tag is present, FPRINTis the hash of the request tag. Otherwise, it is the hash of the TEXT value. For partitioned DMLs, FPRINTis always the hash of the TEXT value.
Each row contains statistics for all executions of a particular SQL query that Spanner captures statistics for during the specified interval.
If Spanner is unable to store all queries run during the interval, the system prioritizes queries with the highest CPU usage during the specified interval.
Tracked queries include those that completed, failed, or were canceled by the user.
A subset of statistics is specific to queries that ran but did not complete:
- Execution count and mean latency in seconds across all the queries that did not succeed.
- Execution count for queries that timed out.
- Execution count for queries that were canceled by the user or failed due to network connectivity issues.
All columns in the tables are nullable.

Query statistics for previously executed partitioned DML statements have the following properties:

Every successful partitioned DML statement strictly counts as oneexecution. A partitioned DML statement that failed, is canceled, or is executing has an execution count of zero.
ALL_FAILED_EXECUTION_COUNT , ALL_FAILED_AVG_LATENCY_SECONDS , CANCELLED_OR_DISCONNECTED_EXECUTION_COUNT , and TIMED_OUT_EXECUTION_COUNT statistics aren't tracked for partitioned DMLs.
The statistics for each previously executed partitioned DML statement might appear in different intervals. SPANNER_SYS.QUERY_STATS_TOP_10MINUTE and SPANNER_SYS.QUERY_STATS_TOP_HOUR provide an aggregated view for partitioned DML statements that complete within 10 minutes and 1 hour, respectively. To view statistics for statements whose duration is longer than 1 hour, see query example .

Table schema

Column name	Type	Description
`INTERVAL_END`	`TIMESTAMP`	End of the time interval that the included query executions occurred in.
`REQUEST_TAG`	`STRING`	The optional request tag for this query operation. For more information about using tags, see Troubleshooting with request tags .
`QUERY_TYPE`	`STRING`	Indicates if a query is a `PARTITIONED_QUERY` or `QUERY` . A `PARTITIONED_QUERY` is a query with a `partitionToken` obtained from the `PartitionQuery` API, or a partitioned DML statement. All the other queries and DML statements are denoted by the `QUERY` query type.
`TEXT`	`STRING`	SQL query text, truncated to approximately 64KB. Statistics for multiple queries that have the same tag string are grouped in a single row with the `REQUEST_TAG` matching that tag string. Only the text of one of those queries is shown in this field, truncated to approximately 64KB. For batch DML, the set of SQL statements are flattened into a single row, concatenated using a semicolon delimiter. Consecutive identical SQL texts are deduplicated before truncating.
`TEXT_TRUNCATED`	`BOOL`	Whether or not the query text was truncated.
`TEXT_FINGERPRINT`	`INT64`	The hash of the `REQUEST_TAG` value if present; Otherwise, the hash of the `TEXT` value. Corresponds to the `query_fingerprint` field in the audit log
`EXECUTION_COUNT`	`INT64`	Number of times Spanner saw the query during the interval.
`AVG_LATENCY_SECONDS`	`FLOAT64`	Average length of time, in seconds, for each query execution within the database. This average excludes the encoding and transmission time for the result set as well as overhead.
`AVG_ROWS`	`FLOAT64`	Average number of rows that the query returned.
`AVG_BYTES`	`FLOAT64`	Average number of data bytes that the query returned, excluding transmission encoding overhead.
`AVG_ROWS_SCANNED`	`FLOAT64`	Average number of rows that the query scanned, excluding deleted values.
`AVG_CPU_SECONDS`	`FLOAT64`	Average number of seconds of CPU time Spanner spent on all operations to execute the query.
`ALL_FAILED_EXECUTION_COUNT`	`INT64`	Number of times the query failed during the interval.
`ALL_FAILED_AVG_LATENCY_SECONDS`	`FLOAT64`	Average length of time, in seconds, for each query execution that failed within the database. This average excludes the encoding and transmission time for the result set as well as overhead.
`CANCELLED_OR_DISCONNECTED_EXECUTION_COUNT`	`INT64`	Number of times the query was canceled by the user or failed due to broken network connection during the interval.
`TIMED_OUT_EXECUTION_COUNT`	`INT64`	Number of times the query timed out during the interval.
`AVG_BYTES_WRITTEN`	`FLOAT64`	Average number of bytes written by the statement.
`AVG_ROWS_WRITTEN`	`FLOAT64`	Average number of rows modified by the statement.
`STATEMENT_COUNT`	`INT64`	The sum of the statements aggregated into this entry. For regular queries and DML, this is equal to the execution count. For batch DML, Spanner captures the number of statements in the batch.
`RUN_IN_RW_TRANSACTION_EXECUTION_COUNT`	`INT64`	The number of times the query was run as part of a read-write transaction. This column helps you determine if you can avoid lock contentions by moving the query to a read-only transaction.
`LATENCY_DISTRIBUTION`	`ARRAY<STRUCT>`	A histogram of the query execution time. The values are measured in seconds. PostgreSQL interface note: PostgreSQL-dialect databases don't support this column. The array contains a single element and has the following type: `ARRAY<STRUCT< COUNT INT64, MEAN FLOAT64, SUM_OF_SQUARED_DEVIATION FLOAT64, NUM_FINITE_BUCKETS INT64, GROWTH_FACTOR FLOAT64, SCALE FLOAT64, BUCKET_COUNTS ARRAY<INT64>>>` For more information about the values, see Distribution . To calculate the percentile latency from the distribution, use the `SPANNER_SYS.DISTRIBUTION_PERCENTILE(distribution, n FLOAT64)` function, which returns the estimated n th percentile. For a related example, see Find the 99th percentile latency for queries . For more information, see Percentiles and distribution-valued metrics .
`AVG_MEMORY_PEAK_USAGE_BYTES`	`FLOAT64`	During a distributed query execution, the average peak memory usage (in bytes). Use this statistic to identify what queries or table data sizes are likely to encounter memory limits.
`AVG_MEMORY_USAGE_PERCENTAGE`	`FLOAT64`	During a distributed query execution, the average memory usage required (as a percentage of the memory limit allowed for this query). This statistic only tracks memory that is required for the query to execute. Some operators use additional buffering memory to improve performance. The additional buffering memory used is visible in the query plan, but isn't used to calculate `AVG_MEMORY_USAGE_PERCENTAGE` because the buffering memory is used for optimization and isn't required. Use this statistic to identify queries that are nearing the memory usage limit and are at risk of failing if the data size increases. To mitigate the risk of the query failing, see SQL best practices to optimize these queries, or split the query into pieces that read less data.
`AVG_QUERY_PLAN_CREATION_TIME_SECS`	`FLOAT64`	The average CPU time in seconds spent on query compilation, including query runtime creation. If the value of this column is high, use parameterized queries .
`AVG_FILESYSTEM_DELAY_SECS`	`FLOAT64`	The average time the query spends on reading from the file system or being blocked on input/output (I/O). Use this statistic to identify potential high latency caused by file system I/O. To mitigate, add an index or add a `STORING` (GoogleSQL) or `INCLUDE` (PostgreSQL) clause to an existing index.
`AVG_REMOTE_SERVER_CALLS`	`FLOAT64`	The average number of remote server calls (RPC) that have been completed by the query. Use this statistic to identify if different queries that scan the same number of rows have a vastly different number of RPCs. The query with a higher RPC value might benefit from adding an index or adding a `STORING` (GoogleSQL) or `INCLUDE` (PostgreSQL) clause to an existing index.
`AVG_ROWS_SPOOLED`	`FLOAT64`	The average number of rows written to a temporary disk (not in-memory) by the query statement. Use this statistic to identify potentially high latency queries that are memory-expensive and can't be executed in-memory. To mitigate, change the `JOIN` order, or add an index that provides a required `SORT` .
`AVG_DISK_IO_COST`	`FLOAT64`	The average cost of this query in terms of Spanner HDD disk load . Use this value to make relative HDD I/O cost comparisons between reads that you run in the database. Querying data on HDD storage incurs a charge against the HDD disk load capacity of the instance. A higher value indicates that you are using more HDD disk load and your query might be slower than if it was running on SSD. Furthermore, if your HDD disk load is at capacity, the performance of your queries might be further impacted. You can monitor the total HDD disk load capacity of the instance as a percentage. To add more HDD disk load capacity, you can add more processing units or nodes to your instance. For more information, see Change the compute capacity . To improve query performance, also consider moving some data to SSD. For workloads that consume a lot of disk I/O, we recommend that you store frequently accessed data on SSD storage. Data accessed from SSD doesn't consume HDD disk load capacity. You can store selective tables, columns, or secondary indexes on SSD storage as needed, while keeping infrequently accessed data on HDD storage. For more information, see Tiered storage overview .

EXECUTION_COUNT , AVG_LATENCY_SECONDS , and LATENCY_DISTRIBUTION for failed queries include queries that failed due to incorrect syntax or encountered a transient error but succeeded on retrying. These statistics don't track failed and cancelled partitioned DML statements.

Aggregate statistics

There are also tables that track aggregate data for all the queries for which Spanner captured statistics in a specific time period:

SPANNER_SYS.QUERY_STATS_TOTAL_MINUTE : Queries during 1 minute intervals
SPANNER_SYS.QUERY_STATS_TOTAL_10MINUTE : Queries during 10 minute intervals
SPANNER_SYS.QUERY_STATS_TOTAL_HOUR : Queries during 1 hour intervals

These tables have the following properties:

Each table contains data for non-overlapping time intervals of the length the table name specifies.
Intervals are based on clock times. 1 minute intervals end on the minute, 10 minute intervals end every 10 minutes starting on the hour, and 1 hour intervals end on the hour.

For example, at 11:59:30 AM, the most recent intervals available to SQL queries are:
- 1 minute: 11:58:00–11:58:59 AM
- 10 minute: 11:40:00–11:49:59 AM
- 1 hour: 10:00:00–10:59:59 AM
Each row contains statistics for allqueries executed over the database during the specified interval, aggregated together. There is only one row per time interval and it includes completed queries, failed queries, and queries canceled by the user.
The statistics captured in the TOTAL tables might include queries that Spanner did not capture in the TOP tables.
Some columns in these tables are exposed as metrics in Cloud Monitoring. The exposed metrics are:
- Query execution count
- Query failures
- Query latencies
- Rows returned count
- Rows scanned count
- Bytes returned count
- Query CPU time
For more information, see Spanner metrics .

Table schema

Column name	Type	Description
`INTERVAL_END`	`TIMESTAMP`	End of the time interval that the included query executions occurred in.
`EXECUTION_COUNT`	`INT64`	Number of times Spanner saw the query during the interval of time.
`AVG_LATENCY_SECONDS`	`FLOAT64`	Average length of time, in seconds, for each query execution within the database. This average excludes the encoding and transmission time for the result set as well as overhead.
`AVG_ROWS`	`FLOAT64`	Average number of rows that the query returned.
`AVG_BYTES`	`FLOAT64`	Average number of data bytes that the query returned, excluding transmission encoding overhead.
`AVG_ROWS_SCANNED`	`FLOAT64`	Average number of rows that the query scanned, excluding deleted values.
`AVG_CPU_SECONDS`	`FLOAT64`	Average number of seconds of CPU time Spanner spent on all operations to execute the query.
`ALL_FAILED_EXECUTION_COUNT`	`INT64`	Number of times the query failed during the interval.
`ALL_FAILED_AVG_LATENCY_SECONDS`	`FLOAT64`	Average length of time, in seconds, for each query execution that failed within the database. This average excludes the encoding and transmission time for the result set as well as overhead.
`CANCELLED_OR_DISCONNECTED_EXECUTION_COUNT`	`INT64`	Number of times the query was canceled by the user or failed due to broken network connection during the interval.
`TIMED_OUT_EXECUTION_COUNT`	`INT64`	Number of times the query timed out during the interval.
`AVG_BYTES_WRITTEN`	`FLOAT64`	Average number of bytes written by the statement.
`AVG_ROWS_WRITTEN`	`FLOAT64`	Average number of rows modified by the statement.
`RUN_IN_RW_TRANSACTION_EXECUTION_COUNT`	`INT64`	The number of times queries were run as part of read-write transactions. This column helps you determine if you can avoid lock contentions by moving some queries to read-only transactions.
`LATENCY_DISTRIBUTION`	`ARRAY<STRUCT>`	A histogram of the execution time across queries. The values are measured in seconds. PostgreSQL interface note: PostgreSQL-dialect databases don't support this column. Specify the array as follows: `ARRAY<STRUCT< COUNT INT64, MEAN FLOAT64, SUM_OF_SQUARED_DEVIATION FLOAT64, NUM_FINITE_BUCKETS INT64, GROWTH_FACTOR FLOAT64, SCALE FLOAT64, BUCKET_COUNTS ARRAY<INT64>>>` For more information about the values, see Distribution . To calculate the percentile latency from the distribution, use the `SPANNER_SYS.DISTRIBUTION_PERCENTILE(distribution, n FLOAT64)` function, which returns the estimated n th percentile. For a related example, see Find the 99th percentile latency for queries . For more information, see Percentiles and distribution-valued metrics .

Data retention

At a minimum, Spanner keeps data for each table for the following time periods:

SPANNER_SYS.QUERY_STATS_TOP_MINUTE and SPANNER_SYS.QUERY_STATS_TOTAL_MINUTE : Intervals covering the previous 6 hours.
SPANNER_SYS.QUERY_STATS_TOP_10MINUTE and SPANNER_SYS.QUERY_STATS_TOTAL_10MINUTE : Intervals covering the previous 4 days.
SPANNER_SYS.QUERY_STATS_TOP_HOUR and SPANNER_SYS.QUERY_STATS_TOTAL_HOUR : Intervals covering the previous 30 days.

Example queries

This section includes several example SQL statements that retrieve query statistics. You can run these SQL statements using the client libraries , the Google Cloud CLI , or the Google Cloud console .

List the basic statistics for each query in a given time period

The following query returns the raw data for the top queries in the previous minute:

  SELECT 
  
 text 
 , 
  
 request_tag 
 , 
  
 interval_end 
 , 
  
 execution_count 
 , 
  
 avg_latency_seconds 
 , 
  
 avg_rows 
 , 
  
 avg_bytes 
 , 
  
 avg_rows_scanned 
 , 
  
 avg_cpu_seconds 
 FROM 
  
 spanner_sys 
 . 
 query_stats_top_minute 
 ORDER 
  
 BY 
  
 interval_end 
  
 DESC 
 ;

List the statistics for partitioned DML statements that run more than one hour

The following query returns the execution count and average rows written by the top partitioned DML queries in the previous hours:

  SELECT 
  
 text 
 , 
  
 request_tag 
 , 
  
 interval_end 
 , 
  
 sum 
 ( 
 execution_count 
 ) 
  
 as 
  
 execution_count 
  
 sum 
 ( 
 avg_rows_written 
 * 
 execution_count 
 ) 
 / 
 sum 
 ( 
 execution_count 
 ) 
  
 as 
  
 avg_rows_written 
 FROM 
  
 spanner_sys 
 . 
 query_stats_top_hour 
 WHERE 
  
 starts_with 
 ( 
 text 
 , 
  
 "UPDATE" 
 ) 
  
 AND 
  
 query_type 
  
 = 
  
 "PARTITIONED_QUERY" 
 group 
  
 by 
  
 text 
 , 
  
 request_tag 
 , 
  
 interval_end 
 ORDER 
  
 BY 
  
 interval_end 
  
 DESC 
 ;

List the queries with the highest CPU usage

The following query returns the queries with the highest CPU usage in the previous hour:

  SELECT 
  
 text 
 , 
  
 request_tag 
 , 
  
 execution_count 
  
 AS 
  
 count 
 , 
  
 avg_latency_seconds 
  
 AS 
  
 latency 
 , 
  
 avg_cpu_seconds 
  
 AS 
  
 cpu 
 , 
  
 execution_count 
  
 * 
  
 avg_cpu_seconds 
  
 AS 
  
 total_cpu 
 FROM 
  
 spanner_sys 
 . 
 query_stats_top_hour 
 WHERE 
  
 interval_end 
  
 = 
  
 ( 
 SELECT 
  
 MAX 
 ( 
 interval_end 
 ) 
  
 FROM 
  
 spanner_sys 
 . 
 query_stats_top_hour 
 ) 
 ORDER 
  
 BY 
  
 total_cpu 
  
 DESC 
 ;

Find the total execution count in a given time period

The following query returns the total number of queries executed in the most recent complete 1-minute interval:

  SELECT 
  
 interval_end 
 , 
  
 execution_count 
 FROM 
  
 spanner_sys 
 . 
 query_stats_total_minute 
 WHERE 
  
 interval_end 
  
 = 
  
 ( 
 SELECT 
  
 MAX 
 ( 
 interval_end 
 ) 
  
 FROM 
  
 spanner_sys 
 . 
 query_stats_top_minute 
 );

Find the average latency for a query

The following query returns the average latency information for a specific query:

  SELECT 
  
 avg_latency_seconds 
 FROM 
  
 spanner_sys 
 . 
 query_stats_top_hour 
 WHERE 
  
 text 
  
 LIKE 
  
 "SELECT x FROM table WHERE x=@foo;" 
 ;

Find the 99th percentile latency for queries

The following query returns the 99th percentile of execution time across queries ran in the previous 10 minutes:

  SELECT 
  
 interval_end 
 , 
  
 avg_latency_seconds 
 , 
  
 SPANNER_SYS 
 . 
 DISTRIBUTION_PERCENTILE 
 ( 
 latency_distribution 
 [ 
 OFFSET 
 ( 
 0 
 )], 
  
 99 
 . 
 0 
 ) 
  
 AS 
  
 percentile_latency 
 FROM 
  
 spanner_sys 
 . 
 query_stats_total_10minute 
 WHERE 
  
 interval_end 
  
 = 
  
 ( 
 SELECT 
  
 MAX 
 ( 
 interval_end 
 ) 
  
 FROM 
  
 spanner_sys 
 . 
 query_stats_total_10minute 
 ) 
 ORDER 
  
 BY 
  
 interval_end 
 ;

Comparing the Average latency with the 99th percentile latency helps identify possible outlier queries with high execution times.

Find the queries that scan the most data

You can use the number of rows scanned by a query as a measure of the amount of data that the query scanned. The following query returns the number of rows scanned by queries executed in the previous hour:

  SELECT 
  
 text 
 , 
  
 execution_count 
 , 
  
 avg_rows_scanned 
 FROM 
  
 spanner_sys 
 . 
 query_stats_top_hour 
 WHERE 
  
 interval_end 
  
 = 
  
 ( 
 SELECT 
  
 MAX 
 ( 
 interval_end 
 ) 
  
 FROM 
  
 spanner_sys 
 . 
 query_stats_top_hour 
 ) 
 ORDER 
  
 BY 
  
 avg_rows_scanned 
  
 DESC 
 ;

Find the statements that wrote the most data

You can use the number of rows written (or bytes written) by DML as a measure of the amount of data that the query modified. The following query returns the number of rows written by DML statements executed in the previous hour:

  SELECT 
  
 text 
 , 
  
 execution_count 
 , 
  
 avg_rows_written 
 FROM 
  
 spanner_sys 
 . 
 query_stats_top_hour 
 WHERE 
  
 interval_end 
  
 = 
  
 ( 
 SELECT 
  
 MAX 
 ( 
 interval_end 
 ) 
  
 FROM 
  
 spanner_sys 
 . 
 query_stats_top_hour 
 ) 
 ORDER 
  
 BY 
  
 avg_rows_written 
  
 DESC 
 ;

Total the CPU usage across all queries

The following query returns the number of CPU hours used in the previous hour:

  SELECT 
  
 ( 
 avg_cpu_seconds 
  
 * 
  
 execution_count 
  
 / 
  
 60 
  
 / 
  
 60 
 ) 
  
 AS 
  
 total_cpu_hours 
 FROM 
  
 spanner_sys 
 . 
 query_stats_total_hour 
 WHERE 
  
 interval_end 
  
 = 
  
 ( 
 SELECT 
  
 MAX 
 ( 
 interval_end 
 ) 
  
 FROM 
  
 spanner_sys 
 . 
 query_stats_total_hour 
 );

List the queries that have failed in a given time period

The following query returns the raw data including execution count and average latency of failed queries for the top queries in the previous minute. These statistics don't track failed and cancelled partitioned DML statements.

  SELECT 
  
 text 
 , 
  
 request_tag 
 , 
  
 interval_end 
 , 
  
 execution_count 
 , 
  
 all_failed_execution_count 
 , 
  
 all_failed_avg_latency_seconds 
 , 
  
 avg_latency_seconds 
 , 
  
 avg_rows 
 , 
  
 avg_bytes 
 , 
  
 avg_rows_scanned 
 , 
  
 avg_cpu_seconds 
 FROM 
  
 spanner_sys 
 . 
 query_stats_top_minute 
 WHERE 
  
 all_failed_execution_count 
 > 
 0 
 ORDER 
  
 BY 
  
 interval_end 
 ;

Find the total error count in a given time period

The following query returns the total number of queries that failed to execute in the most recent complete 1 minute interval. These statistics don't track failed and cancelled partitioned DML statements.

  SELECT 
  
 interval_end 
 , 
  
 all_failed_execution_count 
 FROM 
  
 spanner_sys 
 . 
 query_stats_total_minute 
 WHERE 
  
 interval_end 
  
 = 
  
 ( 
 SELECT 
  
 MAX 
 ( 
 interval_end 
 ) 
  
 FROM 
  
 spanner_sys 
 . 
 query_stats_top_minute 
 ) 
 ORDER 
  
 BY 
  
 interval_end 
 ;

List the queries that timeout the most

The following query returns the queries with the highest timeout count in the previous hour.

  SELECT 
  
 text 
 , 
  
 execution_count 
  
 AS 
  
 count 
 , 
  
 timed_out_execution_count 
  
 AS 
  
 timeout_count 
 , 
  
 avg_latency_seconds 
  
 AS 
  
 latency 
 , 
  
 avg_cpu_seconds 
  
 AS 
  
 cpu 
 , 
  
 execution_count 
  
 * 
  
 avg_cpu_seconds 
  
 AS 
  
 total_cpu 
 FROM 
  
 spanner_sys 
 . 
 query_stats_top_hour 
 WHERE 
  
 interval_end 
  
 = 
  
 ( 
 SELECT 
  
 MAX 
 ( 
 interval_end 
 ) 
  
 FROM 
  
 spanner_sys 
 . 
 query_stats_top_hour 
 ) 
 ORDER 
  
 BY 
  
 timed_out_execution_count 
  
 DESC 
 ;

Find the average latency of successful and failed executions for a query

The following query returns the combined average latency, average latency for successful executions, and average latency for failed executions for a specific query. These statistics don't track failed and cancelled partitioned DML statements.

  SELECT 
  
 avg_latency_seconds 
  
 AS 
  
 combined_avg_latency 
 , 
  
 all_failed_avg_latency_seconds 
  
 AS 
  
 failed_execution_latency 
 , 
  
 ( 
  
 avg_latency_seconds 
  
 * 
  
 execution_count 
  
 - 
  
 all_failed_avg_latency_seconds 
  
 * 
  
 all_failed_execution_count 
  
 ) 
  
 / 
  
 ( 
  
 execution_count 
  
 - 
  
 all_failed_execution_count 
  
 ) 
  
 AS 
  
 success_execution_latency 
 FROM 
  
 spanner_sys 
 . 
 query_stats_top_hour 
 WHERE 
  
 text 
  
 LIKE 
  
 "select x from table where x=@foo;" 
 ;

Troubleshoot high CPU usage or elevated query latency with query statistics

Query statistics are useful when you need to investigate high CPU usage on your Spanner database or when you are just trying to understand the CPU- heavy query shapes on your database. Inspecting queries that use significant amounts of database resources gives Spanner users a potential way to reduce operational costs and possibly improve general system latencies.

You can use SQL code or the Query insights dashboard to investigate problematic queries in your database. The following topics show how you can investigate such queries by using SQL code.

While the following example focuses on CPU usage, similar steps can be followed to troubleshoot elevated query latency and find the queries with the highest latencies. Simply select time intervals and queries by latency instead of CPU usage.

Select a time period to investigate

Start your investigation by looking for a time when your application began to experience high CPU usage. For example, if the issue started occurring around 5:00 pm on July 24, 2020 UTC.

Gather query statistics for the selected time period

Having selected a time period to start our investigation, we'll look at statistics gathered in the QUERY_STATS_TOTAL_10MINUTE table around that time. The results of this query might indicate how CPU and other query statistics changed over that period of time.

The following query returns the aggregated query statistics from 16:30to 17:30UTC inclusive. We are using ROUND in our query to restrict the number of decimal places for display purposes.

  SELECT 
  
 interval_end 
 , 
  
 execution_count 
  
 AS 
  
 count 
 , 
  
 ROUND 
 ( 
 avg_latency_seconds 
 , 
 2 
 ) 
  
 AS 
  
 latency 
 , 
  
 ROUND 
 ( 
 avg_rows 
 , 
 2 
 ) 
  
 AS 
  
 rows_returned 
 , 
  
 ROUND 
 ( 
 avg_bytes 
 , 
 2 
 ) 
  
 AS 
  
 bytes 
 , 
  
 ROUND 
 ( 
 avg_rows_scanned 
 , 
 2 
 ) 
  
 AS 
  
 rows_scanned 
 , 
  
 ROUND 
 ( 
 avg_cpu_seconds 
 , 
 3 
 ) 
  
 AS 
  
 avg_cpu 
 FROM 
  
 spanner_sys 
 . 
 query_stats_total_10minute 
 WHERE 
  
 interval_end 
  
> = 
  
 "2020-07-24T16:30:00Z" 
  
 AND 
  
 interval_end 
  
< = 
  
 "2020-07-24T17:30:00Z" 
 ORDER 
  
 BY 
  
 interval_end 
 ;

Running the query produced the following results.

interval_end	count	latency	rows_returned	bytes	rows_scanned	avg_cpu
2020-07-24T16:30:00Z	6	0.06	5.00	536.00	16.67	0.035
2020-07-24T16:40:00Z	55	0.02	0.22	25.29	0.22	0.004
2020-07-24T16:50:00Z	102	0.02	0.30	33.35	0.30	0.004
`2020-07-24T17:00:00Z`	`154`	`1.06`	`4.42`	`486.33`	`7792208.12`	`4.633`
2020-07-24T17:10:00Z	94	0.02	1.68	106.84	1.68	0.006
2020-07-24T17:20:00Z	110	0.02	0.38	34.60	0.38	0.005
2020-07-24T17:30:00Z	47	0.02	0.23	24.96	0.23	0.004

In the preceding table we see that average CPU time, the avg_cpucolumn in the results table, is highest in the highlighted intervals ending at 17:00. We also see a much higher number of rows scanned on average. This indicates that more expensive queries ran between 16:50 and 17:00. Choose that interval to investigate further in the next step.

Find the queries that are causing high CPU usage

With a time interval to investigate selected, we now query the QUERY_STATS_TOP_10MINUTE table. The results of this query can help indicate which queries cause high CPU usage.

  SELECT 
  
 text_fingerprint 
  
 AS 
  
 fingerprint 
 , 
  
 execution_count 
  
 AS 
  
 count 
 , 
  
 ROUND 
 ( 
 avg_latency_seconds 
 , 
 2 
 ) 
  
 AS 
  
 latency 
 , 
  
 ROUND 
 ( 
 avg_cpu_seconds 
 , 
 3 
 ) 
  
 AS 
  
 cpu 
 , 
  
 ROUND 
 ( 
 execution_count 
  
 * 
  
 avg_cpu_seconds 
 , 
 3 
 ) 
  
 AS 
  
 total_cpu 
 FROM 
  
 spanner_sys 
 . 
 query_stats_top_10MINUTE 
 WHERE 
  
 interval_end 
  
 = 
  
 "2020-07-24T17:00:00Z" 
 ORDER 
  
 BY 
  
 total_cpu 
  
 DESC 
 ;

Running this query yields the following results.

fingerprint	count	latency	cpu	total_cpu
`5505124206529314852`	`30`	`3.88`	`17.635`	`529.039`
`1697951036096498470`	`10`	`4.49`	`18.388`	`183.882`
2295109096748351518	1	0.33	0.048	0.048
11618299167612903606	1	0.25	0.021	0.021
10302798842433860499	1	0.04	0.006	0.006
123771704548746223	1	0.04	0.006	0.006
4216063638051261350	1	0.04	0.006	0.006
3654744714919476398	1	0.04	0.006	0.006
2999453161628434990	1	0.04	0.006	0.006
823179738756093706	1	0.02	0.005	0.0056

The top 2 queries, highlighted in the results table, are outliers in terms of average CPU and latency, as well as number of executions and total CPU. Investigate the first query listed in these results.

Compare query runs over time

Having narrowed down the investigation, we can turn our attention to the QUERY_STATS_TOP_MINUTE table. By comparing runs over time for a particular query, we can look for correlations between the number of rows or bytes returned, or the number of rows scanned and elevated CPU or latency. A deviation may indicate non-uniformity in the data. Consistently high numbers of rows scanned may indicate the lack of appropriate indexes or sub-optimal join ordering.

Investigate the query exhibiting highest average CPU usage and highest latency by running the following statement which filters on the text_fingerprint of that query.

  SELECT 
  
 interval_end 
 , 
  
 ROUND 
 ( 
 avg_latency_seconds 
 , 
 2 
 ) 
  
 AS 
  
 latency 
 , 
  
 avg_rows 
  
 AS 
  
 rows_returned 
 , 
  
 avg_bytes 
  
 AS 
  
 bytes_returned 
 , 
  
 avg_rows_scanned 
  
 AS 
  
 rows_scanned 
 , 
  
 ROUND 
 ( 
 avg_cpu_seconds 
 , 
 3 
 ) 
  
 AS 
  
 cpu 
 , 
 FROM 
  
 spanner_sys 
 . 
 query_stats_top_minute 
 WHERE 
  
 text_fingerprint 
  
 = 
  
 5505124206529314852 
 ORDER 
  
 BY 
  
 interval_end 
  
 DESC 
 ;

Running this query returns the following results.

interval_end	latency	rows_returned	bytes_returned	rows_scanned	cpu
2020-07-24T17:00:00Z	4.55	21	2365	30000000	19.255
2020-07-24T16:00:00Z	3.62	21	2365	30000000	17.255
2020-07-24T15:00:00Z	4.37	21	2365	30000000	18.350
2020-07-24T14:00:00Z	4.02	21	2365	30000000	17.748
2020-07-24T13:00:00Z	3.12	21	2365	30000000	16.380
2020-07-24T12:00:00Z	3.45	21	2365	30000000	15.476
2020-07-24T11:00:00Z	4.94	21	2365	30000000	22.611
2020-07-24T10:00:00Z	6.48	21	2365	30000000	21.265
2020-07-24T09:00:00Z	0.23	21	2365	5	0.040
2020-07-24T08:00:00Z	0.04	21	2365	5	0.021
2020-07-24T07:00:00Z	0.09	21	2365	5	0.030

Examining the preceding results, we see that the number of rows scanned, CPU used, and latency all changed significantly around 9:00 am. To understand why these numbers increased so dramatically, we'll examine the query text and see whether any changes in the schema might have impacted the query.

Use the following query to retrieve the query text for the query we are investigating.

  SELECT 
  
 text 
 , 
  
 text_truncated 
 FROM 
  
 spanner_sys 
 . 
 query_stats_top_hour 
 WHERE 
  
 text_fingerprint 
  
 = 
  
 5505124206529314852 
 LIMIT 
  
 1 
 ;

This returns the following result.

text	text_truncated
select * from orders where o_custkey = 36901;	false

Examining the query text which is returned, we realize that the query is filtering on a field called o_custkey . This is a non-key column on the orders table. As it happens, there used to be an index on that column that was dropped around 9 am. This explains the change in cost for this query. We can add the index back in or, if the query is infrequently run, decide to not have the index and accept the higher read cost.

Our investigation focused so far on queries that completed successfully and we found one reason why the database was experiencing some performance degradation. In the next step, we'll focus on failed or canceled queries and show how to examine that data for more insights.

Investigate failed queries

Queries that don't complete successfully still consume resources before they time out, are cancelled, or otherwise fail. Spanner tracks the execution count and resources consumed by failed queries along with successful ones. These statistics don't track failed and cancelled partitioned DML statements.

To check whether failed queries are a significant contributor to system utilization we can first check how many queries failed in the time interval of interest.

  SELECT 
  
 interval_end 
 , 
  
 all_failed_execution_count 
  
 AS 
  
 failed_count 
 , 
  
 all_failed_avg_latency_seconds 
  
 AS 
  
 latency 
 FROM 
  
 spanner_sys 
 . 
 query_stats_total_minute 
 WHERE 
  
 interval_end 
  
> = 
  
 "2020-07-24T16:50:00Z" 
  
 AND 
  
 interval_end 
  
< = 
  
 "2020-07-24T17:00:00Z" 
 ORDER 
  
 BY 
  
 interval_end 
 ;

interval_end	failed_count	latency
2020-07-24T16:52:00Z	1	15.211391
2020-07-24T16:53:00Z	3	58.312232

Investigating further, we can look for queries that are most likely to fail using the following query.

  SELECT 
  
 interval_end 
 , 
  
 text_fingerprint 
 , 
  
 execution_count 
 , 
  
 avg_latency_seconds 
  
 AS 
  
 avg_latency 
 , 
  
 all_failed_execution_count 
  
 AS 
  
 failed_count 
 , 
  
 all_failed_avg_latency_seconds 
  
 AS 
  
 failed_latency 
 , 
  
 cancelled_or_disconnected_execution_count 
  
 AS 
  
 cancel_count 
 , 
  
 timed_out_execution_count 
  
 AS 
  
 to_count 
 FROM 
  
 spanner_sys 
 . 
 query_stats_top_minute 
 WHERE 
  
 all_failed_execution_count 
 > 
 0 
 ORDER 
  
 BY 
  
 interval_end 
 ;

interval_end	text_fingerprint	execution_count	failed_count	cancel_count	to_count
2020-07-24T16:52:00Z	5505124206529314852	3	1	1	0
2020-07-24T16:53:00Z	1697951036096498470	2	1	1	0
2020-07-24T16:53:00Z	5505124206529314852	5	2	1	1

As the preceding table shows, the query with fingerprint 5505124206529314852 has failed multiple times during different time intervals. Given a pattern of failures such as this, it is then interesting to compare latency of successful and unsuccessful runs.

  SELECT 
  
 interval_end 
 , 
  
 avg_latency_seconds 
  
 AS 
  
 combined_avg_latency 
 , 
  
 all_failed_avg_latency_seconds 
  
 AS 
  
 failed_execution_latency 
 , 
  
 ( 
  
 avg_latency_seconds 
  
 * 
  
 execution_count 
  
 - 
  
 all_failed_avg_latency_seconds 
  
 * 
  
 all_failed_execution_count 
  
 ) 
  
 / 
  
 ( 
  
 execution_count 
  
 - 
  
 all_failed_execution_count 
  
 ) 
  
 AS 
  
 success_execution_latency 
 FROM 
  
 spanner_sys 
 . 
 query_stats_top_hour 
 WHERE 
  
 text_fingerprint 
  
 = 
  
 5505124206529314852 
 ;

interval_end	combined_avg_latency	failed_execution_latency	success_execution_latency
2020-07-24T17:00:00Z	3.880420	13.830709	2.774832

Apply best practices

Having identified a candidate query for optimization, we can next look at the query profile and try to optimize using SQL best practices .

What's next

Use Oldest active queries to determine the longest running active queries.
Learn more about Investigating high CPU utilization .
Learn about other Introspection tools .
Learn about other information Spanner stores for each database in the database's information schema tables.
Learn more about SQL best practices for Spanner.

Query statistics Stay organized with collections Save and categorize content based on your preferences.

Access query statistics

CPU usage grouped by query

Table schema

Aggregate statistics

Table schema

Data retention

Example queries

List the basic statistics for each query in a given time period

List the statistics for partitioned DML statements that run more than one hour

List the queries with the highest CPU usage

Find the total execution count in a given time period

Find the average latency for a query

Find the 99th percentile latency for queries

Find the queries that scan the most data

Find the statements that wrote the most data

Total the CPU usage across all queries

List the queries that have failed in a given time period

Find the total error count in a given time period

List the queries that timeout the most

Find the average latency of successful and failed executions for a query

Troubleshoot high CPU usage or elevated query latency with query statistics

Select a time period to investigate

Gather query statistics for the selected time period

Find the queries that are causing high CPU usage

Compare query runs over time

Investigate failed queries

Apply best practices

What's next

Query statistics