Split statistics

This document describes how to detect and debug hotspots in your database. You can access statistics about hotspots in splits with both GoogleSQL and PostgreSQL.

Spanner stores your data as a contiguous key space, ordered by the primary keys of your tables and indexes. A split is a range of rows from a set of tables or an index. The split's start is called the split start. The split limit sets the end of the split. The split includes the split start, but not the split limit.

In Spanner, hotspots are situations where too many requests are sent to the same server, which saturates the resources of the server and potentially causes high latencies. The splits affected by hotspots are known as hot or warm splits.

A split's hotspot statistic (identified in the system as CPU_USAGE_SCORE) is a measurement of the load on a split that's constrained by the resources available on the server. This measurement is given as a percentage. If more than 50% of the load on a split is constrained by the available resources, then the split is considered warm. If 100% of the load on a split is constrained, then the split is considered hot. Such hot splits might also affect the latency of the requests served by them.

The CPU_USAGE_SCORE of a split can remain constant or vary over time based on the workload accessing the split and changes in the split boundaries.

Based on the warm and hot split resource constraints, Spanner might use load-based splitting to evenly distribute the load across the key space. The warm and hot splits can be moved across the instance's servers for load balancing. Spanner performs load-based splitting in the background, minimizing the impact on latency. However, Spanner might not be able to balance the load, even after multiple attempts at splitting, due to anti-patterns in the application. The UNSPLITTABLE_REASONS column in the statistics views provides specific reasons why a hot or warm split couldn't be divided further. Hence, persistent warm or hot splits that last at least 10 minutes might need further troubleshooting and potential application changes, especially when UNSPLITTABLE_REASONS are present.

The Spanner hot split statistics help you identify the splits where hotspots occur and understand why they may persist. These statistics, combined with UNSPLITTABLE_REASONS codes, can help you diagnose what actions you need to take to resolve hotspots. You can then make changes to your application or schema, as needed.

How to access hot split statistics

Spanner provides the hot split statistics in the SPANNER_SYS schema. SPANNER_SYS data is available through GoogleSQL and PostgreSQL interfaces. You can access this data in the following ways:

The following single read methods that Spanner provides don't support SPANNER_SYS:

  • Performing a strong read from a single row or multiple rows in a table.
  • Performing a stale read from a single row or multiple rows in a table.
  • Reading from a single row or multiple rows in a secondary index.

Hot split statistics

You use the following views to track hot splits:

  • SPANNER_SYS.SPLIT_STATS_TOP_MINUTE: shows splits that are hot during 1-minute intervals.
  • SPANNER_SYS.SPLIT_STATS_TOP_10MINUTE: shows splits that are hot during any part of a 10-minute interval.
  • SPANNER_SYS.SPLIT_STATS_TOP_HOUR: shows splits that are hot during any part of a 1-hour interval.

These views have the following properties:

  • Each view contains data for non-overlapping time intervals of the duration the view name specifies.
  • Intervals are based on clock times:
    • 1-minute intervals end on the minute.
    • 10-minute intervals end on the 10th minute of the hour, for example, 11:10:00, 11:20:00.
    • 1-hour intervals end on the hour.
  • After each interval, Spanner collects data from all servers and then makes the data available in the SPANNER_SYS views shortly thereafter. For example, at 11:59:30 AM, the most recent intervals available to SQL queries are:
    • 1 minute: 11:58:00-11:58:59 AM
    • 10 minutes: 11:40:00-11:49:59 AM
    • 1 hour: 10:00:00-10:59:59 AM
  • Spanner groups the statistics by splits.
  • Each row contains statistics, including the CPU_USAGE_SCORE percentage which indicates how hot or warm a split is, for each split that Spanner captures statistics for during the specified interval.
  • The SPANNER_SYS.SPLIT_STATS_TOP_MINUTE view offers the granular split statistics for every minute. Use this view for detailed debugging of recent events.
  • The SPANNER_SYS.SPLIT_STATS_TOP_10MINUTE and SPANNER_SYS.SPLIT_STATS_TOP_HOUR views provide an aggregated view within 10-minute and hour intervals, respectively. Use these views for trend analysis or investigating issues over the past few days or weeks. For more information on aggregation, see View event aggregation.
  • If Spanner is unable to store all the hot splits during the interval, the system prioritizes the splits with the highest CPU_USAGE_SCORE percentage during the specified interval. If there are no splits returned, it's an indication of the absence of any hot splits.

Data retention

The maximum amount of data that Spanner retains for each view, at any point in time, is as follows:

  • SPANNER_SYS.SPLIT_STATS_TOP_MINUTE: intervals covering the previous 24 hours.
  • SPANNER_SYS.SPLIT_STATS_TOP_10MINUTE: intervals covering the previous 4 days.
  • SPANNER_SYS.SPLIT_STATS_TOP_HOUR: intervals covering the previous 30 days.

These retention periods cannot be increased or decreased, and you can't prevent Spanner from collecting hot split statistics.

  • To delete statistics data, you must either delete the database being tracked or wait until the statistics data rolls out of retention.
  • To retain statistics data for longer periods, periodically copy data out of the hot split statistics views.

View schema

The following table shows the schema for hot split statistics:

Column name Type Description
INTERVAL_END TIMESTAMP End of the time interval during which the split was warm or hot.
SPLIT_START STRING The starting key of the range of rows in the split. The split start might also be <begin>, indicating the beginning of the key space.
SPLIT_LIMIT STRING The limit key for the range of rows in the split. The limit key might also be <end>, indicating the end of the key space.
CPU_USAGE_SCORE INT64 The CPU_USAGE_SCORE percentage of the splits. A CPU_USAGE_SCORE percentage of 50% indicates the presence of warm or hot splits.
AFFECTED_TABLES STRING ARRAY The tables whose rows might be in the split.
UNSPLITTABLE_REASONS STRING ARRAY Identifies the type of hotspots present that load-based splitting cannot mitigate, often due to anti-patterns. The presence of any reason indicates user intervention, such as schema or workload adjustments, is likely needed. An empty array means either no unsplittable conditions were detected during this interval or the high load was too short-lived for Spanner to determine if it was unsplittable. See UNSPLITTABLE_REASONS types for more details.

Split start and split limit keys

A split is a contiguous row range of a database, and is defined by its start and limit keys. A split can be a single row, a narrow row range, or a wide row range, and the split can include multiple tables or indexes.

The SPLIT_START and SPLIT_LIMIT columns identify the primary keys of a warm or hot split.

Example schema

The following schema is an example table for the topics in this page.

GoogleSQL

CREATE TABLE Users (
  UserId INT64 NOT NULL,
  FirstName STRING(MAX),
  LastName STRING(MAX),
) PRIMARY KEY(UserId);

CREATE INDEX UsersByFirstName ON Users(FirstName DESC);

CREATE TABLE Threads (
  UserId INT64 NOT NULL,
  ThreadId INT64 NOT NULL,
  Starred BOOL,
) PRIMARY KEY(UserId, ThreadId),
  INTERLEAVE IN PARENT Users ON DELETE CASCADE;

CREATE TABLE Messages (
  UserId INT64 NOT NULL,
  ThreadId INT64 NOT NULL,
  MessageId INT64 NOT NULL,
  Subject STRING(MAX),
  Body STRING(MAX),
) PRIMARY KEY(UserId, ThreadId, MessageId),
  INTERLEAVE IN PARENT Threads ON DELETE CASCADE;

CREATE INDEX MessagesIdx ON Messages(UserId, ThreadId, Subject),
INTERLEAVE IN Threads;

PostgreSQL

CREATE TABLE users
(
   userid    BIGINT NOT NULL PRIMARY KEY,-- INT64 to BIGINT
   firstname VARCHAR(max),-- STRING(MAX) to VARCHAR(MAX)
   lastname  VARCHAR(max)
);

CREATE INDEX usersbyfirstname
  ON users(firstname DESC);

CREATE TABLE threads
  (
    userid   BIGINT NOT NULL,
    threadid BIGINT NOT NULL,
    starred  BOOLEAN, -- BOOL to BOOLEAN
    PRIMARY KEY (userid, threadid),
    CONSTRAINT fk_threads_user FOREIGN KEY (userid) REFERENCES users(userid) ON
    DELETE CASCADE -- Interleave to Foreign Key constraint
  );

CREATE TABLE messages
  (
    userid    BIGINT NOT NULL,
    threadid  BIGINT NOT NULL,
    messageid BIGINT NOT NULL PRIMARY KEY,
    subject   VARCHAR(max),
    body      VARCHAR(max),
    CONSTRAINT fk_messages_thread FOREIGN KEY (userid, threadid) REFERENCES
    threads(userid, threadid) ON DELETE CASCADE
  -- Interleave to Foreign Key constraint
  );

CREATE INDEX messagesidx ON messages(userid, threadid, subject), REFERENCES
threads(userid, threadid);

Imagine your key space looks like this:

PRIMARY KEY
<begin>
Users()
Threads()
Users(2)
Users(3)
Threads(3)
Threads(3,"a")
Messages(3,"a",1)
Messages(3,"a",2)
Threads(3, "aa")
Users(9)
Users(10)
Threads(10)
UsersByFirstName("abc")
UsersByFirstName("abcd")
<end>

Example of splits

The following shows some example splits to help you understand what splits look like.

The SPLIT_START and SPLIT_LIMIT might indicate the row of a table or index, or they can be <begin> and <end>, representing the boundaries of the key space of the database. The SPLIT_START and SPLIT_LIMIT might also contain truncated keys, which are keys preceding any full key in the table. For example, Threads(10) is a prefix for any Threads row interleaved in Users(10).

SPLIT_START SPLIT_LIMIT AFFECTED_TABLES EXPLANATION
Users(3) Users(10) UsersByFirstName, Users, Threads, Messages, MessagesIdx Split starts at row with UserId=3 and ends at the row before the row with UserId = 10. The split contains the Users table rows and all its interleaved tables rows for UserId=3 to 10.
Messages(3,"a",1) Threads(3,"aa") Threads, Messages, MessagesIdx The split starts at the row with UserId=3, ThreadId="a" and MessageId=1 and ends at the row preceding the row with the key of UserId=3 and ThreadsId = "aa". The split contains all the tables between Messages(3,"a",1) and Threads(3,"aa"). As the split_start and split_limit are interleaved in the same top-level table row, the split contains the interleaved tables rows between the start and limit. See schemas-overview to understand how interleaved tables are co-located.
Messages(3,"a",1) <end> UsersByFirstName, Users, Threads, Messages, MessagesIdx The split starts in the messages table at the row with key UserId=3, ThreadId="a" and MessageId=1. The split hosts all the rows from the split_start to <end>, the end of the key space of the database. All the rows of the tables following the split_start, like Users(4) are included in the split.
<begin> Users(9) UsersByFirstName, Users, Threads, Messages, MessagesIdx The split starts at <begin>, the beginning of the key space of the database and ends at the row preceding the Users row with UserId=9. So the split has all the table rows preceding Users and all the rows of Users table preceding UserId=9 and the rows of its interleaved tables.
Messages(3,"a",1) Threads(10) UsersByFirstName, Users, Threads, Messages, MessagesIdx Split starts at Messages(3,"a", 1) interleaved in Users(3) and ends at the row preceding Threads(10). Threads(10) is a truncated split key that is a prefix of any key of the Threads table interleaved in Users(10).
Users() <end> UsersByFirstName, Users, Threads, Messages, MessagesIdx The split starts at the truncated split key of Users() which precedes any full key of the Users table. The split extends until the end of the possible key space in the database. The affected_tables hence cover the Users table, its interleaved tables and indexes and all the tables that might appear after users.
Threads(10) UsersByFirstName("abc") UsersByFirstName, Users, Threads, Messages, MessagesIdx The split starts at the Threads row with UserId = 10 and ends at the index, UsersByFirstName at the key preceding "abc".

UNSPLITTABLE_REASONS types

When Spanner cannot mitigate a hotspot through load-based splitting, the UNSPLITTABLE_REASONS column in the SPLIT_STATS_TOP_* views cites one or more of the following reasons:

HOT_ROW

Description: High load is concentrated on a single row. Spanner cannot add split points within an individual row.

Common Causes:

  • Frequent high-volume operations (reads, writes, or updates) on a single key.
  • Schema designs that centralize access to a single row.

Mitigation Strategies:

  • Reduce QPS to the hot split.
  • Redesign schema to distribute load. For example, shard counters across multiple rows.
  • Review Schema design best practices.

MOVING_HOT_SPOT

Description: The key range experiencing high load shifts over time, often sequentially. Load-based splitting is ineffective as the hotspot relocates before Spanner can split the previously affected range.

Common Causes:

  • Inserts with a monotonically increasing or decreasing leading key part, such as a commit timestamp.
  • Sequential point reads across the keyspace of a table.

Mitigation Strategies:

  • Avoid monotonically increasing or decreasing keys for the first part of the primary key in write-intensive workloads. For detailed mitigation strategies, see Schema design best practices. Techniques include using UUIDs or prepending a hash of the key.

LARGE_SCAN_HOT_SPOT

Description: The split experiences high load due to frequent or resource-intensive operations that scan across a key range. This can include range reads (including reads issued as part of a transaction) or queries. Spanner refrains from excessively splitting the ranges covered by such operations to avoid potential performance degradation for these scans, which can occur if the data becomes too fragmented across many small splits.

Common Causes:

  • Queries or read operations executing broad range scans on frequently accessed data.
  • DML statements (UPDATE, DELETE) with WHERE clauses that require scanning ranges.
  • Absence of suitable indexes, leading to base table scans.

Mitigation Strategies:

  • Optimize SQL statements (SELECT, UPDATE, DELETE) to reduce the number of rows scanned.
  • Create appropriate indexes to support common query and DML predicates, minimizing the number of rows scanned.

UNISOLATABLE_HOT_ROW

Description: Spanner identifies a narrow, high-load key but cannot isolate it by inserting new split points, due to the unavailability of a suitable split point. This case is similar to HOT_ROW, but the SPLIT_START and SPLIT_LIMIT don't completely isolate the hotspot.

Common Causes:

  • Intense, localized load on one row or adjacent rows sharing a key prefix.

Mitigation Strategies:

  • Analyze application access patterns for the keys within the reported SPLIT_START and SPLIT_LIMIT.
  • Mitigation strategies often overlap with HOT_ROW, focusing on reducing direct operational load on the problematic narrow key range.

UNSPECIFIED

Description: The split is experiencing high load and cannot be split, but the cause does not fall under the other specific categories. This can happen in complex load scenarios or due to internal system behavior.

Mitigation Strategies:

  • Investigate application workloads, queries, or transactions accessing the tables within the hot split (listed in AFFECTED_TABLES) that have shown increased load.
  • Use tools like Query Insights and Transaction Insights to identify expensive operations.
  • Evaluate the workload and ensure that you are using the Schema design best practices and SQL best practices.
  • If the hotspot persists for more than 10 minutes despite the prior optimizations, open a support case.

View event aggregation

Entries in the SPANNER_SYS.SPLIT_STATS_TOP_10MINUTE and SPANNER_SYS.SPLIT_STATS_TOP_HOUR views represent an aggregation of the 1-minute intervals within their respective windows:

CPU_USAGE_SCORE: This shows the maximum CPU_USAGE_SCORE recorded for the split in any 1-minute interval within the 10-minute or 1-hour window.

UNSPLITTABLE_REASONS: This array is a union of all unique UNSPLITTABLE_REASONS observed for the split across all 1-minute intervals within the window.

A split appears in these views if its CPU_USAGE_SCORE was 50% or higher in at least one of the constituent 1-minute intervals.

Example of aggregation

Examine the split from Users(101) to Users(102). The following table shows its potential entries in the MINUTE view over a 10-minute period from 10:00:00 to 10:10:00:

INTERVAL_END SPLIT_START SPLIT_LIMIT CPU_USAGE_SCORE AFFECTED_TABLES UNSPLITTABLE_REASONS
10:01:00 Users(101) Users(102) 60 [Messages,Users,Threads] []
10:02:00 Users(101) Users(102) 95 [Messages,Users,Threads] [HOT_ROW]
10:03:00 Users(101) Users(102) 80 [Messages,Users,Threads] [HOT_ROW]
10:04:00 Users(101) Users(102) 55 [Users,Threads] []
10:06:00 Users(101) Users(102) 70 [Users,Threads] [LARGE_SCAN_HOT_SPOT]
10:07:00 Users(101) Users(102) 65 [Users,Threads] [LARGE_SCAN_HOT_SPOT]
10:09:00 Users(101) Users(102) 52 [Users,Threads] []

The corresponding aggregated entry in 10MINUTE for the interval ending at 10:10:00 for this split would be:

INTERVAL_END SPLIT_START SPLIT_LIMIT CPU_USAGE_SCORE AFFECTED_TABLES UNSPLITTABLE_REASONS
10:10:00 Users(101) Users(102) 95 [Messages,Users,Threads] [HOT_ROW, LARGE_SCAN_HOT_SPOT]
  • CPU_USAGE_SCORE: 95 is the maximum value from the CPU_USAGE_SCORE column in the 1-minute view for this split within the window.
  • UNSPLITTABLE_REASONS: [HOT_ROW, LARGE_SCAN_HOT_SPOT] is the union of all unique reasons present in the UNSPLITTABLE_REASONS column in the 1-minute view.

This example shows how the 10MINUTE view summarizes the most intense load and all types of unsplittable issues encountered during the period. The HOUR view follows the same aggregation logic over a 60-minute window.

Find hot splits

You can use the following SQL statement to retrieve hot split statistics. You can run these SQL statements using the client libraries, Google Cloud CLI, or the Google Cloud console.

SELECT
  t.interval_end,
  t.split_start,
  t.split_limit,
  t.cpu_usage_score,
  t.affected_tables,
  t.unsplittable_reasons
FROM
  SPANNER_SYS.SPLIT_STATS_TOP_DURATION AS t
WHERE
  -- Optional: Filter by a specific interval end time
  -- t.interval_end = 'INTERVAL_END_TIME'
ORDER BY
  t.interval_end DESC, t.cpu_usage_score DESC;

Replace the following:

  • DURATION: choose MINUTE, 10MINUTE, or HOUR, based on the observation period. For example, SPANNER_SYS.SPLIT_STATS_TOP_HOUR.
  • INTERVAL_END_TIME: Replace with a TIMESTAMP of the end time of your observation period. For example, 2072-06-08 08:30:00Z.

Interpret query results

For a complete list of UNSPLITTABLE_REASONS codes and their possible diagnoses, see UNSPLITTABLE_REASONS types. For example, your query output might look like the following:

SPLIT_START SPLIT_LIMIT CPU_USAGE_SCORE AFFECTED_TABLES UNSPLITTABLE_REASONS
Threads(10) Threads(10, "aa") 100 Messages,Threads [UNISOLATABLE_HOT_ROW]
Messages(631, "abc", 1) Messages(631, "abc", 3) 100 Messages [HOT_ROW]
Users(620) <end> 100 Messages,Users,Threads [MOVING_HOT_SPOT]
Users(101) Users(102) 90 Messages,Users,Threads [HOT_ROW]
Users(13) Users(76) 82 Messages,Users,Threads [LARGE_SCAN_HOT_SPOT]
Threads(12, "zebra") Users(14) 76 Messages,Users,Threads []

From these results, you could infer the following issues:

  • Threads(10) to Threads(10, "aa"): Hot at 100% with [UNISOLATABLE_HOT_ROW]. A single key, a key range prefix in the Threads table, or an interleaved table is hot, and Spanner cannot split the range further.
  • Messages(631, "abc", 1) to Messages(631, "abc", 3): Hot at 100% with [HOT_ROW]. Load is concentrated on MessageId 1 and 2 for this User and Thread.
  • Users(620) to <end>: Hot at 100% with [MOVING_HOT_SPOT]. This often indicates a pattern of inserts with monotonically increasing or decreasing User IDs, causing the end of the key space to be persistently hot.
  • Users(101) to Users(102): Hot at 90% with [HOT_ROW]. The load is concentrated on the single Users row UserId = 101 and its interleaved children.
  • Users(13) to Users(76): Hot at 82% with [LARGE_SCAN_HOT_SPOT]. This suggests frequent or expensive scans across this range of User IDs.
  • Threads(12, "zebra") to Users(14): Warm at 76% usage. No unsplittable reasons were detected in this interval. Spanner might still be able to split this if the load persists or increases.

Troubleshoot hotspots using hot split statistics

This section describes how to detect and troubleshoot hotspots.

Select a time period to investigate

Check the latency metrics for your Spanner database to find the time period when your application experienced high latency and CPU usage. For example, it might show you that an issue started around 10:50 PM on May 18, 2072.

Check for unsplittable reasons

As Spanner balances load with load-based splitting, we recommend that you investigate hotspots that continue for more than 10 minutes, especially if they have UNSPLITTABLE_REASONS. The presence of UNSPLITTABLE_REASONS indicates that Spanner cannot split the hot split, and schema or workload changes might be needed to mitigate the hotspot.

You can query for UNSPLITTABLE_REASONS as shown in the following example query:

SELECT
  reason,
  COUNT(*) AS occurrences
FROM
  SPANNER_SYS.SPLIT_STATS_TOP_MINUTE AS t,
  UNNEST(t.unsplittable_reasons) AS reason
WHERE
  t.cpu_usage_score >= 50
  AND ARRAY_LENGTH(t.unsplittable_reasons) > 0
  AND t.interval_end >= "2072-05-18T17:40:00Z"  -- Start of window
  AND t.interval_end <= "2072-05-18T17:50:00Z"  -- End of window
GROUP BY
  reason
ORDER BY
  occurrences DESC;

The presence of UNSPLITTABLE_REASONS indicates a need for further debugging.

You can also monitor unsplittable reasons using Cloud Monitoring. The metric to use is unsplittable_reason_count. For more information, see Spanner metrics.

Find the splits with the highest CPU_USAGE_SCORE and their UNSPLITTABLE_REASONS

For this example, we run the following SQL to find the row ranges with the highest CPU_USAGE_SCORE level and their corresponding UNSPLITTABLE_REASONS:

GoogleSQL

SELECT t.split_start,
     t.split_limit,
     t.cpu_usage_score,
     t.affected_tables,
     t.unsplittable_reasons
FROM   SPANNER_SYS.SPLIT_STATS_TOP_MINUTE t
WHERE  t.cpu_usage_score >= 50
AND  t.interval_end = "interval_end_date_time";

Replace interval_end_date_time with the date and time for the interval, using the format YYYY-MM-DDTHH:MM:SSZ. For example, 2072-05-18T17:40:00Z.

PostgreSQL

SELECT t.split_start,
     t.split_limit,
     t.cpu_usage_score,
     t.affected_tables,
     t.unsplittable_reasons
FROM   spanner_sys.split_stats_top_minute t
WHERE  t.cpu_usage_score >= 50
AND  t.interval_end = 'interval_end_date_time'::timestamptz;

Replace interval_end_date_time with the date and time for the interval, using the format YYYY-MM-DDTHH:MM:SSZ. For example, 2072-05-18T17:40:00Z.

The previous SQL outputs the following:

SPLIT_START SPLIT_LIMIT CPU_USAGE_SCORE AFFECTED_TABLES UNSPLITTABLE_REASONS
Users(180) <end> 85 Messages,Users,Threads [MOVING_HOT_SPOT]
Users(24) Users(76) 76 Messages,Users,Threads [HOT_ROW, LARGE_SCAN_HOT_SPOT]
Threads(10) UsersByFirstName("abc") 100 UsersByFirstName, Users, Threads, Messages, MessagesIdx []

From this table of results, we can see that there are three hot splits and two of them are unsplittable. Hotspots with UNSPLITTABLE_REASONS that persist over time warrant further investigation. To understand what each reason means and how to mitigate it, see UNSPLITTABLE_REASONS types.

Best practices to mitigate hotspots

Note: If you've had an increase in your load and have recently upscaled your instance, Spanner might take a few minutes to perform its load-balancing operations before your latency decreases.

If load-balancing doesn't decrease latency, the next step is to identify the cause of the hotspots. After that, options are to either reduce the hotspot workload, or optimize the application schema and logic to avoid hotspots.

Identify the cause

  • Use Lock & Transaction Insights to look for transactions that have high lock wait time where the row range start key is within the hot split.
  • Use Query Insights to look for queries that read from the table that contains the hot split, and have recently increased latency, or a higher ratio of latency to CPU.
  • Use Oldest Active Queries to look for queries that read from the table that contains the hot split, and have higher than expected latency.

Some special cases to watch for:

  • Check to see if time to live (TTL) was enabled recently. If there are a lot of splits from old data, then TTL can raise CPU_USAGE_SCORE levels during mass deletes. In this case, the issue should self-resolve once the initial deletions complete.

Optimize the workload

  • Follow SQL best practices. Consider stale reads, writes that don't perform reads first, or adding indexes.
  • Follow Schema best practices. Ensure your schema is designed to handle load balancing and avoid hotspots.

What's next