This document describes best practices for tuning Spanner Graph query performance, which include the following optimizations:
- Avoid a full scan of the input table for nodes and edges.
- Reduce the amount of data the query needs to read from storage.
- Reduce the size of intermediate data.
Start from lower cardinality nodes
Write the path traversal so that it starts with the lower cardinality nodes. This approach keeps the intermediate result set small, and speeds up query execution.
For example, the following queries have the same semantics:
Forward edge traversal:
GRAPH FinGraph MATCH (p:Person {name:"Alex"})-[:Owns]->(a:Account {is_blocked: true}) RETURN p.id AS person_id, a.id AS account_id;Reverse edge traversal:
GRAPH FinGraph MATCH (a:Account {is_blocked:true})<-[:Owns]-(p:Person {name: "Alex"}) RETURN p.id AS person_id, a.id AS account_id;
Assuming that there are fewer people with the name Alex than there are
blocked accounts, we recommend that you write this query in the forward
edge traversal.
Starting from lower cardinality nodes is especially important for variable-length path traversal. The following example shows the recommended way to find accounts that are within three transfers of a given account.
GRAPH FinGraph
MATCH (:Account {id: 7})-[:Transfers]->{1,3}(a:Account)
RETURN a.id;
Specify all labels by default
Spanner Graph infers the qualifying nodes and edge labels if labels are omitted. We recommend that you specify labels for all nodes and edges where possible, because this inference might not always be possible and it might cause more labels than necessary to be scanned.
Single MATCH statement
The following example finds accounts linked by at most 3 transfers from the given account:
GRAPH FinGraph
MATCH (src:Account {id: 7})-[:Transfers]->{1,3}(dst:Account)
RETURN dst.id;
Across MATCH statements
Specify labels on nodes and edges when they refer to the same element but are
across MATCH statements.
The following example shows this recommended approach:
GRAPH FinGraph
MATCH (acct:Account {id: 7})-[:Transfers]->{1,3}(other_acct:Account)
RETURN acct, COUNT(DISTINCT other_acct) AS related_accts
GROUP BY acct
NEXT
MATCH (acct:Account)<-[:Owns]-(p:Person)
RETURN p.id AS person, acct.id AS acct, related_accts;
Use IS_FIRST to optimize queries
You can use the
IS_FIRST
function to improve query performance by sampling edges and limiting traversals
in graphs. This function helps handle high-cardinality nodes and optimize
multi-hop queries.
If your specified sample size is too small, the query might return no data. Because of this, you might need to try different sample sizes to find the optimal balance of returned data and improved query performance.
These IS_FIRST examples use FinGraph, a financial graph with Account nodes
and Transfers edges for money transfers. To create the FinGraph and use it
to run the sample queries, see
Set up and query Spanner Graph.
Limit traversed edges to improve query performance
When you query graphs, some nodes can have a significantly larger number of incoming or outgoing edges compared to other nodes. These high-cardinality nodes are sometimes called super nodes or hub nodes. Super nodes can cause performance issues because traversals through them might involve processing huge amounts of data, which leads to data skew and long execution times.
To optimize a query of a graph with super nodes, use the IS_FIRST function
within a FILTER clause to limit the number of edges the query traverses from a
node. Because accounts in FinGraph might have significantly higher numbers of
transactions than others, you might use IS_FIRST to prevent an inefficient
query. This technique is particularly useful when you don't need a complete
enumeration of all connections from a super node.
The following query finds accounts (a2) that either directly or indirectly
receive transfers from blocked accounts (a1). The query uses IS_FIRST to
prevent slow performance when an account has many transfers by limiting the
number of Transfers edges to consider for each Account.
GRAPH FinGraph
MATCH
(a1:Account {is_blocked: true})
-[e:Transfers WHERE e IN
{
MATCH -[selected_e:Transfers]->
FILTER IS_FIRST(@max_transfers_per_account) OVER (
PARTITION BY SOURCE_NODE_ID(selected_e)
ORDER BY selected_e.create_time DESC)
RETURN selected_e
}
]->{1,5}
(a2:Account)
RETURN a1.id AS src_id, a2.id AS dst_id;
This example uses the following:
@max_transfers_per_account: A query parameter that specifies the maximum number ofTransfersedges to consider for each account (a1).PARTITION BY SOURCE_NODE_ID(selected_e): Ensures that theIS_FIRSTlimit applies independently for each account (a1).ORDER BY selected_e.create_time DESC: Specifies that the most recent transfers are returned.
Sample intermediate nodes to optimize multi-hop queries
You can also improve query efficiency by using IS_FIRST to sample intermediate
nodes in multi-hop queries. This technique improves efficiency by limiting the
number of paths the query considers for each intermediate node. To do this,
break a multi-hop query into multiple MATCH statements separated by NEXT, and
apply IS_FIRST at the midpoint where you need to sample:
GRAPH FinGraph
MATCH (a1:Account {is_blocked: true})-[e1:Transfers]->(a2:Account)
FILTER IS_FIRST(1) OVER (PARTITION BY a2)
RETURN a1, a2
NEXT
MATCH (a2)-[e2:Transfers]->(a3:Account)
RETURN a1.id AS src_id, a2.id AS mid_id, a3.id AS dst_id;
To understand how IS_FIRST optimizes this query:
The clause
FILTER IS_FIRST(1) OVER (PARTITION BY a2)is applied in the firstMATCHstatement.For each intermediate account node (
a2),IS_FIRSTconsiders only the first incomingTransfersedge (e1), reducing the number of paths to explore in the secondMATCHstatement.The overall two-hop query's efficiency is improved because the second
MATCHdoesn't process unnecessary data, especially whena2has many incoming transfers.
Use factorized execution to optimize queries
Optimize Spanner Graph query performance by factorizing a query that traverses a graph pattern and creates duplicate intermediate results.
A graph query can execute in factorized mode, which deduplicates
high-cardinality path prefixes before traversing the rest of the path. This
optimization improves query performance by reducing repeated key comparisons or
hash table probes during graph query execution. To factorize a query, add the
@{factorized_mode} hint
to the individual pattern traversal or at the query level.
Set factorized_mode to one of the following:
factorize_left: Optimizes graph traversals where many paths converge on a few nodes. Spanner Graph deduplicates paths by destination node ID and then fetches node properties to reduce redundant work.factorize_both: Optimizes non-linear queries with multiple sub-paths that share nodes. Spanner Graph avoids generating intermediate results for each path independently. Instead, it computes each sub-path once and then combines them, delaying the cross-product of results until the end. Use this for patterns like triangles or branches.
For more information about traversal hints, see Graph hints.
The following examples show how to use factorized mode hints to improve graph
query performance. To run the sample code, create the FinGraph schema in
Set up and query Spanner Graph.
Reduce intermediate results in a linear query
Fraudulent accounts often receive a high volume of transactions. Queries designed to identify these accounts can generate large intermediate results, many of which are duplicates because they converge on a small set of destination accounts. Retrieving fraudulent node properties from these redundant intermediate results requires unnecessary computation, which can significantly slow down query performance.
The @{factorized_mode} hint addresses this issue by optimizing query execution.
The following example demonstrates how to use this hint to retrieve account and
transaction details. It traces transfers originating from a known blocked
account to potentially fraudulent accounts that are one to three hops away. The
unoptimized version of this query is available in the second tab.
Factorized query
GRAPH FinGraph
MATCH (a1:Account {is_blocked: true})-[e1:Transfers]->{1,3}@{factorized_mode=factorize_left}(a2:Account)
LET total_amount = SUM(e1.amount)
RETURN a2.id, a2.create_time, total_amount;
Unoptimized query
GRAPH FinGraph
MATCH (a1:Account {is_blocked: true})-[e1:Transfers]->{1,3}(a2:Account)
LET total_amount = SUM(e1.amount)
RETURN a2.id, a2.create_time, total_amount;
When the number of distinct destination nodes a2 is much smaller than the
number of paths leading to a2, this technique prevents redundant work. This is
common in fraud networks, where intermediaries transfer money to a few
recipients. The factorized hint performs the following actions:
Finds all paths that include one to three edges starting from a blocked node, and the sum of transfers along those paths.
Determines the number of distinct destination nodes
a2.Retrieves the
create_timefor each distinct destination nodea2only once.Combines the
create_timefor each nodea2with the list of paths leading toa2to form the query result.
Defer intermediate results generation in complex non-linear queries
Graph queries often have a complex structure, consisting of several linear
subpath patterns. The following example shows a query that analyzes transactions
originating from an account with id equal to 1. This query categorizes all
destination accounts by their create_time into three distinct buckets:
- Within the past 30 days
- 30 to 90 days ago
- Older than 90 days
The query returns all triples and the total transaction amount for these transactions.
The @{factorized_mode=factorize_both} hint optimizes query execution by
avoiding early intermediate result materialization. It uses the fact that all
subpaths in the MATCH clause are conditionally independent, given the value of
the start node a. factorize_both evaluates the first two subpaths
independently and doesn't repeat evaluation of the last subpath.
Factorized query
GRAPH FinGraph
MATCH (a:Account {id: 1})-[e1:Transfers]->(a1:Account),
@{factorized_mode=factorize_both}(a:Account)-[e2:Transfers]->(a2:Account),
(a:Account)-[e3:Transfers]->(a3:Account)
WHERE e1.create_time < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
AND e2.create_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
AND e2.create_time < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
AND e3.create_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
LET total_amount = e1.amount + e2.amount + e3.amount
RETURN a1.id as a1_id, a2.id as a2_id, a3.id as a3_id, total_amount;
Unoptimized query
GRAPH FinGraph
MATCH (a:Account {id: 1})-[e1:Transfers]->(a1:Account),
(a:Account)-[e2:Transfers]->(a2:Account),
(a:Account)-[e3:Transfers]->(a3:Account)
WHERE e1.create_time < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
AND e2.create_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
AND e2.create_time < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
AND e3.create_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
LET total_amount = e1.amount + e2.amount + e3.amount
RETURN a1.id as a1_id, a2.id as a2_id, a3.id as a3_id, total_amount;
For a given starting account with id equal to 1, each outgoing Transfers
edge leads to multiple independent destination nodes. Consider the following
example:
Mis the number of accounts (a1) that have transfers in(-inf, 90 DAYS)rangeNis the number of accounts (a2) that have transfers in[90 DAYS, 30 DAYS)rangeKis the number of accounts (a3) that have transfers in[30 DAYS, +inf)range
The unoptimized query generates intermediate results early, computing the
second subpath M times and the third subpath M * N times. In contrast, a
factorized version optimizes query execution by doing the following:
Obtaining
Mdestination nodes (a1) for the first subpath, and temporarily storing a set ofa1nodes.Obtaining
Ndestination nodes (a2) for the second subpath only once and temporarily storing a set ofa2nodes.Obtaining
Kdestination nodes (a3) for the last subpath only once.Performing a cross product between temporary intermediate results to create the product of
MxNxKrecords.
What's next
- Learn how to query property graphs in Spanner Graph.
- Migrate to Spanner Graph.