Graph query best practices

This document describes best practices for optimizing your BigQuery Graph queries.

Start path traversal from low-cardinality nodes

To keep intermediate result sets small and speed up query execution, write your graph queries so that path traversal starts from lower cardinality nodes, regardless of the direction of path traversal. The following MATCH statements use property filters to reduce the number of possible starting nodes instead of computing all matches and then filtering:

MATCH (p:Person {id: 10})-[own:Owns]->(a:Account)
MATCH (a:Account WHERE balance > 10)<-[own:Owns]-(p:Person)

This is especially important for quantified path queries:

MATCH (p:Person {id: 10})-[own:Owns]->{1,3}(a:Account)

Use ANY or ANY SHORTEST syntax for connectivity checks

Quantified path queries can return duplicate paths between source nodes and destination nodes. If your goal is to check for connectivity and you don't require all possible paths, use ANY or ANY SHORTEST to reduce redundant computations and improve path lookup efficiency. For example, the following MATCH statement uses ANY SHORTEST to keep only one path between each pair of nodes:

MATCH ANY SHORTEST (a1:Account)-[t:Transfers]->{1,3}(a2:Account)

Use directional path traversal

BigQuery Graph schemas are directional, which means that each edge has a source node and a destination node. Although graph query syntax allows path traversal in any direction (for example, -[edge]-), we recommend using directional path traversal (for example, -[edge]-> or <-[edge]-) for better performance. Any direction path traversal might cause performance loss.

The following MATCH statement uses any direction path traversal:

-- Avoid.
MATCH (a1:Account {id: 7})-[t:Transfers]-(a2:Account)

Instead, combine two directional traversals with UNION ALL:

MATCH (a1:Account {id: 7})-[t:Transfers]->(a2:Account)
...
UNION ALL
...
MATCH (a1:Account  {id: 7})<-[t:Transfers]-(a2:Account)

Specify labels explicitly

If node or edge labels are omitted in a query, BigQuery Graph enumerates all qualifying node and edge labels. This enumeration might cause more labels to be scanned than necessary. To avoid this, specify labels for all nodes and edges in your query whenever possible.

For example, the following query specifies the Account and Transfers labels:

GRAPH graph_db.FinGraph
MATCH (a1:Account)-[t:Transfers]->(a2:Account)
RETURN COUNT(*) AS num_transfers;

Avoid omitting labels, because it might scan other unneeded relationships between nodes. In the following query, a1 can represent an account or a person, and t can represent a transfer or an account ownership.

GRAPH graph_db.FinGraph
MATCH (a1)-[t]->(a2)
RETURN COUNT(*) AS num_transfers;

Prefer a single MATCH statement

BigQuery Graph lets you include multiple MATCH statements in a single graph query. These statements are connected by multi-declared variables that represent the same node or edge. However, using multiple MATCH statements can diminish cardinality benefits across statements. When possible, use a single MATCH statement for better performance.

For example, the following queries are equivalent, but the first one performs better because it uses a single MATCH statement:

-- Preferred syntax.
GRAPH graph_db.FinGraph
MATCH
  (p:Person {id: 1})-[o:Owns]->
  (a:Account)-[t:Transfers]->(a2:Account)
RETURN o.account_id, t.amount;
-- Avoid this syntax.
GRAPH graph_db.FinGraph
MATCH (p:Person {id: 1})-[o:Owns]->(a:Account)
MATCH (a:Account)-[t:Transfers]->(a2:Account)
RETURN o.account_id, t.amount;

Limit traversed edges from high-cardinality nodes

When you query graphs, some nodes can have a significantly larger number of incoming or outgoing edges compared to other nodes. These high-cardinality nodes are sometimes called super nodes or hub nodes. Super nodes can cause performance issues because traversals through them might involve processing large amounts of data, which leads to data skew and long execution times.

To optimize a query of a graph with super nodes, use the ROW_NUMBER() function within a FILTER clause or WHERE clause in MATCH to limit the number of edges that the query traverses from or to a node. This technique is particularly useful when you don't need a complete enumeration of all connections from or to a super node.

For example, if some accounts in FinGraph have a large number of transactions, you can use ROW_NUMBER() to limit the number of Transfers edges to consider for each Account and avoid an inefficient query:

GRAPH graph_db.FinGraph
MATCH (a1:Account)-[e1:Transfers WHERE e1 IN {
  GRAPH graph_db.FinGraph
  -- Sample 5 edges per source node
  MATCH -[selected_e:Transfers]->
    FILTER ROW_NUMBER() OVER (
      PARTITION BY SOURCE_NODE_ID(selected_e)) < 5
    RETURN selected_e
}]->{1,3}(a2:Account)
RETURN COUNT(*) AS cnt;

Sample intermediate nodes or edges in multi-hop queries

You can also improve query efficiency by using ROW_NUMBER() to sample intermediate nodes in multi-hop queries. This technique improves efficiency by limiting the number of paths that the query considers for each intermediate node. To do this, break a multi-hop query into multiple MATCH statements separated by NEXT, and apply ROW_NUMBER() at the midpoint where you need to sample:

GRAPH graph_db.FinGraph
MATCH (a1:Account)-[e1:Transfers]->(a2:Account)
-- Sample 5 destination nodes per a1 source node
FILTER ROW_NUMBER() OVER (PARTITION BY ELEMENT_ID(a1)) < 5
RETURN a1, a2
NEXT
MATCH (a2)-[e2:Transfers]->(a3:Account)
RETURN a1.id AS src_id, a2.id AS mid_id, a3.id AS dst_id;

What's next