Run Spanner Graph algorithms

This document explains how to run algorithms on Spanner Graph.

Spanner Graph algorithm query structure

A Spanner Graph algorithm query has the following structure:

EXPORT DATA OPTIONS (<export_option_list>) AS
GRAPH graph_name
<match_clause>
<call_statement> algorithm_name(<common_input>, <algorithm_specific_input>)
  YIELD <algorithm_specific_output>
RETURN <results>
  • <export_option_list>: Options that define how to persist algorithm query results. See Cloud Storage Options and Spanner Options.

  • graph_name: The name of the graph.

  • <match_clause>: Optional MATCH statements to define algorithm input elements.

  • <call_statement>: Use CALL when you omit <match_clause> and want to operate on the whole graph. Use CALL PER() when <match_clause> is present and you want to operate on the working table. For more information, see GQL CALL.

  • algorithm_name: The name of the algorithm to run. For available algorithms, see Spanner Graph algorithms.

  • <common_input>: Named input parameters common to all algorithm queries. For more information, see common algorithm input parameters.

  • <algorithm_specific_input>: Named input parameters for the algorithm. For more information, see input parameters defined in Spanner Graph algorithms.

  • <algorithm_specific_output>: The output of the algorithm call. For more information, see the outputs defined in Spanner Graph algorithms and YIELD in CALL statement.

  • <results>: Defines what to return in query results.

The query is composed of an EXPORT DATA statement, which defines how to persist results, and a GRAPH CLAUSE that produces the algorithm query result.

In its simplest form, the graph clause identifies the graph, CALLs an algorithm that yields predefined output, and then specifies what to RETURN from the algorithm's output.

Optionally, the graph clause can use supported MATCH statements to select elements of interest. In this case, use a PER () clause to group all rows returned by MATCH as input to the algorithm. The algorithm operates on a logical subgraph composed of the unique set of nodes and edges selected.

The query doesn't return any data. Results are persisted according to export_option_list.

For more information about Spanner Graph algorithm queries, see the following sections in this document:

Common algorithm input parameters

Specify these named input parameters in the following format: NAME => VALUE, ....

Name Value Type Required Default Value Description
node_labels ARRAY No (none) Only supported when CALL is used. A list of node labels to include in the algorithm input. If specified, only nodes with at least one matching label are included.
edge_labels ARRAY No (none) Only supported when CALL is used. A list of edge labels to include in the algorithm input. If specified, only edges with at least one matching label are included.
edge_weight_property STRING No (none) The name of the edge property that contains the weights. If undefined, the system assigns a default weight of 1 to all edges. The property value type must be numeric.
machine_category STRING No default The machine category to use for the algorithm execution. Supported values are: default, large
zone STRING No (none) The zone where the algorithm execution takes place. Must be one of the zones in the region the query is received in.
max_idle_time STRING No 30m Specifies how long the compute instance should remain active for reuse after the algorithm completes. Format is a sequence of decimal numbers, each with a unit suffix, such as 4m, 1.5h or 1h45m. Valid time units are ns, us (or µs), ms, s, m, h.

Handle algorithm output

You must persist algorithm query results before you can inspect them. Use the export_data_option to describe how to persist the results. You can persist the results to Cloud Storage or back to the same Spanner instance the query originated from.

Persist results to Cloud Storage

To use this option, make sure Storage Object Admin (roles/storage.objectAdmin) role is granted to Google-managed Spanner service account service-PROJECT_NUMBER@gcp-sa-spanner.iam.gserviceaccount.com.

The following EXPORT DATA options are supported when persisting results to Cloud Storage. Specify the options in the following format: NAME=VALUE, ....

Name Value Type Required Description
uri STRING Yes The destination URI for the export, in gs://bucket/path/file format. If you export a large amount of data, use a wildcard in uri to export data into multiple files. For example, gs://bucket/path/file_*.csv.
format STRING Yes The format of exported data. Supported values: CSV, PARQUET, AVRO.
header BOOL No If true, the system prints column headers for the first row of each data file. The default is false. Applies only to CSV.
overwrite BOOL No If true, the system overwrites any existing files with the same URI. Otherwise, if files with the same URI exist, the statement returns an error. The default is false.
field_delimiter STRING No The delimiter that separates fields. Default: , (comma). Applies only to CSV.
compression STRING No Specifies compression format. If you don't specify a compression format, files remain uncompressed.
  • For CSV, supported value is GZIP.
  • For PARQUET, supported values are: SNAPPY, GZIP, ZSTD.
  • For AVRO, supported values are: DEFLATE, SNAPPY.

The column names in RETURN clause define the column names in Cloud Storage output files.

Export data into one or more files

Spanner Graph queries support a single wildcard operator (*) in the uri. The wildcard can appear in the filename component, but not in bucket name, folder name, or file extension. Using the wildcard operator instructs Spanner Graph to create multiple sharded files based on the pattern you supply if the result set is large. The system replaces the wildcard operator with a number, starting at zero, left-padded to 12 digits. For example, a URI gs://my-bucket/file-*.csv creates files like gs://my-bucket/file-000000000000.csv, gs://my-bucket/file-000000000001.csv, and similar files.

If you use a uri without wildcard, the result is a single file, like gs://my-bucket/file.csv.

Data types

When you export data, Spanner graph data types convert as follows, depending on the format:

CSV

All data types convert to their string representation:

  • BOOL values convert to true or false.
  • BYTES values base64-encode.
  • TIMESTAMP values format as YYYY-MM-DD HH:MM:SS.ffffff UTC.
  • NULL values appear as empty strings.

You cannot export nested and repeated data in CSV format.

Avro

Spanner data type Avro data type
BOOL BOOLEAN
INT64 LONG
FLOAT FLOAT
DOUBLE DOUBLE
NUMERIC BYTES with logical type DECIMAL(38,9)
STRING STRING
BYTES BYTES
TIMESTAMP LONG (microseconds since epoch)
NULL null

Parquet

Spanner data type Parquet data type
BOOL BOOLEAN
INT64 INT64
FLOAT FLOAT
DOUBLE DOUBLE
NUMERIC DECIMAL(38,9)
STRING STRING
BYTES BYTE_ARRAY
TIMESTAMP TIMESTAMP_MICROS
NULL null

Persist results to Spanner

The following EXPORT DATA options are supported when persisting results back to your source Spanner instance. Specify the options in the following format: NAME=VALUE, ....

Name Value Type Required Description
format STRING Yes The format of exported data. Must be CLOUD_SPANNER.
table STRING Yes The name of the destination Spanner table to write results to. This can be any table in the Spanner instance.
write_mode STRING Yes The write mode to use. Supported values are:
  • update_ignore_all: Updates existing rows in the destination table.
  • upsert_ignore_all: Inserts new rows or updates existing rows in the destination table.

In both modes, Spanner skips any record that would introduce a constraint violation (for example, missing keys on an update, unique index violation, foreign key constraint violation). However, the write fails for non-constraint violation errors (for example, column type mismatch, missing values for NOT NULL columns).

Requirements

When you persist algorithm results back to Spanner, your algorithm query must satisfy the following:

  • The destination table must exist.
  • Columns must exist with a matching type: All column names specified in the RETURN clause must already exist in the destination table with a matching data type. Use aliases to match the destination table column names if needed. Example: RETURN node.id AS person_id.
  • Include all primary key columns: The RETURN clause must include all primary key columns of the destination table.

Write semantics

Persisting results back to Spanner is a non-transactional operation. It provides row-level atomicity. This means the system either successfully writes all columns from the same row or writes none of them. It follows at-least-once semantics. This means a row can be written to multiple times. Reading from the destination table while the execution is underway might yield incomplete results.

If the overall execution fails, the system does not roll back changes that have already been committed. The write process fails on the first non-retryable error. When a write failure occurs, the ERROR_MESSAGE in GRAPH_OPERATION_EXECUTION_STATUS indicates the primary key of the row that failed along with the specific reason for the failure.

The system writes algorithm results back to Spanner Graph using MEDIUM priority.

Run example algorithm queries

This section shows example Spanner Graph algorithm queries you can run on a test instance. For a full list of algorithms Spanner Graph supports, see Spanner Graph algorithms.

Before you begin

To run the example Spanner Graph algorithm queries, you must first complete the following:

  1. Follow Set up and query Spanner Graph to create a Spanner Graph.
  2. Ensure you have the required permissions.
  3. Optional: Augment the Spanner Graph schema if you are persisting output to Spanner.

Add a new column named page_rank to the Account table. Spanner writes algorithm results to this new column. Then, refresh the graph definition so you can access page_rank as a node property.

-- Add `page_rank` as a column. Data type of this column matches the data type defined in `PageRank` output signature.
ALTER TABLE Account ADD COLUMN page_rank FLOAT64;

-- Rerun the graph definition DDL to pickup `page_rank` as a new property.
CREATE OR REPLACE PROPERTY GRAPH FinGraph
  NODE TABLES (`Account`, `Person`)
  EDGE TABLES (
    `PersonOwnAccount`
      SOURCE KEY (id) REFERENCES `Person` (id)
      DESTINATION KEY (account_id) REFERENCES `Account` (id)
      LABEL `Owns`,
    `AccountTransferAccount`
      SOURCE KEY (id) REFERENCES `Account` (id)
      DESTINATION KEY (to_id) REFERENCES `Account` (id)
      LABEL `Transfers`
  );

Run algorithm on full graph with label filter and persist results to Cloud Storage

This example runs PageRank to rank Accounts based on the Transactions they participate in and persists results to a Cloud Storage in CSV format as "my-bucket-name/my-output.csv"

EXPORT DATA OPTIONS (
  uri = "gs://my-bucket-name/my-output.csv",
  format = "csv"
) AS
GRAPH FinGraph
CALL PageRank(node_labels => ['Account'], edge_labels => ['Transfers']) YIELD node, score
RETURN node.id, score AS page_rank

In Cloud Storage, you should see a CSV file with two columns (id and page_rank) when this query completes successfully.

Run algorithm on subgraph defined by MATCH and persist results to graph

This example uses the MATCH pattern to dynamically match a logical subgraph containing all Account nodes and only Transfer edges with an amount less than 500. This logical subgraph is the input to the PageRank algorithm. Spanner persists algorithm results back to the Account table.

EXPORT DATA OPTIONS (
  format = "CLOUD_SPANNER",
  table = "Account",
  write_mode = 'update_ignore_all'
) AS
GRAPH FinGraph
MATCH (n:Account)
RETURN n
FULL UNION ALL
MATCH -[e:Transfers WHERE e.amount < 500]->
RETURN e
NEXT
CALL PER () PageRank() YIELD node, score
RETURN node.id, score AS page_rank

After the query completes successfully, run the following query:

GRAPH FinGraph
MATCH (n:Account)
RETURN n.id, ROUND(n.page_rank, 2) AS page_rank
ORDER BY page_rank DESC, id ASC

You should see results similar to the following:

id page_rank
20 0.49
16 0.46
7 0.05

Check algorithm execution status

When a graph algorithm query completes successfully, it returns zero rows and Success status. Depending on the input graph size and specific algorithm configurations, the algorithm execution might take a while to complete. You can check the progress and execution status of a graph algorithm query in SPANNER_SYS.GRAPH_OPERATION_EXECUTION_STATUS table. This table retains information for 30 days.

GRAPH_OPERATION_EXECUTION_STATUS schema

Column name Type Description
QUERY_ID STRING The ID for the graph algorithm query.
QUERY_TEXT STRING The query statement text.
START_TIMESTAMP TIMESTAMP The time at which the query started execution.
LAST_UPDATE_TIMESTAMP TIMESTAMP The time at which the status was last updated.
PROGRESS FLOAT Estimated percentage of completion. The value is between 0 and 1, where 0 means started and 1 means completed.
STATUS STRING Current state of execution. Possible values are PENDING, IN_PROGRESS, OK, CANCELLED, DEADLINE_EXCEEDED, UNKNOWN.
ERROR_MESSAGE STRING Error message if the query execution failed.

The following sample query lists graph queries that have not yet completed successfully:

SELECT
  query_id,
  query_text,
  start_timestamp,
  last_update_timestamp,
  progress,
  status,
  error_message
FROM
  SPANNER_SYS.GRAPH_OPERATION_EXECUTION_STATUS
WHERE
  status != "OK"
ORDER BY
  start_timestamp DESC;

Cancel algorithm execution

To cancel an in-flight graph algorithm query, locate the query_id from SPANNER_SYS.GRAPH_OPERATION_EXECUTION_STATUS table, then call cancel_query for that query_id.

What's next