This document explains how to run algorithms on Spanner Graph.
Spanner Graph algorithm query structure
A Spanner Graph algorithm query has the following structure:
EXPORT DATA OPTIONS (<export_option_list>) AS
GRAPH graph_name
<match_clause>
<call_statement> algorithm_name(<common_input>, <algorithm_specific_input>)
YIELD <algorithm_specific_output>
RETURN <results>
<export_option_list>: Options that define how to persist algorithm query results. See Cloud Storage Options and Spanner Options.graph_name: The name of the graph.<match_clause>: Optional MATCH statements to define algorithm input elements.<call_statement>: UseCALLwhen you omit<match_clause>and want to operate on the whole graph. UseCALL PER()when<match_clause>is present and you want to operate on the working table. For more information, see GQL CALL.algorithm_name: The name of the algorithm to run. For available algorithms, see Spanner Graph algorithms.<common_input>: Named input parameters common to all algorithm queries. For more information, see common algorithm input parameters.<algorithm_specific_input>: Named input parameters for the algorithm. For more information, see input parameters defined in Spanner Graph algorithms.<algorithm_specific_output>: The output of the algorithm call. For more information, see the outputs defined in Spanner Graph algorithms andYIELDin CALL statement.<results>: Defines what to return in query results.
The query is composed of an EXPORT DATA statement, which defines how to persist results, and a GRAPH CLAUSE that produces the algorithm query result.
In its simplest form, the graph clause identifies the graph, CALLs an algorithm that yields predefined output, and then specifies what to RETURN from the algorithm's output.
Optionally, the graph clause can use supported MATCH statements to select elements of interest. In this case, use a PER () clause to group all rows returned by MATCH as input to the algorithm. The algorithm operates on a logical subgraph composed of the unique set of nodes and edges selected.
The query doesn't return any data. Results are persisted according to export_option_list.
For more information about Spanner Graph algorithm queries, see the following sections in this document:
Common algorithm input parameters
Specify these named input parameters in the following format: NAME => VALUE, ....
| Name | Value Type | Required | Default Value | Description |
|---|---|---|---|---|
node_labels |
ARRAY |
No | (none) | Only supported when CALL is used. A list of node labels to include in
the algorithm input. If specified, only nodes with at least one matching
label are included.
|
edge_labels |
ARRAY |
No | (none) | Only supported when CALL is used. A list of edge labels to include in
the algorithm input. If specified, only edges with at least one matching
label are included.
|
edge_weight_property |
STRING |
No | (none) | The name of the edge property that contains the weights. If undefined, the system assigns a default weight of 1 to all edges. The property value type must be numeric. |
machine_category |
STRING |
No | default | The machine category to use for the algorithm execution. Supported values are: default, large |
zone |
STRING |
No | (none) | The zone where the algorithm execution takes place. Must be one of the zones in the region the query is received in. |
max_idle_time |
STRING |
No | 30m | Specifies how long the compute instance should remain active for
reuse after the algorithm completes. Format is a sequence of decimal
numbers, each with a unit suffix, such as 4m, 1.5h
or 1h45m. Valid time units are ns,
us (or µs), ms, s,
m, h. |
Handle algorithm output
You must persist algorithm query results before you can inspect them. Use the
export_data_option to describe how to persist the results. You can persist the
results to Cloud Storage or back to the same Spanner instance the
query originated from.
Persist results to Cloud Storage
To use this option, make sure Storage Object Admin (roles/storage.objectAdmin)
role is granted to Google-managed
Spanner service account
service-PROJECT_NUMBER@gcp-sa-spanner.iam.gserviceaccount.com.
The following EXPORT DATA options are supported when persisting results to
Cloud Storage. Specify the options in the following format: NAME=VALUE, ....
| Name | Value Type | Required | Description |
|---|---|---|---|
uri |
STRING |
Yes | The destination URI for the export, in gs://bucket/path/file format. If you export a large amount of data, use a wildcard in uri to export data into multiple files. For example, gs://bucket/path/file_*.csv. |
format |
STRING |
Yes | The format of exported data. Supported values: CSV, PARQUET, AVRO. |
header |
BOOL |
No | If true, the system prints column headers for the first row of each data file. The default is false. Applies only to CSV. |
overwrite |
BOOL |
No | If true, the system overwrites any existing files with the same URI. Otherwise, if files with the same URI exist, the statement returns an error. The default is false. |
field_delimiter |
STRING |
No | The delimiter that separates fields. Default: , (comma). Applies only to CSV. |
compression |
STRING |
No | Specifies compression format. If you don't specify a compression format, files remain uncompressed.
|
The column names in RETURN clause define the column names in Cloud Storage
output files.
Export data into one or more files
Spanner Graph queries support a single wildcard operator (*) in the uri. The wildcard can appear in the filename component, but not in bucket name, folder name, or file extension. Using the wildcard operator instructs Spanner Graph to create multiple sharded files based on the pattern you supply if the result set is large. The system replaces the wildcard operator with a number, starting at zero, left-padded to 12 digits. For example, a URI gs://my-bucket/file-*.csv creates files like
gs://my-bucket/file-000000000000.csv, gs://my-bucket/file-000000000001.csv, and similar files.
If you use a uri without wildcard, the result is a single file, like
gs://my-bucket/file.csv.
Data types
When you export data, Spanner graph data types convert as follows, depending on the format:
CSV
All data types convert to their string representation:
BOOLvalues convert totrueorfalse.BYTESvalues base64-encode.TIMESTAMPvalues format asYYYY-MM-DD HH:MM:SS.ffffff UTC.NULLvalues appear as empty strings.
You cannot export nested and repeated data in CSV format.
Avro
| Spanner data type | Avro data type |
|---|---|
BOOL |
BOOLEAN |
INT64 |
LONG |
FLOAT |
FLOAT |
DOUBLE |
DOUBLE |
NUMERIC |
BYTES with logical type DECIMAL(38,9) |
STRING |
STRING |
BYTES |
BYTES |
TIMESTAMP |
LONG (microseconds since epoch) |
NULL |
null |
Parquet
| Spanner data type | Parquet data type |
|---|---|
BOOL |
BOOLEAN |
INT64 |
INT64 |
FLOAT |
FLOAT |
DOUBLE |
DOUBLE |
NUMERIC |
DECIMAL(38,9) |
STRING |
STRING |
BYTES |
BYTE_ARRAY |
TIMESTAMP |
TIMESTAMP_MICROS |
NULL |
null |
Persist results to Spanner
The following EXPORT DATA options are supported when persisting results back
to your source Spanner instance. Specify the options in the following
format: NAME=VALUE, ....
| Name | Value Type | Required | Description |
|---|---|---|---|
format |
STRING |
Yes | The format of exported data. Must be CLOUD_SPANNER. |
table |
STRING |
Yes | The name of the destination Spanner table to write results to. This can be any table in the Spanner instance. |
write_mode |
STRING |
Yes | The write mode to use. Supported values are:
In both modes, Spanner skips any record that would introduce a constraint violation (for example, missing keys on an update, unique index violation, foreign key constraint violation). However, the write fails for non-constraint violation errors (for example, column type mismatch, missing values for NOT NULL columns). |
Requirements
When you persist algorithm results back to Spanner, your algorithm query must satisfy the following:
- The destination table must exist.
- Columns must exist with a matching type: All column names specified in the
RETURNclause must already exist in the destination table with a matching data type. Use aliases to match the destination table column names if needed. Example:RETURN node.id AS person_id. - Include all primary key columns: The
RETURNclause must include all primary key columns of the destination table.
Write semantics
Persisting results back to Spanner is a non-transactional operation. It provides row-level atomicity. This means the system either successfully writes all columns from the same row or writes none of them. It follows at-least-once semantics. This means a row can be written to multiple times. Reading from the destination table while the execution is underway might yield incomplete results.
If the overall execution fails, the system does not roll back changes that have
already been committed. The write process fails on the first non-retryable
error. When a write failure occurs, the ERROR_MESSAGE in
GRAPH_OPERATION_EXECUTION_STATUS
indicates the primary key of the row that failed along with the specific
reason for the failure.
The system writes algorithm results back to Spanner Graph using
MEDIUM priority.
Run example algorithm queries
This section shows example Spanner Graph algorithm queries you can run on a test instance. For a full list of algorithms Spanner Graph supports, see Spanner Graph algorithms.
Before you begin
To run the example Spanner Graph algorithm queries, you must first complete the following:
- Follow Set up and query Spanner Graph to create a Spanner Graph.
- Ensure you have the required permissions.
- Optional: Augment the Spanner Graph schema if you are persisting output to Spanner.
Add a new column named page_rank to the Account table. Spanner
writes algorithm results to this new column. Then, refresh the graph definition
so you can access page_rank as a node property.
-- Add `page_rank` as a column. Data type of this column matches the data type defined in `PageRank` output signature.
ALTER TABLE Account ADD COLUMN page_rank FLOAT64;
-- Rerun the graph definition DDL to pickup `page_rank` as a new property.
CREATE OR REPLACE PROPERTY GRAPH FinGraph
NODE TABLES (`Account`, `Person`)
EDGE TABLES (
`PersonOwnAccount`
SOURCE KEY (id) REFERENCES `Person` (id)
DESTINATION KEY (account_id) REFERENCES `Account` (id)
LABEL `Owns`,
`AccountTransferAccount`
SOURCE KEY (id) REFERENCES `Account` (id)
DESTINATION KEY (to_id) REFERENCES `Account` (id)
LABEL `Transfers`
);
Run algorithm on full graph with label filter and persist results to Cloud Storage
This example runs PageRank to rank Accounts based on the Transactions they
participate in and persists results to a Cloud Storage in CSV format as
"my-bucket-name/my-output.csv"
EXPORT DATA OPTIONS (
uri = "gs://my-bucket-name/my-output.csv",
format = "csv"
) AS
GRAPH FinGraph
CALL PageRank(node_labels => ['Account'], edge_labels => ['Transfers']) YIELD node, score
RETURN node.id, score AS page_rank
In Cloud Storage, you should see a CSV file with two columns (id and
page_rank) when this query completes successfully.
Run algorithm on subgraph defined by MATCH and persist results to graph
This example uses the MATCH pattern to dynamically match a logical subgraph
containing all Account nodes and only Transfer edges with an amount less than 500.
This logical subgraph is the input to the PageRank algorithm.
Spanner persists algorithm results back to the Account table.
EXPORT DATA OPTIONS (
format = "CLOUD_SPANNER",
table = "Account",
write_mode = 'update_ignore_all'
) AS
GRAPH FinGraph
MATCH (n:Account)
RETURN n
FULL UNION ALL
MATCH -[e:Transfers WHERE e.amount < 500]->
RETURN e
NEXT
CALL PER () PageRank() YIELD node, score
RETURN node.id, score AS page_rank
After the query completes successfully, run the following query:
GRAPH FinGraph
MATCH (n:Account)
RETURN n.id, ROUND(n.page_rank, 2) AS page_rank
ORDER BY page_rank DESC, id ASC
You should see results similar to the following:
| id | page_rank |
|---|---|
| 20 | 0.49 |
| 16 | 0.46 |
| 7 | 0.05 |
Check algorithm execution status
When a graph algorithm query completes successfully, it returns zero rows and
Success status. Depending on the input graph size and specific algorithm
configurations, the algorithm execution might take a while to complete. You can
check the progress and execution status of a graph algorithm query in
SPANNER_SYS.GRAPH_OPERATION_EXECUTION_STATUS table. This table retains
information for 30 days.
GRAPH_OPERATION_EXECUTION_STATUS schema
| Column name | Type | Description |
|---|---|---|
QUERY_ID |
STRING |
The ID for the graph algorithm query. |
QUERY_TEXT |
STRING |
The query statement text. |
START_TIMESTAMP |
TIMESTAMP |
The time at which the query started execution. |
LAST_UPDATE_TIMESTAMP |
TIMESTAMP |
The time at which the status was last updated. |
PROGRESS |
FLOAT |
Estimated percentage of completion. The value is between 0 and 1, where 0 means started and 1 means completed. |
STATUS |
STRING |
Current state of execution. Possible values are PENDING, IN_PROGRESS, OK, CANCELLED, DEADLINE_EXCEEDED, UNKNOWN. |
ERROR_MESSAGE |
STRING |
Error message if the query execution failed. |
The following sample query lists graph queries that have not yet completed successfully:
SELECT
query_id,
query_text,
start_timestamp,
last_update_timestamp,
progress,
status,
error_message
FROM
SPANNER_SYS.GRAPH_OPERATION_EXECUTION_STATUS
WHERE
status != "OK"
ORDER BY
start_timestamp DESC;
Cancel algorithm execution
To cancel an in-flight graph algorithm query, locate the query_id from
SPANNER_SYS.GRAPH_OPERATION_EXECUTION_STATUS table, then call
cancel_query
for that query_id.