This document describes best practices for running algorithms on Spanner Graph. It covers the following topics:
- Graph schema best practices for algorithms
- Handle algorithm output
- Specifying
machine_categoryfor large workloads - Reusing compute for fast iteration
- Differentiating element types in output
Graph schema best practices for algorithms
If you use customized labels and
properties
in
Spanner Graph, we recommend that you always expose element
keys
as properties. This way,
you can access those key properties in the algorithm query's RETURN
clause,
and use them to identify the graph element for which the algorithm generates output.
Handle algorithm output
This section shares recommendations on what to return from algorithm output and where to persist the returned results.
What to return
When the algorithm's output signature contains graph elements, Spanner Graph lets you return any property of those elements. To reduce processing cost, we recommend that you minimize the list of properties to return. For example, only return properties that are keys of an element because keys are necessary to identify the element.
Where to persist
Spanner Graph supports persisting algorithm query result to Cloud Storage or back to the same Spanner Graph instance where you initiate the query.
Persisting back to Spannermakes the algorithm output accessible to all
Spanner operations natively. For example, you can run WeaklyConnectedComponent
, persist cluster_id
back to the graph.
Subsequently, run PageRank
for an input graph with specific cluster_id
.
Spanner features like Change
Stream
propagate new algorithm
outputs to downstream systems. If you want to use algorithm output in
Spanner, we recommend persisting to Spanner.
While Spanner Graph uses efficient ways to batch
and persist algorithm output to Spanner, all writes must go
through the same leader
replica. If you
already have critical workloads taking up leader capacity, you might want to
stagger algorithm runs to avoid competing with critical workloads.
Consider persisting to Cloud Storageif you don't plan to use algorithm output in Spanner or want to evaluate the algorithm output before ingesting it to your primary data source. See supported file formats .
Data modeling considerations for augmenting the graph
This section discusses some data modeling considerations when persisting algorithm output to Spanner.
Properties versus edges
Algorithms can produce a variety of insights, including the following:
- Node-level metrics (for example, centrality score). These can be persisted as new properties on the node.
- Pairwise insights (for example, similarity between two nodes). These can be persisted as new edges between the two nodes with the output value as edge property.
- Community detection algorithms identify logical communities in the graph and can assign one or more communities to a given node. You can model community membership as a new property on the node by tagging each member node with the ID of its assigned community. You can also model community membership as new subgraphs and store the logical communities as a new type of nodes in the graph with edges connecting them to member nodes. Community property might be sufficient if you want to know the community a node belongs to when you access a node. Community subgraph might be more appropriate when you want to navigate by community (for example, find nodes in the same community as a starting node). Depending on how you plan to use community information, you can choose either or both.
Predefined schema versus flexible schema
Before you persist algorithm output back to Spanner, define the destination column or table in the schema in one of the following ways:
- If your algorithm use cases are stable and known, you can add additional columns or tables beforehand.
- If you are iterating and experimenting with different ways to extract insight (for example, trying out different algorithms, parameters, input subgraphs), you might want a flexible way to persist and differentiate output from multiple experimentations without updating the Spanner schema for each run. In this case, you can consider storing new properties and edges produced by algorithms in generic child tables.
To store new properties and edges produced by algorithms in generic child tables, follow these steps:
Step 1: Create generic child tables
-- Create `AccountAlgoProperty` as a child table of `Account`, to store all
-- properties produced by algorithms for `Account`.
-- `algo_run_id`: a unique ID for a given algorithm run.
-- `int_val`: column to store integer algorithm output.
-- `float_val`: column to store float algorithm output.
CREATE
TABLE
AccountAlgoProperty
(
id
INT64
NOT
NULL
,
algo_run_id
STRING
(
200
)
NOT
NULL
,
int_val
INT64
,
float_val
FLOAT64
)
PRIMARY
KEY
(
id
,
algo_run_id
),
INTERLEAVE
IN
PARENT
Account
;
-- Create `AccountAlgoEdge` as a child table of `Account`, to store all
-- outgoing edges produced by algorithms for `Account`.
-- `algo_run_id`: a unique ID for a given algorithm run.
-- `to_id`: destination `Account` id.
-- `int_val`: column to store integer algorithm output.
-- `float_val`: column to store float algorithm output.
CREATE
TABLE
AccountAlgoEdge
(
id
INT64
NOT
NULL
,
algo_run_id
STRING
(
200
)
NOT
NULL
,
to_id
INT64
NOT
NULL
,
int_val
INT64
,
float_val
FLOAT64
,
CONSTRAINT
FK_AccountId
FOREIGN
KEY
(
to_id
)
REFERENCES
Account
(
id
)
NOT
ENFORCED
,
)
PRIMARY
KEY
(
id
,
algo_run_id
,
to_id
),
INTERLEAVE
IN
PARENT
Account
;
Step 2: Store properties and edges produced by algorithms as child table rows,
along with the unique algo_run_id
that produced them.
-- Store PageRank score as property in child table.
EXPORT
DATA
OPTIONS
(
format
=
"CLOUD_SPANNER"
,
table
=
"AccountAlgoProperty"
,
write_mode
=
'upsert_ignore_all'
)
AS
GRAPH
FinGraph
CALL
PageRank
(
node_labels
=
>
[
'Account'
]
,
edge_labels
=
>
[
'Transfers'
]
)
RETURN
node
.
id
,
"page_rank_123"
AS
algo_run_id
,
score
As
float_val
;
-- Store ShortestPath output as edge in child table.
EXPORT
DATA
OPTIONS
(
format
=
"CLOUD_SPANNER"
,
table
=
"AccountAlgoEdge"
,
write_mode
=
'upsert_ignore_all'
)
AS
GRAPH
FinGraph
CALL
ShortestPath
(
source_nodes
=
>
ARRAY
{
MATCH
(
n
:
Account
{
id
:
7
}
)
RETURN
n
}
,
target_nodes
=
>
ARRAY
{
MATCH
(
n
:
Account
{
id
:
16
}
)
RETURN
n
}
)
YIELD
source_node
,
target_node
,
path
,
cost
RETURN
source_node
.
id
AS
id
,
"shortest_path_456"
AS
algo_run_id
,
target_node
.
id
AS
to_id
,
PATH_LENGTH
(
path
)
AS
int_val
,
cost
AS
float_val
;
Step 3: Access algorithm output using GRAPH_TABLE
SELECT
acct
.
id
,
prop
.
float_val
AS
page_rank_score
FROM
GRAPH_TABLE
(
FinGraph
MATCH
(
n
:
Account
)
RETURN
n
.
id
)
acct
JOIN
AccountAlgoProperty
prop
ON
acct
.
id
=
prop
.
id
AND
prop
.
algo_run_id
=
'page_rank_123'
;
SELECT
acct
.
id
,
edge
.
to_id
,
edge
.
int_val
AS
path_len
,
edge
.
float_val
AS
cost
FROM
GRAPH_TABLE
(
FinGraph
MATCH
(
n
:
Account
)
RETURN
n
.
id
)
acct
JOIN
AccountAlgoEdge
edge
ON
acct
.
id
=
edge
.
id
AND
edge
.
algo_run_id
=
'shortest_path_456'
;
Step 4: Optionally, create a logical graph augmented with algorithm generated properties and edges to make accessing algorithm output easier.
-- Create a view that aggregates non-null algorithm properties per node into
-- name-value pairs in JSON.
CREATE
OR
REPLACE
VIEW
AccountWithAlgoProperties
SQL
SECURITY
INVOKER
AS
SELECT
n
.
id
,
ANY_VALUE
(
n
.
create_time
)
AS
create_time
,
ANY_VALUE
(
n
.
is_blocked
)
AS
is_blocked
,
ANY_VALUE
(
n
.
nick_name
)
AS
nick_name
,
JSON_OBJECT
(
IF
(
COUNT
(
c
.
algo_run_id
)
=
0
,
[]
,
ARRAY_AGG
(
c
.
algo_run_id
)),
IF
(
COUNT
(
c
.
algo_run_id
)
=
0
,
[]
,
ARRAY_AGG
(
COALESCE
(
IF
(
c
.
int_val
IS
NULL
,
NULL
,
TO_JSON
(
c
.
int_val
)),
IF
(
c
.
float_val
IS
NULL
,
NULL
,
TO_JSON
(
c
.
float_val
))
))))
AS
algo_props
FROM
Account
AS
n
LEFT
JOIN
AccountAlgoProperty
AS
c
ON
n
.
id
=
c
.
id
GROUP
BY
n
.
id
;
-- Create FinGraphAugmented with algorithm generated properties and edges.
CREATE
OR
REPLACE
PROPERTY
GRAPH
FinGraphAugmented
NODE
TABLES
(
AccountWithAlgoProperties
AS
Account
KEY
(
id
)
LABEL
Account
PROPERTIES
(
create_time
,
id
,
is_blocked
,
nick_name
,
algo_props
),
Person
)
EDGE
TABLES
(
PersonOwnAccount
SOURCE
KEY
(
id
)
REFERENCES
Person
(
id
)
DESTINATION
KEY
(
account_id
)
REFERENCES
Account
(
id
)
LABEL
Owns
,
AccountTransferAccount
SOURCE
KEY
(
id
)
REFERENCES
Account
(
id
)
DESTINATION
KEY
(
to_id
)
REFERENCES
Account
(
id
)
LABEL
Transfers
,
AccountAlgoEdge
SOURCE
KEY
(
id
)
REFERENCES
Account
(
id
)
DESTINATION
KEY
(
to_id
)
REFERENCES
Account
(
id
)
);
You can then directly access algorithm properties and navigate through algorithm generated edges in graph query:
-- Retrieve PageRank score property in graph query.
GRAPH
FinGraphAugmented
MATCH
(
a
:
Account
)
RETURN
a
.
id
,
a
.
algo_props
.
page_rank_123
AS
page_rank
ORDER
BY
page_rank
DESC
LIMIT
5
;
-- Navigate through ShortestPath edge in graph query.
GRAPH
FinGraphAugmented
MATCH
(
s
:
Account
{
id
:
7
}
)
-[
e
:
AccountAlgoEdge
{
algo_run_id
:
'shortest_path_456'
}]-
> (
d
:
Account
)
RETURN
s
.
id
AS
src
,
d
.
id
AS
dst
,
e
.
int_val
AS
path_len
,
e
.
float_val
AS
cost
Specify machine_category
for large workloads
For most workloads, the default
machine_category is appropriate. If your input
graph has > 1 billion nodes and > 10 billion edges, we recommend that you
explicitly choose large
for machine_category
.
Reuse compute for fast iteration
If you are experimenting with different algorithms or different input parameters
on small datasets for fast iteration, we recommend that you use max_idle_time
.
A larger max_idle_time
Spanner Graph lets you reuse compute resources
it already provisioned for more algorithm workloads. This reduces the overhead
of cold starting compute on demand and mitigates the risk of being unable to
provision compute due to temporary lack of capacity. Note that algorithm compute
is billable when it is idle.
Differentiate element types in output
If the algorithm output contains multiple element types, you might want to include ELEMENT_DEFINITION_NAME
in your output to differentiate them. For
example:
EXPORT
DATA
OPTIONS
(
uri
=
"gs://bucket-name/page_rank.csv"
,
format
=
"csv"
,
overwrite
=
TRUE
)
AS
GRAPH
FinGraph
CALL
PageRank
()
YIELD
node
,
score
RETURN
ELEMENT_DEFINITION_NAME
(
node
)
AS
node_type
,
node
.
id
,
score
;

