Connected components

This algorithm computes the connected components of an undirected graph \(G\) (or a directed graph, which the algorithm treats as an undirected one; see details in the following Input graph characteristics section). A connected component is a maximal set of nodes such that there is a path between any two nodes in the set. The algorithm returns a clustering containing one cluster per connected component.

Interface specifications

The following list summarizes the expectations on the graph passed as input to this algorithm. These expectations complement the ones listed in General assumptions on input graphs.

Directed graph or undirected graph: The algorithm ignores edge directions; see details in the following Algorithm description section.

Edge weights: The algorithm does not use edge weights.

While connected components typically apply to undirected graphs only, for performance reasons the implementation also accepts directed graphs. Even when the input is a directed graph, the algorithm still treats edges as though they were undirected. In other words, if the algorithm sees an edge \(xy\), then it assumes that an edge \(yx\) is implicitly present even if it may not actually be present in the graph.

The output is a non-overlapping clustering, with each cluster representing a connected component.

The algorithm does not accept any configuration parameters.

Algorithm

The algorithm uses an asynchronous implementation of a union-find data structure.

Initially, each node belongs to its own (singleton) cluster. Then, for each edge \(xy\) (in parallel), the algorithm merges the cluster containing \(x\) and the cluster containing \(y\) (if they are not the same cluster) into a single cluster. Edge processing happens in parallel with graph loading. The algorithm returns the clustering that results from all those cluster-merging operations.

The algorithm maintains the clustering as a disjoint set forest. While merging two clusters, it uses path compression to reduce the heights of the trees in the forest, so as to speed up future lookup and merge operations.

Computational complexity

The algorithm is a parallel algorithm that runs in \(O(n + m \log n)\) work and \(O(n)\) depth, where \(n\) and \(m\) denote the number of nodes and edges in the input graph, respectively. Running the algorithm on a machine that supports high parallelism tends to lead to a smaller running time.

The algorithm uses \(O(n)\) memory.

Example

Input graph

Consider the following input graph:

An undirected graph with seven nodes and various edges.

Solution

The graph has three connected components, containing nodes (a, b, c, d), (f, g), and (h):

The three connected components discovered in the graph: {a, b, c, d}, {f, g}, and {h}, each enclosed in a dashed outline.

Algorithm details

Initially, the algorithm places each node in its own cluster:

The initial state of the algorithm where each node is in its own individual cluster.

The algorithm then processes each edge of the graph, merging the clusters of its two endpoints (if they are not already in the same cluster).

Note that the order in which the algorithm processes the edges does not affect the result.

Processing edge ac: the algorithm merges clusters (a) and (c).

Merging of nodes a and c into a single cluster after processing edge ac.

Processing edge bc: the algorithm merges clusters (b) and (ac).

Merging of node b into the existing {a, c} cluster after processing edge bc.

Processing edge bd: the algorithm merges clusters (abc) and (d).

Merging of node d into the existing {a, b, c} cluster after processing edge
bd.

Processing edge cd: nodes c and d are already in the same cluster, so the clustering remains unchanged.

Clustering unchanged after processing edge cd, since nodes c and d are already
in the same
cluster.

Processing edge fg: the algorithm merges clusters (f) and (g).

Merging of nodes f and g into a new cluster after processing edge fg.

The clustering on the right is the final result.

External references

ConnectIt: A Framework for Static and Incremental Parallel Graph Connectivity Algorithms (Proceedings of the VLDB Endowment, 2020) first published this implementation.
A Randomized Concurrent Algorithm for Disjoint Set Union (Principles of Distributed Computing, 2016) inspired the asynchronous Union-Find implementation.
An open-source implementation is available on GitHub.