This algorithm computes the connected components of an undirected graph \(G\) (or a directed graph, which the algorithm treats as an undirected one; see details in the following Input graph characteristics section). A connected component is a maximal set of nodes such that there is a path between any two nodes in the set. The algorithm returns a clustering containing one cluster per connected component.
Interface specifications
The following list summarizes the expectations on the graph passed as input to this algorithm. These expectations complement the ones listed in General assumptions on input graphs.
- Directed graph or undirected graph: The algorithm ignores edge directions; see details in the following Algorithm description section.
- Edge weights: The algorithm does not use edge weights.
While connected components typically apply to undirected graphs only, for performance reasons the implementation also accepts directed graphs. Even when the input is a directed graph, the algorithm still treats edges as though they were undirected. In other words, if the algorithm sees an edge \(xy\), then it assumes that an edge \(yx\) is implicitly present even if it may not actually be present in the graph.
The output is a non-overlapping clustering, with each cluster representing a connected component.
The algorithm does not accept any configuration parameters.
Algorithm
The algorithm uses an asynchronous implementation of a union-find data structure.
Initially, each node belongs to its own (singleton) cluster. Then, for each edge \(xy\) (in parallel), the algorithm merges the cluster containing \(x\) and the cluster containing \(y\) (if they are not the same cluster) into a single cluster. Edge processing happens in parallel with graph loading. The algorithm returns the clustering that results from all those cluster-merging operations.
The algorithm maintains the clustering as a disjoint set forest. While merging two clusters, it uses path compression to reduce the heights of the trees in the forest, so as to speed up future lookup and merge operations.
Computational complexity
The algorithm is a parallel algorithm that runs in \(O(n + m \log n)\) work and \(O(n)\) depth, where \(n\) and \(m\) denote the number of nodes and edges in the input graph, respectively. Running the algorithm on a machine that supports high parallelism tends to lead to a smaller running time.
The algorithm uses \(O(n)\) memory.
Example
Input graph
Consider the following input graph:
Solution
The graph has three connected components, containing nodes (a, b, c, d), (f, g), and (h):
Algorithm details
Initially, the algorithm places each node in its own cluster:
The algorithm then processes each edge of the graph, merging the clusters of its two endpoints (if they are not already in the same cluster).
Note that the order in which the algorithm processes the edges does not affect the result.
Processing edge ac: the algorithm merges clusters (a) and (c).
Processing edge bc: the algorithm merges clusters (b) and (ac).
Processing edge bd: the algorithm merges clusters (abc) and (d).
Processing edge cd: nodes c and d are already in the same cluster, so the clustering remains unchanged.
Processing edge fg: the algorithm merges clusters (f) and (g).
The clustering on the right is the final result.
External references
- ConnectIt: A Framework for Static and Incremental Parallel Graph Connectivity Algorithms (Proceedings of the VLDB Endowment, 2020) first published this implementation.
- A Randomized Concurrent Algorithm for Disjoint Set Union (Principles of Distributed Computing, 2016) inspired the asynchronous Union-Find implementation.
- An open-source implementation is available on GitHub.