PageRank

This algorithm computes the approximate PageRank centralities of all nodes of a weighted directed graph with non-negative edge weights. Those centralities provide a ranking of the nodes based on the underlying link structure of the graph.

Originally, the PageRank algorithm was used to rank Web pages for Web search indexing. The computation was based on the idea of a "random surfer" who randomly clicks on links. The PageRank score of a page is proportional to the probability that the random surfer will land on that page.

Formally, the PageRank centralities of nodes in a graph are defined as follows. Let \(S\) be a nonempty set of source nodes chosen by the user; by default, \(S\) consists of all nodes of the graph. Consider a random walk which starts at a node of \(S\) selected uniformly at random, and at each step does one of these actions:

If the current node has at least one outgoing edge with positive weight, do one of two actions:
- With probability \(\gamma\), select one of those outgoing edges and traverse it (that is, move to the other endpoint of the selected edge). The outgoing edge to be traversed is selected at random; for each of the candidate outgoing edges, its probability of being selected is proportional to its weight.
- With probability \(1 - \gamma\), move to one of the nodes in \(S\), which the algorithm chooses uniformly at random.
Otherwise (that is, if all outgoing edges of the current node have weight zero): move to one of the nodes in \(S\), which the algorithm chooses uniformly at random.

Here, \(\gamma \in [0, 1)\) is a parameter of the algorithm, called the damping factor. The (exact) PageRank centrality of a node is defined as the limit as \(s\) goes to infinity of the probability that such a random walk with \(s\) steps ends at that node. Note that the PageRank centrality is defined (and computed by the algorithm) for all nodes in the graph, not just those in the set \(S\) of source nodes.

Interface specifications

The following list summarizes the expectations on the graph passed as input to this algorithm. These expectations complement the ones listed in General assumptions on input graphs.

Directed graph (an undirected graph can be converted into a directed graph by replacing each undirected edge \(xy\) with two directed edges \(xy\) and \(yx\), both with the same weight as the original edge).

Edge weights: Must be non-negative.

The output is a list of \(n\) floating-point numbers (where \(n\) denotes the number of nodes in the input graph), corresponding to the approximate PageRank centralities of all nodes (not just those in \(S\)).

These parameters configure the algorithm:

Name	Type	Description
`damping_factor`	floating-point number	The probability that, at any given iteration, the algorithm chooses to traverse one of the outgoing edges of the current node (if the node has at least one outgoing edge with positive weight). Must be in the range [0, 1). This parameter is usually set to 0.85 in research papers related to PageRank.
`num_iterations`	integer	The maximum number of iterations of the algorithm. Must be positive. The running time of the algorithm scales linearly with the number of iterations executed; a higher number of iterations tends to produce more precise results.
`approx_precision`	floating-point number	Approximation precision threshold for the algorithm. Must be non-negative. The algorithm stops if the L1-distance between the approximate centrality vector that the algorithm obtains in two consecutive iterations is smaller than `approx_precision` times the number of nodes in the graph. Smaller values of `approx_precision` tend to lead to more precise results, but also tend to increase the running time.

Algorithm

The algorithm executes at most num_iterations (possibly fewer than that, if the algorithm achieves convergence earlier—see the documentation of the approx_precision parameter). The \(i\)-th iteration computes the probability distribution for the location of the random walk after \(i\) steps.

The algorithm uses the following equations to compute those distributions. Let \(P_i(v)\) (where \(i = 0, 1, 2...\)) denote the probability that the random walk reaches node \(v\) after \(i\) steps.

For \(i = 0\), since the algorithm chooses the starting point of the random walk as a source node uniformly at random:

\[ P_0(y) = \begin{cases} \frac{1}{|S|} & \text{, if } y \in S ;\\ 0 & \text{, otherwise.} \\ \end{cases} \]

For \(i \ge 1\), the algorithm derives the \(P_i\) values from the \(P_{i-1}\) values using a recurrence relation.

Let:

\(V\) denote the set of nodes of the graph;
\(\deg^+(x)\) denote the weighted out-degree of a node \(x\) (that is, the sum of the weights of \(x\)'s outgoing edges);
\(V_{\textrm{sink}}\) denote the set of nodes of the graph whose weighted out-degree is \(0\);
\(E\) denote the set of the (directed) edges in the graph;
\(w_{xy}\) denote the weight of an edge \(xy\);
\(N^-(y)\) denote the set of nodes \(x\) such that there is an edge \(xy\) with positive weight.

The algorithm computes the \(P_i\) values by this relation:

\[ P_i(y) = \begin{cases} \gamma \sum_{x \in N^-(y)} \frac{w_{xy}P_{i-1}(x)}{\deg^+(x)} + \frac{(1 - \gamma) + \gamma\sum_{x \in V_{\textrm{sink}}}P_{i-1}(x)}{|S|} & \text{, if } y \in S ;\\ \gamma \sum_{x \in N^-(y)} \frac{w_{xy}P_{i-1}(x)}{\deg^+(x)} & \text{, otherwise.} \\ \end{cases}\]

Computational complexity

The algorithm is a parallel algorithm that runs in \(O(\text{num_iterations} \cdot (m + n))\) work and \(O(\text{num_iterations} \cdot \log n)\) depth, where \(n\) and \(m\) denote the number of nodes and edges in the input graph, respectively, and \(\text{num_iterations}\) denotes the number of iterations. Running the algorithm on a machine that supports high amounts of parallelism tends to lead to a smaller running time.

The algorithm uses \(O(n)\) memory.

Examples

Example 1

Input graph

Consider this input graph:

A directed weighted graph with four nodes (a-d) and weighted edges.

For the provided directed weighted graph, assume that:

A number over an edge indicates its weight.
The damping factor \(\gamma\) is 0.85.
The number of iterations is 3.
The approximate precision is 0.01.
No source nodes are provided in the input (that is, the algorithm treats all nodes as source nodes).

Solution

PageRank scores that the algorithm obtains after three iterations for each node \(x\) in the graph, denoted \(P_3(x)\):

\(P_3(a) \approx 0.066\), \(P_3(b) \approx 0.407\), \(P_3(c) \approx 0.138\), \(P_3(d) \approx 0.389\)

Note that the exact PageRank scores are:

\(P(a) \approx 0.067\), \(P(b) \approx 0.414\), \(P(c) \approx 0.137\), \(P(d) \approx 0.382\)

Algorithm details

Since no source nodes were provided in the input, the set of source nodes consists of all nodes: \(S = \{a, b, c, d\}\). Initially, the algorithm assigns a score \(P_0(x) = 1 / |S| = 0.25\) to each node.

\(P_0(a) = 0.25\), \(P_0(b) = 0.25\), \(P_0(c) = 0.25\), \(P_0(d) = 0.25\)

The weighted out-degrees of the nodes are:

\(deg^+(a) = 2 + 3= 5\), \(deg^+(b) = 4 + 1 = 5\), \(deg^+(c) = 0\), \(deg^+(d) = 2\)

Since only node \(c\) has a weighted out-degree of 0, \(V_{sink} = \{c\}\).

Recurrence relation for node \(a\):

Incoming edges to node a highlighted for the recurrence relation.

Nodes that have a positive-weight edge pointing to node a: \(N^-(a) = \{\}\)

\[ \begin{align*} P_i(a) &= [ (1 - \gamma) + \gamma * P_{i-1}(c) ] / |S| \\ &= [ 0.15 + 0.85 * P_{i-1}(c) ] / 4 \end{align*} \]

Recurrence relation for node \(b\):

Incoming edges to node b highlighted for the recurrence relation.

Nodes that have a positive-weight edge pointing to node \(b\): \(N^-(b) = \{a, d\}\)

\[ \begin{align*} P_i(b) &= \gamma * [ P_{i-1}(a) w_{ab} / deg^+(a) + P_{i-1}(d) w_{db} / deg^+(d)] + [ (1 - \gamma) + \gamma * P_{i-1}(c) ] / |S| \\ &= 0.85 * [ 0.4 * P_{i-1}(a) + P_{i-1}(d)] + [ 0.15 + 0.85 * P_{i-1}(c) ] / 4 \end{align*} \]

Recurrence relation for node \(c\):

Incoming edges to node c highlighted for the recurrence relation.

Nodes that have a positive-weight edge pointing to node \(c\): \(N^-(c) = \{b\}\)

\[ \begin{align*} P_i(c) &= \gamma * [ P_{i-1}(b) w_{bc} / deg^+(b)] + [ (1 - \gamma) + \gamma * P_{i-1}(c) ] / |S| \\ &= 0.85 * 0.2 * P_{i-1}(b) + [ 0.15 + 0.85 * P_{i-1}(c) ] / 4 \end{align*} \]

Recurrence relation for node \(d\):

Incoming edges to node d highlighted for the recurrence relation.

Nodes that have a positive-weight edge pointing to node \(d\): \(N^-(d) = \{a, b\}\)

\[ \begin{align*} P_i(d) &= \gamma * [ P_{i-1}(a) w_{ad} / deg^+(a) + P_{i-1}(b) w_{bd} / deg^+(b) ] + [ (1 - \gamma) + \gamma * P_{i-1}(c) ] / |S| \\ &= 0.85 * [ 0.6 * P_{i-1}(a) + 0.8 * P_{i-1}(b) ] + [ 0.15 + 0.85 * P_{i-1}(c) ] / 4 \end{align*} \]

Iteration #1:

The algorithm updates the PageRank values using the recurrence relations.

\[ \begin{align*} P_1(a) &= [ 0.15 + 0.85 * P_0(c) ] / 4 \approx 0.091 \\ P_1(b) &= 0.85 * [ 0.4 * P_0(a) + P_0(d)] + [ 0.15 + 0.85 * P_0(c) ] / 4 \approx 0.388 \\ P_1(c) &= 0.85 * 0.2 * P_0(b) + [ 0.15 + 0.85 * P_0(c) ] / 4 \approx 0.133 \\ P_1(d) &= 0.85 * [ 0.6 * P_0(a) + 0.8 * P_0(b) ] + [ 0.15 + 0.85 * P_0(c) ] / 4 \approx 0.388 \end{align*} \]

The L1-distance between the initial PageRank values and the values from iteration #1 is:

\[|P_1(a) - P_0(a)| + |P_1(b) - P_0(b)| + |P_1(c) - P_0(c)| + |P_1(d) - P_0(d)| \approx 0.553\]

This is larger than 0.01 (the approximate precision in the configuration) x 4 (the number of nodes in the graph), so the algorithm proceeds to the next iteration.

Iteration #2:

The algorithm updates the PageRank values using the recurrence relations.

\[ \begin{align*} P_2(a) &= [ 0.15 + 0.85 * P_1(c) ] / 4 \approx 0.066 \\ P_2(b) &= 0.85 * [ 0.4 * P_1(a) + P_1(d)] + [ 0.15 + 0.85 * P_1(c) ] / 4 \approx 0.427 \\ P_2(c) &= 0.85 * 0.2 * P_1(b) + [ 0.15 + 0.85 * P_1(c) ] / 4 \approx 0.132 \\ P_2(d) &= 0.85 * [ 0.6 * P_1(a) + 0.8 * P_1(b) ] + [ 0.15 + 0.85 * P_1(c) ] / 4 \approx 0.376 \\ \end{align*} \]

The L1-distance between the PageRank values from iteration #1 and iteration #2 is:

\[|P_2(a) - P_1(a)| + |P_2(b) - P_1(b)| + |P_2(c) - P_1(c)| + |P_2(d) - P_1(d)| \approx 0.077\]

This is larger than 0.01 (the approximate precision in the configuration) x 4 (the number of nodes in the graph), so the algorithm proceeds to the next iteration.

Iteration #3:

The algorithm updates the PageRank values using the recurrence relations.

\[ \begin{align*} P_3(a) &= [ 0.15 + 0.85 * P_2(c) ] / 4 \approx 0.066 \\ P_3(b) &= 0.85 * [ 0.4 * P_2(a) + P_2(d)] + [ 0.15 + 0.85 * P_2(c) ] / 4 \approx 0.407 \\ P_3(c) &= 0.85 * 0.2 * P_2(b) + [ 0.15 + 0.85 * P_2(c) ] / 4 \approx 0.138 \\ P_3(d) &= 0.85 * [ 0.6 * P_2(a) + 0.8 * P_2(b) ] + [ 0.15 + 0.85 * P_2(c) ] / 4 \approx 0.389 \end{align*} \]

The algorithm stops because the configuration specifies a maximum of 3 iterations.