The PageRank Problem
Hans Vandierendonck
CSC3021 Concurrent Programming, 2018–’19
PageRank is Google’s algorithm to rank the search results that match the queried keywords [1]. The algorithm
models the internet as a directed graph where web pages are represented as vertices and links between pages are
edges. The PageRank algorithm calculates the likelihood, or rank, that a page will be visited by people surfing
the web. The rank of a page depends on the rank of the pages that link to it, where pages that are frequently
pointed to tend to gain a higher rank. Also, pages pointed to by highly ranked pages tend to receive a higher rank.
The algorithm is itself quite simple but in practice it captures well the appreciation of the importance of pages by
humans.
You will create a Java program that solves the PageRank problem in Assignment 1; and you will create a
concurrent version of that code in Assignment 2. This document provides essential background on the PageRank
problem. You are encouraged to initiate your own reading on this subject.
1 Graphs
Let’s talk first about graphs. A graph is a collection of items that are connected. We call the items vertices and
the connections are called edges or links. An edge connects two vertices. We assume edges are directed, i.e., they
point form a source vertex (also called a tail) to a destination vertex (also called a head).
In mathematical terms, a graph is described as a pair G = (V, E) where V is the set of vertices and E is the set
of edges. We describe an edge as an ordered pair of vertices, i.e., E ? V × V . The number of vertices is given by
|V | and the number of edges is given by |E|.
An edge is described as an ordered pair. As such, the pair (u, v) describes an edge pointing from u to v,
assuming u and v are elements of the set V . We say that u is the source of the edge and v is the destination. We
also say that the edge is incident to u and to v.
The number of edges incident to a vertex is called the degree of the vertex. A vertex with degree 5 will thus
appear in the description of 5 edges, either as source or destination. We further distinguish the in-degree and the
out-degree. The in-degree of vertex v is the number of edges pointing to v. It is indicated by deg(v). The outdegree
is the number of edges pointing away from v. It is indicated by deg+(v). There is a relationship between
the different types of degrees. In the absence of self-edges (edges pointing from a vertex to itself), it is always the
case that deg?(v) + deg+(v) = deg(v).
A graph may be directed or undirected. In a directed graph, each edge is an ordered pair, pointing from a source
to a destination vertex. Our definition of a graph focusses on directed graphs. In an undirected graph, edges have
no direction. They link two vertices without notion of source and destination. We emulate undirected graphs by
assuming that the connection between two vertices u and v is recorded by a directed edge from u to v and another
directed edge pointing from v to u. As such, in our representation, the number of edges in an undirected graph is
always an even number.
2 Problem Statement
The PageRank algorithm assumes that a person surfing the web visits a sequence of web pages. The next page in
the sequence is determined either by following one of the outgoing links of the current page, or by restarting the
1
sequence in a random web page. The PageRank value P R(d) for a page d is recursively defined by the following
equation:
P R(d) = 1 α
n
+ d
X
s∈in(d)
P R(s)
deg+(s)
(1)
where n is the number of vertices in the web graph, in(d) is the set of web pages pointing to the page d, deg+(s) is
the number of outgoing links of page s (the out-degree) and α is the damping factor. This equation has two terms.
The first term states that with probability 1 ? α the surfer visits a random web page. The second term models the
likelihood of arriving at page d by following a sequence of links. Note that page s has deg+(s) outgoing links
and that its PageRank value is equally distributed among all these links. As such, every link is equally likely to be
followed.
Because of its importance, efficient ways to solve this equation have been extensively investigated. We will
describe an iterative method that is easy to implement and is amenable to concurrent execution.
3 The Power Iteration Method
The power iteration method solves the recursive equation 1 by feeding in estimates for P R(s) in the right-handside
of the equation and calculating new pagerank values by evaluating the formula. The newly calculated values
are then fed into the right-hand-side again and the process is repeated until the pagerank values converge, i.e., they
change only marginally in subsequent steps.
In order to describe the power iteration method unambiguously, we introduce an extra parameter t that indicates
the iteration and let P R(s;t) represent the pagerank of page s in iteration t.
Initially, the page ranks are uniformly initialised to the same value. As they represent a probability, we make
sure that they all add up to 1 and we set them to
P R(s; 0) = 1
n
(2)
Given that the PageRank values at iteration t are known, the power method calculates the PageRank values at
iteration t + 1 using the following formula;
P R(d;t + 1) = 1 α
n
+ α
X
s∈in(d)
P R(s;t)
deg+(s)
(3)
This formula is identical to Equation 1 with the iteration numbers added.
The power method continues for as many iterations as necessary until the L1-norm of the difference between
PageRank values in subsequent iterations is less than a pre-defined error bound :
X
p
|P R(p;t + 1) P R(p;t)|
!
< (4)
where |x| produces the absolute value of x. Often, = 10?7
is selected.
Algorithm 1 shows how the PageRank problem can be solved using the power iteration method. It uses two
arrays to store PageRank values: pr stores the current estimates of the PageRank values, i.e., P R(v;t). newpr
stores the new estimates which are being made in the current iteration of the Power Method, i.e., P R(v;t + 1).
Each iteration of the power method evaluates Equation 3.
First, the initial pagerank values are set up in line 1, in accordance with Equation 2. The while loop starting at
line 3 loops over the iterations of the Power Iteration Method. It checks if the computation has converged. Initially,
it has not. The algorithm then proceeds to initialise the new PageRank estimates in newpr to zero. It then moves
on to evaluate the main part of Equation 3. Lines 5 to 9 are written in strict correspondence with the right-hand
term in Equation 3 (the constants 1?α
n
are not explicitly added up). That means: for each vertex d, we sum up the
2
Algorithm 1: Solving the PageRank Problem using the Power Iteration Method.
Data: G = (V, E), a graph. Dampen factor α, accepted error
Result: pr: the PageRank values
1 for v ∈ V do pr[v] = 1n;
2 converged = f alse;
3 while not converged do
4 for v ∈ V do newpr[v] = 0 ;
5 for d ∈ V do
6 for s ∈ in(d) do
7 newpr[d]+ = αr[s]deg+(s);
8 end
9 end
// Normalize vector to sum to 1
10 l = 0;
11 for v ∈ V do l+ = newpr[v] ;
12 for v ∈ V do newpr[v]
// Check convergence
13 l = 0;
14 for v ∈ V do l+ = abs(pr[v] ? newpr[v]) ;
15 converged = (l < );
// Swap to new solution
16 for v ∈ V do pr[v] = newpr[v] ;
17 end
Algorithm 2: Alternative formulation for lines 5-9 of Algorithm 1
1 for e ∈ E do
2 s = source(e);
3 d = destination(e);
4 newpr[d]+ = αpr[s]eg+(s);
5 end
PageRank contributions of all its incoming edges. The set of the source vertices of the incoming edges is in(d).
For each vertex s in this set, we look up the current PageRank estimate in pr, divide by the out-degree of s and
multiply with the damping factor α.
So far, we have performed the main part of evaluating Equation 3, but we have not added in the constants 1αn.
We could add a loop that adds this constant to each entry of newpr, however, that is actually unnecessary. A key
problem in any algorithm working with real numbers is that the computer cannot always perform the arithmetic
exactly (see the appendix on real number representation). As such, we need to compensate for errors kreeping in
in the computation. The observable error is that the PageRank values do not add up to 1. We can compensate for
this by multiplying the PageRank values such that the sum does add up to 1. Hereto, we calculate the sum of the
PageRank values (l in line 11) and we then multiply the PageRank values by 1
such that they sum up to 1 in
line 12. As a side effect of this normalisation step, we don’t need to add in the constants
The power iteration then finishes off by checking convergence (line 15) and copy the newly computed estimates
of the PageRank values from newpr to pr (line 16).
The key part of Algorithm 1 are lines 5-9 and this is where we will focus on. There are many ways in which
these lines can be encoded. Algorithm 2 shows how instead of using two loops over destinations d and sources s,
an equivalent code is to use one loop over all edges, and to extract source and destination from the edge.
3
4 Representing Graphs as Sparse Matrices
The discussion so far has considered a description of the algorithm using pseudo-code. To make the algorithm
concrete, we need to choose what data structures we will use to represent the graph. When choosing a data structure,
we need to consider what operations are performed on the graph. We can discern the following:
It must be possible to iterate over all vertices, e.g., Algorithm 1, line 1
It must be possible to iterate over all edges, e.g., Algorithm 1, line 5 or Algorithm 2.
It must be possible to retrieve the out-degree of a vertex
The first constraint can be resolved easily if we pre-process the graph and conveniently represent the vertices
as integers in a dense range, e.g., we assign to each vertex an integer in the range 0, · · · , n 1 for a graph with n
vertices.
The second constraint is more critical and requires thought. As we will be dealing with very large graphs, it is
best to utilise representations that use a few very large arrays. We will focus on fairly easy representations, much
more complex representations are described in the literature [2].
Depending on the data structure we choose for the graph, it may be possible to retrieve the out-degree of a vertex
directly from the graph data structure, which would satisfy the third constraint. Alternatively, we can calculate the
out-degree of each vertex once at the start of the program and store these values in an array. Then we can look them
up efficiently.
4.1 Adjacency Matrix
The adjacency matrix of a graph G = (V, E) is a square binary matrix with dimension |V |. Assume that the number
of vertices is represented by n (n = |V |) and the number of edges by m (m = |E|). Assume further that vertices
are labelled with integer numbers in the range 0, · · · , n ? 1, i.e., V = 0, 1, 2, · · · , n ? 1. The element aij on row i
and column j in the adjacency matrix is determined as follows:1
1. aij = 1 if an edge exists from vertex j to vertex i
2. aij = 0 if the graph does not contain an edge from vertex j to vertex i
Note that the adjacency matrix is typically a sparse matrix: most of the entries are 0. Sparse matrices may be
represented as 2-dimensional arrays, however, this would take a lot of space, namely n
2
elements. Because of the
larger number of zeroes in the matrix, there exist more efficient alternatives that typically take space proportional
to the number of non-zeroes, i.e., proportional to the number of edges, which is typically much smaller than n
2
.
4.2 Coordinate Format, a.k.a. Edge Lists
A straightforward representation of a sparse matrix is the Coordinate Formate, abbreviated to COO: simply list all
edges as pairs of source and destination. As such, an array is required of length m which holds the pairs (i, j) ∈ E.
Alternatively, one may also create two arrays of length m where one array stores the source vertices and the
other stores the destinations of an edge. Sources and destinations are stored in corresponding locations: if the source
of an edge is stored at index k in the source array, then its destination is stored also at index k in the destinations.2
Note that the COO format does not specify in which order the edges are stored. That makes it inefficient to
answer certain queries, such as retrieving all the edges that are incident to a particular vertex. This is, however,
irrelevant to the PageRank problem (as Algorithm 2 may be used).
1The definition given is actually the transpose of the adjacency matrix, which is obtained by swapping the position of the indices i and
j: the element at row i and column j in the transpose of a matrix A equals the element at row j and column i in A. It is custom in graph
analytics to operate on the transpose of the adjacency matrix.
2Terminology: the distinction between these two data layouts is known as the distinction between an “array of structures” layout vs a
“structures of arrays” layout.
4
0
1
3
4 2
5 destination
0 5 5 6 8 9
1 2 3 4 5 4 4 5 5 0 1 2 3 4
index
CSRformat
source
0 1 3 5 7 11
5 0 5 0 5 0 5 0 2 3 5 0 3 4
index
CSC format
14
14
Figure 1: A small graph with 6 vertices and 14 edges; and its representation in the CSR and CSC formats
4.3 Compressed Sparse Rows
The Compressed Sparse Rows format, abbreviated to CSR, provides an indexed representation to the graph. In
particular, it individually compresses each row of the adjacency matrix and stores only the destination vertices, i.e.,
it stores the values j where aij = 1. This is a compressed representation of the dense matrix, which would store all
values aij for j = 0, · · · , n ? 1 on every row.
The CSR representation requires two arrays: a destination array containing the destination vertex IDs for each
edge, and an index array containing for every source vertex the index of its first destination ID in the destination
array. The destination array has length m. The index array has length n + 1, where an extra element is inserted to
simplify the traversal of the data structure.
Figure 1 shows the CSR format for an exemplar graph.
The CSR representation allows one to quickly recover all outgoing edges of a vertex with index k: (i) lookup
the numbers index[k] and index[k+1] in the index array; (ii) the destinations of the outgoig edges for vertex
k are found in the destinations array at positions j=index[k] up to but not including j=index[k+1].
4.4 Compressed Sparse Columns
The Compressed Sparse Columns format, abbreviated to CSC, is an indexed representation to the graph, similar to
CSR. The main difference is that it lists the incoming edges as opposed to the outgoing edges. As such, it uses an
index array of length n + 1 and a source array of length m. The process for traversing the graph is analogous to
the CSR. Figure 1 graphically illustrates the CSC representation.
5 Reference Solution
To help you with debugging your code, Table 1 lists the PageRank values as they are calculated by the reference
solution to this exercise. PageRank values seem to convergence from iteration 18 onwards. Note, however, that we
show only 6 significant digits while we aim to calculate 7 converged digits ( < 10?7
).
References
[1] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web.
Technical Report 1999-66, Stanford InfoLab, November 1999. Previous number = SIDL-WP-1999-0120.
Available online: http://ilpubs.stanford.edu:8090/422/.
[2] Y. Saad. SPARSKIT: A basic tool for sparse matrix computations. Technical Report NASA-CR-185876,
NASA, May 1990. Available online: https://ntrs.nasa.gov/search.jsp?R=19910023551.
5
Table 1: PageRank values for the 6-node graph in Figure 1
t delta P R(0;t) P R(1;t) P R(2;t) P R(3;t) P R(4;t) P R(5;t)
init 0.166666 0.166666 0.166666 0.166666 0.166666 0.166666
1 0.547778 0.0769444 0.105278 0.105278 0.105278 0.317778 0.289444
2 0.181160 0.0891199 0.102200 0.102200 0.102200 0.236429 0.367849
3 0.137640 0.102013 0.117163 0.117163 0.117163 0.247469 0.299029
4 0.0634867 0.0924331 0.109775 0.109775 0.109775 0.259158 0.319083
5 0.0173711 0.0947956 0.110509 0.110509 0.110509 0.250473 0.323204
6 0.0131304 0.0956002 0.111715 0.111715 0.111715 0.252615 0.316639
7 0.00674097 0.0946550 0.110907 0.110907 0.110907 0.253344 0.319280
8 0.00171397 0.0949894 0.111081 0.111081 0.111081 0.252487 0.319281
9 0.00114623 0.0950142 0.111162 0.111162 0.111162 0.252790 0.318708
10 6.61535E-4 0.0949284 0.111081 0.111081 0.111081 0.252813 0.319016
11 2.39131E-4 0.0949692 0.111107 0.111107 0.111107 0.252735 0.318975
12 9.54587E-5 0.0949658 0.111111 0.111111 0.111111 0.252772 0.318930
13 6.58410E-5 0.0949588 0.111103 0.111103 0.111103 0.252769 0.318963
14 2.89733E-5 0.0949633 0.111106 0.111106 0.111106 0.252762 0.318955
15 8.19374E-6 0.0949624 0.111106 0.111106 0.111106 0.252767 0.318952
16 6.49790E-6 0.0949619 0.111106 0.111106 0.111106 0.252766 0.318956
17 3.18692E-6 0.0949624 0.111106 0.111106 0.111106 0.252765 0.318954
18 8.35832E-7 0.0949622 0.111106 0.111106 0.111106 0.252766 0.318954
19 5.90395E-7 0.0949622 0.111106 0.111106 0.111106 0.252766 0.318955
20 3.23357E-7 0.0949623 0.111106 0.111106 0.111106 0.252766 0.318954
21 1.01826E-7 0.0949623 0.111106 0.111106 0.111106 0.252766 0.318955
22 4.92322E-8 0.0949623 0.111106 0.111106 0.111106 0.252766 0.318954
6
A Real Number Representation
Real numbers are typically represented approximately in a computer using floating-point number representations.
These number representations have a finite number of bits, while real numbers may have an infinite number of digits.
For instance, the number 1
3
has an infinitely long representation 0.33333 · · ·. As such, numbers are represented
with limited accuracy.
The IEEE-754 standard for floating-point number representation specifies that a float value consists of 32
bits. These are 1 sign bit to indicate if the number is postive or negative, 8 exponent bits and 23 mantissa bits.
Similarly, a double consists of 64 bits, with 1 sign bit, 11 exponent bits and 53 mantissa bits. Let s be the sign
bit, e the exponent bits and m the mantissa bits of a floating-point number, then the real number that is represented
is (?1)s × (1.m) × 2
e?b
, where b = 127 for float and b = 1023 for double. The sign of the number is
retrieved as (?1)0 = 1 and (?1)1 = ?1. The constant b ensures that both very large and very small numbers
can be represented. It is furthermore assumed that the exponent is decided such that it indicates the leading (most
significant) 1-bit of the real number. As every real number that is not zero has such a bit, we don’t need to store it.
The zero value is represented by the special case of s = 0, e = 0 and m = 0.
Any real number can be represented with a relative precision of = 2?M, where M mantissa bits are used.
The difference between any real number r and its floating-point representation rb is thus at most : |
r?br
br
| < 2
?M.
That is good news: a float can represent a real number with a relative error of at most 2
?23 ≈ 1.19 × 10?7
and
a double introduces a relative error of at most 2
?53 ≈ 1.11 × 10?16
.
Because numbers are not always represented exactly, every arithmetic operation using floating-point number
representations may introduce error. The problem results from the representation of the numbers. Assume we have
two real numbers a and b that can be represented exactly using a floating-point representation as ab = a and bb = b. If
we desire to calculate a + b, then the computer will evaluate abd+ bb, i.e., it will perform the addition on the floatingpoint
representations, calculate an exact result for ab + bb and then finally convert this exact result to a floating-point
number representation. The final conversion from the exact result to a floating-point number is signficant: the
addition may generate more non-zero mantissa bits, e.g., because of carry taking place. However, only a limited
number of mantissa bits can be stored. Some mantissa bits need to be dropped. This causes an inaccuracy of
floating-point addition, even if the added values can be represented exactly. The error can be bounded by the same
value as the representation of numbers.
A similar observation applies to floating-point subtraction, multiplication and division. In each case, the error
bound is .
If a long chain of computations is performed, then the error will propagate and all individual errors may add
up. As such, when n PageRank values are added up in line 11 of Algorithm 1, then the error on the value of l
may be as large as n × 2
53. Assuming a double representation and a graph with a billion edges, which is less
than the current size of the World Wide Web, the total error may be as large as 109 × 2
53 ≈ 109 × 10152
3 ≈
106/8 > 10 7
. In other words, the error due to floating-point arithmetic may be larger than the desired accuracy
of the computation (typically 10 7
), which means the algorithm would never converge.
版权所有:留学生程序网 2020 All Rights Reserved 联系方式:QQ:99515681 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。