联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> javajava

日期:2018-10-15 11:08

The PageRank Problem

Hans Vandierendonck

CSC3021 Concurrent Programming, 2018–’19

PageRank is Google’s algorithm to rank the search results that match the queried keywords [1]. The algorithm

models the internet as a directed graph where web pages are represented as vertices and links between pages are

edges. The PageRank algorithm calculates the likelihood, or rank, that a page will be visited by people surfing

the web. The rank of a page depends on the rank of the pages that link to it, where pages that are frequently

pointed to tend to gain a higher rank. Also, pages pointed to by highly ranked pages tend to receive a higher rank.

The algorithm is itself quite simple but in practice it captures well the appreciation of the importance of pages by

humans.

You will create a Java program that solves the PageRank problem in Assignment 1; and you will create a

concurrent version of that code in Assignment 2. This document provides essential background on the PageRank

problem. You are encouraged to initiate your own reading on this subject.

1 Graphs

Let’s talk first about graphs. A graph is a collection of items that are connected. We call the items vertices and

the connections are called edges or links. An edge connects two vertices. We assume edges are directed, i.e., they

point form a source vertex (also called a tail) to a destination vertex (also called a head).

In mathematical terms, a graph is described as a pair G = (V, E) where V is the set of vertices and E is the set

of edges. We describe an edge as an ordered pair of vertices, i.e., E ? V × V . The number of vertices is given by

|V | and the number of edges is given by |E|.

An edge is described as an ordered pair. As such, the pair (u, v) describes an edge pointing from u to v,

assuming u and v are elements of the set V . We say that u is the source of the edge and v is the destination. We

also say that the edge is incident to u and to v.

The number of edges incident to a vertex is called the degree of the vertex. A vertex with degree 5 will thus

appear in the description of 5 edges, either as source or destination. We further distinguish the in-degree and the

out-degree. The in-degree of vertex v is the number of edges pointing to v. It is indicated by deg(v). The outdegree

is the number of edges pointing away from v. It is indicated by deg+(v). There is a relationship between

the different types of degrees. In the absence of self-edges (edges pointing from a vertex to itself), it is always the

case that deg?(v) + deg+(v) = deg(v).

A graph may be directed or undirected. In a directed graph, each edge is an ordered pair, pointing from a source

to a destination vertex. Our definition of a graph focusses on directed graphs. In an undirected graph, edges have

no direction. They link two vertices without notion of source and destination. We emulate undirected graphs by

assuming that the connection between two vertices u and v is recorded by a directed edge from u to v and another

directed edge pointing from v to u. As such, in our representation, the number of edges in an undirected graph is

always an even number.

2 Problem Statement

The PageRank algorithm assumes that a person surfing the web visits a sequence of web pages. The next page in

the sequence is determined either by following one of the outgoing links of the current page, or by restarting the

1

sequence in a random web page. The PageRank value P R(d) for a page d is recursively defined by the following

equation:

P R(d) = 1 α

n

+ d

X

s∈in(d)

P R(s)

deg+(s)

(1)

where n is the number of vertices in the web graph, in(d) is the set of web pages pointing to the page d, deg+(s) is

the number of outgoing links of page s (the out-degree) and α is the damping factor. This equation has two terms.

The first term states that with probability 1 ? α the surfer visits a random web page. The second term models the

likelihood of arriving at page d by following a sequence of links. Note that page s has deg+(s) outgoing links

and that its PageRank value is equally distributed among all these links. As such, every link is equally likely to be

followed.

Because of its importance, efficient ways to solve this equation have been extensively investigated. We will

describe an iterative method that is easy to implement and is amenable to concurrent execution.

3 The Power Iteration Method

The power iteration method solves the recursive equation 1 by feeding in estimates for P R(s) in the right-handside

of the equation and calculating new pagerank values by evaluating the formula. The newly calculated values

are then fed into the right-hand-side again and the process is repeated until the pagerank values converge, i.e., they

change only marginally in subsequent steps.

In order to describe the power iteration method unambiguously, we introduce an extra parameter t that indicates

the iteration and let P R(s;t) represent the pagerank of page s in iteration t.

Initially, the page ranks are uniformly initialised to the same value. As they represent a probability, we make

sure that they all add up to 1 and we set them to

P R(s; 0) = 1

n

(2)

Given that the PageRank values at iteration t are known, the power method calculates the PageRank values at

iteration t + 1 using the following formula;

P R(d;t + 1) = 1 α

n

+ α

X

s∈in(d)

P R(s;t)

deg+(s)

(3)

This formula is identical to Equation 1 with the iteration numbers added.

The power method continues for as many iterations as necessary until the L1-norm of the difference between

PageRank values in subsequent iterations is less than a pre-defined error bound :

X

p

|P R(p;t + 1) P R(p;t)|

!

<  (4)

where |x| produces the absolute value of x. Often,  = 10?7

is selected.

Algorithm 1 shows how the PageRank problem can be solved using the power iteration method. It uses two

arrays to store PageRank values: pr stores the current estimates of the PageRank values, i.e., P R(v;t). newpr

stores the new estimates which are being made in the current iteration of the Power Method, i.e., P R(v;t + 1).

Each iteration of the power method evaluates Equation 3.

First, the initial pagerank values are set up in line 1, in accordance with Equation 2. The while loop starting at

line 3 loops over the iterations of the Power Iteration Method. It checks if the computation has converged. Initially,

it has not. The algorithm then proceeds to initialise the new PageRank estimates in newpr to zero. It then moves

on to evaluate the main part of Equation 3. Lines 5 to 9 are written in strict correspondence with the right-hand

term in Equation 3 (the constants 1?α

n

are not explicitly added up). That means: for each vertex d, we sum up the

2

Algorithm 1: Solving the PageRank Problem using the Power Iteration Method.

Data: G = (V, E), a graph. Dampen factor α, accepted error 

Result: pr: the PageRank values

1 for v ∈ V do pr[v] = 1n;

2 converged = f alse;

3 while not converged do

4 for v ∈ V do newpr[v] = 0 ;

5 for d ∈ V do

6 for s ∈ in(d) do

7 newpr[d]+ = αr[s]deg+(s);

8 end

9 end

// Normalize vector to sum to 1

10 l = 0;

11 for v ∈ V do l+ = newpr[v] ;

12 for v ∈ V do newpr[v]

// Check convergence

13 l = 0;

14 for v ∈ V do l+ = abs(pr[v] ? newpr[v]) ;

15 converged = (l < );

// Swap to new solution

16 for v ∈ V do pr[v] = newpr[v] ;

17 end

Algorithm 2: Alternative formulation for lines 5-9 of Algorithm 1

1 for e ∈ E do

2 s = source(e);

3 d = destination(e);

4 newpr[d]+ = αpr[s]eg+(s);

5 end

PageRank contributions of all its incoming edges. The set of the source vertices of the incoming edges is in(d).

For each vertex s in this set, we look up the current PageRank estimate in pr, divide by the out-degree of s and

multiply with the damping factor α.

So far, we have performed the main part of evaluating Equation 3, but we have not added in the constants 1αn.

We could add a loop that adds this constant to each entry of newpr, however, that is actually unnecessary. A key

problem in any algorithm working with real numbers is that the computer cannot always perform the arithmetic

exactly (see the appendix on real number representation). As such, we need to compensate for errors kreeping in

in the computation. The observable error is that the PageRank values do not add up to 1. We can compensate for

this by multiplying the PageRank values such that the sum does add up to 1. Hereto, we calculate the sum of the

PageRank values (l in line 11) and we then multiply the PageRank values by 1

such that they sum up to 1 in

line 12. As a side effect of this normalisation step, we don’t need to add in the constants

The power iteration then finishes off by checking convergence (line 15) and copy the newly computed estimates

of the PageRank values from newpr to pr (line 16).

The key part of Algorithm 1 are lines 5-9 and this is where we will focus on. There are many ways in which

these lines can be encoded. Algorithm 2 shows how instead of using two loops over destinations d and sources s,

an equivalent code is to use one loop over all edges, and to extract source and destination from the edge.

3

4 Representing Graphs as Sparse Matrices

The discussion so far has considered a description of the algorithm using pseudo-code. To make the algorithm

concrete, we need to choose what data structures we will use to represent the graph. When choosing a data structure,

we need to consider what operations are performed on the graph. We can discern the following:

It must be possible to iterate over all vertices, e.g., Algorithm 1, line 1

It must be possible to iterate over all edges, e.g., Algorithm 1, line 5 or Algorithm 2.

It must be possible to retrieve the out-degree of a vertex

The first constraint can be resolved easily if we pre-process the graph and conveniently represent the vertices

as integers in a dense range, e.g., we assign to each vertex an integer in the range 0, · · · , n 1 for a graph with n

vertices.

The second constraint is more critical and requires thought. As we will be dealing with very large graphs, it is

best to utilise representations that use a few very large arrays. We will focus on fairly easy representations, much

more complex representations are described in the literature [2].

Depending on the data structure we choose for the graph, it may be possible to retrieve the out-degree of a vertex

directly from the graph data structure, which would satisfy the third constraint. Alternatively, we can calculate the

out-degree of each vertex once at the start of the program and store these values in an array. Then we can look them

up efficiently.

4.1 Adjacency Matrix

The adjacency matrix of a graph G = (V, E) is a square binary matrix with dimension |V |. Assume that the number

of vertices is represented by n (n = |V |) and the number of edges by m (m = |E|). Assume further that vertices

are labelled with integer numbers in the range 0, · · · , n ? 1, i.e., V = 0, 1, 2, · · · , n ? 1. The element aij on row i

and column j in the adjacency matrix is determined as follows:1

1. aij = 1 if an edge exists from vertex j to vertex i

2. aij = 0 if the graph does not contain an edge from vertex j to vertex i

Note that the adjacency matrix is typically a sparse matrix: most of the entries are 0. Sparse matrices may be

represented as 2-dimensional arrays, however, this would take a lot of space, namely n

2

elements. Because of the

larger number of zeroes in the matrix, there exist more efficient alternatives that typically take space proportional

to the number of non-zeroes, i.e., proportional to the number of edges, which is typically much smaller than n

2

.

4.2 Coordinate Format, a.k.a. Edge Lists

A straightforward representation of a sparse matrix is the Coordinate Formate, abbreviated to COO: simply list all

edges as pairs of source and destination. As such, an array is required of length m which holds the pairs (i, j) ∈ E.

Alternatively, one may also create two arrays of length m where one array stores the source vertices and the

other stores the destinations of an edge. Sources and destinations are stored in corresponding locations: if the source

of an edge is stored at index k in the source array, then its destination is stored also at index k in the destinations.2

Note that the COO format does not specify in which order the edges are stored. That makes it inefficient to

answer certain queries, such as retrieving all the edges that are incident to a particular vertex. This is, however,

irrelevant to the PageRank problem (as Algorithm 2 may be used).

1The definition given is actually the transpose of the adjacency matrix, which is obtained by swapping the position of the indices i and

j: the element at row i and column j in the transpose of a matrix A equals the element at row j and column i in A. It is custom in graph

analytics to operate on the transpose of the adjacency matrix.

2Terminology: the distinction between these two data layouts is known as the distinction between an “array of structures” layout vs a

“structures of arrays” layout.

4

0

1

3

4 2

5 destination

0 5 5 6 8 9

1 2 3 4 5 4 4 5 5 0 1 2 3 4

index

CSRformat

source

0 1 3 5 7 11

5 0 5 0 5 0 5 0 2 3 5 0 3 4

index

CSC format

14

14

Figure 1: A small graph with 6 vertices and 14 edges; and its representation in the CSR and CSC formats

4.3 Compressed Sparse Rows

The Compressed Sparse Rows format, abbreviated to CSR, provides an indexed representation to the graph. In

particular, it individually compresses each row of the adjacency matrix and stores only the destination vertices, i.e.,

it stores the values j where aij = 1. This is a compressed representation of the dense matrix, which would store all

values aij for j = 0, · · · , n ? 1 on every row.

The CSR representation requires two arrays: a destination array containing the destination vertex IDs for each

edge, and an index array containing for every source vertex the index of its first destination ID in the destination

array. The destination array has length m. The index array has length n + 1, where an extra element is inserted to

simplify the traversal of the data structure.

Figure 1 shows the CSR format for an exemplar graph.

The CSR representation allows one to quickly recover all outgoing edges of a vertex with index k: (i) lookup

the numbers index[k] and index[k+1] in the index array; (ii) the destinations of the outgoig edges for vertex

k are found in the destinations array at positions j=index[k] up to but not including j=index[k+1].

4.4 Compressed Sparse Columns

The Compressed Sparse Columns format, abbreviated to CSC, is an indexed representation to the graph, similar to

CSR. The main difference is that it lists the incoming edges as opposed to the outgoing edges. As such, it uses an

index array of length n + 1 and a source array of length m. The process for traversing the graph is analogous to

the CSR. Figure 1 graphically illustrates the CSC representation.

5 Reference Solution

To help you with debugging your code, Table 1 lists the PageRank values as they are calculated by the reference

solution to this exercise. PageRank values seem to convergence from iteration 18 onwards. Note, however, that we

show only 6 significant digits while we aim to calculate 7 converged digits ( < 10?7

).

References

[1] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web.

Technical Report 1999-66, Stanford InfoLab, November 1999. Previous number = SIDL-WP-1999-0120.

Available online: http://ilpubs.stanford.edu:8090/422/.

[2] Y. Saad. SPARSKIT: A basic tool for sparse matrix computations. Technical Report NASA-CR-185876,

NASA, May 1990. Available online: https://ntrs.nasa.gov/search.jsp?R=19910023551.

5

Table 1: PageRank values for the 6-node graph in Figure 1

t delta P R(0;t) P R(1;t) P R(2;t) P R(3;t) P R(4;t) P R(5;t)

init 0.166666 0.166666 0.166666 0.166666 0.166666 0.166666

1 0.547778 0.0769444 0.105278 0.105278 0.105278 0.317778 0.289444

2 0.181160 0.0891199 0.102200 0.102200 0.102200 0.236429 0.367849

3 0.137640 0.102013 0.117163 0.117163 0.117163 0.247469 0.299029

4 0.0634867 0.0924331 0.109775 0.109775 0.109775 0.259158 0.319083

5 0.0173711 0.0947956 0.110509 0.110509 0.110509 0.250473 0.323204

6 0.0131304 0.0956002 0.111715 0.111715 0.111715 0.252615 0.316639

7 0.00674097 0.0946550 0.110907 0.110907 0.110907 0.253344 0.319280

8 0.00171397 0.0949894 0.111081 0.111081 0.111081 0.252487 0.319281

9 0.00114623 0.0950142 0.111162 0.111162 0.111162 0.252790 0.318708

10 6.61535E-4 0.0949284 0.111081 0.111081 0.111081 0.252813 0.319016

11 2.39131E-4 0.0949692 0.111107 0.111107 0.111107 0.252735 0.318975

12 9.54587E-5 0.0949658 0.111111 0.111111 0.111111 0.252772 0.318930

13 6.58410E-5 0.0949588 0.111103 0.111103 0.111103 0.252769 0.318963

14 2.89733E-5 0.0949633 0.111106 0.111106 0.111106 0.252762 0.318955

15 8.19374E-6 0.0949624 0.111106 0.111106 0.111106 0.252767 0.318952

16 6.49790E-6 0.0949619 0.111106 0.111106 0.111106 0.252766 0.318956

17 3.18692E-6 0.0949624 0.111106 0.111106 0.111106 0.252765 0.318954

18 8.35832E-7 0.0949622 0.111106 0.111106 0.111106 0.252766 0.318954

19 5.90395E-7 0.0949622 0.111106 0.111106 0.111106 0.252766 0.318955

20 3.23357E-7 0.0949623 0.111106 0.111106 0.111106 0.252766 0.318954

21 1.01826E-7 0.0949623 0.111106 0.111106 0.111106 0.252766 0.318955

22 4.92322E-8 0.0949623 0.111106 0.111106 0.111106 0.252766 0.318954

6

A Real Number Representation

Real numbers are typically represented approximately in a computer using floating-point number representations.

These number representations have a finite number of bits, while real numbers may have an infinite number of digits.

For instance, the number 1

3

has an infinitely long representation 0.33333 · · ·. As such, numbers are represented

with limited accuracy.

The IEEE-754 standard for floating-point number representation specifies that a float value consists of 32

bits. These are 1 sign bit to indicate if the number is postive or negative, 8 exponent bits and 23 mantissa bits.

Similarly, a double consists of 64 bits, with 1 sign bit, 11 exponent bits and 53 mantissa bits. Let s be the sign

bit, e the exponent bits and m the mantissa bits of a floating-point number, then the real number that is represented

is (?1)s × (1.m) × 2

e?b

, where b = 127 for float and b = 1023 for double. The sign of the number is

retrieved as (?1)0 = 1 and (?1)1 = ?1. The constant b ensures that both very large and very small numbers

can be represented. It is furthermore assumed that the exponent is decided such that it indicates the leading (most

significant) 1-bit of the real number. As every real number that is not zero has such a bit, we don’t need to store it.

The zero value is represented by the special case of s = 0, e = 0 and m = 0.

Any real number can be represented with a relative precision of  = 2?M, where M mantissa bits are used.

The difference between any real number r and its floating-point representation rb is thus at most : |

r?br

br

| < 2

?M.

That is good news: a float can represent a real number with a relative error of at most 2

?23 ≈ 1.19 × 10?7

and

a double introduces a relative error of at most 2

?53 ≈ 1.11 × 10?16

.

Because numbers are not always represented exactly, every arithmetic operation using floating-point number

representations may introduce error. The problem results from the representation of the numbers. Assume we have

two real numbers a and b that can be represented exactly using a floating-point representation as ab = a and bb = b. If

we desire to calculate a + b, then the computer will evaluate abd+ bb, i.e., it will perform the addition on the floatingpoint

representations, calculate an exact result for ab + bb and then finally convert this exact result to a floating-point

number representation. The final conversion from the exact result to a floating-point number is signficant: the

addition may generate more non-zero mantissa bits, e.g., because of carry taking place. However, only a limited

number of mantissa bits can be stored. Some mantissa bits need to be dropped. This causes an inaccuracy of

floating-point addition, even if the added values can be represented exactly. The error can be bounded by the same

 value as the representation of numbers.

A similar observation applies to floating-point subtraction, multiplication and division. In each case, the error

bound is .

If a long chain of computations is performed, then the error will propagate and all individual errors may add

up. As such, when n PageRank values are added up in line 11 of Algorithm 1, then the error on the value of l

may be as large as n × 2

53. Assuming a double representation and a graph with a billion edges, which is less

than the current size of the World Wide Web, the total error may be as large as 109 × 2

53 ≈ 109 × 10152

3 ≈

106/8 > 10 7

. In other words, the error due to floating-point arithmetic may be larger than the desired accuracy

of the computation (typically 10 7

), which means the algorithm would never converge.


版权所有:留学生程序网 2020 All Rights Reserved 联系方式:QQ:99515681 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。