Google/Matrix

Page-ranking Problem

There are billions of web pages accessible over the internet, but how can someone find the ones with the most relevant information to the topic they are interested in? We need a way to compare the importance and/or relevance of all of these pages. A primary concern for the search engine (beyond being able to index every page) is to be able to rank pages with an algorithm that incorporates common usage patterns (i.e. a user who visits web page X is more likely to click the link to page Y than the link to page Z).

Eigenvalue Solution Techniques

Google uses the PageRank algorithm which primarily ranks pages by the number of links to a web page, and the weights of the links. Initially every indexed page is given an equal rank, and a n x n connectivity matrix A is created (where n is the number of indexed pages). Position A_i,j is 1 if page i links to page j, otherwise it is 0. (Note: Since n is a very large integer, storing and working with A as a full matrix would be very inefficient, so it is usually stored as a sparse matrix.)

Once the initial rankings of every indexed page is calculated (simply 1/n), the PageRank of a new page is determined by how many links there are to the page, and the PageRanks of those 'linking' websites. The rank of a page is the probability that a web-surfer with an infinite amount of time will visit it.^[3]

In order to compute the PageRank of each website, we can use the Power Method to get the dominating eigenvector. This eigenvector will be n x 1, and will be made up of the PageRanks of all n websites. The connectivity matrix A is not initially in the correct format for use in the Power Method algorithm, so we will adjust A by dividing each row by the number of ones in that row and then (in order to keep track of everything) call the resultant matrix B.

Algorithm Outline

The Power Method x ← Bx will converge if B is:

Stochastic: each row contains strictly nonnegative values and the sum of the values in a row equals one
Irreducible: in the connectivity matrix it must be possible to get from any web page to any other web page
Aperiodic: every web page has a link to itself (or the user can simply refresh the page)

In order to calculate our dominant eigenvector, we need to satisfy the rules listed above. Let C be a n x n matrix whose elements are C_i,j = p*B_i,j + δ.

(where p is the probability of the user clicking a link on the page, usually taken to be 0.85, and δ = (1 - p)/n, where n is the total number of pages).

C now satisfies the above conditions (and is the transition matrix of the Markov chain of a random progression through web pages). Next, we iterate through the Power Method until we find a dominating eigenvector, x, such that x = Ax. Once we have x, we can "assign" its components (the PageRanks) to the corresponding web pages in A.