Google/Matrix
Page-ranking Problem
[edit | edit source]There are billions of web pages accessible over the internet, but how can someone find the ones with the most relevant information to the topic they are interested in? We need a way to compare the importance and/or relevance of all of these pages. A primary concern for the search engine (beyond being able to index every page) is to be able to rank pages with an algorithm that incorporates common usage patterns (i.e. a user who visits web page X is more likely to click the link to page Y than the link to page Z).
Similar Problems
[edit | edit source]A similar problem is the BCS (Bowl Championship Series)w:Bowl_Championship_Series rankings for college football. Since many football teams do not play each other during the regular season, how can they be ranked against one another? In this case, the BCS standings are calculated by summing three results: the Harris Interactive Poll[1], the USA Today Coaches Pollw:Coaches_Poll, and the average from six computer ranking systems.[1][2]
Eigenvalue Solution Techniques
[edit | edit source]Google uses the PageRank algorithm which primarily ranks pages by the number of links to a web page, and the weights of the links. Initially every indexed page is given an equal rank, and a n x n connectivity matrix A is created (where n is the number of indexed pages). Position Ai,j is 1 if page i links to page j, otherwise it is 0. (Note: Since n is a very large integer, storing and working with A as a full matrix would be very inefficient, so it is usually stored as a sparse matrix.)
Once the initial rankings of every indexed page is calculated (simply 1/n), the PageRank of a new page is determined by how many links there are to the page, and the PageRanks of those 'linking' websites. The rank of a page is the probability that a web-surfer with an infinite amount of time will visit it.[3]
In order to compute the PageRank of each website, we can use the Power Method to get the dominating eigenvector. This eigenvector will be n x 1, and will be made up of the PageRanks of all n websites. The connectivity matrix A is not initially in the correct format for use in the Power Method algorithm, so we will adjust A by dividing each row by the number of ones in that row and then (in order to keep track of everything) call the resultant matrix B.
Algorithm Outline
[edit | edit source]The Power Method x ← Bx will converge if B is:
- Stochastic: each row contains strictly nonnegative values and the sum of the values in a row equals one
- Irreducible: in the connectivity matrix it must be possible to get from any web page to any other web page
- Aperiodic: every web page has a link to itself (or the user can simply refresh the page)
In order to calculate our dominant eigenvector, we need to satisfy the rules listed above. Let C be a n x n matrix whose elements are Ci,j = p*Bi,j + δ.
(where p is the probability of the user clicking a link on the page, usually taken to be 0.85, and δ = (1 - p)/n, where n is the total number of pages).
C now satisfies the above conditions (and is the transition matrix of the Markov chain of a random progression through web pages). Next, we iterate through the Power Method until we find a dominating eigenvector, x, such that x = Ax. Once we have x, we can "assign" its components (the PageRanks) to the corresponding web pages in A.
Example of Use
[edit | edit source]As an example, let's say we have websites X, Y, and Z. X links to both Y and Z. Y links to X. Z links to Y. This then gives us matrix A,
but then we must divide each element of a row by the sum of the elements in that row, giving us matrix B,
and then we compute matrix C by taking into account p,
and then we use some initial guess, x(0), for our Power Method, and calculate x(1), x(2), and so on,
Analysis
[edit | edit source]Convergence
[edit | edit source]To more easily see which component is increasing the most, we can break down the vectors, x(i), into unit vectors and their coefficients,
and so, from the looks of it, Y seems to have the highest PageRank, followed by Z, and then X.