Computer Science

PageRank Algorithm

PageRank Algorithm is an algorithm used by Google to rank web pages in search engine results. It assigns a numerical weight to each element of a hyperlinked set of documents, with the purpose of measuring its relative importance within the set. The algorithm works by analyzing the link structure of the web and has been a key factor in the development of search engine technology.

Written by Perlego with AI-assistance

12 Key excerpts on "PageRank Algorithm"

  • Book cover image for: Practical Discrete Mathematics
    • Ryan T. White, Archana Tikayat Ray(Authors)
    • 2021(Publication Date)
    • Packt Publishing
      (Publisher)
    These failings prompted the need for a different kind of ranking method to move the "best" web pages to the top of the list when users search. With this in mind, we will continue to see how modern ranking algorithms, PageRank in particular, use linear algebra and probability to find the importance of web pages on the basis of which other websites link to them.

    Google PageRank II

    In this section, we will continue learning about Google's PageRank Algorithm, which we started to look at in Chapter 5 , Elements in Discrete Probability . As we discussed in that chapter, two students at Stanford University and later founders of Google, Larry Page and Sergey Brin, along with some researchers at Stanford, Rajeev Motwani and Terry Winograd, tapped into some existing academic literature on information retrieval in linked documents and merged several innovations to adapt the ideas for use in web searches.
    The algorithm they developed, PageRank, was so effective that Google soon became totally dominant in the field of search engines in the late 1990s to early 2000s. This innovative PageRank Algorithm still forms a part of Google's searching methods, although their methods have, of course, progressed significantly in the past 20 years by implementing information from user histories, user location, and the like in determining which websites are most likely relevant to users.
    Without further ado, let's dive into learning just how PageRank works!
    In a basic search engine using the PageRank Algorithm, a user searches some terms and all the web pages with the terms, and perhaps web pages matching according to some fuzzy criteria, are returned. Then, the PageRanks of all the websites are found and sorted from highest to lowest. Finally, the results are displayed to the user in this descending order. Ideally, this means the most relevant websites will be shown to the user first.
    To see how PageRank works under the hood, let's first quickly review what we learned about the PageRank Algorithm in Chapter 5 , Elements of Discrete Probability . Suppose we have an "internet" made up of a set of web pages. We will call the "internet" I and assume it has some finite number, N , of distinct web pages. In the real internet, this N
  • Book cover image for: Algorithms for the People
    eBook - PDF

    Algorithms for the People

    Democracy in the Age of AI

    Algo- rithms could use hyperlinks to estimate the relevance of pages in search.26 This was the foundation of PageRank, an algorithm developed by the com- puter scientists Larry Page and Sergey Brin in 1998. (“PageRank” is a riff on Page’s name and webpages.) Page and Brin were working on a citation analysis project at Stanford University, supported by the National Science Foundation. (Google began as http://google.stanford.edu/). Just as citations were used to estimate the impact of scholarly papers, hyperlinks could be used to develop “an approximation of the overall relative importance of web pages,” encoding a judgment about relevance and authority.27 There are three parts to PageRank. The first is the number of backlinks (the number of hyperlinks to that page). If hyperlinks are affirmations or citations, pages with more backlinks are probably more authoritative than those with few. “The intuition behind PageRank,” write Page and Brin, “is that it uses informa- tion which is external to the Web pages themselves—their backlinks, which provide a kind of peer review.” Important websites (Yahoo .com was their ex- ample) “will have tens of thousands of backlinks (or citations) pointing to it,” as “many backlinks generally imply that [a page] is quite important.”28 The second is the quality of backlinks. “Backlinks from ‘important’ pages,” write Page and Brin, “are more significant than backlinks from average pages. This is encompassed in the . . . definition of PageRank.” Just as a citation from an important paper with lots of citations might count for more than a citation from a paper that has never been cited, PageRank estimates the quality of T h e P ol i t ic s of M ac h i n e L e a r n i ng I I 119 links “by counting the number and quality of links to a page to determine a rough estimate of how important the website is.
  • Book cover image for: Data Mining and Data Warehousing
    eBook - PDF

    Data Mining and Data Warehousing

    Principles and Practical Techniques

    This algorithm formed the base for the Google search engine. The PageRank Algorithm works on the basis of backlinks. It can be understood well by considering the social behaviour of human beings. For instance, if person A is very important, then many people will know him. This notion of knowing someone is similar to a web page linking to some other web page (a backlink). If a web page A is very important then many other web pages will link back to it. This is accompanied by the fact that if very important web pages link back to a certain page then that web page itself must be very important. Just like if many famous people know about a certain person then that person itself must be very important or famous. Let us look into the mathematics of the PageRank Algorithm. The PageRank Algorithm uses probability theory at its core. The probability of a user clicking on a link may be estimated based on the number of links present in that web page. The more links a web page has the less chance that a user will click on a particular link. In other words, the probability of a user clicking any link is inversely proportional to the number of links on a web page. This means that if a user is on webpage A which has two links, with the first leading to web page B and the second leading to web page C, then there is 50% chance of user visiting web page B from web page A. This 50% chance is multiplied by a damping factor in order to accommodate for circumstances such as user leaving the web page A or jumping to another tab in the browser. Let us call this probability A C that is the probability of visiting the web page C from web page A. So, if three other web pages P, Q and R link-back to web page C and P C , Q C , and R C are probabilities of visiting web page C from respective web pages then total probability of visiting web page C can be defined as sum total of all these probabilities. This allows us to calculate how likely a user is to visit a certain web page.
  • Book cover image for: Handbook of Bibliometric Indicators
    eBook - ePub

    Handbook of Bibliometric Indicators

    Quantitative Tools for Studying and Evaluating Research

    • Roberto Todeschini, Alberto Baccini(Authors)
    • 2016(Publication Date)
    • Wiley-VCH
      (Publisher)
    P

    PageRank (PR)

    Used in the Google search engine, the PageRank is an algorithm aimed to measure the relative importance of web pages [Brin and Page, 1998; Page, Brin et al., 1999; Bollen, Rodriguez et al., 2006; Sidiropoulos and Manolopoulos, 2006].
    The PageRank equation that governs the iterative transfer of page rank values (PR) from one web page to the other is
    where a collection of web page links to a recipient page and each transfers a proportion of their to ; is the number of outlinks of the web page. It also assumes that PR values are uniformly distributed along the jth page outlinks, i.e., if a page has three outlinks, each recipient page receives 1/3 of the 's PageRank. The parameter , ranging between 0 and 1, is a damping factor representing the attenuation of prestige values as they are transferred from one page to another one. The first term represents the minimal amount of prestige assigned to each web page, n being the total number of pages in the network. A typical values of the dumping factor is 0.85, but also different values were chosen [Ma, Guan et al., 2008; Boldi, Santini et al., 2009; Fragkiadaki, Evangelidis et al
  • Book cover image for: Practical Graph Mining with R
    • Nagiza F. Samatova, William Hendrix, John Jenkins, Kanchana Padmanabhan, Arpan Chakraborty, Nagiza F. Samatova, William Hendrix, John Jenkins, Kanchana Padmanabhan, Arpan Chakraborty(Authors)
    • 2013(Publication Date)
    5.4.2 Algorithmic Intuition A simple approach to calculate the rank of a page would be to consider only the count of its backlinks. The drawback with this approach is that the credibility of the backlinks cannot be trusted. It is fairly simple for anyone to spam links to a page and thereby increase its backlink count, thus increasing the rank. To avoid such spamming, PageRank also considers the rank of the web pages that provide the backlinks. This iterative procedure makes the ranking process more accurate and reliable. The algorithm associates a numerical value to every web page, representing its relative importance within the Internet. This number is used by Google to index the web pages while displaying the search results. The Web has about a trillion unique URLs. Considering each of these web pages as a vertex and the links between pages as edges, the Web can be modeled as a directed graph. The relative importance of the web pages that provide the backlink in addition to the number of backlinks for a given page are used to rank web pages. The computational and mathematical backbone of PageRank is the power method , which computes an eigenvector of a matrix. It is a recursive procedure that can run for days in case of bigger matrices representing the Web. This tech- nique is explained in Section 5.4.3.1. Link Analysis 93 5.4.3 Mathematical Formulation Before we proceed, we define some important terminology and notation sum- marized in Table 5.2.
  • Book cover image for: Networked Life
    eBook - PDF

    Networked Life

    20 Questions and Answers

    Summary Box 3 PageRank computes webpages’ importance scores We can view the hyperlinked webpages as a network, and use the connec- tivity patterns to rank the webpages by their importance scores. If many important webpages point to a given webpage, that webpage maybe also im- portant. PageRank uniquely defines and efficiently computes a consistent set of importance scores. This set of scores can be viewed as the dominant eigen- vector of a Google matrix that consists of a network connectivity part and a randomization part. Further Reading The PageRank Algorithm is covered in almost every single book on network sci- ence these days. Some particularly useful references are as follows. Problems 57 1. Back in 1998, the Google founders wrote the following paper explaining the PageRank Algorithm: S. Brin and L. Page, “The anatomy of a large-scale hypertextual Web search engine,” Computer Networks and ISDN Systems, vol. 33, pp. 107–117, 1998. 2. The standard reference book devoted to PageRank is A. N. Langville and C. D. Meyer, Google’s PageRank and Beyond, Princeton University Press, 2006. 3. The following excellent textbook on the computer science and economics of networks has a detailed discussion of navigation on the web: D. Easley and J. Kleinberg, Networks, Crowds, and Markets: Reasoning About a Highly Connected World, Cambridge University Press, 2010. 4. Dealing with non-negative matrices and creating linear stationary iterations are well documented, e.g., in the following reference book: A. Berman and R. J. Plemmons, Nonnegative Matrices in the Mathematical Sciences, Academic Press, 1979. 5. Computational issues in matrix multiplication are treated in textbooks like this one: G. Golub and C. F. van Van Loan, Matrix Computations, 3rd edn., The Johns Hopkins University Press, 1996. Problems 3.1 PageRank sink ? 2 3 4 5 6 1 Figure 3.5 A simple network of webpages with a sink node.
  • Book cover image for: Engines of Order
    eBook - PDF

    Engines of Order

    A Mechanology of Algorithmic Techniques

    Indeed, PageRank is sometimes presented as an example for the f ield of unsupervised machine learning (Hastie, Tibshirani, and Friedman, 2009, p. 576), a set of techniques briefly discussed in Chapter 5 that are often used to ‘map out’ a dataset, either through forms of classif icat ion or through attribution of some numerical value to individual items. In this case, PageRank is the mechanism by which the web is no longer treated exclusively as a document repository, but additionally as a social system that is riddled with the very ‘socionomic hierarchies’ Moreno described. By applying a recursive status index inspired by sociometry and citation analysis to the index of web pages, every document in the corpus can be 286 ENGINES OF ORDER located in the stratif ied network/society and ranked according to its ‘status’ before searching even begins. Attributing an ‘importance score’ (Page et al., 1999, p. 3) to every document ‘independent of the page content’ (Bianchini, Gori, and Scarselli, 2005, p. 94) fundamentally means giving up the idea of a ‘flat corpus’ (Chakrabarti, 2003) of documents where relevance is only determined in relation to a query. While information retrieval generally conceives relevance with regard to a specif ic ‘informational need’, the socio- and scientometric attributions of status, influence, or impact are much closer to forms of aggregate description like mapping or clustering. As an unsupervised machine learning technique that tries to find an optimal description for specif ic aspects of a dataset, PageRank establishes a global ‘pecking order’ for the web. More precisely, because linking is framed as a form of citation and thus as rational attribution of importance by document authors, the hyperlink network can become a singular, universal network of ‘authority’ that the search system can combine with traditional ranking mechanisms to calculate a final score during the search process.
  • Book cover image for: Mining of Massive Datasets
    • The behavior of a random surfer indicates which pages users of the Web are likely to visit. Users are more likely to visit useful pages than useless pages. But regardless of the reason, the PageRank measure has been proved empirically to work, and so we shall study in detail how it is computed. 5.1.2 Definition of PageRank PageRank is a function that assigns a real number to each page in the Web (or at least to that portion of the Web that has been crawled and its links discovered). The intent is that the higher the PageRank of a page, the more “important” it is. There is not one fixed algorithm for assignment of PageRank, and in fact variations on the basic idea can alter the relative PageRank of any two pages. We begin by defining the basic, idealized PageRank, and follow it by modifications that are necessary for dealing with some real-world problems concerning the structure of the Web. Think of the Web as a directed graph, where pages are the nodes, and there is an arc from page p 1 to page p 2 if there are one or more links from p 1 to p 2 . Figure 5.1 is an example of a tiny version of the Web, where there are only four 172 Link Analysis pages. Page A has links to each of the other three pages; page B has links to A and D only; page C has a link only to A, and page D has links to B and C only. B A C D Figure 5.1 A hypothetical example of the Web Suppose a random surfer starts at page A in Fig. 5.1. There are links to B, C, and D, so this surfer will next be at each of those pages with probability 1/3, and has zero probability of being at A. A random surfer at B has, at the next step, probability 1/2 of being at A, 1/2 of being at D, and 0 of being at B or C. In general, we can define the transition matrix of the Web to describe what happens to random surfers after one step. This matrix M has n rows and columns, if there are n pages. The element m ij in row i and column j has value 1/k if page j has k arcs out, and one of them is to page i.
  • Book cover image for: Mining of Massive Datasets
    • The behavior of a random surfer indicates which pages users of the Web are likely to visit. Users are more likely to visit useful pages than useless pages. But regardless of the reason, the PageRank measure has been proved empirically to work, and so we shall study in detail how it is computed. 5.1.2 Definition of PageRank PageRank is a function that assigns a real number to each page in the Web (or at least to that portion of the Web that has been crawled and its links discovered). The intent is that the higher the PageRank of a page, the more “important” it is. There is not one fixed algorithm for assignment of PageRank, and in fact variations on the basic idea can alter the relative PageRank of any two pages. We begin by defining the basic, idealized PageRank, and follow it by modifications that are necessary for dealing with some real-world problems concerning the structure of the Web. Think of the Web as a directed graph, where pages are the nodes, and there is an arc from page p 1 to page p 2 if there are one or more links from p 1 to p 2 . Figure 5.1 is an example of a tiny version of the Web, where there are only four 142 Link Analysis pages. Page A has links to each of the other three pages; page B has links to A and D only; page C has a link only to A, and page D has links to B and C only. B A C D Figure 5.1 A hypothetical example of the Web Suppose a random surfer starts at page A in Fig. 5.1. There are links to B, C, and D, so this surfer will next be at each of those pages with probability 1/3, and has zero probability of being at A. A random surfer at B has, at the next step, probability 1/2 of being at A, 1/2 of being at D, and 0 of being at B or C. In general, we can define the transition matrix of the Web to describe what happens to random surfers after one step. This matrix M has n rows and columns, if there are n pages. The element m ij in row i and column j has value 1/k if page j has k arcs out, and one of them is to page i.
  • Book cover image for: Google's PageRank and Beyond
    eBook - ePub

    Google's PageRank and Beyond

    The Science of Search Engine Rankings

    beautiful . We hope the next few chapters will convince you that it just might.
    In Chapter 3 , we used words to present the PageRank thesis: a page is important if it is pointed to by other important pages. It is now time to translate these words into mathematical equations. This translation reveals that the PageRank importance scores are actually the stationary values of an enormous Markov chain, and consequently Markov theory explains many interesting properties of the elegantly simple PageRank model used by Google.
    This is the first of the mathematical chapters. Many of the mathematical terms in each chapter are explained in the Mathematics Chapter (Chapter 15 ). As terms that appear in the Mathematics Chapter are introduced in the text, they are italicized to remind you that definitions and more information can be found in Chapter 15 .

    4.1 THE ORIGINAL SUMMATION FORMULA FOR PAGERANK

    Brin and Page, the inventors of PageRank,1 began with a simple summation equation, the roots of which actually derive from bibliometrics research, the analysis of the citation structure among academic papers. The PageRank of a page
    Pi
    , denoted r (
    Pi
    ), is the sum of the PageRanks of all pages pointing into
    Pi
    .
    where is the set of pages pointing into
    Pi
    (backlinking to
    Pi
    in Brin and Page’s words) and |
    Pj
    | is the number of outlinks from page
    Pj
    . Notice that the PageRank of inlinking pages r (
    Pj
    ) in equation (4.1.1)) is tempered by the number of recommendations made by
    Pj
    , denoted |
    Pj
    |. The problem with equation (4.1.1) is that the r (
    Pj
    ) values, the PageRanks of pages inlinking to page
    Pi
    , are unknown. To sidestep this problem, Brin and Page used an iterative procedure. That is, they assumed that, in the beginning, all pages have equal PageRank (of say, 1/n , where n is the number of pages in Google’s index of the Web). Now the rule in equation (4.1.1) is followed to compute r (
    Pi
    ) for each page
    Pi
    in the index. The rule in equation (4.1.1) is successively applied, substituting the values of the previous iterate into r (
    Pj
    ). We introduce some more notation in order to define this iterative procedure . Let r
    k + 1
    (
    Pi
    ) be the PageRank of page
    Pi
    at iteration k
  • Book cover image for: The Power of Networks
    eBook - PDF

    The Power of Networks

    Six Principles That Connect Our Lives

    25. Is the Ranking Robust? What we’ve looked at is a method that allows us to find a set of impor-tance scores that is consistent across various equations, thereby achieving a ranking. However, are we always guaranteed a unique ranking for any webgraph? The answer is: not quite yet. In general, we have to make two modifi-cations to our procedure to guarantee a single, unique solution. These involve dealing with cases in which a webgraph has dangling nodes , which are nodes that don’t point to any other nodes, and in which it has multiple connected components . For more information on these special cases and how PageRank accounts for them, you can take a look at Q5.2 and Q5.3 on the book’s website. Suffice it to say that these modifications involve account-ing for portions of the random surfing process that we didn’t describe in detail here. If we apply these fixes to our procedure, then PageRank will always have a unique answer. The specific procedures that can be used to make the PageRank computa-tion scalable to trillions of webpages require some advanced mathematics that go beyond our scope. But you now have a conceptual understanding of how Google ranks the webpages on its search results page. Though our focus here has been largely on how to get the importance scores, don’t for-get that the relevance scores are also a significant ingredient to the search engine recipe, so that irrelevant pages (i.e., those without content match-ing the user’s query) can be filtered out in the first place. Summary of Part II Here in part II of the book, we have explored methods of ranking, which happen to be fundamental to Google’s operation. We have investigated two key cases in which the company sifts through information about particular items—whether of bids or of a webgraph—to find some effective ranking of the items.
  • Book cover image for: An Introduction to Search Engines and Web Navigation
    • Mark Levene(Author)
    • 2011(Publication Date)
    • Wiley
      (Publisher)
    There is also a problem that each citation database covers resources that the others do not and therefore citation analysis is not consistent across the databases (64). Moreover, Google Scholar contains citations to nonscholarly sources, although, looking at the big picture, this is unlikely to have a big effect on the citation metrics. One could argue that the different databases complement each other, especially as the coverage of each is uneven across different fields of study.
    5.2.15 The Wide Ranging Interest in PageRank
    The topic of PageRank has received a fair amount of attention due to its immense innovation in successfully factoring link analysis into web search engines, and due to the meteoric rise of Google, the favourite search engine of the day. PageRank has and is being covered in the business and technology press as part of Google's success story and the controversy that it is generating by attempts to manipulate it. Computer and information scientists are researching the effects of PageRank on search technology and proposing further innovations to improve its utility in ranking web pages. Mathematicians are looking at PageRank from a linear algebra point of view, and statisticians are investigating PageRank as a result of a stochastic process. PageRank has also generated interest from social scientists who are interested in the social and economic issues related to web links, and PageRank has infiltrated into popular science with various articles explaining its inner working to a wider audience. Finally, the SEO companies are writing about PageRank in an attempt to advise their customers how to improve their search engine ranking.
    5.3 Popularity-Based Metrics
    After attaining a bachelor's degree in mechanical engineering, Gary Culliss went on to work as a patent agent. He then entered Harvard Law School in 1995 with the intention of returning to patent law on graduation. Having the idea that search should be user controlled, based on the popularity or number of hits to a web page, he created a prototype search engine with the help of a computer programmer from MIT, which he called Direct Hit
Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.