Computer Science
Search Engine Indexing
Search engine indexing is the process of collecting, parsing, and storing data to facilitate fast and accurate information retrieval. It involves creating an index of web pages and their content, which allows search engines to quickly provide relevant results to user queries. Indexing enables search engines to efficiently crawl and organize vast amounts of information available on the internet.
Written by Perlego with AI-assistance
Related key terms
1 of 5
8 Key excerpts on "Search Engine Indexing"
- eBook - PDF
- B. Barla Cambazoglu, Ricardo Baeza-Yates, rla Barla Cambazoglu(Authors)
- 2022(Publication Date)
- Springer(Publisher)
31 C H A P T E R 3 e Indexing System e indexing system, besides indexing, is responsible for a number of information extrac- tion, filtering, and classification tasks. Here, we overload the term “indexing” to include the tasks related to the processing of web pages as well. is system provides meta-data, metrics, and other kinds of feedback to the crawling system (e.g., various link analysis measures that are used to guide the crawling process) as well as the query processing system (e.g., some query-independent rank- ing features). It manipulates the output of the crawling system and the query processing system operates on its output. In this respect, the indexing system acts as a bridge between the crawling and query processing systems. One of the important tasks performed by the indexing system is to convert the pages in the web repository into appropriate index structures that facilitate searching the textual content of pages. ese index structures include an inverted index together with some other auxiliary data structures, which are processed each time a query is evaluated. In practice, the time cost of pro- cessing these data structures is the dominant factor in the response latency of a search engine. Moreover, these data structures are usually retained in memory, avoiding costly disk accesses as much as possible. erefore, the compactness and efficient implementation of the index structures is vital. Finally, the index data structures need to be kept up to date by including newly down- loaded pages so that the search engine can guarantee a certain level of freshness in its results. is is achieved either by periodically rebuilding the index structures or incrementally reflecting the changes in the web repository to the index structures. Another important objective of the indexing system is to improve the quality of the results served by the search engine. To this end, the indexing system performs two complementary types of task. - Maria Stone, Stone Maria(Authors)
- 2022(Publication Date)
- Springer(Publisher)
Search Results Abstracts: Short descriptions of results that appear on SERP with each result. You may also see terms like “Snippets” (Google name for the abstract) or “Summaries” (Bing name for the abstract). Search Result: One of the documents or entities included in a SERP. Can be a text document, media file, or listing of structured data (such as product listing). Document or Entity: A text or media file or a listing that can be indexed, retrieved, ranked, and presented on SERP by a search engine. Prefix: First few letters a user typed that were then auto-completed by an autocomplete service. Autocomplete Service: A service that finds most likely completions to the first few letters that a user types. Query Suggestion: An explicit complete suggestion of an additional or a replacement query based on a user query. Spell Correction: A silent, automatic, or explicit user-visible correction of a likely spelling error in a query by user. Search Aids: Algorithmic and UI innovations designed to assist users in generating good queries (autocomplete service, spell corrections, query suggestions, etc.). 2.2 INDEXING Much like a human, a search engine first needs to “learn” information it is going to provide in response to queries. It needs to build a memory of the documents (or other entities) and in- formation that it will work with, and organize this memory for easy retrieval. There are many different architectures that designers of a search engine may consider, but there are some com- monalities as to how search engines work that are common. A typical search engine will “crawl” or ingest the collection of documents or entities and associated metadata it needs to index. Sometimes, these files and documents are well-structured, and other times not. The process of indexing involves deciding which keywords or phrases are important to associate with each document, which metadata is useful to keep, and creating one big table.- Liyang Yu(Author)
- 2007(Publication Date)
- Chapman and Hall/CRC(Publisher)
However, in some sense, all these search engines are created equal, and in this section, we are going to study how they are created. More importantly, remember the frustration each one of us has experienced when using a search engine? In this section, we will begin to understand the root cause of this frustration as well. In fact, based on the discussion in this section, you can even start building your own search engine and play with it to see whether there is a better way to minimize the frustration. 2.1.1 B UILDING THE I NDEX T ABLE Even before a search engine is made available on the Web, it starts preparing a huge index table for its potential users. This process is called the indexation process, and 18 Introduction to the Semantic Web and Semantic Web Services it will be conducted repeatedly throughout the life of the search engine. The quality of the generated index table to a large extent decides the quality of a query result. The indexation process is conducted by a special piece of software usually called a spider , or crawler . A crawler visits the Web to collect literally everything it can fi nd by constructing the index table during its journey. To initially kick off the process, the main control component of a search engine will provide the crawler with a seed URL (in reality, this will be a set of seed URLs), and the crawler, after receiving this seed URL, will begin its journey by accessing this URL: it downloads the page pointed to by this URL and does the following: Step 1: Build an index table for every single word on this page. Let us use URL 0 to denote the URL of this page. Once this step is done, the index table will look like this (see Figure 2.1). It reads like this: word 1 shows up in this document, which has a URL 0 as its location on the Web, and the same with word 2 , word 3 , and so on. But what if some word shows up in this document more than once? The crawler certainly needs to remember this information.- Mark Levene(Author)
- 2011(Publication Date)
- Wiley(Publisher)
Fig. 4.5 . The main components of a search engine are the crawler, indexer, search index, query engine, and search interface.Figure 4.5 Simplified search engine architecture.As I have already mentioned, a web crawler is a software program that traverses web pages, downloads them for indexing, and follows the hyperlinks that are referenced on the downloaded pages; web crawlers will be discussed in detail in the next section. As a matter of terminology, a web crawler is also known as a spider , a wanderer or a software robot . The second component is the indexer which is responsible for creating the search index from the web pages it receives from the crawler.4.5.1 The Search IndexThe search index is a data repository containing all the information the search engine needs to match and retrieve web pages. The type of data structure used to organize the index is known as an inverted file . It is very much like an index at the back of a book. It contains all the words appearing in the web pages crawled, listed in alphabetical order (this is called the index file ), and for each word it has a list of references to the web pages in which the word appears (this is called the posting list ). In 1998 Brin and Page reported the Google search index to contain 14 million words, so currently it must be much larger than that, although clearly very much smaller than the reported number of web pages covered, which is currently over 600 billion. (Google reported that after discarding words that appear less than 200 times, there are about 13.6 million unique words in Google's search index.32 )Consider the entry for “chess” in the search index. Attached to the entry is the posting list of all web pages that contain the word “chess”; for example, the entry for “chess” could be- eBook - PDF
- Kotrayya B. Agadi(Author)
- 2023(Publication Date)
- Society Publishing(Publisher)
(b) indexes the retrieved information about the web pages discovered, resulting in the creation of a database, and (c) enables users to search its database/index via an interface that provides searching facilities and options that the user can utilize at his or her discretion (Figure 2.4). Internet Search Engines and Libraries 60 Figure 2.4. Stat counter global stats for search engines. Source: https://www.researchgate.net/profile/Akhtar-Hussain-13/publica- tion/284726465/figure/fig1/AS:300344123904001@1448619027535/Stat- counter-global-stats-for-search-engines.png. Bots, also known as robots, spiders, (web) crawlers, worms, intelligent agents, knowledge bots, or know bots, are computer programs (i.e., software) used by search engines for the first task. Whatever name you give them, they all serve the same purpose: They’surf’ or ‘crawl’ the web by following links from one webpage or website to the next, collecting data for storage in their database. Furthermore, new websites are continually being created, and search engines must ensure that the results they present to their customers are up-to-date in order to remain competitive in the search engine industry. Spiders do not often work one at a time to meet the needs of a search engine. At its greatest performance, their system could crawl over 100 pages every second, generating over 600 kilobytes of data per second. Spiders collect data that will be evaluated to create indexes that will be stored in the search engine’s database. What is indexed is determined by how each search engine decides to use the information available on each of the web pages gathered. Some search engines use the provided full-text, others maintain some of the original mark-up tags, and still others consider both content and links when generating indexes based on the three most prominent information retrieval models: Boolean, vector space, and probabilistic. - eBook - ePub
Google's PageRank and Beyond
The Science of Search Engine Rankings
- Amy N. Langville, Carl D. Meyer(Authors)
- 2011(Publication Date)
- Princeton University Press(Publisher)
94 ]. This user impatience means that search engine precision must increase just as rapidly as the number of documents is increasing. Another dilemma unique to web search engines concerns their performance measurements and comparison. While traditional search engines are compared by running tests on familiar, well studied, controlled collections, this is not realistic for web engines. Even small web collections are too large for researchers to catalog, count, and create estimates of the precision and recall numerators and denominators for dozens of queries. Comparing two search engines is usually done with user satisfaction studies and market share measures in addition to the baseline comparison measures of speed and storage requirements.1.3.2 Elements of the Web Search Process
This last section of the introductory chapter describes the basic elements of the web information retrieval process. Their relationship to one another is shown in Figure 1.2 . Our purpose in describing the many elements of the search process is twofold: first, it helps emphasize the focus of this book, which is the ranking part of the search process, and second, it shows how the ranking process fits into the grand scheme of search. Chapters 3 -12 are devoted to the shaded parts of Figure 1.2 , while all other parts are discussed briefly in Chapter 2 .Figure 1.2 Elements of a search engine• Crawler Module. The Web’s self-organization means that, in contrast to traditional document collections, there is no central collection and categorization organization. Traditional document collections live in physical warehouses, such as the college’s library or the local art museum, where they are categorized and filed. On the other hand, the web document collection lives in a cyber warehouse, a virtual entity that is not limited by geographical constraints and can grow without limit. However, this geographic freedom brings one unfortunate side effect. Search engines must do the data collection and categorization tasks on their own. As a result, all web search engines have a crawler module. This module contains the software that collects and categorizes the web’s documents. The crawling software creates virtual robots, called spiders , that constantly scour the Web gathering new information and webpages and returning to store them in a central repository.• Page Repository. - eBook - PDF
- Susan Feldman(Author)
- 2022(Publication Date)
- Springer(Publisher)
Usually (but not always) a document that contains all the query terms will be ranked higher than one that contains fewer terms but more appearances of those terms. Variations in how factors are weighted account for some of the differences in how search engines perform. Some of these factors might be the position in the document where the terms occur (title, chapter heading, bold print), how closely the terms appear together in the document (proximity), or whether the emphasis is on matching all the terms or on number of occurrences of one of the query terms.There is no reason why any system can’t return documents sorted by any criterion. A Boolean system could add relevance ranking, and a statistical system could return documents in order by date. 4.5 SEARCH AND CONTENT ANALYTICS We can group the technologies associated with “search” into four broad categories: • Connectors and crawlers • Search engines • Categorizers and clustering engines • Content analytics and natural language processing Together, these technologies gather information, index it, and provide access to it. Each of these technologies provides different pathways into a collection of information. Taken together, and tuned for a particular application, they provide a user experience—and answers—that is far superior 4.6. COLLECTING INFORMATION FOR SEARCHING OR ANALYSIS 23 to the common user experience today.The trick is to select the most appropriate technology for each use. Chapter 5 discusses these technologies in terms of the types of uses, sources, and users. 4.6 COLLECTING INFORMATION FOR SEARCHING OR ANALYSIS Crawlers collect documents from sources such as the Web or enterprise repositories by following URL’s or addresses. In other words, they crawl along the filaments of the Web that lead from one document to another, collecting the contents of each document along the way. For obvious reasons, this process is also called spidering. - eBook - PDF
- Phil Bradley(Author)
- 2017(Publication Date)
- Facet Publishing(Publisher)
In fact, it can be quite difficult to identify exactly what you can and cannot do with an engine, since their key demographic is not the sophisticated searcher, but the one who wants to find, not search. Ease of use is therefore far more important in that situation than the bells and whistles of Boolean operators and other advanced search functionality. Data collection In order to use search engines effectively it is necessary to have some background knowledge of the ways in which they work, and in particular, how they collect the information that they can then give you in the SERPS (search engine results pages). After all, if you don’t know where the information comes from, you have no way of knowing if you have got all of it or not, and if you need to continue your search. Free-text search engines will make use of ‘spiders’, ‘robots’ or ‘crawlers’ (all the terms are interchangeable) and these are tools that spend their time exploring web page after web page. It’s a simplification, but I hope an acceptable one (otherwise the explanation would drone on for page after technical page) to say that they will look at a page, include it in their index database and index it down to the word level. Once the page has been digested, as it were, the spider will follow a link to another page, do the same exact thing again and continue on. The software will take into account new sites, new pages, changes to the content on existing pages 24 EXPERT INTERNET SEARCHING and any dead links that it discovers. The Google crawler, named ‘Googlebot’, has indexed over 100,000,000 gigabytes, and an unspecified number of actual web pages, but it’s certainly in the trillions. At some unspecified time in the future the crawler will return to the site to reindex it. When it does that is based mainly on how often the pages on a site are updated. After all, it’s a waste of the crawler’s time if it keeps revisiting pages that haven’t changed in months or years.
Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.







