General words about the unit of search in the Web

Because so many questions arose about the General functionality of the search engine here is a small introductory article. To make it a little clearer what is search engine and what it should do, describe in a General way. Probably for specialists programmers would not be very interesting, do not blame me

But to a search engine in my humble opinion should be able to find the most relevant results for search queries. In the case of text search, for which we are all accustomed to the search query – a set of words, personally, I have limited the length of eight words. The answer is a set of links to pages that are most relevant to the search query. Links it is desirable to provide an annotation so people know what to expect and could choose the results you need – a summary is called a snippet.

I must say that the problem of search in General can not be solved – for any document having the highest relevancy for example the word "work", you can create a modified copy which will be even better, from the point of view of search engines, relevancy, however, is complete nonsense from the point of view of the person. The question of price and time, of course. Due to the vastness of the Internet today these pages, to put it mildly, a lot. Different systems are struggling with them in different ways and with varying degrees of success, ever the artificial intelligence will win all of us...


Here would help the recognition algorithms, but I am familiar with only one of them (who really recognize the meaning, and do not believe the statistics) and do not see its applicability. Because tasks are empirically – i.e. the selection of some manipulation of the pages to separate the "wheat from the chaff".

In the real world to canvass for a second the entire Internet and find the results better suited, while it is impossible, therefore, the search engine stores a local copy of that piece of network, which she managed to collect and process. In order to quickly get out of a billion pages, only those that contain the desired word is built, the index database in which each word represents a list of pages that contain that word. Naturally, it is necessary to store on which locations were met by the words as they were highlighted in the text, other numeric metrics page to use them in the process of sorting.

Let's say I have 100 million pages. The average word occurs in 1-1,5% of pages, i.e. 1 million pages for each word (there are words which occur on every second page, and there are more rare). Just say 3 million words – the others are much rarer and is mostly errors and numbers. Storage 1 records that a particular word is found on a specific page id of the page takes 4 bytes, a site id is 4 bytes, Packed information on where and how it was allocated – 16-32 byte 3 factor reference ranking is 12 bytes, the remaining metrics about 12-24 bytes. The amount of the index on estimation:
3 million*1 million*total volume of records.

To build this index there are 3 mechanism:

indexing of pages – receive pages from the web and the initial treatment
building the reference metrics, the type of PageRank on the basis of primary information
update an existing index – the bring back new information and sorting on the received metrics, in particular PageRank.

Additionally need to save the text of the pages – for creating annotations in the search process

The process of finding

You can identify a lot of metrics of relevance, some depend on the "usefulness" of the result to specific user, other of the total number found, others from that of the actual pages themselves – for example, some search engines have a "benchmark" to aim for.

In order for the machine, i.e. the server could sort results by any metric that uses a set of numbers maps to each page. For example, the total number of words found in the text of the pages, their weight, calculated on the basis of selection of these words in the page text, and so on. Other such factors are not always dependent on the query – for example the number of pages which refer to this. The higher it is the more significant page in the output. The third type of coefficients depends on the query as rarely used words it found, which ones are common and can be skipped.

When the index is already built, it is possible to look:

split query into words, choose the pieces of the index corresponding to each word, cross it or do something else, depending on the selected policy
to calculate the coefficients for each page, if desired, may be well over a thousand
to build a metric based on the relevance ratios, to sort, to select the best results
to build an annotation — snippets and display the result

Full contents and list of my articles on search engine will be updated here: http://habrahabr.ru/blogs/search_engines/123671/
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Why I left Google Zurich

2000 3000 icons ready — become a sponsor! (the table of orders)

New web-interface for statistics and listen to the calls for IP PBX Asterisk