Search technology or the problem to write your search engine

once upon a time came into my head the idea: to write your own search engine. It was a long time ago when I was still in high school, what little I knew about the technology development of large projects, but I have owned a couple of dozen programming languages and protocols, and their sites at that time was curve had a lot.

Well, I've got a craving for monstrous projects, Yes...

At the time, about how they work little is known. Articles in English and very scarce. Some of my friends, who were then in the course of my search, based on naryt and me and them documents and ideas, including those who were born in the process of our debate, now doing good courses coming up with new search technologies, in General, this theme has developed quite an interesting work. This work has led in particular to new developments of various large companies, including Google, but I personally have no direct relation to it.

At the moment I have my own studying search engine, with many nuances – counting PR, the collection of statistics-topics, learning a ranking function, know-how in the form of cutting non-essential page content type menu and advertising. Speed indexing about half a million pages per day. It all revolves on two of my home servers and I am currently engaged in the scaling of the system at about 5 available servers to which I have access.


Here I am for the first time publicly, describe what was done by me. I think many will be wondering how does it work Google and almost all search engines known to me from the inside.

There are many challenges when building such systems, which are almost impossible to solve in the General case, however, with some tricks, creativity and a good understanding of how it works zhelezyachnye part of Your computer can be seriously simplified. As an example of the recalculation of PR, which in the case of several tens of millions of pages, it is impossible to put the largest RAM, especially if You, like me, greedy for information, and want except for 1 digit to store many more Goodies. Another task is the storing and updating of index, at least two-dimensional database in which a particular word is mapped to a list of documents on which it is found.

Just think about it, Google stores, according to one estimate, more than 500 billion pages in the index. If every word were found on 1 page only 1 time, and storage that we had 1 byte – which is impossible, because it is necessary to store at least the id of a page from 4 bytes, then the volume index would be 500GB. In reality, one word found on the page an average of 10 times, the amount of information on the occurrence of rarely less than 30-50 bytes, the index increased thousands of times... and tell them how to put it? And the update?

Well, that's how it all works, I will tell systematically, as well as about how to count PR quickly and incrementally, about how to store millions and billions of pages of texts, their addresses, and to quickly search for addresses how to organize different parts of my database, how to incrementally update the index for many hundreds of gigs, well, probably tell you how to make the student ranking algorithm.

For today, the only index that is search — 57Gb, increases every day by about 1Gb. The volume of the compressed texts is 25Gb, so I keep a bunch of other useful information, the volume of which is very difficult to calculate because of its abundance.

Here is the full list of articles which relate to my project and described here:
0. Search technology, or the problem is to write your own search engine
1. how to start a search, or a few thoughts about crawler
2. General words about the device searching the Web
3. Dataflow of the search machine
5. Methods to optimize application performance when working with RBD
6. a Little about designing databases for search engines
7. AVL trees, and the breadth of their application
8. Working with URLS and storage
9. Build index for search engines
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Why I left Google Zurich

2000 3000 icons ready — become a sponsor! (the table of orders)

New web-interface for statistics and listen to the calls for IP PBX Asterisk