Whose morphology better? Yandex vs Google

There is an opinion that Russian morphology at Yandex are better implemented than Google. In this article, I will show that the situation is absolutely contrary.
image

This article is an adaptation of my article on SeoNews Habra

Russian morphology

In the Russian language several hundred thousand words, and each of them can be in many forms. For example, the adjective can be in 100 word forms:

Piccy.info - Free Image Hosting

In the end, if you save a morphological dictionary "in a forehead" we need about 500 MB. 500.000(number of words) * 75(i.e., the number of word-forms) * (10 (Ms. word length) + 4 bytes (to store the number of words + 2 bytes to store the number of word forms)). To accelerate you need to keep all these data in memory, and the speed is critical in the case of search engines.

There is a "compressed" view. Many words have the same endings in the same form. For example, "great" and "mighty." We need to keep only the beginning of a word ("great" and "mighty") and the group number. In the end, we need about 5MB. 500.000 * (8(Ms. length start)+ 2(number of groups)). However, in this case base will contain artifacts.

Artifacts

Transformation rules of verbs(do) participle(doing) not a lot. So in summary the basis of gerunds and participles are considered as word forms of the verb and not separate words.
But the transformation rules of verbs in the perfect aspect (to do->to make, buy- > to buy, search for->find) countless, so for the compacted base verbs perfect and imperfect form different words.
These artifacts are critical for search only, in which morphology is used to combine word forms.

Yandex

Yandex highlights not only the word, but also synonyms. However, the highlighting of synonyms can be disabled using " + " operator.
image

Communications of perfect and imperfect verbs in Yandex arranged through synonyms, not through morphology.
But the relationship of verbs and participles is realized through morphology.
image
In this picture you can clearly see compression artifacts in the morphological dictionary. In other words Yandex uses compression.

difference in the results

Maybe the backlight is just "behind the brain". However, for high frequency queries backlight synonymous herself off. This shows that the backlight is connected with brains in case of synonyms — she can't just be switched off. The only explanation is the results in the results so scarce and Yandex saves resources without having to search for synonyms.

The difference in results is well observed in queries containing the verb in both kinds and the communion. For example, "to do an enema", "to do an enema" and "enema made", if you type them into Google and in Google.

Influence on the quality of issuance

We showed the presence of artifacts morphology Yandex and the fact that they affect the ranking, although they may not affect the quality of issuance. However, I quickly managed to find a few exceptions in Yandex: buy and buy, tweeze and tweeze, send and send glued at the level of morphology. The only hypothesis why these exceptions appeared, they were added to improve the issue. Consequently, the artifacts, at least in particular cases, worsen the results.

Google

Google uses uncompressed morphology. At least "compression artifacts" I was not able to find.

The only discrepancy between the formal model of Russian language in Google — the usual (good), and excellent (better) degrees of adjectives divided into morphology. Probably they are connected as synonyms, however, Google highlights synonyms.
This is not explained as an artifact of compression, since the transformation rules convert the forms of adjectives are not so many (beautiful -- > more beautiful, more clever- > cleverest) and not the base AOT.ru and no dictionary of Zaliznyak does not share the forms of the adjective.

The separation of the adjectives by degree, due to the optimization of the quality issue. Degree adjectives change their "color", making them semantic relationship more like synonyms than the word forms. For example, the query "lovely photo" in meaning much closer to "beautiful photos" than "beautiful photos".

This coincides with the intuitive notion of the language. I met several times with the fact that "good" and "better" cited the fact that Google understands synonyms.

Why it happened

Morphology in Yandex was written 10 years ago, and then 500 MB. memory for hundreds of servers can cost a pretty penny. Since then memory prices dropped, but the change of morphology would lead to a whole cascade of changes in the database Yandex. Therefore, in Yandex uses compressed view of morphology.
Initially English Google search engine. In English, words have only a few word forms in the morphology of compression makes no sense. Apparently, therefore, in Russian morphology Google is not compressed.

total

The morphology of the Google-organized "correct" and slightly better than that of Yandex. The irony is that the reason for this is the English origin of Google.
However, morphology is only one of many aspects in the results. To say that Google has better results than Yandex only on the basis of morphology, the same intellect at the height of the forehead. The purpose of the article was to dispel the belief that morphology is organized in the Google worse than Yandex.
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Why I left Google Zurich

2000 3000 icons ready — become a sponsor! (the table of orders)

New web-interface for statistics and listen to the calls for IP PBX Asterisk