Crimson Reason: Enhancing Textual Information Retrieval

Saturday, November 10, 2007

Enhancing Textual Information Retrieval

Information Retrieval (IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone databases or hyper-textually-networked databases such as the World Wide Web.

IR encompasses the subject matter areas of data retrieval, document retrieval, information retrieval, and text retrieval, and each of these has its own bodies of literature, theory, praxis and technologies.

Automated IR systems are used to reduce information overload. Many universities and public libraries use IR systems to provide access to books, journals, and other documents. Web search engines such as Google, Yahoo search or Live Search (formerly MSN Search) are the most visible IR applications. Others are MS SQL English Query and such like systems that enable the construction of queries that use English.

A typical IR application, such as Yahoo or Google searches, could be initiated by a user that supplies a search string of text to the IR system. The IR system will then try to match the words in this textual search string, using computational linguistics techniques, against documents, or their metadata which describe those documents, or within any available databases. Once the system has executed its search algorithms against the target data sources (documents, metadata about those documents, data-stores, etc.) it will supply the user with a set of matches against the supplied search string.

It is often the case that the user is interested in retrieving information that are “about” his search string. On the other hand, at times the user could also be interested in finding information that is relevant to the search string but not necessarily about it. However, the “about-ness” search and the “relevance” search are 2 operations that do not commute (in the sense of Abstract Algebra). That is, the results of an “about” search followed by a “relevance” search are not, in general, the same as those of a “relevance” search followed by an “about” search. Often times, the users try to circumvent this by changing the order of the words in their search sentence, by using synonyms, and so on.

We can circumvent this limitation by the following 2 observations:

1. That the textual searches return different results based on the order of the words in the search string; .e.g. a Google search for “Black is lake” will return different (but over-lapping) results if one permutes the orders of the words. Thus one gets more results if one permutes the order of words in the search string and collects all the returned “hits”.

2. The second observation is that while word permutations in an English sentence (for example) could lead to meaningless sentences, the same loss of meaning does not obtain in all languages. In fact, in languages such as Russian, or Sanskrit, or Ancient Greek the sentence retains some meaning after permuting the word order. The reason is that these languages have preserved the Indo-European synthetic-inflexional structure. Inflection or inflexion is the modification or marking of a word to reflect grammatical (that is, relational) information, such as gender, tense, number or person.

In this approach, the search text string (say in English) is first translated automatically into one of the 3 languages mentioned above; i.e. Russian, Sanskrit, or Ancient Greek. An embodiment of this will be to use an existing Commercial-Off-The-Shelf translation (COTS) software that can translate automatically from and to the Russian language (an example is the package used in (www.babalefish.com) . Next, the translated Russian text will undergo automatic permutations of its word order. The system will maintain a list of each permuted search string of Russian text. The system will then translate back into English (from Russian in this embodiment) the permuted search strings. The resulting list of permuted English search strings is then supplied to the IR system. The results of all these permuted searches are then collected and collated and delivered to the user.

This approach may be immediately generalized to any language which is a member of the Indo-European family of languages; that is, almost all of the modern languages of North and South America, Europe, The Middle East, and South Asia.

Crimson Reason

Saturday, November 10, 2007

Enhancing Textual Information Retrieval

No comments:

Useful Links

Topics

About Me

Blog Archive