Lucene is an open source search engine, that one can use on top of custom data and create your own search engine – like your own personal Google! In this post, we will go over the basic math behind Lucene, and how it ranks documents to the input search query.
THE BASICS – TF*IDF
The analysis of language often brings us in situations where we are required to determine the weight or importance of a given word in a document, to determine the relative importance or similarity of a document to another document. In situations such as this, the first step to remove stop words which are basically words that dont contribute to the general focus of a given article. Most common stop words are words like – a, when, who, what. The list of stop words keeps changing based on the domain of discourse. For instance, in a corpus of articles about the human heart, the word heart could potentially be a stop word due to the sheer frequency in which it is mentioned. It is always a good idea to remove stop words in a given text before processing it. However, once these stop words are removed, one still faces the task of determining the relative importance or weights of the remaining words in the document – let’s start talking about TF*IDF.
Observing the words in a document, an intuitive way to discover the importance of a given word is to count the frequency of the word in the document. This is the Term Frequency or the of the word in the given document. is often normalized so as to not introduce a bias because of the document size. A normalized for a given word in a given document is calculated as:
where, is the word count of word in document and is the total count of all words in document .
However, by itself is not enough to capture the importance of a word in a document, as there are numerous words like therefore, however, that have a very high frequency but are not stop words, and carry very little importance. So we need to complement this with the Document Frequency of each word – or . Document Frequency is the total number of documents a given word occurs in. If a word occurs in a large number of documents, it has a high . When this is used to scale the weight of a word in the document, it is called or Inverse Document Frequency of that given word.
If is the total number of documents in a given corpus, then the of a given word in the corpus is:
Notice that is calculated for a given word in a given document, is calculated for a given word over all the documents.
Once we have the and , we can calculate the score for each word to determine the weight of a word in the given document.
The score assigns each word a weight in document. Here are some insights about it.
- is highest when occurs many times within a small number of documents. This means that the word is significant, and can help establish relevance within the small number of documents.
- is lower when the word occurs fewer times in a document, or occurs in many documents. If a word occurs infrequently, it might not be of significance. Similarly if a word occurs in a very large number of documents, it probably will not help in discriminating between the document.
- is lowest when the term occurs in virtually all documents. This covers words like a, an the, when etc. which span a large number of documents and have no contribution towards the semantic composition of the document.
FROM TF*IDF TO DOCUMENT SIMILARITY
Document vectors are created by computing the relative weights of each word in the document. One way to accomplish this is by computing the values of each of the words in the document. When we compute the of each of the words in a document, we end up with documents with a list of features (words) with their values (weights). In a sense, this represents the document vector. The representation of documents as vectors in a common vector space is known as a vector space model, and is the basis of a large number of information retrieval tasks.
A standard way of computing the document similarity is to compute the cosine similarity of the vector representations of the documents. If and are two documents, and and are the vector representations of them respectively, then the similarity of and can be measured as the cosine of the angle between and
In this equation, the numerator is the dot product of vectors and , and the denominator is the product of the Euclidean length. Euclidean Length is the sum of squares of the magnitude of each element of the vector.
If and are the following,
The dot product of and is:
The Euclidean length of :
Similarly, the Euclidean length of :
Applying (6), (7) and (8) to (3) we get:
Ok, we have cosine similarity. Now what? What cosine similarity tells us, is how similar 2 documents are. If the documents are very similar, we have the similarity score closer to 1, but if the documents are completely different, the similarity score is closer to -1.
This is the heart of the scoring mechanism that Lucene uses to retrieve similar documents given a search query. Although Lucene does use a couple of other (mostly user defined) constants to fine tune the results, is the heart of how Lucene operates.
- Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze, Chapter on Scoring, term weighting and the vector space model