The Math Behind Lucene
Jun 17, 2020
Jun 17, 2020
Note: This article was originally published on June 17, 2020 and has been migrated from our previous blog. Some details — tools, libraries, benchmarks, industry context — may be outdated. For our latest perspective, see our recent posts.
Lucene is an open source search engine, that one can use on top of custom data and create your own search engine – like your own personal Google! In this post, we will go over the basic math behind Lucene, and how it ranks documents to the input search query.
The analysis of language often brings us in situations where we are required to determine the weight or importance of a given word in a document, to determine the relative importance or similarity of a document to another document. In situations such as this, the first step to remove stop words which are basically words that don’t contribute to the general focus of a given article. Most common stop words are words like – a, when, who, what. The list of stop words keeps changing based on the domain of discourse. For instance, in a corpus of articles about the human heart, the word heart could potentially be a stop word due to the sheer frequency with which it is mentioned. It is always a good idea to remove stop words in a given text before processing it. However, once these stop words are removed, one still faces the task of determining the relative importance or weights of the remaining words in the document – let’s start talking about TF*IDF.
Observing the words in a document, an intuitive way to discover the importance of a given word is to count the frequency of the word in the document. This is the Term Frequency or the of the word in the given document.
is often normalized so as to not introduce a bias because of the document size. A normalized
for a given word in a given document is calculated as:
where, is the word count of word
in document
and
is the total count of all words
in document
.
However, by itself is not enough to capture the importance of a word in a document, as there are numerous words like therefore, however, that have a very high frequency but are not stop words, and carry very little importance. So we need to complement this with the Document Frequency of each word – or
. Document Frequency is the total number of documents a given word occurs in. If a word occurs in a large number of documents, it has a high
. When this
is used to scale the weight of a word in the document, it is called
or Inverse Document Frequency of that given word.
If is the total number of documents in a given corpus, then the
of a given word in the corpus is:
Notice that is calculated for a given word in a given document,
is calculated for a given word over all the documents.
Once we have the and
, we can calculate the
score for each word to determine the weight of a word in the given document.
The score assigns each word
a weight in document. Here are some insights about it.
Document vectors are created by computing the relative weights of each word in the document. One way to accomplish this is by computing the values of each of the words in the document. When we compute the
of each of the words in a document, we end up with documents with a list of features (words) with their values (weights). In a sense, this represents the document vector. The representation of documents as vectors in a common vector space is known as a vector space model, and is the basis of a large number of information retrieval tasks.
A standard way of computing the document similarity is to compute the cosine similarity of the vector representations of the documents. If and
are two documents, and
and
are the vector representations of them respectively, then the similarity of
and
can be measured as the cosine of the angle between
and
In this equation, the numerator is the dot product of vectors and
, and the denominator is the product of the Euclidean length. Euclidean Length is the sum of squares of the magnitude of each element of the vector.
If and
are the following,
The dot product of and
is:
The Euclidean length of :
Similarly, the Euclidean length of :
Applying (6), (7) and (8) to (3) we get:
Ok, we have cosine similarity. Now what? What cosine similarity tells us, is how similar 2 documents are. If the documents are very similar, we have the similarity score closer to 1, but if the documents are completely different, the similarity score is closer to -1.
This is the heart of the scoring mechanism that Lucene uses to retrieve similar documents given a search query. Although Lucene does use a couple of other (mostly user defined) constants to fine tune the results, is the heart of how Lucene operates.
References:
Say hello to optimized workflows and enhanced decision-making! Harness the power of AI to unlock differentiated insights
Harness the power of AI - Whether it’s optimizing supply chains in logistics, preventing fraud in healthcare insurance, or leveraging advanced social listening to enhance your portfolio companies.