Build Your Own Search Engine
Jun 17, 2020
Jun 17, 2020
Note: This article was originally published on June 17, 2020 and has been migrated from our previous blog. Some details — tools, libraries, benchmarks, industry context — may be outdated. For our latest perspective, see our recent posts.
Post by Dr. Rutu Mulkar-Mehta
In this post, I will take you through the steps for calculating the values for all the words in a given document. To implement this, we use a small dataset (or corpus, as NLPers like to call it) form the Project Gutenberg Catalog. This is just a simple toy example on a very small dataset. In real life we use much larger corpora, and need some more sophisticated tools in order to handle large amounts of data. To brush up on the basic concepts of
you might want to check out my post on the The Math behind Lucene.
For this exercise, we will use the Project Gutenberg Selections which are released as part of NLTK Data. NLTK – Natural Language Toolkit – is a Python-based module for text processing. The corpus – Project Gutenberg Selections – contains 18 files. Each file is a complete book. Our task is to calculate the of all the words for each of the documents provided. At the end of this exercise we will have 18 documents, with the
of each word in each of the documents.
Documents with values for each word (or token) are often used (with the vector space model) to compute the similarity between two documents. Such statistics are quite relevant in Information Retrieval, Search Engines, Document Similarity, and so on.
NOTE: All the materials needed for this exercise (code + data) can be downloaded from my GitHub repo.
The code is written in Perl, and is heavy in regular expressions. I have tried my best to document the code, but if you have any issues, or discover bugs, please do not hesitate in contacting me. I will not go into the details of all my code in this post, but will highlight a few features.
This is one of the most important tasks of any NLP application. Discourse often contains non-ASCII characters, non-alphanumeric characters, spaces, and so on. The first and most important task is to remove these unwanted characters from text, to clean it up. Here is a set of regular expressions (regex) that is helpful for this dataset. However, it is not an exhaustive set of regexes. For instance, I have not used any regex to convert UTF-8 to ASCII.
# remove endline character
chomp(
txt =~ s/[\h\v]+/ /g;
# remove caps
txt =~ s/[^a-zA-Z\d\s]//g;
The rest of my code can be downloaded from my GitHub repo. The code is relatively easy to understand if you are familiar with regular expressions, and understand Perl. I have provided my solutions in the output directory, within the folder . The intermediate
and
results are also provided in the output folder.
It is interesting to note here that words occurring in all the documents have an value of 0. This means that their
value will also be 0, deeming them insignificant in contributing to the document vector. These include words like – a, the, when, if, etc. A longer list can be developed by sorting all the values in the file
idf.txt (downloaded from my GitHub repo).
The next step, is to use these metrics to compute document similarity. Discovering similar document in a corpus of documents remains one of the most important problems of information retrieval. This is often done by converting documents into document vectors, and comparing the similarity of these vectors to each other using vector similarity metrics.
This post uses the output generated in the toy example for which is available in this GitHub repo.
After computing the values for each document in a given corpus, we need to go through the exercise to convert these values into a document vector. If you recall some vector basics, a vector constitutes a magnitude and a direction. In our case, the direction is represented by a word in the document and magnitude is the weight or the
value of the word in the document. In order to simplify our vector calculations we pre-specify the locations of each word in the array representing the document, creating a sparse vector.
For instance consider the vocabulary of the entire corpus to be: John, Mary, Susan, Kendra, sang, for, with, plays
We will assign a location to each word in the vocabulary as:
| John | Mary | Susan | Kendra | sang | for | with | dances |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Given this data, for the sentence: John sang for Mary, we create the following boolean document vector:
| John | Mary | Susan | Kendra | sang | for | with | dances |
|---|---|---|---|---|---|---|---|
| 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
Here a 1 represents the existence of a word in the text, and a 0 represents the absence of a word from the text. This example is simply to illustrate the concept of creating a document vector. In our toy example we will replace with 1 values with the values of a given word in the given document.
Computing the dot Product of two Vectors
| John | Mary | Kendra | sang | for | with | dances | Sentence |
|---|---|---|---|---|---|---|---|
| d1 | 0 | 0.27 | 0 | 0 | 0 | 0.1 | 0.1 |
| d2 | 0 | 0 | 0.1 | 0.27 | 0.27 | 0 | 0 |
| d3 | 0 | 0 | 0.1 | 0 | 0 | 0.1 | 0.1 |
Dot product between and
:
The final similarity scores are:
| doc3.txt | doc2.txt | 0.178554901188263 |
| doc3.txt | doc1.txt | 0.377800203993898 |
doc2 similarity with doc1 and doc3:
| doc2.txt | doc3.txt | 0.178554901188263 |
| doc2.txt | doc1.txt | 0 |
doc1 similarity with doc2 and doc3:
| doc1.txt | doc3.txt | 0.377800203993898 |
| doc1.txt | doc2.txt | 0 |
Notice that document 1 is most similar to document 3. Document 2 and document 1 have absolutely no similarities. This is because both these documents have just one word in common – John, which is common to all the documents, so has a weight of 0.
Happy Coding!
Say hello to optimized workflows and enhanced decision-making! Harness the power of AI to unlock differentiated insights
Harness the power of AI - Whether it’s optimizing supply chains in logistics, preventing fraud in healthcare insurance, or leveraging advanced social listening to enhance your portfolio companies.