Post by Dr. Rutu Mulkar-Mehta
In this post, I will take you through the steps for calculating the values for all the words in a given document. To implement this, we use a small dataset (or corpus, as NLPers like to call it) form the Project Gutenberg Catalog. This is just a simple toy example on a very small dataset. In real life we use much larger corpora, and need some more sophisticated tools in order to handle large amounts of data. To brush up on the basic concepts of you might want to check out my post on the The Math behind Lucene.
For this exercise, we will use the Project Gutenberg Selections which are released as part of NLTK Data. NLTK – Natural Language Toolkit – is a Python-based module for text processing. The corpus – Project Gutenberg Selections – contains 18 files. Each file is a complete book. Our task is to calculate the of all the words for each of the documents provided. In the end of this exercise we will have 18 documents, with the of each word in each of the documents.
Documents with values for each word (or token) are often used (with the vector space model) to compute the similarity between two documents. Such statistics are quite relevant in Information Retrieval, Search Engines, Document Similarity, and so on.
NOTE: All the materials needed for this exercise (code + data) can be downloaded from my GitHub repo.
The code is written in Perl, and is heavy in regular expressions. I have tried my best to document the code, but if you have any issues, or discover bugs, please do not hesitate in contacting me. I will not go into the details of all my code in this post, but will highlight a few features.
This is one of the most important tasks of any NLP application. Discourse often contains non-ASCII characters, no alphanumeric characters, spaces, and so on. The first and most important task is to remove these unwanted characters from text, to clean it up. Here is a set of regular expressions (regex) that is helpful for this dataset. However it is not an exhaustive set of regexes. For instance, I have not used any regex to convert UTF-8 to ASCII.
# remove endline character chomp(txt =~ s/[\h\v]+/ /g; # remove caps txt =~ s/[^a-zA-Z\d\s]//g;
The rest of my code can be downloaded from my GitHub repo. The code is relatively easy to understand if you are familiar with regular expressions, and understand Perl. I have provided my solutions in the output directory, within the folder . The intermediate and results are also provided in the output folder.
It is interesting to note here that words occurring in all the documents have an value of 0. This means that their value will also be 0, deeming them insignificant in contributing to the document vector. These include words like – a, the, when, if, etc. A longer list can be developed by sorting all the values in the file
idf.txt (downloaded from my GitHub repo).
The next step, is to use these metrics to compute document similarity. Discovering similar document in a corpus of documents remains one of the most important problems of information retrieval. This is often done by converting documents into document vectors, and comparing the similarity of these vectors to each other using vector similarity metrics.
This post uses the output generated in the toy example for which is available in this GitHub repo.
After computing the values for each document in a given corpus, we need to go through the exercise to convert these values into a document vector. If you recall some vector basics, a vector constitutes a magnitude and a direction. In our case, the direction is represented by a word in the document and magnitude is the weight or the value of the word in the document. In order to simplify our vector calculations we pre-specify the locations of each word in the array representing the document, creating a sparse vector.
For instance consider the vocabulary of the entire corpus to be:
John, Mary, Susan, Kendra, sang, for, with, plays
We will assign a location to each word in the vocabulary as:
Given this data, for the sentence: John sang for Mary, we create the following boolean document vector:
Here a 1 represents the existence of a word in the text, and a 0 represents the absence of a word from the text. This example is simply to illustrate the concept of creating a document vector. In our toy example we will replace with 1 values with the values of a given word in the given document.
Computing the dot Product of two Vectors
Dot product between and :
The final similarity scores are:
doc2 similarity with doc1 and doc3:
doc1 similarity with doc2 and doc3:
Notice that document 1 is most similar to document 3. Document 2 and document 1 have absolutely no similarities. This is because both these documents have just one word in common – John, which is common to all the documents, so has a weight of 0.