Human history has an unfortunate record of discrimination and biases against each other that follow us from ancient times. A conscious effort towards developing inclusive and equal social systems is a necessity. In the increasingly automated world, where computers interact with humans, Artificial Intelligence gives us another shot at making the world a fairer place with equal opportunities.

However, machines are built by people. We are obliged to put conscious effort into making sure the AI solutions won’t carry over our mistakes.

The appeal of AI is tremendous: it can search through millions of pieces of data and use it to make forecasts that are often more accurate than ours. Automating processes with AI also seems more objective than relying on subjective (and slower) human analysis. After all, the AI algorithm will not “dislike” your picture or assume anything based on it, especially when it is taught to ignore it completely.

The problem is, AI algorithms are not necessarily human-bias free. AI is designed and trained on human data, and human thinking is characterized by bias. Therefore bias is also a built-in byproduct of human-designed systems. These AI biases can echo problematic perceptions, such as the perceived superiority of certain groups. Even well-designed AI systems can still end up with a bias, entirely by accident.

The question is: can we prevent AI from being racist and sexist? And if we can, then could machines help us create a fairer society?

People and biases

Let’s start from the beginning: before the machines were biased, people were. Why is that? According to the Cambridge Dictionary, bias is "the action of supporting or opposing a particular person or thing in an unfair way, because of allowing personal opinions to influence your judgment."

Biases can be innate or learned. People may develop biases for or against an individual, a group, or a belief. It does not have to be limited to ethnicity and race; it is also gender, religion, sexual orientation, and many other characteristics that are subject to bias.

Are all biases conscious?

There are two types of bias: conscious (also known as explicit bias) and unconscious (also known as implicit bias). The conscious bias refers to the attitude we have on a conscious level, and most of the time, it arises as to the direct result of a perceived threat.

The unconscious bias is a social stereotype about certain groups of people that a person form outside their conscious awareness. It is automatic, unintentional, deeply ingrained, and able to influence behavior. Unconscious bias is more prevalent than the conscious one.

According to the American Psychological Association, only a small proportion of Americans today are explicitly racist and feel hatred towards other ethnicities and races.

But the majority of Americans, because they have grown up in a culture that has been historically racist in many ways and because they’re exposed to the media that are biased, associate violence, drugs, and poverty with specific groups.

shutterstock img
Source: Hyejin Kang, Shutterstock

People will always be biased to some extent because their opinions are subjective and what is worse - humans tend to generalize. This is partly the fault of the way we are programmed, and partly the failure of the way we programmed our society and culture. Does it mean machines have to be programmed like this as well?

Want to build a racist AI solution? Just hand it a newspaper

Obviously, AI solutions have no political agenda of their own, right? It’s not going to be intentionally racist unless it has explicitly been trained to be. It is also not (most of the time, at least) a political agenda of their creators that is the issue. The problem is that it is very easy to train machines to be racist by accident and without even trying.

Here are a few examples of how algorithms can discriminate on different fields based on race:

    • Image Recognition: Researchers from Georgia Institute of Technology tested eight image-recognition systems used in self-driving cars, after observing higher error rates for specific demographics. They found their accuracy proving five percent less accurate on average for people with darker skin. So a self-driving car is more likely to run over a black person than it is to run a white one.
    • Healthcare: An algorithm used in US hospitals to allocate healthcare for patients has been systematically discriminating against black people. A study concluded that the algorithm was less likely to refer black people than white people who were equally sick to programs that aim to improve care for patients with complex medical needs.
AI solutions
Source: Hyejin Kang, Shutterstock
  • Criminal Cases: In 2016, ProPublica published an investigation on an ML program that is used by courts to predict future criminals, and they found out that the system is biased against black people. The program learned about who is mostly to end up in jail from incarceration data. And historically, the criminal justice system has been unfair to black Americans. So when an AI program was fed with these historical data, it learned from biased decisions historically made by humans.
  • Natural Language Processing (NLP): There is a broad spectrum of cases (for example, work/college admissions or even loan applications), where words can serve as an input - the so-called word embeddings represent words as inputs to machine learning. But there is a fairness problem when an algorithm learns the meaning of words from humans. Our opinions are often subjective and biased, so the meaning of the words (e.g., people names) is then biased. A paper in Science from 2017 found out that when a computer teaches itself English by crawling the internet, it becomes prejudiced against black Americans and women. For example, when the GloVe news dataset is used, the sentiment (how positive a given sentence is) of simple text starting with ‘My name is...’ is significantly lower when it ends with common names of black people than common names for white people.
pasted image 0 2
This is just a glimpse of how sentiment works for different sentences. A machine understands the meaning of these words due to word embeddings.
unnamed karolina
pasted image 0 3
Sentiment values are more positive for stereotypically-white names, and more negative for stereotypically-black names. These examples show how word-embeddings can accidentally learn racist bias from us just by reading internet news.

How can we fight it?

Bias-free AI can soon be a powerful tool to tackle social issues - such as enhancing social mobility through fairer access to the financing/healthcare system, mitigating exclusion and poverty through making the judiciary systems more objective, bias-free testing in university admissions, and much more. We should expect fair AI solutions from both technology companies and authorities.

Military and tactics books say that weapons or tactics should always take into consideration the enemy strategy. That is also true for this particular fairness and human dignity fight. In the case of this war, there are also different ways to fight racial bias, depending on the domain (field) and the data used by the algorithm.

Image Recognition

In the case of image-recognition systems, the reason for bias is that training data that machines use for learning contain mostly samples gathered from white people. These AI solutions can have problems with recognizing people of different races because it simply did not see enough pictures of them. The effect is doubled when it comes to women of color.

Face recognition systems from the leading companies failed during the man/women classification of Oprah Winfrey, Michelle Obama, and Serena Williams.

The biggest problem here is that face-recognition systems are widely used by law enforcement and border control. If the application has high false-positive rates for a specific group (and according to NIST study, a majority of the face recognition algorithms used in the US had worse perform on nonwhite faces), it puts this population at the highest risk for being falsely accused of a crime.


The solution (our weapon) for this problem is quite simple. Still, it requires more attention during dataset preparation: the equal representation of people of color and gender in training datasets (e.g., face recognition) is crucial for algorithms (like face recognition ones) to work with the same precision regardless of race or gender.

Moreover, from the sociological point of view, those bias problems would be hard to overlook and natural to point out if minority representatives would be more encouraged to be a part of the artificial intelligence team.

Biases in AI solutions for Healthcare

The reason millions of black people were affected by racial bias in health-care algorithms? It was based on the historical cost of health care, where higher health-care costs are associated with greater needs. People who did have higher-cost treatment were assumed to need more extensive care, which seemed just about right - or did it?
The biggest problem with this approach was the fact that those less wealthy simply couldn’t afford more extensive treatment, so they chose less expensive options, while their actual needs remain the same as people in the same condition who could opt in the more expensive ones.

The approximation of the healthcare needs by the amount of money spent on treatment was actually an exclusive approach, biased towards more wealthy people. Finding other variables than the cost of treatment to estimate a person's medical needs reduced bias by 84%.

College Admissions

The problem here is often not testing actual skills but rather candidate background (e.g., childhood environment). If you have two candidates and one comes from the wealthy neighborhood and the other from a poor one, there is a possibility that the second one will score lower because she (or he) had the lower probability of gathering the certain knowledge assumed in the question, but it does not mean that she (or he) lacks needed skills.

people woman coffee meeting
Source: Pexels

The solution here is building classifiers powered by NLP which role is early detection of biased questions based on the past applicants’ results. It would allow the preparation of a fair and inclusive set of questions during the admission/recruitment process for testing actual candidates’ skills, not their environmental background.

What exactly makes NLP systems biased?

The word embeddings technique is one of the most popular of Natural Language Processing methods and one of the most powerful ones.

It transforms actual words into mathematical vectors, which are later used as features in many predictive algorithms. It can be used for the analysis of job or loan applications to speed up these processes (and make them less subjective).

The problem is, these word embeddings can be easily biased because of the features of the actual human-generated internet data, from which they learn. Humans’ opinions are often subjective and biased, so the meaning of the words is then biased too.

From the perspective of the word embeddings algorithm, the easiest approach would be to gently ask humans to stop being racists and start producing less exclusive content. But even then, it would be hard to fight some indirect frequent words co-occurrences including historical, language ones, and even benevolent stereotypical ones.

For example, the association of female gender with any word, even a subjectively positive one such as attractive, can cause discrimination against women. The reason is, it can reduce their association with other terms, such as professional.

The actual ML approach to this problem seeks debiasing of words meaning (word embeddings) rather in skin-of-color neutralization (or softening) so that gender-neutral and race-neutral words are semantically equidistant to all human races (similarly as debiasing of gender described in this article).

pasted image 0 4
Visible bias (negative sentiment) against black people's names when using GloVe word embeddings.

A significant difference can be observed between well-established popular word embeddings and the debiased ones when comparing similar texts where the name is changed to provide ethnical connotations. The figure above presents visible bias against people of color, while the usage of debiased word embeddings is presented below.

pasted image 0 5
Weakening of race bias against people's names when using debiased ConceptNet embeddings.


Not only should AI algorithms be fed with inclusive AI datasets where all races are equally represented to avoid discrimination, but also variables should be carefully chosen to avoid any potential bias against people of color.

Selecting less biased representations of our world (like word embeddings) should be incorporated as a standard in all solutions to serve fairness, not only in the NLP field. We should make sure that we create algorithms that serve everyone, not only a small part of society.

Verifying against exclusivity should be one of the most critical steps during the creation of machine learning software for any AI team to take. It is also crucial to focus on building awareness inside technology teams, which are creating these life-changing solutions.

Let's make the future available for all of us.


Word2Vec [1] is a technique for creating vectors of word representations to capture the syntax and semantics of words. The vectors used to represent the words have several interesting features.

Here are a few:

  • Addition and subtraction of vectors show how word semantics are captured: e.g. king - man + woman = queen. This example captures the fact that the semantics of king and queen are nicely captured by the word vectors
  • Similar words have similar word vectors: E.g. king is most similar to queen, duke, duchess

Here is the description of Gensim Word2Vec, and a few blogs that describe how to use it:

One of the issues of the Word2Vec algorithm is that it is not able to add more words to vocabulary after an initial training. This approach to 'freeze vocabulary' might not work for several situations where we need to train the model in an online manner, by adding and training on new words as they are encountered. Here is a quick description of an online algorithm.

In this post, I will discuss an online word2vec implementation that I have developed and how to use it to update the vocabulary and learn new word vectors in an online manner. I maintain the code here.

How to use online word2vec

  1. Download the source code.
  2. On your local machine, browse to the location of the downloaded code, and install it by typing:
# clean already existing install 
sudo rm -rf build dist gensim/*.pyc  
# installation 
sudo python install
  1. Now run the following lines of code from iPython or a separate Python file:
import gensim.models

# setup logging
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# train the basic model with text8-rest, which is all the sentences
# without the word - queen
model = gensim.models.Word2Vec()
sentences = gensim.models.word2vec.LineSentence("text8-rest")

# Evaluation
> model.n_similarity(["king"], ["duke"])
> 0.68208604377750204
> model.n_similarity(["king"], ["queen"])
> KeyError: 'queen'

# text8-rest:
> model.accuracy("questions-words.txt")

2015-08-21 10:56:49,781 : INFO : precomputing L2-norms of word weight vectors
2015-08-21 10:56:56,346 : INFO : capital-common-countries: 33.2% (168/506)
2015-08-21 10:57:12,728 : INFO : capital-world: 17.5% (254/1452)
2015-08-21 10:57:15,807 : INFO : currency: 6.0% (16/268)
2015-08-21 10:57:32,402 : INFO : city-in-state: 15.0% (235/1571)
2015-08-21 10:57:35,197 : INFO : family: 50.0% (136/272)
2015-08-21 10:57:43,378 : INFO : gram1-adjective-to-adverb: 6.0% (45/756)
2015-08-21 10:57:46,406 : INFO : gram2-opposite: 12.4% (38/306)
2015-08-21 10:57:59,972 : INFO : gram3-comparative: 34.6% (436/1260)
2015-08-21 10:58:05,865 : INFO : gram4-superlative: 13.2% (67/506)
2015-08-21 10:58:17,331 : INFO : gram5-present-participle: 18.2% (181/992)
2015-08-21 10:58:31,446 : INFO : gram6-nationality-adjective: 37.5% (514/1371)
2015-08-21 10:58:45,533 : INFO : gram7-past-tense: 18.3% (244/1332)
2015-08-21 10:58:55,660 : INFO : gram8-plural: 30.2% (300/992)
2015-08-21 10:59:02,508 : INFO : gram9-plural-verbs: 20.3% (132/650)
2015-08-21 10:59:02,509 : INFO : total: 22.6% (2766/12234)

OK. So far so good.

You will notice that I did some more evaluation on this data, by testing it against the same dataset that Google released, to compute the sysntactic and semantic relationships between words. As text8 is a small dataset, we don't expect it to achieve very high levels of accuracy on this task, however, it will help us discern the difference in learning words in an online manner, vs learning it all in one sitting. You can download the script that I ran from here.

Now lets update the model with all the sentences containing queen and see if the vector for queen is similar to that of king and duke. Notice that the build_vocab function now has an additional argument update=True that add more words to the existing vocabulary.

sentences2 = gensim.models.word2vec.LineSentence("text8-queen")
model.build_vocab(sentences2, update=True)

# Evaluation
> model.n_similarity(["king"], ["duke"])
> 0.47693305301957223
> model.n_similarity(["king"], ["queen"])
> 0.68197327708244115

# text8-rest + text8-queen (using update model)
> model.accuracy("questions-words.txt")
2015-08-21 11:00:42,571 : INFO : precomputing L2-norms of word weight vectors
2015-08-21 11:00:47,892 : INFO : capital-common-countries: 23.3% (118/506)
2015-08-21 11:01:02,583 : INFO : capital-world: 14.1% (205/1452)
2015-08-21 11:01:05,521 : INFO : currency: 4.5% (12/268)
2015-08-21 11:01:21,348 : INFO : city-in-state: 13.2% (208/1571)
2015-08-21 11:01:24,349 : INFO : family: 46.4% (142/306)
2015-08-21 11:01:31,891 : INFO : gram1-adjective-to-adverb: 6.2% (47/756)
2015-08-21 11:01:34,925 : INFO : gram2-opposite: 13.4% (41/306)
2015-08-21 11:01:47,631 : INFO : gram3-comparative: 32.4% (408/1260)
2015-08-21 11:01:52,768 : INFO : gram4-superlative: 11.7% (59/506)
2015-08-21 11:02:02,831 : INFO : gram5-present-participle: 18.0% (179/992)
2015-08-21 11:02:16,823 : INFO : gram6-nationality-adjective: 35.2% (483/1371)
2015-08-21 11:02:31,937 : INFO : gram7-past-tense: 17.1% (228/1332)
2015-08-21 11:02:42,960 : INFO : gram8-plural: 26.8% (266/992)
2015-08-21 11:02:49,822 : INFO : gram9-plural-verbs: 19.2% (125/650)
2015-08-21 11:02:49,823 : INFO : total: 20.5% (2521/12268)

Bingo! Looks like it learned the weights of the vector queen quite well.

NOTE: text8-rest, text8-queen, text8-all can be downloaded here.

Here is how the files are divided: All sentences from text8 that have queen in them are in
text8-queen, and the remaining sentences are in text8-rest. The file text8-all, is a
concatenation of text8-rest and text8-queen.

Here are the output accuracies that were achieve if we were to train the entire model in one go, as opposed to piecemeal in an online manner. Note that as the amount of data we are using is very little, the accuracy will vary a little due to the initialization parameters.

sentences = gensim.models.word2vec.LineSentence("text8-all")

# text8-all
> model.accuracy("questions-words.txt")

2015-08-21 11:07:53,811 : INFO : precomputing L2-norms of word weight vectors
2015-08-21 11:07:58,595 : INFO : capital-common-countries: 36.0% (182/506)
2015-08-21 11:08:12,343 : INFO : capital-world: 18.9% (275/1452)
2015-08-21 11:08:14,757 : INFO : currency: 4.9% (13/268)
2015-08-21 11:08:28,813 : INFO : city-in-state: 16.4% (257/1571)
2015-08-21 11:08:31,542 : INFO : family: 48.4% (148/306)
2015-08-21 11:08:38,486 : INFO : gram1-adjective-to-adverb: 6.9% (52/756)
2015-08-21 11:08:41,268 : INFO : gram2-opposite: 16.7% (51/306)
2015-08-21 11:08:52,507 : INFO : gram3-comparative: 34.4% (434/1260)
2015-08-21 11:08:57,148 : INFO : gram4-superlative: 12.8% (65/506)
2015-08-21 11:09:06,475 : INFO : gram5-present-participle: 19.1% (189/992)
2015-08-21 11:09:18,681 : INFO : gram6-nationality-adjective: 40.0% (548/1371)
2015-08-21 11:09:30,722 : INFO : gram7-past-tense: 18.2% (243/1332)
2015-08-21 11:09:39,516 : INFO : gram8-plural: 32.7% (324/992)
2015-08-21 11:09:45,498 : INFO : gram9-plural-verbs: 17.1% (111/650)
2015-08-21 11:09:45,499 : INFO : total: 23.6% (2892/12268)

As you can see, the output score does drop a little, when the model is updated in an online manner, as opposed to training everything in one go. See the PR for my code.


[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space

Have you ever had to deal with a lot of data, and don't know where to start? If yes, then this post is for you. In this post I will try to guide you through some basic approaches and operations you can perform to analyze your data, make some basic sense of it, and decide on your approach for deeper analysis of it. I will use Python and a small subset of data from the Kaggle Bikesharing Challenge to illustrate my examples. The code for this work can be found at this location. Please take a minute to download Python and the sample data before we proceed.

Dataset Description

The data provided is a CSV file - bikesharing.csv, with 5 columns - datetime, season, holiday, workingday, and count.

  • datetime: The date and time when the statistics were captured
  • season: 1 = spring, 2 = summer, 3 = fall, 4 = winter
  • holiday: whether the day is considered a holiday
  • workingday: whether the day is neither a weekend nor holiday
  • count: the total number of bikes rented on that day

Min, Max and Mean

One of the first analyses one can do with their data, is to find the minimum, maximum and the mean. The mean (or average) number of bikes per day rented in this case is the sum of all bikes rented per day divided by the total number of days:

$$ \bar{f} = \frac{\sum_{i=1}^{n} b_i}{n} $$

where $ \bar{f} $ is the mean, $ b_i $ is the number of bikes rented on day $i$ and $n$ are the total number of days. We can compute these in python using the following code:


from datetime import datetime
from collections import defaultdict
import csv
import numpy

data = defaultdict(int)
prevdate = ""
with open("bikesharing.csv", "r") as fin:
    csvreader = csv.reader(fin)
    for row in csvreader:
        date_object = datetime.strptime(row[0].split(" ")[0], '%m/%d/%y')
        currdate = str(date_object.month) + "-" + str(
        # Computing the total number of bikes rented in a day
        data[currdate] += int(row[4])

totals = []
for key, value in data.iteritems():



2752 #min
13446 #max
9146.82 #mean

In this case, the mean is 9146.82. It looks like there are several large values in the data, because the mean is closer to the max than to the min. Maybe the data will provive more insight if we compute the min, max and mean per day, grouped by another factor, like season or weather. Here is some code to compute mean per day grouped by the season:


from datetime import datetime
import csv
import numpy

data = {}

with open("bikesharing.csv", "r") as fin:
    csvreader = csv.reader(fin)
    for row in csvreader:
        date_object = datetime.strptime(row[0].split(" ")[0], '%m/%d/%y')
        currdate = str(date_object.month) + "-" + str(
        if row[1] in data:
            if currdate in data[row[1]]:
                data[row[1]][currdate] += int(row[4])
                data[row[1]].update({ currdate : int(row[4])})
            data[row[1]] = { currdate : int(row[4]) }

for key, value in data.iteritems():
    print key
    totals = []
    for k, val in value.iteritems():
    print numpy.min(totals)
    print numpy.max(totals)
    print numpy.mean(totals)


1 #season
2752 #min
10580 #max
5482.42105263 #mean




As you can see, the mean varies significantly with the season. It intuitively makes sense because we would expect more people to ride a bike in the summer as compared to the winter, which means a higher mean in the summer than the winter, this also means higher min and max values in the summer than the winter. This data also helps us intuitive guess that season 1, is most likely winter, and season 3 is most likely summer.

Variability in Data

The next thing we would like to know, is the variability of the data provided. It is good to know if the data is skewed in a particular direction, or how varied it is. If the data is highly variable, it is hard to determine if the mean changes with different samples of data. Reducing variability is a common goal of designed experiments, and this can be done by finding subsets of data that have low variablity such that samples from each of the subsets produce similar mean value. We already did a little bit of that in the second example above.

There are 2 ways of measuring variability: variance and standard deviation.


Variance is defined as the average of the squared differences from the mean. In most experiments, we take a random sample from a population. In this case, we will compute the population variance, which uses all possible data provided. Population variance can be computed as:

$$ \sigma^2 = \frac{\sum_{i=1}^{n} (f_i - \bar{x})^2}{N} $$

If you needed to compute the sample variance, you can use the following formula:

$$ variance = \frac{\sum_{i=1}^{n} (f_i - \bar{x})^2}{N-1} $$

where $x_i $ is each instance, $\bar{x}$ is the mean, and $N$ is the total number of features. Dividing by $ n - 1 $ gives a better estimate of the population standard deviation for the larger parent population than dividing by $n$, which gives a result which is correct for the sample only. This is known as Bessel's correction.

In our case we will compute population variance using most of the same code as that above, except adding the following line to it:

print numpy.var(totals)


1 #season
2752 #min
10580 #max
5482.42105263 #mean
2812044.87535 #variance




Standard Deviation

Variance by itself is not particularly insightful, as its units are feature squared and it is not possible to plot it on a graph and compare it with the min, max and mean values. The square root of variance is the standard deviation, and it is a much more insightful metric.

The population standard deviation, $\sigma$, is the square root of the variance, $\sigma^2$. In python you can compute variance by adding the following line to the above code:


print numpy.var(totals)


1 #season
2752 #min
10580 #max
5482.42105263 #mean
1676.91528568 #standard deviation




A standard deviation close to 0 indicates that the data points tend to be very close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values. In our case, the data is very spread out. Three standard deviations from the mean account for 99.7% of the sample population being studied, assuming the distribution is normal (bell-shaped).

Standard Error of the Mean (SEM)

So far post, we have computed the population mean. However, if one has to compute the sample mean, it is useful to know how accurate this value is in estimating the population mean. SEM is the error in estimating $\mu$.

$$\textbf{SEM} = \frac{\sigma}{\sqrt{N}}$$

however, as we often are unable to compute the population standard deviation, we will use teh sample standard deviation instead:

$$\textbf{SEM} = \frac{s}{\sqrt{N}}$$

The mean of any given sample is an estimate of the population mean number of features. Two aspects of the population and the sample could affect the variability of the mean number of features of those samples.

  • If the population of number of features has very small standard deviation, then the samples from that population will have small sample standard deviation the sample means will be close to the population mean and we will have a small standard error of the mean,
  • If the population of number of features has a large standard deviation, then the samples from that population will have large sample standard deviation the sample means may be far from the population mean and we will have a large standard error of the mean.

So large population variability causes a large standard error of the mean. The estimate of the population mean using 2 observations is less reliable than the estimate using 20 observations, and much less reliable than the estimate of the mean using 100 observations. As $N$ gets bigger, we expect our error in estimating the population mean to get smaller.

If you enjoyed reading this article about basic statistics, please take a second to share it with others!


In this world things keep happening around us. Each event occurring is a Random Variable. A Random Variable is an event, like elections, snow or hail. Random variables have an outcome attached them - the value of which is between 0 and 1. This is the likelihood of that event happening. We hear the outcomes of random variables all the time - There is a 50% chance or precipitation, The Seattle Seahawks have a 90% chance of winning the game


Where do we get these numbers from? From past data.

Year 2008 2009 2010 2011 2012 2013 2014 2015
Rain Rainy Dry Rainy Rainy Rainy Dry Dry Rainy
p(\text{Rain}=\text{Rainy}) = \frac{\sum(\text{Rain}=\text{Rainy})}{\sum(\text{Rain}=\text{Rainy}) + \sum(\text{Rain}=\text{Dry})} = \frac{5}{8}

$$ p(\text{Rain}=\text{Dry}) = \frac{\sum(\text{Rain}=\text{Dry})}{\sum(\text{Rain}=\text{Rainy}) + \sum(\text{Rain}=\text{Dry})} = \frac{3}{8} $$


What is the probability that event A and Event B happening together Consider the following table, with data about the Rain and Sun received by Seattle for the past few years.

Year 2008 2009 2010 2011 2012 2013 2014 2015
Rain Rainy Dry Rainy Rainy Rainy Dry Dry Rainy
Sun Sunny Sunny Sunny Cloudy Cloudy Cloudy Sunny Sunny

Using the above information, can you compute what is the probability that it will be Sunny and Rainy in 2016?

We can get this number easily from the Joint Distribution

Rainy Dry
SUN Sunny 3/8 2/8
Cloudy 2/8 1/8

In 3 out of the 8 examples above, it is Sunny and Rainy at the same time. Similarly, in 1 out of 8 times it is Cloudy and it is Dry. So we can compute the probability of multiple events happening at the same time using the Joint Distribution. If there are more than 2 variables, the table will be of a higher dimension.

We can extend this table further including marginalization. Marginalization is just a fancy word for adding up all the probabilities in each row, and the probabilities in each column respectively.

Rainy Dry Margin
SUN Sunny 0.375 0.25 0.625
Cloudy 0.25 0.125 0.375
Margin 0.625 0.375 1

Why are margins helpful? They remove the effects of one of the two events in the table. So, if we want to know the probability that it will rain (irrespective of other events), we can find it from the marginal table as 0.625. From Table 1, we can confirm this by computing all the individual instances that it rains: $ \frac{5}{8} = 0.625 $


What do we do when one of the outcomes is already given to us? On this new day in 2016, it is very sunny, but what is the probability that it will rain?

$$ P(\text{Rain}=\text{Rainy} \mid \text{Sun}=\text{Sunny}) $$

which is read as - probability that it will rain, given that there is sun.

This is computed in the same way as we compute normal probability, but we will just look at the cases where Sun = Sun from Table 1. There are 5 instances of Sun = Sun in Table 1, and in 3 of those cases Rain = Rain. So the probability of

$$ P(\text{Rain}=\text{Rainy} \mid \text{Sun}=\text{Sunny}) = \frac{3}{5} = 0.6 $$

We can also compute this from Table 3. Total probability of Sun = 0.625 (Row 1 Marginal probability). Probability of Rain and Sun = 0.375

Probability of Rain given Sun = $\frac{0.375}{0.625} = 0.6$


Conditional and Joint probability are often mistaken for each other because of the similarity in their naming convention. So what is the difference between: $ P(AB) $ and $ P(A \mid B) $

The first is Joint Probability and the second is Conditional Probability.

Joint probability computes the probability of 2 events happening together. In the case above - what is the probability that Event $A$ and Event $B$ both happen together? We do not know whether either of these events actually happened, and are computing the probability of both of them happening together.

Conditional probability is similar, but with one difference - We already know that one of the events (e.g. Event $B$) did happen. So we are looking for the probability of Event $A$, when we know the Event $B$ already happened or that the probability of Event $B$ is 1. This is a subtle but a significantly different way of looking at things.


$$P(A B) = P(A) , P(B \mid A)$$

$$ P(B A) = P(B) , P(A \mid B)$$

$$ P(A B) = P(B A) $$

Equating (4) and (5)

$$ P(A \mid B) = \frac{P(B \mid A) , P(A)}{P(B)} $$

This is the Bayes Rule.

Bayes Rule is interesting, and significant, because we can use it to discover the conditional probability of something, using the conditional probability going the other direction. For example: to find the probability $ P(death \mid smoking)$, we can get this unknown from $ P(smoking \mid death) $, which is much easier to collect data for, as it is easier to find out whether the person who died was a smoker or a non smoker.

Lets look at some real examples of probability in action. Consider a prosecutor, who wants to know whether to charge someone with a crime, given the forensic evidence of fingerprints, and town population.

The data we have is the following:

  • One person in a town of 100,000 committed a crime. The probability that is he guilty $ P(G) = 0.00001 $, where $ P(G) $ is the probability of a person being guilty of having committed a crime.
  • The forensics experts tell us, that if someone commits a crime, then they leave behind fingerprints 99% of the time. $ P(F \mid G) = 0.99 $, where $ P(F \mid G) $ is the probability of fingerprints, given crime is commited.
  • There are usually 3 people's fingerprints in any given location. So $ P(F) = 3 \times 0.00001 = 0.00003 $. This is because only 1 in 100,000 people could have their fingerprints.

We need to compute:

$$ P(G \mid F) $$

Using Bayes Rule we know that:

$$ P(G \mid F) = \frac{P(F \mid G) , P(G)}{P(F)} $$

Plugging in the values that we already know:

$$ P(G \mid F) = \frac{0.99 \times 0.00001}{0.00003} $$

$$P(G \mid F) = 0.33$$

This is a good enough probability to get in touch with the suspect, and get his side of the story. However, when the prosecutor talks to the detective, the detective points out that the suspects actually lives at the scrime scene. This makes it highly likely to find the suspect's fingerprints in that location. And the new probability of finding fingerprints becomes : $ P(F) = 0.99 $

Plugging in those values again into (9), we get:

$$ P(G \mid F) = \frac{P(F \mid G) , P(G)}{P(F)} $$

$$ P(G \mid F) = \frac{0.99 \times 0.00001}{0.99} $$

$$ P(G \mid F) = 0.00001 $$

So it completely changes the probability of the suspect being guilty.

This example is interesting because we computed the probability of a $ P(G \mid F) $ using the probability of $ P(F \mid G) $. This is because we have more data from previous solved crimes about how many peple actually leave fingerprints behind, and the correlation of that with them being guilty.

Another motivation for using conditional probability, is that conditional probability in one direction is often less stable that the conditional probability in the other direction. For example, the probability of disease given a symptom $ P(D \mid S) $ is less stable as compared to probability of symptom given disease $ P(S \mid D) $

So, consider a situation where you think that you might have a horrible disease Severenitis. You know that Severenitis is very rare and the probability that someone actually has it is 0.0001. There is a test for it that is reasonably accurate 99%. You go get the test, and it comes back positive. You think, "oh no! I am 99% likely to have the disease". Is this correct? Lets do the Math.

Let $ P(H \leftarrow w) $ be the probability of Health being well, and $ P(H \leftarrow s) $ be the probability of Health being sick. Let and $ P(T \leftarrow p) $ be the probability of the Test being positive and $ P(T \leftarrow n) $ be the probability of the Test being negative.

We know that the probability you have the disease is low $ P(H \leftarrow s) = 0.0001 $. We also know that the test is 99% accurate. What does this mean? It means that if you are sick, then the test will accurately predict it by 99%

$$ P(T \leftarrow n \mid H \leftarrow w) = 0.99 $$

$$ P(T \leftarrow n \mid H \leftarrow s) = 0.01 $$

$$ P(T \leftarrow p \mid H \leftarrow w) = 0.01 $$

$$ P(T \leftarrow p \mid H \leftarrow s) = 0.99 $$

We need to find out the probability that you are sick given that the test is positive or $ P(H \leftarrow s \mid T \leftarrow p) $

Using Bayes Rule:

$$ P(H \leftarrow s \mid T \leftarrow p) = \frac{P(T \leftarrow p \mid H \leftarrow s) , P(H \leftarrow s)}{P(T
\leftarrow p)} $$

We know the numerator, but not the denominator. However, it is easy enough to compute the denominator using some clever math!

We know that the total probability of

$$ P(H \leftarrow s \mid T \leftarrow p) + P(H \leftarrow w \mid T \leftarrow p) = 1 $$

$$ P(H \leftarrow s \mid T \leftarrow p) = \frac{P(T \leftarrow p \mid H \leftarrow s) , P(H \leftarrow s)}{P(T
\leftarrow p)} $$

$$ P(H \leftarrow w \mid T \leftarrow p) = \frac{P(T \leftarrow p \mid H \leftarrow w) , P(H \leftarrow w)}{P(T
\leftarrow p)} $$

Adding (16) and (17), and equating with (15) we get:

$$ \frac{P(T \leftarrow p \mid H \leftarrow s) , P(H \leftarrow s)}{P(T \leftarrow p)} + \frac{P(T \leftarrow p \mid H \leftarrow w) , P(H \leftarrow w)}{P(T \leftarrow p)} = 1 $$


$$ P(T \leftarrow p) = P(T \leftarrow p \mid H \leftarrow s) , P(H \leftarrow s) + P(T \leftarrow p \mid H \leftarrow w) , P(H \leftarrow w) $$

Substituting (7) into (4) we get:

$ P(H \leftarrow s \mid T \leftarrow p) = \frac{P(T \leftarrow p \mid H \leftarrow s) , P(H \leftarrow s)}{P(T \leftarrow p \mid H \leftarrow s) , P(H \leftarrow s) + P(T \leftarrow p \mid H \leftarrow w) , P(H \leftarrow w)} $

$$ P(H \leftarrow s \mid T \leftarrow p) = \frac{0.99 \times 0.0001}{0.99 \times 0.0001 + 0.01 \times 0.9999} $$

$$ = 0.0098$$

This is the reason why doctors are hesitant to order expensive tests if it is unlikely tht you have the disease. Even though the test is accurate, rare diseases are so rare that the very rarity dominates the accuracy of the test.


When someone applies Naive Bayes to a problem, they are assuming conditional independence of all the events. This means:

$$ P(ABC... \mid Z) = P(A \mid Z) , P(B \mid Z) , P(C \mid Z) ,... $$

When this is plugged into Bayes Rules:

$$ P(A \mid BCD...) = \frac{P(BCD...\mid A ) , P(A)}{P(BCD...)} $$

$$ = \frac{P(B\mid A ) , P(C \mid A) P(D \mid A) ... P(A)}{P(BCD...)} $$

  1. Let $ P(BCD...) = \alpha $ which is the normalization constant. Then,

$$ = \alpha \times P(B\mid A ) , P(C \mid A) P(D \mid A) ... P(A) $$

What we have done here, is assumed that the events $A$, $B$, $C$ etc. are not dependent on each other, thereby reducing a very high dimensional table into several low dimensional tables. If we have 100 features, and each feature can take 2 values, then we would have a table of size $ 2^{100} $. However, assuming independence of events we reduce this to one hundred 4 element tables.

Naive bayes is rarely ever true, but it often works because we are not interested in the right probability, but the fact that the correct class has the highest probability.


  1. A gentle review of Basic Probability

Post by Dr. Rutu Mulkar-Mehta

In this post, I will take you through the steps for calculating the $ tf \times idf $ values for all the words in a given document. To implement this, we use a small dataset (or corpus, as NLPers like to call it) form the Project Gutenberg Catalog. This is just a simple toy example on a very small dataset. In real life we use much larger corpora, and need some more sophisticated tools in order to handle large amounts of data. To brush up on the basic concepts of $ tf \times idf $ you might want to check out my post on the The Math behind Lucene.

For this exercise, we will use the Project Gutenberg Selections which are released as part of NLTK Data. NLTK - Natural Language Toolkit - is a Python-based module for text processing. The corpus - Project Gutenberg Selections - contains 18 files. Each file is a complete book. Our task is to calculate the $ tf \times idf $ of all the words for each of the documents provided. In the end of this exercise we will have 18 documents, with the $ tf \times idf $ of each word in each of the documents.

Documents with $ tf \times idf $ values for each word (or token) are often used (with the vector space model) to compute the similarity between two documents. Such statistics are quite relevant in Information Retrieval, Search Engines, Document Similarity, and so on.

NOTE: All the materials needed for this exercise (code + data) can be downloaded from my GitHub repo.

The code is written in Perl, and is heavy in regular expressions. I have tried my best to document the code, but if you have any issues, or discover bugs, please do not hesitate in contacting me. I will not go into the details of all my code in this post, but will highlight a few features.

This is one of the most important tasks of any NLP application. Discourse often contains non-ASCII characters, no alphanumeric characters, spaces, and so on. The first and most important task is to remove these unwanted characters from text, to clean it up. Here is a set of regular expressions (regex) that is helpful for this dataset. However it is not an exhaustive set of regexes. For instance, I have not used any regex to convert UTF-8 to ASCII.

# remove endline character
# remove extra space characters
$txt =~ s/[\h\v]+/ /g;
# remove caps
$txt =~ tr/[A-Z]/[a-z]/;
# remove non-alphanumeric characters
$txt =~ s/[^a-zA-Z\d\s]//g;

The rest of my code can be downloaded from my GitHub repo. The code is relatively easy to understand if you are familiar with regular expressions, and understand Perl. I have provided my solutions in the output directory, within the folder $ tf \times idf $. The intermediate $ tf $ and $ idf $ results are also provided in the output folder.

It is interesting to note here that words occurring in all the documents have an $ idf $ value of 0. This means that their $ tf \times idf $ value will also be 0, deeming them insignificant in contributing to the document vector. These include words like - a, the, when, if, etc. A longer list can be developed by sorting all the values in the file idf.txt (downloaded from my GitHub repo).

The next step, is to use these metrics to compute document similarity. Discovering similar document in a corpus of documents remains one of the most important problems of information retrieval. This is often done by converting documents into document vectors, and comparing the similarity of these vectors to each other using vector similarity metrics.

This post uses the output generated in the toy example for $ tf \times idf $ which is available in this GitHub repo.

Creating document vectors

After computing the $ tf \times idf $ values for each document in a given corpus, we need to go through the exercise to convert these values into a document vector. If you recall some vector basics, a vector constitutes a magnitude and a direction. In our case, the direction is represented by a word in the document and magnitude is the weight or the $ tf \times idf $ value of the word in the document. In order to simplify our vector calculations we pre-specify the locations of each word in the array representing the document, creating a sparse vector.

For instance consider the vocabulary of the entire corpus to be:
John, Mary, Susan, Kendra, sang, for, with, plays

We will assign a location to each word in the vocabulary as:

John Mary Susan Kendra sang for with dances
0 1 2 3 4 5 6 7

Given this data, for the sentence: John sang for Mary, we create the following boolean document vector:

John Mary Susan Kendra sang for with dances
1 1 0 0 1 1 0 0

Here a 1 represents the existence of a word in the text, and a 0 represents the absence of a word from the text. This example is simply to illustrate the concept of creating a document vector. In our toy example we will replace with 1 values with the $ tf \times idf $ values of a given word in the given document.

Computing the dot Product of two Vectors

John Mary Kendra sang for with dances Sentence
d1 0 0.27 0 0 0 0.1 0.1
d2 0 0 0.1 0.27 0.27 0 0
d3 0 0 0.1 0 0 0.1 0.1

Dot product between $d_1$ and $d_2$:

$$ d_1 \cdot d_2 = d_1.John \times d_2.John + d_1.Mary \times d_2.Mary + ...$$

The final similarity scores are:

doc3.txt doc2.txt 0.178554901188263
doc3.txt doc1.txt 0.377800203993898

doc2 similarity with doc1 and doc3:

doc2.txt doc3.txt 0.178554901188263
doc2.txt doc1.txt 0

doc1 similarity with doc2 and doc3:

doc1.txt doc3.txt 0.377800203993898
doc1.txt doc2.txt 0

Notice that document 1 is most similar to document 3. Document 2 and document 1 have absolutely no similarities. This is because both these documents have just one word in common - John, which is common to all the documents, so has a weight of 0.

Happy Coding!

Lucene is an open source search engine, that one can use on top of custom data and create your own search engine - like your own personal Google! In this post, we will go over the basic math behind Lucene, and how it ranks documents to the input search query.


The analysis of language often brings us in situations where we are required to determine the weight or importance of a given word in a document, to determine the relative importance or similarity of a document to another document. In situations such as this, the first step to remove stop words which are basically words that dont contribute to the general focus of a given article. Most common stop words are words like - a, when, who, what. The list of stop words keeps changing based on the domain of discourse. For instance, in a corpus of articles about the human heart, the word heart could potentially be a stop word due to the sheer frequency in which it is mentioned. It is always a good idea to remove stop words in a given text before processing it. However, once these stop words are removed, one still faces the task of determining the relative importance or weights of the remaining words in the document - let's start talking about TF*IDF.

Observing the words in a document, an intuitive way to discover the importance of a given word is to count the frequency of the word in the document. This is the Term Frequency or the $ tf $ of the word in the given document. $ tf $ is often normalized so as to not introduce a bias because of the document size. A normalized $ tf $ for a given word in a given document is calculated as:

tf_{w,d} = \frac{C_{w,d}}{\sum C_{w,d}}

where, $ C_{w,d} $ is the word count of word $ w $ in document $ d $ and $ \sum C_{w,d} $ is the total count of all words $ w $ in document $ d $.

However, $ tf $ by itself is not enough to capture the importance of a word in a document, as there are numerous words like therefore, however, that have a very high frequency but are not stop words, and carry very little importance. So we need to complement this with the Document Frequency of each word - or $ df $. Document Frequency is the total number of documents a given word occurs in. If a word occurs in a large number of documents, it has a high $ df $. When this $ df $ is used to scale the weight of a word in the document, it is called $ idf $ or Inverse Document Frequency of that given word.

If $ N $ is the total number of documents in a given corpus, then the $ IDF $ of a given word in the corpus is:

$$ idf_w = log \frac{N}{df_w} $$

Notice that $ tf $ is calculated for a given word in a given document, $ IDF $ is calculated for a given word over all the documents.

Once we have the $ tf $ and $ idf $, we can calculate the $ tf \times idf $ score for each word to determine the weight of a word in the given document.

The score $ tf_w \times idf_{w,d} $ assigns each word $ w $ a weight in document. Here are some insights about it.

  • $ tf_w \times idf_{w,d} $ is highest when $ w $ occurs many times within a small number of documents. This means that the word is significant, and can help establish relevance within the small number of documents.
  • $ tf_w \times idf_{w,d} $ is lower when the word occurs fewer times in a document, or occurs in many documents. If a word occurs infrequently, it might not be of significance. Similarly if a word occurs in a very large number of documents, it probably will not help in discriminating between the document.
  • $ tf_w \times idf_{w,d} $ is lowest when the term occurs in virtually all documents. This covers words like a, an the, when etc. which span a large number of documents and have no contribution towards the semantic composition of the document.


Document vectors are created by computing the relative weights of each word in the document. One way to accomplish this is by computing the $tf \times idf$ values of each of the words in the document. When we compute the $tf \times idf$ of each of the words in a document, we end up with documents with a list of features (words) with their values (weights). In a sense, this represents the document vector. The representation of documents as vectors in a common vector space is known as a vector space model, and is the basis of a large number of information retrieval tasks.

A standard way of computing the document similarity is to compute the cosine similarity of the vector representations of the documents. If $ d_1 $ and $ d_2 $ are two documents, and $ V(d_1) $ and $ V(d_2) $ are the vector representations of them respectively, then the similarity of $ d_1 $ and $ d_2 $ can be measured as the cosine of the angle between $ V(d_1) $ and $ V(d_2) $

$$ \textbf{sim}(d_1, d_2) = \frac{V(d_1) \cdot V(d_2)}{\mid V(d_1) \mid \mid V(d_2) \mid} $$

In this equation, the numerator is the dot product of vectors $ V(d1) $ and $ V(d2) $, and the denominator is the product of the Euclidean length. Euclidean Length is the sum of squares of the magnitude of each element of the vector.

If $ V(d_1) $ and $ V(d_2) $ are the following,

$$ V(d_1) = [a_1, a_2, a_3, a_4, ...] $$

$$ V(d_2) = [b_1, b_2, b_3, b_4, ...] $$

The dot product of $ V(d_1) $ and $ V(d_2) $ is:

$$ V(d_1) \cdot V(d_2) = a_1b_1 + a_2b_2 + a_3b_3 + a_4b_4 + ... $$

The Euclidean length of $ V(d_1) $:

$$ \sqrt{a_1^2 + a_2^2 + a_3^2 + a_4^2 + ... } $$

Similarly, the Euclidean length of $ V(d_2) $:

$$ \sqrt{b_1^2 + b_2^2 + b_3^2 + b_4^2 + ... } $$

Applying (6), (7) and (8) to (3) we get:

$$ sim(d_1, d_2) = \frac{(a_1b_1 + a_2b_2 + a_3b_3 + a_4b_4 + ...)}{\sqrt{a_1^2 + a_2^2 + a_3^2 + a_4^2 + ... } \sqrt{b_1^2 + b_2^2 + b_3^2 + b_4^2 + ... }} $$

Ok, we have cosine similarity. Now what? What cosine similarity tells us, is how similar 2 documents are. If the documents are very similar, we have the similarity score closer to 1, but if the documents are completely different, the similarity score is closer to -1.

This is the heart of the scoring mechanism that Lucene uses to retrieve similar documents given a search query. Although Lucene does use a couple of other (mostly user defined) constants to fine tune the results, $ tf \times idf $ is the heart of how Lucene operates.


  1. Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze, Chapter on Scoring, term weighting and the vector space model
  2. fuzzy matching a simple trick

    Ever wondered how Google knows what you mean even though you make spelling mistakes in each word of your query? In this short post, we would like to discuss a very simple but efficient method of fuzzy matching. It allows you to find the non-exact matches to your target phrase; not synonyms but rather phrases or terms that are constructed in a slightly different way than expected. For example, it would not find the word interesting for the query fascinating, but it would find United States of America for United States. For more information about the idea behind fuzzy matching, please refer to

    Fuzzy Matching - Introduction

    Exact searching is not an option. Although sometimes we can be lucky enough to have it in our database, more often we aren't. We have to use fuzzy matching! So suppose that we have the following database of terms or phrases as in the figure below.

    fuzzy matching 3
    Figure 1. Query and matching texts

    We need a way to somehow compare records from our database with a given query.
    $$ d(\textbf{some text}, \textbf{query text}) = \textbf{matching score} $$
    There are quite a lot of techniques that are useful for measuring the distance between strings. One of the most popular and for many applications the best one is a function called Levenshtein distance. Its score is a natural number beginning from 0. Its interpretation is a number of edits needed to transform one string into another. By edit we mean a set of operations, e.g. insertion or substitution. We don't want to spend much time on Levenshtein itself here. There are a lot of sources that explain it. One of them is here

    Having decided on the function comparing strings, we still need to figure out when to use it. Theoretically, we could use it for every term in our database, but it would be very time-consuming. If the Levenshtein computation is done N times, and N is quite big, it can be too big from a practical point of view. Let's introduce a simple idea and check if it works. Select a small number of phrases that potentially are good candidates for a match and on this small subset of candidates compute Levenshtein distance. So the procedure would consist of two steps:

    • Select a small number of candidates, let's say n of them, where $n \ll N$
    • Compute Levenshtein distance between each candidate and a query string

    The second step is obvious. After having done it, we can return the string whose Levenshtein distance turned out to be the smallest one. The first step requires an explanation. In order to find candidates, we begin with vectorizing our phrases. We count the number of specific characters and then normalize it. The sum of all components is equal to 1. So is vectorized the query phrase.

    fuzzy matching 1

    When comparing two vectors with each other euclidean distance is a quite popular choice. Let's pick it. We'll select candidates based on the distance between query string vector representation and vector representations from our database. As a quick recap, euclidean distance (to be more precise taken to the power of 2) is computed as below:

    $$ ||x - y||^{2} = \sum{i=1}^{N}(x{i} - y_{i})^{2} $$

    Let's stop here for the moment and think if it makes sense for our application. We'll look at a couple of cases just to have a better feeling.

    • query string is equal to ac, then distances are 0, $\frac{1}{8}$ and $\frac{1}{6}$. - winning phrase ac;
    • query string is equal to c, the distances are $\frac{1}{4}$, $\frac{7}{8}$, $\frac{2}{3}$. - winning phrase ac
    • query string is equal to abca!!!!, the distances are $\frac{15}{32}$, $\frac{11}{32}$, $\frac{3}{32}$. - winning phrase ac!

    What can we say about the above results? The first two of them would be chosen if we used Levenshtein distance directly, which is good. But the third case should pick abca according to Levenshtein distance. It turns out that the bigger the string is (in the database or the query), the less weight can have on different characters, and this can have an impact on the final result. Let's add two comments.

    • In the above examples, we haven't considered the case of a palindrome. A phrase ac and ca are presumably entirely different phrases, but with our approximation, its distance is equal to 0. The problem is more general. Simply speaking, we ignore the order of characters.
    • To have more precise approximation and also to mediate the effect of ignoring the order we should not only take single characters into account but also n-grams (like 2, 3, or 4) when vectorizing.

    The vectorization that works is only slightly more involved, but it pays off.

    1. Use the “bag of words” vectorizer for database strings and the query string and normalize the resulting vectors so that the sum of their components is equal to 1.
    2. Compute the norms for all the vectors.
    3. Compute dot product between all vectors and the query vector and divide it by the multiplication of its norms.

    The last two steps are just cosine similarity, but we described this way so that we can have a mechanical description of calculations that must be done. Besides, we can compute norms just once (for database strings) and then use them without the need for recalculation every time a new query comes.

    The code will speak by itself and if anything has been unclear so far you the code will help.


    import numpy as np
    import scipy.sparse as sparse
    from sklearn.feature_extraction.text import CountVectorizer
    strings = ['ac', 'abca', 'ac!']
    query = 'abca!!!!'
    def create_sparce_from_diagonal(diag: np.ndarray) -> sparse.csr_matrix:
        """Creates sparse matrix with provided elements on the diagonal."""
        n = len(diag)
        return sparse.csr_matrix((diag, (list(range(n)), list(range(n)))), shape=(n, n))
    def normalize_sparse_matrix(matrix: sparse.csr_matrix) -> sparse.csr_matrix:
        """Normalizes each row so that the sum of components is equal to one."""
        sums_for_strings = matrix.sum(axis=1).A.flatten()
        normalization_matrix = create_sparce_from_diagonal(1 / (sums_for_strings + 0.1))
    def compute_norms(matrix: sparse.csr_matrix) -> np.ndarray:
        """Computes norms for each row."""
        return np.sqrt(matrix.multiply(matrix).sum(axis=1).A).flatten()
    vectorizer = CountVectorizer(
        ngram_range=(1, 4),
    X = vectorizer.fit_transform(strings)
    X = normalize_sparse_matrix(X) # step 1
    norms = compute_norms(X) # step 2
    query_vector = vectorizer.transform([query])
    query_vector = normalize_sparse_matrix(query_vector)
    query_norm = compute_norms(query_vector)
    # step 3
    similarities =
    similarities = similarities / (norms.reshape((-1, 1)) * query_norm.reshape((1, -1)))
    best_string = strings[np.argmax(similarities.flatten())]


    In this post we discussed fuzzy matching and the popular technique called the Levenshtein distance. If you would like to know more, or need help with understanding this better, feel free to contact us.

    The recommender systems have a quite long history. The problem of deciding what content should be presented to a user is important to many companies. It can bring more money and what's important in the long run, it makes the experience of users more delightful. It is also one of the biggest motivators for the growth of machine learning. It started with a very famous competition run by Netflix that offered a million dollars to a team that built the best recommender system. It was in 2009, over a decade has passed since this competition. Needless to say that during this time a lot of different solutions have been proposed. Nowadays, Facebook, Spotify, and an endless list of other companies use their proprietary algorithms to enhance their products. There are a lot of posts on the web that address recommender systems very well. They usually start from the problem definition and then move to the collaborative filtering and matrix factorization - techniques that seem to be an integral part of the toolbox. Here we won't have time to discuss every single detail of these techniques. We'll provide essential information and focus more on intuition and will move to different approaches. We have to say that recommender systems apart from being treated as a distinct kind of problem can be (depending on the application) understood as classification, regression, or ranking problem. A lot of factors can influence how you should look at the problem. But there is one distinct feature of the recommendation problem - sparsity of data. It is something they have in common with NLP applications. In the second part of the post how we will make use of it.

    In this post, we'll start by sketching the problem. Then we'll move to explain shortly collaborative filtering and matrix factorization. Finally, we'll talk about Factorization Machine, which although introduced about a decade ago has inspired a lot of new inventions and it is useful to know.

    Basic definitions

    Let's start from the basics. Historically, recommender systems were motivated by movie recommendations.

    So here for the sake of argument, we will talk about users and movies but it can be generalized to any recommendation application. Imagine that we have a database for users, a database for movies, and a database for rankings. We will present the first two tables later when they are needed. Now, let's focus on the ratings matrix. Suppose ratings vary from 1 to 5. As you can see matrix $R$ is filled only partially. Not all users have watched and rated all the movies. Our task is to fill these missing values with reasonable guesses.

    oCh fq5a4Pcy9OBERzvisBgwyDq5

    We have $N$ users and $M$ movies. The dimension of this matrix is $N \times M$. It is sparse because most users can only watch a few films. The problem is to figure out what movie to recommend to a user that hasn't been already seen and it will be probably a good match. If someone is a fan of horrors, it is not a good idea to show him or her a comedy. At least we don't have any signs that it would be a good choice.

    With the problem formulated in such a way, we can ask what to show a user $j$. Let’s use our intuition. Suppose he or she watched a movie (let's say $k$) and rated it high. We would like to display on our recommendation list a movie that would be liked as much as $k$. The simple idea would be to select the most similar movie. How can we measure it? Assume movies are similar if watched by the same users and are rated similarly. In other words, if the movie $k$ and another movie $m$ have some common users that watched them and they rated them similarly, movie $m$ seems to be a good candidate for a recommendation. We could also look at this problem from a different perspective. We have some ratings for user $j$, so let's find the most similar user in terms of watched movies and recommend something that has been highly rated by the latter if they happen to have similar tastes (a common set of movies doesn’t imply they were equally enjoyed by both users). The problems of this kind are the subject of collaborative filtering.

    Collaborative Filtering

    We ended the last section with a simple idea of searching for the most similar movie or user in order to get an educated guess about the movie we should recommend. It seems that we would get even better results if we had taken all users into account and weighed their impact on the final guess according to the strength of similarity to the user we want predictions for.

    This type of strategy is called Collaborative Filtering. In general, it can be divided into two subcategories:

    • User-User Collaborative Filtering
    • Item-Item Collaborative Filtering

    Both approaches are analogical. The first one is easier to think about. We want to find ratings for users, i.e. we ask ourselves what rating a given user would give a movie if he or she had a chance to watch it. In the second approach, we just need to switch to thinking in terms of items. We'll only talk about the first approach. Suppose that we want to find an estimate for $R_{ij}$, i.e. what rating would the user I give to movie j if he or she watched? We use the following formula to find an estimate.

    $$\hat{R_{ij}} = \frac{\sum_{k \in \text{users that watched } j}w_{k}R_{kj}}{\sum_{k \in \text{users that watched } j}w_{k}}$$

    In the preceding equation, $w_{k}$ measures the similarity of user k to user j. There are a couple of ways to find a value for it. As we know the only information we have at our disposal is a set of ratings. We could, for example, compute the correlation coefficient between users. If so, it is important to remember that some users have zero or a few common watched films, so we should set some threshold (e.g 5) of the minimum number of movies before taking the user into account and computing $w_{k}$. The second thing is that different people can have different ways of perceiving numbers, which means that for one person "3" can mean a bad movie and not worth time and for another person who very rarely gives high scores it can mean a good film. To remedy this problem we must account for bias and we don't estimate mean value, but deviation from our average. If we denote by $\bar{R_{i}}$ the average rating given by user i, we would end up with the following equation:

    $$\hat{R_{ij}} = \bar{R_{i}} + \frac{\sum_{k \in \text{users that watched } j}w_{k}(R_{kj} - \bar{R_{k}})}{\sum_{k \in \text{users that watched } j}w_{k}}$$

    Matrix factorization

    In the matrix factorization approach we assume that users can be described by a set of factors, let's say $K$, so are the movies. We can imagine that a user can like action movies, doesn't like horrors, likes seeing celebrities or good music, and so on. We don't explicitly create these features, because it would take an enormous amount of work and we probably won't find all the relevant factors. So we suppose the existence of hidden factors that are found by mathematical optimization. The assumption that the number of hidden factors for both users and movies is the same seems to be less convincing, but it makes the problem feasible and it doesn't hurt if we tune it correctly. Suppose we got these matrices somehow. Each row in the user matrix has $K$ components corresponding to hidden factors or features of a given user. Similarly so is constructed the movie matrix.

    iJqD RBqGea4GvV4HLXT GSKdp49Ow6gkDdQYM1W5V60KFRbUNs9usBhaE2h742SrHOI3kGDnsk3O WhX34TPrrj8bycTRsyEGddJj GqVSA 3pXJR aeEiopFOPUdO5MiRcsy

    How can we say these matrices are good? The purpose of this matrix is to model ratings estimation by taking dot products between users and movies.

    $$r_{ij} \approx u_{i}^{T}w_{j}$$

    or in matrix form

    $$R \approx UW^{T}$$

    Probably we won't do explicit matrix multiplication of $U$ and $W$, because $R$ is sparse, whereas $U$ and $W$ are dense, so is the result of the multiplication. Now suppose that we need to present some content to a user $i$. If the found matrices are good, the process comes down to find the movies, whose dot produces with the vector representation of the user are the biggest. There are some extensions to this simple model. You can add global bias ($mu$), vectors of users' biases ($b$) and vector of movies' biases ($c$). In this case, you would have $N + M + 1$ more parameters to tune. In general, the equation would be as follows:

    $$r_{ij} \approx mu + b_{i} + c_{j} + u_{i}^{T}w_{j} $$

    Let's think about how we can understand this approach more intuitively. Dot products represent, up to scalar multiplication, cosine between angles (for more check the Wikipedia page on cosine similarity).

    In other words, if we have similar vectors, the angle they create is small. Graphically we can represent it as below:

    9FbpJS7UDh27sDaQprtvwEEXSEg5VuUCKbRSld5ITw4gUfu6MiJ1lt 3zIvFT4Voti5 zBh3i9DKEq2dUM9CQ2a46nE3rIIZT4wPJp OY4QLBAzb7vC9VGr0uz YBziLm6HV7sy

    The relation "having small angle" is very often transitive, i.e. if vectors $x$ and $y$ have a small angle, and so do $y$ and $z$, then also $x$ and $z$ do. In a strict mathematical sense it doesn't have to be true, because we can define small angle as being at most $\theta$ and we can plot $x$, $y$ and $z$ vectors in such a way that $angle(x, y) = \theta$, $angle(y, z) = \theta$ and $angle(x, z) = \theta$. But here we just want to build intuition. Suppose we know the ratings for user $i$ and movies $j$ and $m$, and also we know rating for user $k$ and a movie $j$. What is a rating for an unknown pair, i.e. the user $j$ and the movie $m$?

      wj wm    
    ui 5 1    
    uk 5 ?    

    We can say that 5 means a big value of dot product between vectors or a small angle. And the same reasoning can be applied to the value 1, i.e. a big angle in this case. Let's transfer our intuition to the plot below.

    AtSn1Uo3xkwQjVcu6UmoZ4 VRwsoLhCFB5jfHAtiti0wybDRci6LY3GbFJ1tyy7sfs2mDeKedih4FcX0vtgG515yR0cZ0KeGELgXd6DoQ7skT32JOw5A7APSIQPLY1pu uFWnSNn

    In plain English, we could say that both users liked the same movie, so it is logical to expect if one dislikes the other, so would the user that hasn't watched it yet. It is the idea behind collaborative filtering we've discussed earlier.

    Apart from it, you can regularize the weight of these matrices so their performance on unseen data (which is almost all of the entries in the $R$ matrix) is better. We will stick with the simple version. We haven't said yet how these matrices can be obtained. Actually there are a couple of ways. Since the problem is simple you could implement the derivative yourself and run update U in W in an alternating way. But the more convenient way is to use the existing library to do the job for us. Last but not least, remember to keep some part of ratings as a testing data set, otherwise, you won't know if you carefully approximate the R matrix or just entries that happen to be known.

    from tensorflow import keras
    K = 2 # number of hidden factors
    N = 30 # number of users
    M = 10 # number of movies
    user_input = keras.layers.Input(shape=1)
    movie_input = keras.layers.Input(shape=1)
    U = keras.layers.Embedding(N, K)(user_input)
    W = keras.layers.Embedding(M, K)(movie_input)
    UW = keras.layers.Flatten()(
        keras.layers.Dot(axes=2)([U, W]))
    model = keras.Model([user_input, movie_input], UW)
    # Toy example
    # Below we have three ratings (embeddings are indexed from 0, whereas matrices from 1)
    # R_{1, 2} = 5, R_{1, 6} = 5, R_{5, 4} = 3
    users = np.array([0, 0, 4])
    movies = np.array([1, 5, 8])
    ratings = np.array([5, 5, 3])
        optimizer=keras.optimizers.SGD(lr=0.05, momentum=0.9),
    )[users, movies], ratings, batch_size=1, epochs=1)
    estimated_U, estimated_W = model.get_weights()

    Factorization Machine

    The factorization machine is a very powerful model that appeared for the first time in this paper. So far we ignored that fact we might have more data than only information about ratings. In our database each user is characterized by some feature, so are movies.

    To better explain what the Factorization Machine is we must detour a bit. It is the way the article introduces the concept and is more natural. We'll later see how it is connected to our application. We talked about hidden factors for users and for movies and then what interpretation has its dot product, which was the rating or its estimation. Similarly here the simple operation of dot products is of great importance. Now we don't look for hidden factors for entities (like users or movies) but for features, in machine learning sense. We know that the last statement can sound a bit vague. It will be more clear as we get to the example. Suppose that we have a simple dataset representing houses. We have their prices, surface area, a number of rooms, and a binary feature reflecting having a garden or not. And suppose that the price is our target, it is what we want to predict. A simple linear regression would have the following form:

    $$y = a + bx_{surface} + cx_{rooms} + dx_{garden}$$

    We could introduce some features manually by adding, for example, a feature $x_{doors}x_{garden}$ if we believe that having garden influence somehow we appraise the number of rooms. In other words, could add some interactions. Let's add them.

    $$y = a + bx_{surface} + cx_{rooms} + dx_{garden} + ex_{surface}x_{rooms} + fx_{surface}x_{garden} + gx_{rooms}x_{garden}$$

    Normally we would run an optimization algorithm and find values for $a$, $b$, ..., $g$. The difference between this approach and the factorization machine is the way it treats interactions. Interactions are modeled as dot products. Each feature gets a hidden vector, let's say with $K$ components.

    k3zZV8dlKDqKIyL NXZ1ZcmtTu2f1hcOnlVNxHG1Y0Dnf 8BnMZkGvL87cXRz6DLPojmU182DNFrkZAynExl HvSJ1lkVyjjLBUV3g9kgp87qAi U3cazr6yFI CMreTjJMO7Ibl

    Having said that, the model assumes the following form:

    y = a + bx_{surface} + cx_{rooms} + dx_{garden} + v_{surface}^{T}v_{rooms}x_{surface}x_{rooms} + v_{surface}^{T}v_{garden}x_{surface}x_{garden} + v_{rooms}^{T}v_{garden}x_{rooms}x_{garden}

    In the process of fitting the interaction matrix (apart from $a$, $b$, and $c$) will be found. Of course, we wanted only to introduce you gently to the model. Probably it is not the best idea to solve this problem this way. When are these interactions useful then? As we saw in the previous section, dot product measures the similarity between vectors, and this can be helpful to transfer knowledge (transitive property). We somehow proved that it works for users. What about features? Actually we can formulate the problem in such a way that users are categorical features in our model.

    Basically we have a ratings matrix and we can treat each rating as an observation. In this case, our label is the rating itself and two features, categorized users and movies. Simply speaking our feature vector is of length $N + M$.

    If we wanted to include all interactions we could have $\binom{N + M}{2}$ of them. But in our case, in our encoding, there exists only one interaction, between a given user and a given movie. Our factorization

    In the article you can find a simple trick that makes the training of the model easy. We'll omit some details. Here’s the trick:

    \sum_{i=1}^{n}\sum_{j=i+1}^{n}v_{i}^{T}v_{j}=\frac{1}{2}\sum_{f=1}^{K}((\sum_{i=1}^{n}v_{i, f}x_{i})^{2} - (\sum_{i=1}^{n}v_{i, f}^{2}x_{i}^{2}))$$

    We can describe the math in plain English.

    • $(\sum_{i=1}^{n}v_{i, f}x_{i})^{2}$ - It is a linear combination of rows squared. In sparse settings (which is ours) it is simply a sum of hidden vectors corresponding to non-zero features.
    • $\sum_{i=1}^{n}v_{i, f}^{2}x_{i}^{2}$ - The same interpretation as the previous bullet point, but first squaring is performed
    • $\frac{1}{2}\sum_{f=1}^{K}$ - summing all hidden components, collapsing to a scalar

    Below you can find the code so that you can see how easy it is to implement it in tensorflow.

    import tensorflow as tf
    from tensorflow import keras
    K = 2 # number of hidden factors
    N = 30 # number of users
    M = 10 # number of movies
    user_input = keras.layers.Input(shape=1, dtype=tf.int32)
    movie_input = keras.layers.Input(shape=1, dtype=tf.int32)
    i = keras.layers.Concatenate()([user_input, movie_input])
    V = keras.layers.Embedding(
        input_dim=N + M,
    rows = V(i)
    first_component = tf.reduce_sum(rows, axis=1) ** 2
    second_coponent = tf.reduce_sum(rows * rows, axis=1)
    interaction_value = 0.5 * tf.reduce_sum(first_component - second_coponent, axis=1)
    model = keras.Model([user_input, movie_input], interaction_value)
    users = np.array([0, 0, 4])
    movies = np.array([1, 5, 8])
    ratings = np.array([5, 5, 3])
        optimizer=keras.optimizers.SGD(lr=0.05, momentum=0.9),
    )[users, movies], ratings, batch_size=1, epochs=1)
    estimated_V = model.get_weights()[0]

    Of course, it is not a good idea not to add linear and bias terms. We’ll add linear terms in the subsequent snippet (although for simplicity we’ll omit global bias). The real power of this model can be seen if we have sparse data and we want to find generalizable value for interactions between features. But It might be the best choice if we have a dense table filled with numbers. So far we haven't mentioned a word that we have different data at our disposal. It is easy to imagine that in our database of users and a database of movies, and also ratings (like the time the rating was submitted) we have a lot of new information. Let's add for the sake of argument one more feature to our problem. In the code, a little will change, but we'll provide you the code for reference.

    We want to keep this example simple. Let's add just one more feature which is production year.

    UmRsNs9T GL15iaoOo4y05kc7 smOGxB 1Q

    $$\text{rating} = m_{u} + a_{i} + b_{j} + c_{k} + v_{i}^{T}v_{j} + v_{i}^{T}v_{k} + v_{j}^{T}v_{k}$$

    import tensorflow as tf
    from tensorflow import keras
    K = 2 # number of hidden factors
    N = 30 # number of users
    M = 10 # number of movies
    Y = 3 # number of movies, from 2018 to 2020
    user_input = keras.layers.Input(shape=1, dtype=tf.int32)
    movie_input = keras.layers.Input(shape=1, dtype=tf.int32)
    year_input = keras.layers.Input(shape=1, dtype=tf.int32)
    i = keras.layers.Concatenate()([user_input, movie_input, year_input])
    V = keras.layers.Embedding(
        input_dim=N + M + Y,
    a = keras.layers.Embedding(
    b = keras.layers.Embedding(
    c = keras.layers.Embedding(
    rows = V(i)
    first_component = tf.reduce_sum(rows, axis=1) ** 2
    second_coponent = tf.reduce_sum(rows * rows, axis=1)
    interaction_value = 0.5 * tf.reduce_sum(first_component - second_coponent, axis=1)
    rating_prediction = a(user_input) + b(movie_input) + c(year_input) + interaction_value
    model = keras.Model([user_input, movie_input, year_input], rating_prediction)
    users = np.array([0, 0, 4])
    movies = np.array([1, 5, 8])
    years = np.array([0, 0, 2]) # 2018, 2018, 2020
    ratings = np.array([5, 5, 3])
        optimizer=keras.optimizers.SGD(lr=0.05, momentum=0.9),
    )[users, movies, years], ratings, batch_size=1, epochs=1)
    estimated_V, estimated_a, estimated_b, estimated_c = model.get_weights()


    It is the end of the first part of a short series devoted to recommendation systems. We have discussed classical approaches to the problem with the intuition of why they might work. We transfer that intuition to a more complex model, which is a Factorization Machine, and showed that is a good starting point for further improvement for example by including more data. In the next post, we’ll take a look at more advanced approaches. Hope to see you soon!

    With 2020 in full swing, the business world is undergoing major changes in order to accommodate the growth of technology. Artificial intelligence is making waves in modern business operations, forcing organizations to adopt new processes in order to remain relevant. Below is a list of AI business tools that are making an impact this year and how to adopt them into operational strategies to remain forward thinking.

    scott graham 5fNmWej4tAA unsplash scaled 1
    Source: Pexels

    Mobile Technology Improvements

    Mobile technology is a great low cost tool that is easily accessible and convenient for both employees and consumers. While they’re not a new form of AI, mobile applications have come a long way. They now allow the creation of real-time, shareable information using AI-generated data to personalize devices and ensure a higher level of security for employers who utilize them for sharing sensitive information. New and improved cyber defense systems create more demand for authenticity and security measures to prevent hacks into company data (for information on what qualifies as a cyber attack, check out this article here). Machine learning will also continue to expand its ability to collect diverse sets of data to track any irregular patterns that may hint at a cyber attack. These advancements will be hugely impactful to both businesses and consumers, so it’s lucrative to remain up-to-date on newly emerging security measures.

    The smaller AI elements that make up mobile technology advancements are carving their way to the forefront of business strategies. Speech recognition for example, takes efficiency and security a step further, by identifying unique voice tracks that surpass digital barriers to entry on mobile devices. This makes sensitive information hidden to those who do not have granted access. Translation features improve accessibility for organizations who have international offerings as well, with AI technology providing instant translation options with or without internet access. Image recognition is another feature that, when paired with mobile technology, allows consumers to “scan to buy” and also receive suggestions for other products based on the items scanned. This has greatly improved ecommerce sales, by increasing search exposure, consumer engagement, and improving overall conversion rates. Chatbot systems expand upon this feature by responding to consumer inquiries through digitally generated messaging. This ensures consumers are not going unheard, with with 90% of businesses reporting faster complaint resolution with the help of chatbots. Consumers are also able to find basic information without allocating employees for these tasks.

    person holding silver iphone 7 887751 scaled
    Source: Pexels

    Machine Learning, Deep Learning

    2019 was a record year for enterprise’s interest in AI and machine learning technology. Machine learning takes information and data, learns from the processed information, and ultimately produces a complex solution. This acts as a way to improve and streamline overall decision making processes and, with its ability to improve over time as more data is manually added, it continuously produces more informed decisions and projections. In the case of inaccuracies, machine learning programs need to be monitored and adjusted manually. This is where deep learning becomes the competing technology.

    Deep learning stems from machine learning, but requires little to no human involvement. These systems determine inaccuracies using complex algorithms designed as artificial neural networks. In other words, it mimics the biological layered structure of the human brain. Deep learning has the ability to learn, with the help of advances in computing power, from previous data and improves on its own with no manual input. These systems make intelligent decisions to further improve operational processes, and require limited human involvement. Computing power is the fuel for deep learning, and is reported to see major improvements in the coming year to handle more complex issues facing modern businesses. This is the technology of the future, but currently requires highly skilled IT professionals to maintain. Keep this in mind when considering implementing it into an operational structure.

    pexels photo 1936299
    Source: Pexels

    Internal Operations Tools

    Communication tools are a pretty obvious necessity and with the help of AI, they can create a higher level of transparency and a more connected culture by limiting miscommunication. By opening up communication channels and adapting to social platforms, all levels of an organization have access to pertinent information. Social technologies and chat applications, when implemented, also allow for fast-paced, real-time communication to build stronger internal connections between employees. A recent study found that the implementation of these tools could lead to a  20-25% increase in overall productivity. Being able to instantaneously share ideas and expertise creates a more collaborative, productive environment that harbors innovation to continuously improve internal processes.

    AI-based operational tools that are more strategy focused are also making an impact in the coming year. The growth of cloud-based tools has grown exponentially to streamline business strategies. Among the most influential AI systems are enterprise resource planning systems. These artificial intelligence tools manage financial information, project management, procurement, and other operational tasks. Robotic process automation is similar in its ability to organize and simplify processes, and take over manual tasks from workforce members. It delivers automated processes instantaneously across a large number of differing platforms and software tools. The ability to process complex information at scale is an enticing improvement making its way into modern business strategies, with quick implementation also influencing the transformative switch to these software systems.

    adult architect blueprint business 416405 scaled
    Source: Pexels


    Going into 2020 with AI-driven strategies is key to remaining relevant in the business market. AI powered analytics produce informed and personalized solutions to a multitude of newly arising business problems. With multiple sets of metrics and data, managers are better able to optimize and predict future outcomes and forecast for long-term success. Transform and grow a workforce that aligns with future artificial intelligence advancements as well in order to fully adapt to data-driven strategies. Take the above impactful AI tools into consideration as we continue into the new year, and transform business operations into future proofed strategies.

    Before Deep Learning Era

    Back in the days before the era — when a Neural Network was more of a scary, enigmatic mathematical curiosity than a powerful or tool — there were surprisingly many relatively successful applications of classical mining algorithms in the Natural Language Processing (NLP) domain. It seemed that problems like spam filtering or part of speech tagging could be solved using rather straightforward and understandable models.

    But not every problem can be solved this way. Simple models fail to adequately capture linguistic subtleties like context, idioms, or irony (though humans often fail at that one too). Algorithms based on overall summarization (e.g. bag-of-words) turned out to be not powerful enough to capture sequential nature of text, whereas n-grams struggled to model general context and suffered severely from a curse of dimensionality. Even HMM -based models had trouble overcoming these issues due to their Markovian nature (as in, their memorylessness). Of course, these methods were also used when tackling more complex tasks, but not to great success.

    First breakthrough - Word2Vec

    One of the main challenges in language analysis is the method of transforming text into numerical input, which makes modelling feasible. It is not a problem in computer vision tasks due to the fact that in an image, each pixel is represented by three numbers depicting the saturations of three base colors. So for many years, researchers tried numerous algorithms for finding so called embeddings, which refer, in general, to representing text as vectors. At first, most of these methods were based on counting words or short sequences of words (n-grams).

    The initial approach to tackle this problem is one-hot encoding, where each word from the vocabulary is represented as a unique binary vector with only one nonzero entry. A simple generalization is to encode n-grams (sequence of n consecutive words) instead of single words. The major disadvantage to this method is very high dimensionality, each vector has a size of the vocabulary (or even bigger in case of n-grams) which makes modelling difficult. Another drawback to this approach is lack of semantic information. This means that all vectors representing single words are equidistant. In this embedding, space synonyms are just as far from each other as completely unrelated words. Using this kind of word representations unnecessarily makes tasks much more difficult as it forces your model to memorize particular words instead of trying to capture the semantics.

    word2vec pca paper example 1 fs8
    Figure 1: Word2Vec representations of words projected onto a two-dimensional space.

    The first major leap forward in this area came in 2013 with the introduction of Word2Vec - a neural network based model used exclusively for producing embeddings. Imagine starting from a sequence of words, removing the middle one, and having a model predict it only by looking at context words (i.e., Continuous Bag of Words, CBOW). The alternative version of that model is asking to predict the context given the middle word (skip-gram). This idea is counterintuitive, because such a model might be used in information retrieval tasks (a certain word is missing and the problem is to predict it using its context), but that's rarely the case. Instead, it turns out that if you initialize your embeddings randomly and then use them as learnable parameters in training CBOW or a skip-gram model, you obtain a vector representation of each word that can be used for any task. Those powerful representations emerge during training, because the model is forced to recognize words that appear in the same context. This way you avoid memorizing particular words, but rather convey semantic meaning of the word explained not by a word itself, but by its context.

    In 2014, just a year later, Stanford’s research group challenged Word2Vec with a strong competitor: GloVe. They proposed a different approach, arguing that the best way to encode semantic meaning of words in vectors is through global word-word co-occurrence matrix as opposed to local co-occurrences as in Word2Vec. As you can see in Figure 1 that the ratio of co-occurrence probabilities is able to discriminate words when compared to the context word. It is around 1 when both target words co-occur very often or very rarely with the context word. Only when the context word co-occurs with one of the target words is the ratio either very small or very big. This is the intuition behind GloVe. The exact algorithm involves representing words as vectors in a way that their difference, multiplied by a context word, is equal to the ratio of the co-occurrence probabilities.

    Further improvements

    Even though the new powerful Word2Vec representation boosted the performance of many classical algorithms, there was still a need for a solution capable of capturing sequential dependencies in a text (both long- and short-term). The first concept for this problem was so-called vanilla Recurrent Neural Networks (RNNs). Vanilla RNNs take advantage of the temporal nature of by feeding words to the network sequentially while using the information about previous words stored in a hidden-state.

    RNN unrolled fs8
    Figure 2: A recurrent neural network. Image courtesy of an excellent Colah's.

    These networks proved very effective in handling local temporal dependencies, but performed quite poorly when presented with long sequences. This failure was caused by the fact that after each time step, the content of the hidden-state was overwritten by the output of the network. To address this issue, computer scientists and researchers designed a new RNN architecture called long-short term memory (LSTM). LSTM deals with the problem by introducing an extra unit in the network called a memory cell, a mechanism that is responsible for storing long term dependencies and several gates responsible for control of the information flow in the unit. How this works is at each time step, the forget gate generates a fraction which depicts an amount of memory cell content to forget. Next, the input gate determines how much of the input will be added to the content of the memory cell. Finally, the output gate decides how much of the memory cell content to generate as the whole unit's output. All the gates act like regular neural network layers with learnable parameters, which means that over time, the network adapts and is better at deciding what kind of input is relevant for the task and what information can and should be forgotten.

    LSTMs have actually been around since late 1990s, but they are quite expensive computationally and memory wise, so it is only recently, thanks to remarkable advances in hardware, that it became feasible to train LSTM networks in reasonable time. Nowadays, there exist many variations of LSTM such as mLSTM, which introduces multiplicative dependency on the input or GRU which, thanks to an intelligent simplification of the memory cell update mechanism, significantly decreased the number of trainable parameters.

    After a short while it became clear that these models significantly outperform classic approaches, but researchers were hungry for more. They started to study the astounding success of Convolutional Neural Networks in Computer Vision and wondered whether those concepts could be incorporated into . It quickly turned out that a simple replacement of 2D filters (processing a small segment of the image, e.g. regions of 3x3 pixels) with 1D filters (processing a small part of the sentence, e.g. 5 consecutive words) made it possible. Similarly to 2D CNNs, these models learn more and more abstract features as the network gets deeper with the first layer processing raw input and all subsequent layers processing outputs of its predecessor. Of course, a single word embedding (embedding space is usually around 300 dimensions) carries much more information than a single pixel, which means that it not necessary to use such deep networks as in the case of images. You may think of it as the embedding doing the job supposed to be done by first few layers, so they can be skipped. Those intuitions proved correct in experiments on various tasks. 1D CNNs were much lighter and more accurate than RNNs and could be trained even an order of magnitude faster due to an easier parallelization.

    Convolutional Neural Networks were first used to solve Computer Vision problems and remain state-of-the-art in that space. Learn more about their applications and capabilities here.

    Despite incredible contributions made by CNN, the networks still suffered from several drawbacks. In a classic setup, a convolutional network consists of several convolutional layers which are responsible for creating so-called feature maps and a module transforming it into predictions. Feature maps are essentially high level features extracted from text (or image) preserving the location where it emerged in the text (or image). The prediction module performs aggregating operations on feature maps and either ignores the location of the feature (fully convolutional networks) or more commonly: learns where particular features appear most often (fully connected modules). The problem with these approaches arises for example in the Question Answering task, where the model is supposed to produce the answer given the text and a question. In this case, it is difficult and often unnecessary to store all information carried by the text in a single text, as is done by classic prediction modules. Instead, we would like to focus on a particle part of text where the most crucial information is stored for a particular question. This problem is addressed by Attention Mechanism, which weighs parts of the text depending on what may be relevant based on the input. This approach has also been found useful for classic applications like text classification or translation.

    Typical NLP problems

    There are a variety of language tasks that, while simple and second-nature to humans, are very difficult for a machine. The confusion is mostly due to linguistic nuances like irony and idioms. Let’s take a look at some of the areas of NLP that researchers are trying to tackle (roughly in order of their complexity):

    The most common and possibly easiest one is sentiment analysis. This is, essentially, determining the attitude or emotional reaction of a speaker/writer toward a particular topic (or in general). Possible sentiments are positive, neutral, and negative. Check out this great article about using Deep Convolutional Neural Networks for gauging sentiment in tweets. Another interesting experiment showed that a Deep Recurrent Net could the learn sentiment by accident.

    sentiment prediction fs8
    Figure 3: Activation of a neuron from a net used to generate next character of text. It is clear that it learned the sentiment even though it was trained in an entirely unsupervised environment.

    A natural generalization of the previous case is document classification, where instead of assigning one of three possible flags to each article, we solve an ordinary classification problem. According to a comprehensive comparison of algorithms, it is safe to say that Deep Learning is the way to go fortext classification.

    Now, we move on to the real deal: Machine Translation. Machine translation has posed a serious challenge for quite some time. It is important to understand that this an entirely different task than the two previous ones we’ve discussed. For this task, we require a model to predict a sequence of words, instead of a label. Machine translation makes clear what the fuss is all about with Deep Learning, as it has been an incredible breakthrough when it comes to sequential data. In this blog post you can read more about how — yep, you guessed it — Recurrent Neural Networks tackle translation, and in this one you can learn about how they achieve state-of-the-art results.

    Say you need an automatic text summarization model, and you want it to extract only the most important parts of a text while preserving all of the meaning. This requires an algorithm that can understand the entire text while focusing on the specific parts that carry most of the meaning. This problem is neatly solved by attention mechanisms, which can be introduced as modules inside an end-to-end solution.

    Lastly, there is question answering, which comes as close to Artificial Intelligence as you can get. For this task, not only does the model need to understand a question, but it is also required to have a full understanding of a text of interest and know exactly where to look to produce an answer. For a detailed explanation of a question answering solution (using Deep Learning, of course), check out this article.

    attention fs8 1
    Figure 4: Beautiful visualization of an attention mechanism in a recurrent neural network trained to translate English to French.

    Since Deep Learning offers vector representations for various kinds of data (e.g., text and images), you can build models to specialize in different domains. This is how researchers came up with visual question answering. The task is "trivial": Just answer a question about an image. Sounds like a job for a 7-year-old, right? Nonetheless, deep models are the first to produce any reasonable results without human supervision. Results and a description of such a model are in this paper.

    Starving for applications? Get your hands dirty and implement your chatbot using LSTMs.

    Natural Language Generation

    You may have noticed that all of the above tasks share a common denominator. For sentiment analysis an article is always positive, negative or neutral. In document classification each example belongs to one class. This means that these problems belong to a family of problems called supervised learning. Where the model is presented with an example and a correct value associated with it. Things get tricky when you want your model to generate text.

    Andrej Karpathy provides a comprehensive review of how RNNs tackle this problem in his excellent blog post. He shows examples of deep learning used to generate new Shakespeare novels or how to produce source code that seems to be written by a human, but actually doesn't do anything. These are great examples that show how powerful such a model can be, but there are also real life business applications of these algorithms. Imagine you want to target clients with ads and you don't want them to be generic by copying and pasting the same message to everyone. There is definitely no time for writing thousands of different versions of it, so an ad generating tool may come in handy.

    RNNs seem to perform reasonably well at producing text at a character level, which means that the network predicts consecutive letters (also spaces, punctuation and so on) without actually being aware of a concept of word. However, it turned out that those models really struggled with sound generation. That is because to produce a word you need only few letters, but when producing sound in high quality, with even 16kHz sampling, there are hundreds or maybe even thousands points that form a spoken word. Again, researchers turned to CNNs and again with great success. Mathematicians at DeepMind developed a very sophisticated convolutional generative WaveNet model, which deals with a very large receptive field (length of the actual raw input) problem by using a so-called attrous convolutions, which increase the receptive field exponentially with each layer. This is currently the state-of-the-art model significantly outperforming all other available baselines, but is very expensive to use, i.e. it takes 90 seconds to generate 1 second of raw audio. This means that there is still a lot of room for improvement, but we're definitely on the right track.


    So, now you know. Deep Learning appeared in NLP relatively recently due to computational issues, and we needed to learn much more about Neural Networks to understand their capabilities. But once we did, it changed the game forever.



    Information sourcing and research is a common problem for many companies out there.

    Obtaining useful information about market opportunities from huge amounts of data require vast resources, both in terms of time and funds. The good news is there is the AI-based technology that is able to perform some parts automatically, and therefore improve the efficiency of the whole process.

    A system using AI algorithms - Document Classification, Named Entity Recognition, to name a few - can spot certain changes on the market without any human intervention.

    Solution: Detecting Company Management Structure Changes

    Let's focus on a particular, real-life example.
    An investor needed information about changes in the management structure of companies in NYSE. Before leveraging the AI solution, he employed a dedicated team of research analysts traversing thousands of web articles, tweets and social media posts looking for recent changes in the structure of the companies.

    Unlock hidden investment opportunities, automate processes, and more.

    Talk to us to digitally transform your company using AI.

    1. Collect the data from the online and local sources

    sources 1

    The first step we need to do is to aggregate data from the valuable sources. So far, our client's analysts read online business newspapers and monitored selected Twitter accounts looking for structural change information.

    Our system has entirely automated that work by scraping newspapers and tweets, producing a steady feed of information ready to use by the analysts, putting the information in one place.

    2. Filter out the irrelevant materials


    As the next stage in the pipeline, our state-of-the-art AI text classification algorithms precisely filter out all the irrelevant materials, basing their judgment on the context and actual meaning of the text.

    3. Extract knowledge from text

    Screen Shot 2017 10 25 at 15.05.27

    We've already managed to save a lot of time by showing only the most valuable information.

    However, there's still a lot of work to do. Working on raw documents is quite difficult. We need to extract the information that the client needs.

    And AI can help us in that as well!

    Data scientists call this technique named entity extraction. It identifies important information to the investor in the text, based on given criteria. In this case, we extract the company name, position, and the reason, and put it in a table row that we can use later.

    At this point, it can be exported into JSON or CSV.

    4. Generate insights and find correlations in the data


    You probably know that the computer is great in processing tabular data, like Excel tables or databases. We call this type of information structured data.

    The AI now further processes this data by:

    • prioritizing the most important structural change instances
    • filtering out based on certain criteria (e.g. companies outside a geographical region)
    • finding patterns and correlations in data
    • using the collected information to predict other changes in the market and stock movements (generate insights)

    5. Generate a human-sounding, readable report


    Our investor needs human-like, readable reports that he/she can easily relate to and quickly find important information.

    Thanks to another technology called the Natural Language Generation (NLG), the artificial intelligence can generate human-sounding reports and summaries.

    And we're going to leverage this new machine skill to help our investor.

    Our last step is to turn the raw table into an executive summary that the client can use to guide their further decisions.

Integrating all that goodness into the corporate workflow

We've made the NLP technology very accessible.

Depending on your needs, you can either:

  • plug it into your infrastructure using our microservice-based API
  • use it as a standalone cloud-based application, importing and exporting the data in popular formats, like Excel, CSV, or Word.


Although the safety of drugs is tested during clinical trials, many adverse drug reactions (ADR) may only be revealed under certain circumstances, for instance: long-term use, when used in conjunction with other drugs or by people excluded from trials, such as children or pregnant women.

Nowadays consumers often report their experiences with ADR on social media instead of traditional channels, which makes drug safety surveillance systems less efficient.


Thanks to our solution, drug safety surveillance systems can be easily improved by incorporating the knowledge extracted from social media into them.

Our system enables fast scanning of various data resources, such as: Twitter, Facebook or forums posts.

Then, using state-of-the-art Natural Language Processing techniques it filters out irrelevant information. The remaining, relevant data is processed and certain entities, such as: drug name, ADR, pharmaceutical company name, person age and gender, are extracted and analyzed.

The last step is automatic report generation. The executive summary containing all found ADR and insights on how they occurred is created and could be used in further drug development.

AI in pharma steps


  • More efficient identification of novel adverse drug reactions
  • Reducing cost of drug safety surveillance systems
  • Improving drug safety by finding potential ADR faster
  • Identified interactions could be used in further study and drug development
[email protected]
linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram