Attention mechanisms revolutionized machine learning in applications ranging from NLP through computer vision to reinforcement learning. Attention is the key innovation behind the recent success of Transformer-based language models such as BERT.1 In this blog post, I will look at a first instance of attention that sparked the revolution - additive attention (also known as Bahdanau attention) proposed by Bahdanau et al.2

The idea of attention is quite simple: it boils down to weighted averaging. Let us consider machine translation as an example. When generating a translation of a source text, we first pass the source text through an encoder (an LSTM or an equivalent model) to obtain a sequence of encoder hidden states $\mathbf{s}_1, \dots, \mathbf{s}_n$. Then, at each step of generating a translation (decoding), we selectively attend to these encoder hidden states, that is, we construct a context vector $\mathbf{c}_i$ that is a weighted average of encoder hidden states:

$$\mathbf{c}_i = \sum\limits_j a_{ij}\mathbf{s}_j$$

We choose the weights $a_{ij}$ based both on encoder hidden states $\mathbf{s}_1, \dots, \mathbf{s}_n$ and decoder hidden states $\mathbf{h}_1, \dots, \mathbf{h}_m$ and normalize them so that they encode a categorical probability distribution $p(\mathbf{s}_j|\mathbf{h}_i)$.

$$\mathbf{a}_i = \text{softmax}(f_{att}(\mathbf{h}_i, \mathbf{s}_j))$$

Intuitively, this corresponds to assigning each word of a source sentence (encoded as $\mathbf{s}_j$) a weight $a_{ij}$ that tells how much the word encoded by $\mathbf{s}_j$ is relevant for generating subsequent $i$-th word (based on $\mathbf{h}_i$) of a translation. The weighting function $f_{att}(\mathbf{h}_i, \mathbf{s}_j)$ (also known as alignment function or score function) is responsible for this credit assignment.

There are many possible implementations of $f_{att}$, including multiplicative (Luong) attention or key-value attention. In this blog post, I focus on the historically first and arguably the simplest one — additive attention.

Additive attention uses a single-layer feedforward neural network with hyperbolic tangent nonlinearity to compute the weights $a_{ij}$:

$$f_{att}(\mathbf{h}_i, \mathbf{s}_j) = \mathbf{v}_a{}^\top \text{tanh}(\mathbf{W}_1 \mathbf{h}_i + \mathbf{W}_2 \mathbf{s}_j)$$

where $\mathbf{W}_1$ and $\mathbf{W}_2$ are matrices corresponding to the linear layer and $\mathbf{v}_a$ is a scaling factor.

## PyTorch Implementation of Additive Attention

class AdditiveAttention(torch.nn.Module):
def __init__(self, encoder_dim=100, decoder_dim=50):
super().__init__()

self.encoder_dim = encoder_dim
self.decoder_dim = decoder_dim
self.v = torch.nn.Parameter(torch.rand(self.decoder_dim))
self.W_1 = torch.nn.Linear(self.decoder_dim, self.decoder_dim)
self.W_2 = torch.nn.Linear(self.encoder_dim, self.decoder_dim)

def forward(self,
query, # [decoder_dim]
values # [seq_length, encoder_dim]
):
weights = self._get_weights(query, values) # [seq_length]
weights = torch.nn.functional.softmax(weights, dim=0)
return weights @ values # [encoder_dim]

def _get_weights(self,
query, # [decoder_dim]
values # [seq_length, encoder_dim]
):
query = query.repeat(values.size(0), 1) # [seq_length, decoder_dim]
weights = self.W_1(query) + self.W_2(values) # [seq_length, decoder_dim]


Here _get_weights corresponds to $f_{att}$, query is a decoder hidden state $\mathbf{h}_i$ and values is a matrix of encoder hidden states $\mathbf{s}$. To keep the illustration clean, I ignore the batch dimension.

In practice, the attention mechanism handles queries at each time step of text generation.

Here context_vector corresponds to $\mathbf{c}_i$. h and c are LSTM’s hidden states, not crucial for our present purposes.

Finally, it is now trivial to access the attention weights $a_{ij}$ and plot a nice heatmap.

attention = AdditiveAttention(encoder_dim=100, decoder_dim=50)
encoder_hidden_states = torch.rand(10, 100)
decoder_hidden_states = torch.rand(13, 50)
weights = torch.FloatTensor(13, 10)
for step in range(decoder_hidden_states.size(0)):
context_vector = attention(decoder_hidden_states[step], encoder_hidden_states)
weights[step] = attention._get_weights(decoder_hidden_states[step], encoder_hidden_states)
seaborn.heatmap(weights.detach().numpy())


Here each cell corresponds to a particular attention weight $a_{ij}$. For a trained model and meaningful inputs, we could observe patterns there, such as those reported by Bahdanau et al. — the model learning the order of compound nouns (nouns paired with adjectives) in English and French. Let me end with this illustration of the capabilities of additive attention.

1. Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Annual Conference of the North American Chapter of the Association for Computational Linguistics. ↩︎
2. Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio (2015). Neural Machine Translation by Jointly Learning to Align and Translate. International Conference on Learning Representations. ↩︎

While vanilla linear regression predicts a maximum likelihood estimate of the target variable, Bayesian linear regression predicts a whole distribution over the target variable, offering a natural measure of prediction uncertainty. In this blog post, I demonstrate how to break down this uncertainty measure into two contributing factors: aleatoric uncertainty and epistemic uncertainty. We will dive deeper into the Bayesian framework for doing machine learning and inspect closed-form solutions for training and doing inference with Bayesian linear regression. I will then go on to discuss practical uses of uncertainty estimation: deciding when to stop gathering data, active learning and outlier detection as well as improving model performance by predicting only on a subset of the data.

## Bayesian linear regression

Vanilla linear regresion predicts the target value $y$ based on trained weights $\mathbf{w}$ and input features $\mathbf{x}$. Bayesian linear regression predicts the distribution over target value $y$ by mariginalizing over the distribution over weights $\mathbf{w}$. Both training and prediction can described in terms of inferring $y$, which decomposes into two inference problems: inferring $y$ based on parameters $\mathbf{w}$ and features $\mathbf{x}$ (prediction) and inferring weights based on training data ($\mathbf{X_{train}}, \mathbf{y_{train}}$) (training).

The distribution over targets $p(y|\mathbf{x}, \mathbf{w})$ is known as the predictive distribution and can be obtained by marginalization over $\mathbf{w}$. Intuitively, we take the average of predictions of infinitely many models -- that's the essence of the Bayesian approach to machine learning.

$$\underset{\mathrm{\text{predictive distribution}}} {\underbrace{p(y|\mathbf{x}, \mathbf{X_{train}}, \mathbf{y_{train}}, \alpha, \beta)}} = \int d\mathbf{w} \underset{\mathrm{\text{distribution over targets}}} {\underbrace{p(y|\mathbf{x}, \mathbf{w}, \beta) }} \ \underset{\mathrm{\text{parameter distribution}}} {\underbrace{p(\mathbf{w}|\mathbf{X_{train}}, \mathbf{y_{train}}, \alpha, \beta)}}$$

Here $\mathbf{X_{train}}$ and $\mathbf{y_{train}}$ constitute our training set and $\alpha$ and $\beta$ are two hyperparameters. Both of the distributions of the right-hand side have closed-form and there is also a closed-form solution for the predictive distribution. Let's take a look at those.

### Conditional distribution over targets

The distribution over targets conditioned on weights and features is simply a Gaussian with mean determined by a dot product of weights and features (as in vanilla linear regression) and fixed variance determined by a precision parameter $\beta$.

$$p(y|\mathbf{x}, \mathbf{w}, \beta) = \mathcal{N}(y|\mathbf{x}\mathbf{w}, \beta^{-1})$$

### Parameter distribution

The parameter distribution is also assumed to be a Gaussian governed by mean $\mathbf{m}_N$ and covariance $\mathbf{S}_N$.

$$p(\mathbf{w}|\mathbf{X_{train}}, \mathbf{y_{train}}, \alpha, \beta) = \mathcal{N}(\mathbf{w}|\mathbf{m}_N, \mathbf{S}_N)$$

The parameters $\mathbf{m}_N$ and $\mathbf{S}_N$ of the posterior parameter distribution are given by:

$$\mathbf{m}_N = \beta \mathbf{S}_N \mathbf{X_{train}} \mathbf{y_{train}}$$

and

$$\mathbf{S}_N^{-1} = \alpha \mathbf{I} + \beta \mathbf{X^T_{train}} \mathbf{X_{train}}$$

where $\alpha$ is a parameter governing the precision of a prior parameter distribution $p(\mathbf{w})$.

### Predictive distribution

The predictive distribution is the actual output of our model: for a given $\mathbf{x}$ it predicts a probability distribution over $y$, the target variable. Because both the distribution over parameters and the conditional distribution over targets are Gaussians, the predictive distribution is a convolution of these two and also Gaussian, taking the following form:

$$p(y|\mathbf{x}, \mathbf{X_{train}}, \mathbf{y_{train}}, \alpha, \beta) = \mathcal{N}(y|\mathbf{m}_N^\text{T}\mathbf{x}, \sigma_N^2(\mathbf{x}))$$

The mean of the predictive distribution is given by a dot product of the mean of the distribution over weights $\mathbf{m}_N$ and features $\mathbf{x}$. Intuitively, we're just doing vanilla linear regression using the average weights and ignoring the variance of the distribution over weights for now. It is accounted for separately in the variance of the predictive distribution:

$$\sigma_N^2(\mathbf{x}) = \underset{\mathrm{\text{aleatoric}}} {\underbrace {\beta^{-1}} }+ \underset{\mathrm{\text{epistemic}}}{\underbrace {\mathbf{x}^\text{T} \mathbf{S}_N \mathbf{x} }}$$

The variance of the predictive distribution, dependent on features $\mathbf{x}$, gives rise to a natural measure of prediction uncertainty: how sure is the model that the predicted value ($\mathbf{m}_N^\text{T}\mathbf{x}$) is the correct one for $\mathbf{x}$. This uncertainty can be further decomposed into aleatoric uncertainty and epistemic uncertainty.

Aleatoric uncertainty represents the noise inherent in the data and is just the variance of the conditional distribution over targets, $\beta^{-1}$. Since the optimal value of $\beta^{-1}$ is --- as we will see --- just the variance of $p(y|\mathbf{x})$, it will converge to the variance of the training set.

Epistemic uncertainty reflects the uncertainty associated with the parameters of $\mathbf{w}$. In principle, it could be reduced by moving the parameter distribution towards a better region given more training examples ($\mathbf{X_{train}}$ and $\mathbf{y_{train}}$).

Can this decomposition be used in practice? I will now proceed to discuss three applications of uncertainty estimation in the context of Bayesian linear regression: a stopping criterion for data collection, active learning, and selecting only a subset of the data to predict targets for.

## What is uncertainty for?

### Bayesian linear regression in scikit-learn

Scikit-learn provided a nice implementation of Bayesian linear regression as BayesianRidge, with fit and predict implemeted using the closed-form solutions laid down above. It also automatically takes scare of hyperparameters $\alpha$ and $\beta$, setting them to values maximizing model evidence $p(\mathbf{y_{train}}|\mathbf{X_{train}}, \alpha, \beta)$.

Just for the sake of experiments, I will override predict to access what scikit-learn abstracts away from: my implementation will return aleatoric and epistemic uncertainties rather than just the square root of their sum -- the standard deviation of the predictive distribution.

class ModifiedBayesianRidge(BayesianRidge):

def predict(self, X):
y_mean = self._decision_function(X)

if self.normalize:
X = (X - self.X_offset_) / self.X_scale_
aleatoric = 1. / self.alpha_
epistemic = (np.dot(X, self.sigma_) * X).sum(axis=1).
return y_mean, aleatoric, epistemic


Note that while I loosely followed the notation in Bishop's Pattern Recognition and Machine Learning, scikit-learn follows a different convention. self.alpha_ corresponds to $\beta$ while self.sigma_ corresponds to $\mathbf{S}_N$.

We will experiment with ModifiedBayesianRidge on several toy one-dimensional regression problems. Belove, on a scatter plot visualizing the dataset, I added the posterior predictive distribution of a fitted model. The red line is the average of the predictive distribution for each $\mathbf{x}$, while the light-red band represents the area within 1 standard deviation (i.e. $\sqrt{\beta^{-1} + \mathbf{x}^\text{T} \mathbf{S}_N\mathbf{x}}$) from the mean. Prediction for each data point $\mathbf{x}$ comes with its own measure of uncertainty. Regions far from training examples are obviously more uncertain for the model. We can exploit these prediction uncertainties in several ways.

### When to stop gathering data?

Acquiring more training data is usually the best thing you can do to improve model performance. However, gathering and labeling data is usually costly and offers diminishing returns: there more data you have, the smaller improvements new data bring about. It is hard to predict in advance the value of a new batch of data or to develop a stopping criterion for data gathering/labeling. One way is to plot your performance metric (for instance, test set mean squared error) against training set size and look for a trend. It requires, however, multiple evaluations on a held-out test set, ideally a different one than those used for hyperparameter tuning and final evaluation. In small data problems, we may not want to do that.

Uncertainty estimates offer an unsupervised solution. We can plot model uncertainty (on an unlabeled subset) against training set size and see how fast (or slow) epistemic uncertainty is reduced as more data is available. An example of such a plot is below. If data gathering is costly, we might decide to stop gathering more data somewhere around the red line. More data offers diminishing returns.

### Active learning and outlier detection

Some data points $\mathbf{x}$ are more confusing for the model than others. We can identify the most confusing data points in terms of epistemic uncertainty and exploit it in two ways: either focusing on the most confusing data when labeling more data or removing the most confusing data points from the training set.

The first strategy is known as active learning. Here we select the new data for labeling based on prediction uncertainties of a model trained on existing data. We will usually want to focus on the data the model is most uncertain about. A complementary approach is outlier detection. Here we assume that the datapoints model is most uncertain about are outliers, artifacts generated by noise in the data generating process. We might decide to remove them from the training set altogether and retrain the model.

Which approach is the best heavily depends on multiple circumstances such as data quality, dataset size and end user's preferences. Bayesian linear regression is relatively robust against noise in the data and outliers should not be much of a problem for it, but we might want to use Bayesian linear regression just to sanitize the dataset before training a more powerful model, such as a deep neural net. It is also useful to take a look at the ratio between aleatoric and epistemic uncertainty to decide whether uncertainty stems from noise or real-but-not-yet-learned patterns in the data.

Below I illustrate active learning with a simple experiment. We first train a model on 50 data points and then, based on its uncertainty, select 10 out of 100 data additional data points for labeling. I compare this active learning scheme with a baseline (randomly selecting 10 out of 100 data points for labeling) in terms of mean square error. In the active learning case, it is significantly lower, meaning our carefully selected additional data points reduce the mean squared error better than randomly sampled ones. The effect is small, but sometimes makes a difference.

### Doing inference on a subset of the data

We might also do outlier detecion at test time or during production use of a model. In some applications (e.g. in healthcare), the cost of making a wrong prediction is frequently higher than the cost of making no prediction. When the model is uncertain, the right thing to do may be to pass the hardest cases over to a human expert. This approach is sometimes called the reject option.1

For the sake of illustration, I trained a model on 10 data points and computed test set mean squared error on either the whole test set (10 data points), or 5 data points in the test set the model is most certain about.

We can get slightly better performance when refraining from prediction on half of the dataset. Is it worth it? Again, it heavily depends on your use case.

## Conclusions

The goal of this blog post was to present the mathematics underlying Bayesian linear regression, derive the equations for aleatoric and epistemic uncertainty and discuss the difference between these two, and finally, show three practical applications for uncertainty in data science practice. The notebook with code for all the discussed experiments and presented plots is available on GitHub.

1. Christopher M. Bishop (2006), Pattern Recognition and Machine Learning, p. 42.

The relation between syntax (how words are structured in a sentence) and semantics (how words contribute to the meaning of a sentence) is a long-standing open question in linguistics. It happens, however, to have practical consequences for NLP. In this blog post, I review recent work on disentangling the syntactic and the semantic information when training sentence autoencoders. These models are variational autoencoders with two latent variables and auxiliary loss functions specific for semantic and for syntactic representations. For instance, they may require the syntactic representation of a sentence to be predictive of word order and the semantic representation to be predictive of an (unordered) set of words in the sentence. I then go on to argue that sentence embeddings separating syntax from semantics can have a variety of uses in conditional text generation and may provide robust features for multiple downstream NLP tasks.

## Introduction

The ability of word2vec embeddings1 to capture semantic and syntactic properties of words in terms of geometrical relations between their vector representations is almost public knowledge now. For instance, the word embedding for king minus the word embedding for man plus the word embedding for woman will lie close to queen in the embedding space. Similarily, trained word embeddings can do syntactic analogy tasks such as quickly - quick + slow = slowly. But from a purely statistical point of view, the difference between syntax and semantics is arbitrary. Word embeddings themselves do not distinguish between the two: the word embedding for quick will be in the vicinity of both quickly (adverb) and fast (synonym). This is because word embeddings (this applies to word2vec but also to more powerful contextual word embeddings, such as those produced by BERT2) are optimized to predict words based on their context (or vice versa). Context can be semantic (the meaning of neighbouring words) as well as syntactic (the syntactic function of neighbouring words). But from the point of view of a neural language model, learning that a verb must agree with a personal pronoun (do is unlikely when preceded by she) is not fundamentally different from learning that it must maintain coherence with the rest of the sentence (rubble is unlikely when preceded by I ate).

It seems that we need a more fine-grained training objective to force a neural model to distinguish between syntax and semantics. This is what motivates some recent approaches to learning two separate sentence embeddings for a sentence --- one focusing on syntax, and the other on semantics.

## Training sentence autoencoders to disentangle syntax and semantics

Variational autoencoder (VAE) is a popular architectural choice for unsupervised learning of meaningful representations.3 VAE's training objective is simply to encode an object $x$ into a vector representation (more precisely, a probability distribution over vector representations) such that it is possible to reconstruct $x$ based on this vector (or a sample from the distribution over these vectors). Although VAE research focuses on images, it can also be applied in NLP, where our $x$ is a sentence.4 In such a setting, VAE encodes a sentence $x$ into a probabilistic latent space $q(z|x)$ and then tries to maximize the likelihood of its reconstruction $p(x|z)$ given a sample from the latent space $z \sim q(z|x)$. $p(x|z)$ and $q(z|x)$, usually implemented as recurrent neural networks, can be seen as a decoder and an encoder. The model is regularized to minimize the following loss function:

$$\mathcal{L}_{\mathrm{VAE}}(x):=\mathbb{E}_{z \sim q(\cdot | x)}[p(x | z)]+\mathrm{KL}(q(z | x) \| p(z))$$

where $p(z)$ is assumed to be a Gaussian prior and the Kullback-Leibler divergence $\text{KL}$ between $q(z|x)$ and $p(z)$ is a regularization term.

Recently, two extensions of the VAE framework have been independently proposed: VG--VAE (von Mises--Fisher Gaussian Variational Autoencoder)5 and DSS--VAE (disentangled syntactic and semantic spaces of VAE).6 These extensions replace $z$ with two separate latent variables encoding the meaning of a sentence ($z_{sem} \sim q_{sem}(\cdot|x)$) and its syntactic structure ($z_{syn} \sim q_{syn}(\cdot|x)$). I will jointly refer to these models as sentence autoencoders disentangling semantics and syntax (SADSS). Disentanglement in SADSS is achieved via a multi-task objective. Auxiliary loss functions $\mathcal{L}{sem}$ and $\mathcal{L}{syn}$, separate for semantic and syntactic representations, are added to the VAE loss function with two latent variables:

$$\begin{array}{r} \mathcal{L}_{\mathrm{SADSS}}(x):=\mathbb{E}_{z_{s e m} \sim q_{s e m}(\cdot | x)} \mathbb{E}_{z_{s y n} \sim q_{s y n}(\cdot | x)}\left[p\left(x | z_{s e m}, z_{s y n}\right)\right. \\ \left.+\mathcal{L}_{s e m}\left(x, z_{s e m}\right)+\mathcal{L}_{s y n}\left(x, z_{s y n}\right)\right] \\ +\mathrm{KL}\left(q\left(z_{s e m} | x\right) \| p\left(z_{s e m}\right)\right) \\ +\mathrm{KL}\left(q\left(z_{s y n} | x\right) \| p\left(z_{s y n}\right)\right) \end{array}$$

There are several choices for auxilary loss functions $\mathcal{L}{sem}$ and $\mathcal{L}{syn}$. $\mathcal{L}{sem}$ might require the semantic representation $z{sem}$ to predict the bag of words contained in $x$ (DSS--VAE) or to discriminate between a sentence $x^+$ paraphrasing $x$ and a dissimilar sentence $x^-$ (VG--VAE). $\mathcal{L}{syn}$ might require the syntactic representation to predict a linearized parse tree of $x$ (DSS--VAE) or to predict a position $i$ for each word $x_i$ in $x$ (VG--VAE). DSS--VAE also uses adversarial losses, ensuring that (i) $z{sym}$ minimizes semantic losses, (ii) $z_{sem}$ minimizes semantic losses, and that (iii) neither $z_{syn}$ nor $z_{sym}$ alone is sufficient to reconstruct $x$. Crucially, both auxiliary losses $\mathcal{L}{sem}$ and $\mathcal{L}{syn}$ are motivated by the assumption that syntax pertains to the ordering of words, while semantics deals with their lexical meanings.

## What is syntax--semantics disentanglement for?

SADSS allow a number of applications in conditional text generation, including unsupervised paraphrase generation7 and textual style transfer8. Generating a paraphrase $x'$ of $x$ can be seen as generating a sentence shares the meaning of $x$ but expresses it with different syntax. Paraphrases can be sampled by greedily decoding $x' = p(\cdot|z_{sem}, z_{syn})$ where $z_{sem} = \text{argmax}_{z_{sem}} p(z_{sem}|x)$ and $z_{syn} \sim p(z_{syn}|x)$.

Similarly, one can pose textual style transfer as the problem of producing a new sentence $x_{new}$ that captures the meaning of some sentence $x_{sem}$ but borrows the syntax of another sentence $x_{syn}$.

There is one further application of unsupervised paraphrase generation: data augmentation. Data augmentation means generating synthetic training data by applying label--preserving transformations to available training data. Data augmentation is far less popular in NLP than computer vision and other applications, partly due to the difficulty of finding task-agnostic transformations of sentences that preserve their meaning. Indeed, Sebastian Ruder lists task-independent data augmentation for NLP as one of the core open research problems in machine learning today. Unsupervised paraphrase generation might be a viable alternative to methods such as backtranslation. Backtranslation produces a synthetic sentence $x'$ capturing the meaning of an original $x$ by first machine translating $x$ into some other language (e.g. French) and then translating $x$ back to English.9 A more principled approach would be to use SADSS and generate synthetic sentences by conditioning on the meaning of $x$ captured in $z_{sem}$ but sampling $z_{syn}$ from a prior distribution to ensure syntactic diversity.

## Beyond conditional text generation

While most research has focused on applying SADSS to natural language generation, representation learning applications remain relatively underexplored. One can imagine, however, using SADSS for producing task-agnostic sentence representation10 that can be used as features in various downstream applications, including document classification and question answering. Syntax--semantics disentanglement seems to brings some additional benefits to the table that even more powerful models, such as BERT, might lack.

First, representations produced by SADSS may be more robust to the distribution shift. Assuming that stylistic variation will be mostly captured by $z_{syn}$, we can expect SADSS to exhibit increased generalization across stylistically diverse documents. For instance, we can expect a SADSS model trained on the Wall Street Journal collection of Penn treebank to outperform a baseline model on generalizing to Twitter data.

Second, SADSS might be more fair. Raw text is known to be predictive of some demographic attributes of its author, such as gender, race or ethnicity.11 Most approaches to removing information about sensitive attributes from a representation, such as adversarial training,12 require access to these attributes at training time. However, disentanglement of representation has been observed to correlate consistently with increased fairness across several downstream tasks13 without the need to know the protected attribute in advance. This fact raises the question of whether disentangling semantics from syntax also improves fairness, being understood as blindness to demographic attributes. Assuming that most demographic information is captured by syntax, one can conjecture that disentangled semantic representation would be fairer in this sense.

Finally, learning disentangled representations for language is sometimes conjectured to be part of a larger endeavor of building AI capable of symbolic reasoning. Consider syntactic attention, an architecture separating the flow of semantic and syntactic information inspired by models of language comprehension in computational neuroscience. It was shown to offer improved compositional generalization.14 The authors further argue the results are due to a decomposition of a difficult out-of-domain (o.o.d.) generalization problem into two separate i.i.d. generalization problems: learning the meanings of words and learning to compose words. Disentangling the two allows the model to refer to particular words indirectly (abstracting away from their meaning), which is a step towards emulating symbol manipulation in a differentiable architecture --- research directions laid down by Yoshua Bengio in his NeurIPS 2019 keynote keynote From System 1 Deep Learning to System 2 Deep Learning.

## Wrap up

Isn't it naïve to assume that syntax boils down to word order, and the meaning of a sentence is nothing more than a bag of words used in a sentence? Surely, it is. The assumptions embodied in $\mathcal{L}{sem}$ and $\mathcal{L}{syn}$ are highly questionable from a linguistic point of view. There are a number of linguistic phenomena that seem to escape these loss functions or occur at the syntax--semantics interface. These include the predicate-argument structure (especially considering the dependence of subject and object roles on context and syntax) or function words (e.g. prepositions). Moreover, what $\mathcal{L}{sem}$ and $\mathcal{L}{syn}$ capture may be quite specific for how the grammar of English works. While English indeed encodes the grammatical function of constituents primarily through word order, other languages (such as Polish) manifest much looser word order and mark grammatical function via case inflection, by relying on an array of orthographically different word forms.

Interpreting $z_{sem}$ and $z_{syn}$ as semantic and syntactic is therefore somewhat hand-wavy and seems to provide little insight into the nature of language. Nevertheless, SADSS demonstrate impressive results in paraphrase generation and textual style transfer and show promise for several applications, including data augmentation as well as robust representation learning. They may deserve interest in their own right, despite being a crooked image of how language works.

1. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems.
2. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Annual Conference of the North American Chapter of the Association for Computational Linguistics.
3. Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. International Conference on Learning Representations.
4. Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., & Bengio, S. (2016). Generating sentences from a continuous space. Proceedings of The 20th Conference on Computational Natural Language Learning.
5. Chen, M., Tang, Q., Wiseman, S., & Gimpel, K. (2019). A Multi-Task Approach for Disentangling Syntax and Semantics in Sentence Representations. Annual Conference of the North American Chapter of the Association for Computational Linguistics.
6. Bao, Y., Zhou, H., Huang, S., Li, L., Mou, L., Vechtomova, O., Dai, X., & Chen, J. (2019). Generating sentences from disentangled syntactic and semantic spaces. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
7. Gupta, A., Agarwal, A., Singh, P., & Rai, P. (2018). A deep generative framework for paraphrase generation. Thirty-Second AAAI Conference on Artificial Intelligence.
8. Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., & Xing, E. P. (2017). Toward controlled generation of text. Proceedings of the 34th International Conference on Machine Learning, Volume 70, 1587–1596.
9. Sennrich, R., Haddow, B., & Birch, A. (2016). Improving Neural Machine Translation Models with Monolingual Data. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.
10. Conneau, A., & Kiela, D. (2018). Senteval: An evaluation toolkit for universal sentence representations. Proceedings of the Eleventh International Conference on Language Resources and Evaluation.
11. Pardo, F. M. R., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., & Stein, B. (2016). Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations. CLEF.
12. Elazar, Y., & Goldberg, Y. (2018). Adversarial Removal of Demographic Attributes from Text Data. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
13. Locatello, F., Abbati, G., Rainforth, T., Bauer, S., Schölkopf, B., & Bachem, O. (2019). On the Fairness of Disentangled Representations. Advances in Neural Information Processing Systems.
14. Russin, J., Jo, J., & O’Reilly, R. C. (2019). Compositional generalization in a deep seq2seq model by separating syntax and semantics. ArXiv Preprint ArXiv:1904.09708.

Event-driven or opportunistic investing is a strategy associated with exploiting stock mispricings occurring before, during or after corporate events (also called catalyst or special situations) like restructurings, M&As, spinoffs or bankruptcies. Mispricings tend to arise when public companies are involved in special situations because the stock can become artificially inflated or depressed due to speculations from market players. Money managers tend to look at things from their own perspective and value their stock differently than others. While a company subject to a catalyst may not seem like a good investment opportunity, some sophisticated investors are willing to accept the increased risk. At Sigmoidal we are looking for similar opportunities with our MarketMove news analytics engine. Instead of looking at raw numbers, spreadsheets, and balance sheets, the system is scanning global news. Currently, it analyses 200,000 articles/day to find companies whose market valuation may be wrong in light of recent events that could lift a share price up. A reliable, but still speculative strategy is to follow the actions of activist investors. Their intentions are filed with the SEC under Schedule 13D and they can be a source of potential event-driven investment opportunities. However, companies that are targeted by activist investors are more-or-less in far from perfect conditions which pose a relatively high risk when considering an investment.

In light of recent situations stirring around Twitter and a recent activist investment of $1B from Elliott Management, let’s take a look back at similar events in the past. Between 2011 and 2014, we saw an increase in the number of activist funds in the US from 19 to 162, with their assets increasing from$68B to $205B. Back in the early 2000s, institutional investors were heavily reluctant to back activists, as they were viewed as a rather disruptive force. With recent successful activist investments, the fears are mitigated with the realization that successful campaigns can turn a struggling company around. In terms of long-term performance, in the 2004-2016 period impact investors have increased their capital by 1,400%. This stands out when comparing to the asset growth of alternative investors (hedge funds/private equity) of 304% over the same time.x One of the recent examples of successful activist investment campaigns was pushing eBay to sell StubHub and its Classifieds ad business, which is still in progress. We saw eBay’s CEO stepping down in September and a significant divestment in the sale of StubHub for almost$4B. Starboard Value - the main activist investor in eBay who owns more than 1% in the business noted an over 4% after increasing the pressure on eBay’s management. On January 22nd, 2019 news websites announced that Elliott Management took a $1.4B stake in eBay the stock also jumped by 7%. ## How AI can discover activist investments before/after a market catalyst takes place? With MarketMove we can identify market-moving scoops as they take place in the capital markets. With precise algorithms, the platform identifies opportunities in shareholder activism across news articles, TV headlines and more. The system is using Named Entity Recognition models and advanced neural Topic Modelling to identify when the press releases about activist investors. Over the last month, the MarketMove algorithms identified several events related to new activist campaigns and shareholder activism in real-time, including, among others: • New activist campaigns: • Sachem Head Capital Management building a new stake in Olin Corporation • Elliott Management unveiling a$3.4B stake at SoftBank
• Tenzing Global Management buying up 5% of Noodle & Co.
• Activist fund Amber raising stake in Lagardere above 10% from 5.3%

## Other strategies

Other examples of corporate special situations include regulatory changes, earnings announcements, succession issues, divestitures, turnaround, huge layoffs/trouble/antitrust bodies, corporate relocations, pension issues and more. Identifying such events as they happen in real-time often turns out to be crucial for portfolio strategies of investors.

## Alternative Data In Event-Driven Strategies

Alternative data flows from more and more sources. For fund managers and individual investors, it is very important to leverage uncommon sources in their strategies. With the technology being made available for more, gradually many investors turn to a quantitative approach. The rest of them, without a solid data foundation, lose ground in the competitive market. The sheer amount of unquantifiable data that can be implemented in investment strategies presents many new opportunities for money managers. With the presence of Artificial Intelligence in investment management, management consulting, financial research or investor relations there is a growing trend to turn into alternative data sources to seek insights and recommendations. One of them is processing massive amounts of textual data from global news.

## Building Machine Learning Algorithms

Obtaining useful information about market opportunities from huge amounts of data require vast resources, both in terms of time and funds. The good news is there is the AI-based technology that is able to perform some parts automatically, and therefore improve the efficiency of the whole process. A system using AI algorithms – Document Classification, Named Entity Recognition, to name a few – can spot certain changes on the market without human intervention.

### 1. Collect the data from the online and local sources

Let’s focus on a particular, real-life example. An investor needed information about changes in the management structure of companies in NYSE. Before leveraging the AI solution, he employed a dedicated team of research analysts traversing thousands of web articles, tweets, and social media posts looking for recent changes in the structure of the companies. The first step we need to do is to aggregate data from valuable sources. So far, our client’s analysts read online business newspapers and monitored selected Twitter accounts looking for structural change information.

### 2. Filter out the irrelevant materials

Our system has entirely automated that work by scraping newspapers and tweets, producing a steady feed of information ready to use by the analysts, putting the information in one place.

### 3. Extract knowledge from text

We’ve already managed to save a lot of time by showing only the most valuable information. However, there’s still a lot of work to do. Working on raw documents is hard. We need to extract the information that the client needs.And AI helps us in that as well!
Data scientists call this technique named entity extraction. It identifies important information to the investor in the text, based on given criteria. In this case, we extract the company name, position, and the reason, and put it in a table row that we can use later. At this point, it can be exported into JSON, CSV, or made accessible via API.

## Real-time News Analytics For Investors

At Sigmoidal we help investors gather and process crucial unstructured data with AI. The Sigmoidal MarketMove™ engine started as a news analytics platform, which with the use of Natural Language Processing helped identify fresh investment opportunities. At an early stage, the tool processed textual data from over 250 news sources and searched for specific documents like new regulations. Currently, the engine is very versatile and fits various use cases. MarketMove discovers specific investment opportunities for a fund manager, performs due diligence, or identifies fraud for a risk consulting corporation. MarketMove™ scans multiple media resources in 7 languages and matches demand in different even more demanding uses.

Just in the last month, the MarketMove™ platform identified several spin-offs. As a result, it discovered GSK prospective joint venture with Pfizer or a logistics company Transplace merging with Celtic International. The tool also helps investors identify opportunities among distressed companies. For instance, MarketMove™ addressed the potential takeover of a French furniture manufacturer. The company was recently placed in receivership. It also underlined a jewelry retailer, Links of London, going into administration. In addition, it identified wafer biscuits maker, Rivington Biscuits, selling their assets, and filing for insolvency. In conclusion, MarketMove™ found over 230 distressed companies over the last 30 days.

## Use In Event-Driven Strategies

Michael, Sigmoidal’s product manager, explains that the MarketMove™ engine finds use in stock market forecasting and time series analysis. “The datasets accessible by our API contain multiple data points for each collected news article and there are 80,000 new articles a day from 1,000+ news sources. Every single record matches a specific company and/or tickers with a sentiment score and an accuracy metric. As a consequence, unquantifiable data get quantifiable and accessible for various stock market prediction models.” he says. Leveraging MarketMove™ enables data engineers to run more accurate and better performing stock market prediction models.

“Investors also use MarketMove™ in opportunistic strategies.”, he continues. “Money managers often rely on special situations in the market. Events like carve-outs or restructurings cause movements in the stock prices. Our algorithms detect such events early. For instance, when a company files an M&A agreement or media sentiment surges prior to a huge layoff our system can detect it and alert our users. When leveraging such strategies, timing is very, very important.”

Currently, the MarketMove™ engine tracks the reputation of more than 6,000,000 companies 10,000,000+ individuals globally. The system then assesses their involvement in the news around the globe and allows them to set up custom alerts exactly when special situations occur. The MarketMove™ engine is also available as a web-based application. With an easy to use interface the platform enables users to discover new companies or individuals based on recent market events.

SEE ALL INSIGHTSRECENT INSIGHTSCASE STUDIES
[email protected]