Where Syntax Ends and Semantics Begin. Why Should We Care?
Mar 24, 2020
Mar 24, 2020
Note: This article was originally published on March 24, 2020 and has been migrated from our previous blog. Some details — tools, libraries, benchmarks, industry context — may be outdated. For our latest perspective, see our recent posts.
The relation between syntax (how words are structured in a sentence) and semantics (how words contribute to the meaning of a sentence) is a long-standing open question in linguistics. It happens, however, to have practical consequences for NLP. In this blog post, I review recent work on disentangling the syntactic and the semantic information when training sentence autoencoders. These models are variational autoencoders with two latent variables and auxiliary loss functions specific for semantic and for syntactic representations. For instance, they may require the syntactic representation of a sentence to be predictive of word order and the semantic representation to be predictive of an (unordered) set of words in the sentence. I then go on to argue that sentence embeddings separating syntax from semantics can have a variety of uses in conditional text generation and may provide robust features for multiple downstream NLP tasks.
The ability of word2vec embeddings1 to capture semantic and syntactic properties of words in terms of geometrical relations between their vector representations is almost public knowledge now. For instance, the word embedding for king minus the word embedding for man plus the word embedding for woman will lie close to queen in the embedding space. Similarly, trained word embeddings can do syntactic analogy tasks such as quickly - quick + slow = slowly. But from a purely statistical point of view, the difference between syntax and semantics is arbitrary. Word embeddings themselves do not distinguish between the two: the word embedding for quick will be in the vicinity of both quickly (adverb) and fast (synonym). This is because word embeddings (this applies to word2vec but also to more powerful contextual word embeddings, such as those produced by BERT2) are optimized to predict words based on their context (or vice versa). Context can be semantic (the meaning of neighbouring words) as well as syntactic (the syntactic function of neighbouring words). But from the point of view of a neural language model, learning that a verb must agree with a personal pronoun (do is unlikely when preceded by she) is not fundamentally different from learning that it must maintain coherence with the rest of the sentence (rubble is unlikely when preceded by I ate).

It seems that we need a more fine-grained training objective to force a neural model to distinguish between syntax and semantics. This is what motivates some recent approaches to learning two separate sentence embeddings for a sentence — one focusing on syntax, and the other on semantics.
Variational autoencoder (VAE) is a popular architectural choice for unsupervised learning of meaningful representations.3 VAE’s training objective is simply to encode an object into a vector representation (more precisely, a probability distribution over vector representations) such that it is possible to reconstruct
based on this vector (or a sample from the distribution over these vectors). Although VAE research focuses on images, it can also be applied in NLP, where our
is a sentence.4 In such a setting, VAE encodes a sentence
into a probabilistic latent space
and then tries to maximize the likelihood of its reconstruction
given a sample from the latent space
.
and
, usually implemented as recurrent neural networks, can be seen as a decoder and an encoder. The model is regularized to minimize the following loss function:
where is assumed to be a Gaussian prior and the Kullback-Leibler divergence
between
and
is a regularization term.
Recently, two extensions of the VAE framework have been independently proposed: VG–VAE (von Mises–Fisher Gaussian Variational Autoencoder)5 and DSS–VAE (disentangled syntactic and semantic spaces of VAE).6 These extensions replace with two separate latent variables encoding the meaning of a sentence (
) and its syntactic structure (
). I will jointly refer to these models as sentence autoencoders disentangling semantics and syntax (SADSS). Disentanglement in SADSS is achieved via a multi-task objective. Auxiliary loss functions
and
, separate for semantic and syntactic representations, are added to the VAE loss function with two latent variables:
There are several choices for auxilary loss functions and
.
might require the semantic representation
to predict the bag of words contained in
(DSS–VAE) or to discriminate between a sentence
paraphrasing
and a dissimilar sentence
(VG–VAE).
might require the syntactic representation to predict a linearized parse tree of
(DSS–VAE) or to predict a position
for each word
in
(VG–VAE). DSS–VAE also uses adversarial losses, ensuring that (i)
minimizes semantic losses, (ii)
minimizes semantic losses, and that (iii) neither
nor
alone is sufficient to reconstruct
. Crucially, both auxiliary losses
and
are motivated by the assumption that syntax pertains to the ordering of words, while semantics deals with their lexical meanings.
SADSS allow a number of applications in conditional text generation, including unsupervised paraphrase generation7 and textual style transfer8. Generating a paraphrase of
can be seen as generating a sentence that shares the meaning of
but expresses it with different syntax. Paraphrases can be sampled by greedily decoding
where
and
.

Similarly, one can pose textual style transfer as the problem of producing a new sentence that captures the meaning of some sentence
but borrows the syntax of another sentence
.

There is one further application of unsupervised paraphrase generation: data augmentation. Data augmentation means generating synthetic training data by applying label–preserving transformations to available training data. Data augmentation is far less popular in NLP than computer vision and other applications, partly due to the difficulty of finding task-agnostic transformations of sentences that preserve their meaning. Indeed, Sebastian Ruder lists task-independent data augmentation for NLP as one of the core open research problems in machine learning today. Unsupervised paraphrase generation might be a viable alternative to methods such as backtranslation. Backtranslation produces a synthetic sentence capturing the meaning of an original
by first machine translating
into some other language (e.g. French) and then translating
back to English.9 A more principled approach would be to use SADSS and generate synthetic sentences by conditioning on the meaning of
captured in
but sampling
from a prior distribution to ensure syntactic diversity.
While most research has focused on applying SADSS to natural language generation, representation learning applications remain relatively underexplored. One can imagine, however, using SADSS for producing task-agnostic sentence representation10 that can be used as features in various downstream applications, including document classification and question answering. Syntax–semantics disentanglement seems to bring some additional benefits to the table that even more powerful models, such as BERT, might lack.
First, representations produced by SADSS may be more robust to the distribution shift. Assuming that stylistic variation will be mostly captured by , we can expect SADSS to exhibit increased generalization across stylistically diverse documents. For instance, we can expect a SADSS model trained on the Wall Street Journal collection of Penn treebank to outperform a baseline model when generalizing to Twitter data.
Second, SADSS might be more fair. Raw text is known to be predictive of some demographic attributes of its author, such as gender, race or ethnicity.11 Most approaches to removing information about sensitive attributes from a representation, such as adversarial training,12 require access to these attributes at training time. However, disentanglement of representation has been observed to correlate consistently with increased fairness across several downstream tasks13 without the need to know the protected attribute in advance. This fact raises the question of whether disentangling semantics from syntax also improves fairness, being understood as blindness to demographic attributes. Assuming that most demographic information is captured by syntax, one can conjecture that disentangled semantic representation would be fairer in this sense.
Finally, learning disentangled representations for language is sometimes conjectured to be part of a larger endeavor of building AI capable of symbolic reasoning. Consider syntactic attention, an architecture separating the flow of semantic and syntactic information inspired by models of language comprehension in computational neuroscience. It was shown to offer improved compositional generalization.14 The authors further argue the results are due to a decomposition of a difficult out-of-domain (o.o.d.) generalization problem into two separate i.i.d. generalization problems: learning the meanings of words and learning to compose words. Disentangling the two allows the model to refer to particular words indirectly (abstracting away from their meaning), which is a step towards emulating symbol manipulation in a differentiable architecture — research directions laid down by Yoshua Bengio in his NeurIPS 2019 keynote From System 1 Deep Learning to System 2 Deep Learning.
Isn’t it naïve to assume that syntax boils down to word order, and the meaning of a sentence is nothing more than a bag of words used in a sentence? Surely, it is. The assumptions embodied in and
are highly questionable from a linguistic point of view. There are a number of linguistic phenomena that seem to escape these loss functions or occur at the syntax–semantics interface. These include the predicate-argument structure (especially considering the dependence of subject and object roles on context and syntax) or function words (e.g. prepositions). Moreover, what
and
capture may be quite specific for how the grammar of English works. While English indeed encodes the grammatical function of constituents primarily through word order, other languages (such as Polish) manifest much looser word order and mark grammatical function via case inflection, by relying on an array of orthographically different word forms.
Interpreting and
as semantic and syntactic is therefore somewhat hand-wavy and seems to provide little insight into the nature of language. Nevertheless, SADSS demonstrate impressive results in paraphrase generation and textual style transfer and show promise for several applications, including data augmentation as well as robust representation learning. They may deserve interest in their own right, despite being a crooked image of how language works.
Say hello to optimized workflows and enhanced decision-making! Harness the power of AI to unlock differentiated insights
Harness the power of AI - Whether it’s optimizing supply chains in logistics, preventing fraud in healthcare insurance, or leveraging advanced social listening to enhance your portfolio companies.