Word embeddings and how they vary

23 Jul 2018

[ natural-language-processing word-embeddings language machine-learning ]

Consider for a moment the complexity of human language. Humans can effortlessly process nuanced expressions such as simile (“cool as a cucumber”), sarcasm (“Just great! I failed that test”), and allusion (“don’t be such a Scrooge”). Our vocabulary is also changing – in March 2018, 850 new words were added to the Merriam-Webster Dictionary, including “cryptocurrency” and “chiweenie” (a cross between a Chihuahua and a dachshund).

New vocabulary word of the day: chiweenie photo credit: thehappypuppysite.com

Even the same word can mean different things in different contexts. Consider the word bar:

After work, the man went to go get a drink at the bar.
This is a really tasty chocolate bar!
The recent law school graduate just passed his bar exam.
The beginning ballet students held onto the bar for balance.

*A **bar** that serves chocolate **bars**? Oh, the ambiguity!* (photo credit: Mahmoud Azab)

Given all of this complexity, how is a computer system supposed to understand words and language? Answering this question is a major goal within the field of natural language processing (NLP). When we input a word into a computer, we would ideally like the computer to understand the meaning of that word, as well as its common usage and properties. For example, given the word “cat,” we want a computer to understand that a cat is an animal, that it has fur and four legs, that it’s commonly kept as a household pet, etc. We also want it to understand how each word interacts with other words – cats like to sunbathe and catch mice and boss around their personal servants (er, owners). We also want to know that a cat is more similar to a dog, for instance, than it is to a car. A dictionary provides some of this information, but it also omits many practical details, due to its brevity.

How else can we get information about a word? One way is to look at the context of the word (the surrounding words). As an example, let’s suppose that we’ve come across a word we’ve never seen before: theraw. Here are a few sentences for this new word:

What delicious theraw!
I wish that I could make theraw the way that she makes it.
theraw was the perfect ending to a satisfying meal.

After seeing these sentences, we have a pretty good guess about what the word theraw means - it’s probably some kind of special dessert. We know this because theraw is described as “delicious”, a word typically associated with food. We also see theraw discussed in the context of a meal (specifically the end of a meal), and we see an example of someone making theraw, again an action that can be associated with food. Furthermore, we have seen other words that we know appear in very similar sentences (e.g., “What delicious pie!”), so we can deduce that theraw is probably related to these other words.

In short, if we want to gather meaningful information about a particular word, the words surrounding that word are a very good place to start. This context-based approach is a key concept behind word embeddings, a popular idea in computer science right now. Essentially, a word embedding is a group of numbers that represents a word. There are different ways to generate word embeddings, but almost all of them use context to extract meaningful information about a word. We can think about a word embedding as a single point somewhere in space. When you look at many of these individual “points” together, you can see how they interact. For example, consider this famous illustration

Here, we can see four points in space, representing king, queen, man, and woman. The location of each point (the word embedding) is determined by the particular group of numbers associated with that word. We can also see that the relationship between king and queen is almost identical to the relationship between man and woman! This tells us that the word embeddings are capturing some analogy information from the words – king is to queen as man is to woman. Here’s another example:

Here, we see word embeddings of countries and capital cities. Lines are drawn between each country and its capital city, and we can see that the relationship between countries and their capitals is very similar across multiple countries. This is exciting because it shows us that word embeddings are capturing meaningful information about the words.

So what are the applications of word embeddings? For any word, researchers have at their disposal a word embedding that captures meaningful information drawn from the context of the words. This gives a computer system the ability to better understand word meanings, which is useful in lots of other tasks. For example, word embeddings can be used to perform more precise translations between languages or to more accurately summarize long documents. Researchers use embeddings to predict the sentiment (happy or sad) of a piece of writing and to classify articles into certain categories (e.g., news or sports or entertainment).

To create such word embeddings, we generally use very large collections of texts, which capture how different words occur in different contexts. Of course, this raises the question whether the embeddings that we create depend in any way on the collection being used. Are we getting the same embeddings from a large collection versus a small collection? Or from a collection of texts on sports, versus a collection of texts on arts?

To answer these questions about different text sources, I, along with my collaborators Rada Mihalcea and Jonathan Kummerfeld, have recently been analyzing word embeddings. One of the things we found is that word embeddings change if you use different collections of text as input. This is because words tend to have different contexts when they are used to discuss different topics. For example, consider the following two sentences using the word power, one from the Arts section of the NYT and one from the Sports section of the NYT:

Arts: Eric Owens, a bass, offered a fine balance of gentleness and power in Raphael’s music in parts 1 and 3, and as Adam in part 5.

Sports: What has never been in doubt is that Clarett is a special athlete, a thrilling combination of power and speed.

Though the same word, power, is being used in both sentences, it is being used in different ways. In the first sentence, power denotes strength of voice, while in the second sentence, power denotes strength of body. Many words have different connotations when used to talk about different topics. Because of this, changing the collection of text changes the word embeddings produced. This has implications for how word embeddings are used - if a researcher is using embeddings to summarize documents about art, then it will be more effective to use embeddings trained on a collection of texts about art, rather than a collection of texts about sports. For more details, as well as results from other embedding experiments, take a look at our paper (in references below).

Word embeddings are a powerful tool to help computers understand language better. Having a better understanding of how these embeddings work will help us create more effective embeddings, and handle more complex language in a computer system!

If you’re interested in learning more about word embeddings, check out the following resources:

Blog Posts
“On Word Embeddings - Part 1” by Sebastian Ruder [link]
“Deep Learning, NLP, and Representations” by Christopher Olah [link]

Textbooks
Speech and Language Processing (3rd ed. draft) by Dan Jurafsky and James H. Martin - Chapter 15, “Vector Semantics”; Chapter 16, “Semantics with Dense Vectors” [link]
Natural Language Processing by Jacob Eisenstein - Chapter 14, “Distributional and Distributed Semantics” [link]
Neural Network Methods for Natural Language Processing by Yoav Goldberg - Chapter 10, “Pre-trained Word Representations”; Chapter 11, “Using Word Embeddings” [link]

Code for Word Embeddings
“Word2vec Tutorial” by Radim Řehůřek [link]
“GloVe: Global Vectors for Word Representation” by Jeffrey Pennington, Richard Socher, and Christopher D. Manning [link]
“Quora: Where can I find some pre-trained word vectors for natural language processing/understanding?” [link]

Research Papers
[our paper] Wendlandt, Laura, Jonathan K. Kummerfeld, and Rada Mihalcea. “Factors Influencing the Surprising Instability of Word Embeddings.” NAACL-HLT (2018). [link]
Camacho-Collados, Jose, and Taher Pilehvar. “From Word to Sense Embeddings: A Survey on Vector Representations of Meaning.” arXiv preprint arXiv:1805.04032 (2018). [link]