Word Embedding

4 min readAug 22, 2020

Word Embedding => Collective term for models that learned to map a set of words or phrases in a vocabulary to vectors of numerical values.

Neural Networks are designed to learn from numerical data.

Word Embedding is really all about improving the ability of networks to learn from text data. By representing that data as lower dimensional vectors. These vectors are called Embedding.

This technique is used to reduce the dimensionality of text data but these models can also learn some interesting traits about words in a vocabulary.

How it is done!

General approach for dealing with words in your text data is to one-hot encode your text. You will have tens of thousands of unique words in your text vocabulary. Computations with such one-hot encoded vectors for these words will be very inefficient because most values in your one-hot vector will be 0. So, the matrix calculation that will happen in between a one-hot vector and a first hidden layer will result in a output that will have mostly 0 values

We use embeddings to solve this problem and greatly improve the efficiency of our network. Embeddings are just like a fully-connected layer. We will call this layer as — embedding layer and the weights as — embedding weights.

Now, instead of doing the matrix multiplication between the inputs and hidden layer we directly grab the values from embedding weight matrix. We can do this because the multiplication of one-hot vector with weight matrix returns the row of the matrix corresponding to the index of ‘1’ input unit

So, we use this Weight Matrix as lookup table. We encode the words as integers, for example ‘cool’ is encoded as 512, ‘hot’ is encoded as 764. Then to get hidden layer output value for ‘cool’ we just simply need to lookup the 512th row in the weight matrix. This process is called Embedding Lookup. The number of dimension in the hidden layer output is the embedding dimension

To reiterate :-

a) The embedding layer is just a hidden layer

b) The lookup table is just a embedding weight matrix

c) The lookup is just a shortcut for matrix multiplication

d) The lookup table is trained just like any weight matrix

Popular off-the-shelf word embedding models in use today:

Word2Vec (by Google)
GloVe (by Stanford)
fastText (by Facebook)

Word2Vec:

This model is provided by Google and is trained on Google News data. This model has 300 dimensions and is trained on 3 million words from google news data.

Team used skip-gram and negative sampling to build this model. It was released in 2013.

GloVe:

Global Vectors for words representation (GloVe) is provided by Stanford. They provided various models from 25, 50, 100, 200 to 300 dimensions based on 2, 6, 42, 840 billion tokens

Team used word-to-word co-occurrence to build this model. In other words, if two words co-occur many times, it means they have some linguistic or semantic similarity.

fastText:

This model is developed by Facebook. They provide 3 models with 300 dimensions each.

fastText is able to achieve good performance for word representations and sentence classifications because they are making use of character level representations.

Each word is represented as bag of characters n-grams in addition to the word itself. For example, for the word partial, with n=3, the fastText representation for the character n-grams is <pa, art, rti, tia, ial, al>. <and> are added as boundary symbols to separate the n-grams from the word itself.

Word Embedding

How it is done!

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Avik Das

No responses yet