Word2Vec

Welcome! Before getting to know 'what' the Word2Vec NN algorithm is, we must first understand the 'why' behind it i.e. its purpose. The following description answers just that by introducing to you the concept of word embeddings. So stick around!

Word Embeddings

In order to use Word2Vec we must first understand what word embeddings are. Word embeddings are vectors assigned to each word in the vocabulary. These vectors lie in an n-dimensional space where n could range from 300 to 1000, and can be understood by describing features for each word in the vocabulary. For example words like ‘man’ and ‘woman’ can be converted into vectors where they would have strongly contrasting values on the ‘gender’ axis (say -1, 1 respectively) since they primarily describe a difference in gender; whereas they would score low on features like ‘food’ or ‘colour’ and would have equal if not similar values w.r.t each other. The following table considers a few more words to understand this better:

Word	Man	Woman	King	Queen	Apple	Orange	Sky
Gender	-1	1	-0.95	0.97	0.00	0.00	0.005
Colour	0.01	0.009	0.01	0.009	0.98	0.94	-0.88
Food	0.03	0.04	0.04	0.02	0.95	0.956	0.002

As can be seen words like apple and orange score high on the ‘food’ feature and are also very similar to each other on the ‘colour’ feature; whereas the sky which has an almost opposite colour (blue) has a high negative value. Word embeddings help us find analogies among words by minimizing a distance metric between two vectors as follows:

If; man:woman ::king: ?

E_man-E_woman=E_king-E_?+e

The value of E_? that minimizes the magnitude of e turns out to be the embedding for the required word i.e 'queen':

Hence: E_queen=argmin(D(E_king-E_man+E_woman , E_?));

where the distance function D is usually cosine although root mean squared distance is also used in some cases

Word embeddings can be thought of as a method by which a NLP model understands the meaning of a word and further applies it to other similar words. The embedding matrix can be treated as a parameter and it's values can be learned using the Word2Vec model which will now be discussed.

Word2Vec Model

Since you have understood the reason to why we use Word2Vec, I shall now explain to you what it exactly is. In simple terms it is shallow 2-layered neural network model that helps us construct the embedding matrix. However it differs from a regular neural network in the sense that the outputs returned by the softmax layer are used only for the purpose of learning the parameters for the embedding vector (which turns out to be the weight matrix for the hidden layer) and has no purpose in and of itself after the model has been trained.

We will be using the Skip-Gram model which works on the principle that similar words occur in similar sentences and hence tend to have a similar context. Given a specified context window it takes each target-context pair as a new observation. This tends to work better on large datasets as compared to the Continuous Bag-of-Words (CBoW) model which treats the entire context as one observation. Here is a step-wise guide to help understand a single epoch of the skip-gram algorithm:

Obtain a corpus of training data which basically means a bunch of sentences. Larger the corpus is greater the chances that the model picks up a similarity between words. Hence a more accurate embedding matrix learnt.
Create a vocabulary consisting of all words mentioned (no repetitions) in the corpus and label them.
Generate context-target pairs chosing a suitable sized context window.
Chose number of embedding dimensions (#emb) and randomly initialize weight matrices for the hidden layer and softmax output layer as follows:
- hidden layer weight(W1) would have the dimensions- (#emb , vocabulary size)
- output layer weight(W2) would have the dimensions- (vocabulary size , #emb)
The input layer is a single one-hot vector corresponding to the 'context' word in the context-target pair. Hence it would have the dimensions (vocabulary_size,1)
The softmax layer outputs a (vocabulary_size,1) dimensional probability vector.
Loss is computed using the softmax loss function w.r.t the target one-hot vector. Calculate the gradients- dW1, dW2 and modify the weights chosing a learning rate.
The above steps are done for every context-target pair

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word2Vec

Word Embeddings

Word2Vec Model

All The Best !! 😃

FilesExpand file tree

Part3-Word_2_Vec.md

Latest commit

History

Part3-Word_2_Vec.md

File metadata and controls

Word2Vec

Word Embeddings

Word2Vec Model

All The Best !! 😃