Skip to content

Latest commit

 

History

History
23 lines (23 loc) · 1.74 KB

File metadata and controls

23 lines (23 loc) · 1.74 KB

What is Subsampling ?

Language models assign probabilities to sequences of words. They are widely used in many natural language processing applications. The probability of a sequence can be modeled as a product of local probabilities as shown in

P(w1, w2, ..., wl)= \prod_i^l  P(wi|hi)

Where:

'wi' is the ith word, and
'hi' is the word history preceding 'wi'.
'i' starts from 0.

Therefore the task of language modeling reduces to estimating a set of conditional distributions {P(w|h)}.
For complex models, this poses a computational challenge for learning, because the resulting objective functions are expensive to normalize.
Subsampling is a simple solution to get around the constraint of computing resources. For the purpose of language modeling, it amounts to taking only part of the text corpus to train the Language Model. For complex models, it has been shown that subsampling can speed up training greatly, at the cost of some degradation in predictive performance, allowing for trade-off between computational cost and Language Model quality

In Simple words:

Subsampling is also one of the techniques that we use when we are building word pairs which removes the most frequently used words.
Generally thses words are:

  • Prepositions (eg. of, on, for)
  • Articles (a, an, the)

Advantages of Subsampling are:

  • Reduces the Training time for Model
  • Reduces the bias of the model towards training data.

Disadvantages of Subsampling are:

  • Reduces the overall Model accuracy.