What is Subsampling ?

Language models assign probabilities to sequences of words. They are widely used in many natural language processing applications. The probability of a sequence can be modeled as a product of local probabilities as shown in

$P(w1, w2, ..., wl)= \prod_i^l P(wi|hi)$

Where:

'wi' is the i^th word, and
'hi' is the word history preceding 'wi'.
'i' starts from 0.

Therefore the task of language modeling reduces to estimating a set of conditional distributions {P(w|h)}.
For complex models, this poses a computational challenge for learning, because the resulting objective functions are expensive to normalize.
Subsampling is a simple solution to get around the constraint of computing resources. For the purpose of language modeling, it amounts to taking only part of the text corpus to train the Language Model. For complex models, it has been shown that subsampling can speed up training greatly, at the cost of some degradation in predictive performance, allowing for trade-off between computational cost and Language Model quality

In Simple words:

Subsampling is also one of the techniques that we use when we are building word pairs which removes the most frequently used words.
Generally thses words are:

Prepositions (eg. of, on, for)
Articles (a, an, the)

Advantages of Subsampling are:

Reduces the Training time for Model
Reduces the bias of the model towards training data.

Disadvantages of Subsampling are:

Reduces the overall Model accuracy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is Subsampling ?

Where:

In Simple words:

Advantages of Subsampling are:

Disadvantages of Subsampling are:

FilesExpand file tree

Part5-subsampling.md

Latest commit

History

Part5-subsampling.md

File metadata and controls

What is Subsampling ?

Where:

In Simple words:

Advantages of Subsampling are:

Disadvantages of Subsampling are: