Skip to content

Dawson-ma/Machine-Translation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Machine Translation (Sequence to Sequence)

Performing English to Chinese machine translation resulted in the model achieving a notable bilingual evaluation understudy (BLEU) score of 29.36.

Data Preprocessing

Data Cleaning and Normalization

The sentences underwent cleaning and normalization, removing excessively long or short sentences and normalizing punctuation. Subsequently, tokenization was performed. The training, test, and monolingual datasets were stored.

Subword Units

To address out-of-vocabulary issues, subword units were employed:

  • Utilization of the 'sentencepiece' package
  • Selection of either 'unigram' or 'byte-pair encoding (BPE)' algorithm

The data was binarized after subword tokenization.

Model Architecture

  • Encoder: Employed either a RNN or Transformer Encoder.
  • Decoder: Utilized either a RNN or Transformer Decoder.
  • Attention:
    • Addressing long input sequences, attention mechanisms provided the Decoder with more comprehensive information.
    • Correlation between Decoder embeddings of the current timestep and Encoder outputs was determined, followed by weighted summation of the Encoder outputs as input to the Decoder RNN.
    • Common attention implementations utilized neural networks/dot product for correlation determination between query (decoder embeddings) and key (Encoder outputs). This was followed by softmax to obtain a distribution, with subsequent weighted sum of values (Encoder outputs) based on this distribution.

Training Techniques

  • Label Smoothing Regularization: Reserved probability for incorrect labels to prevent overfitting.
  • Learning Rate Scheduling: Linearly increased learning rate, followed by decay via inverse square root of steps to stabilize transformer training in early stages.
  • Back-Translation: Leveraged monolingual data for synthetic translation data:
    1. Trained a translation system in the opposite direction.
    2. Collected monolingual data on the target side and applied machine translation.
    3. Utilized translated and original monolingual data as additional parallel data for training stronger translation systems.

Dataset

The Ted2020 dataset served as the primary data source for this project:

  • Raw: 398,066 sentences
  • Processed: 393,980 sentences

For detailed implementation and usage instructions, please refer to the provided code.

About

Performing English to Chinese machine translation resulted in the model achieving a notable bilingual evaluation understudy (BLEU) score of 29.36.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors