Machine Translation (Sequence to Sequence)

Performing English to Chinese machine translation resulted in the model achieving a notable bilingual evaluation understudy (BLEU) score of 29.36.

Data Preprocessing

Data Cleaning and Normalization

The sentences underwent cleaning and normalization, removing excessively long or short sentences and normalizing punctuation. Subsequently, tokenization was performed. The training, test, and monolingual datasets were stored.

Subword Units

To address out-of-vocabulary issues, subword units were employed:

Utilization of the 'sentencepiece' package
Selection of either 'unigram' or 'byte-pair encoding (BPE)' algorithm

The data was binarized after subword tokenization.

Model Architecture

Encoder: Employed either a RNN or Transformer Encoder.
Decoder: Utilized either a RNN or Transformer Decoder.
Attention:
- Addressing long input sequences, attention mechanisms provided the Decoder with more comprehensive information.
- Correlation between Decoder embeddings of the current timestep and Encoder outputs was determined, followed by weighted summation of the Encoder outputs as input to the Decoder RNN.
- Common attention implementations utilized neural networks/dot product for correlation determination between query (decoder embeddings) and key (Encoder outputs). This was followed by softmax to obtain a distribution, with subsequent weighted sum of values (Encoder outputs) based on this distribution.

Training Techniques

Label Smoothing Regularization: Reserved probability for incorrect labels to prevent overfitting.
Learning Rate Scheduling: Linearly increased learning rate, followed by decay via inverse square root of steps to stabilize transformer training in early stages.
Back-Translation: Leveraged monolingual data for synthetic translation data:
1. Trained a translation system in the opposite direction.
2. Collected monolingual data on the target side and applied machine translation.
3. Utilized translated and original monolingual data as additional parallel data for training stronger translation systems.

Dataset

The Ted2020 dataset served as the primary data source for this project:

Raw: 398,066 sentences
Processed: 393,980 sentences

For detailed implementation and usage instructions, please refer to the provided code.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Machine_translation.ipynb		Machine_translation.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Translation (Sequence to Sequence)

Data Preprocessing

Data Cleaning and Normalization

Subword Units

Model Architecture

Training Techniques

Dataset

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Machine Translation (Sequence to Sequence)

Data Preprocessing

Data Cleaning and Normalization

Subword Units

Model Architecture

Training Techniques

Dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages