Created synthetic datasets with 29,800 positive/negative pairs each using the GPT-4o-mini API. The discrepancy between Dataset-V1 and Dataset-V2 lies in how the preprocessing step was applied to the resume sequence. The former contains the full resume, while the latter includes only the experience, education, and skill sections.
Conducted ablation studies varying two approaches: cross-encoder and single-encoder, fine-tuning multilingual BERT models with the custom dataset using contrastive loss.
| Model | dataV1 | dataV2 | # Params |
|---|---|---|---|
| TD-IDF | 0.2214 | 0.2267 | NA |
| OpenAI-text-embedding-large-3 | ... | 0.2853 | Unknown |
| mBERT | 0.3887 | 0.3449 | 135M |
| mPnet | 0.3936 | 0.3254 | 278M |
| Bge-m3-korean | 0.3621 | 0.3394 | 568M |
| Below are ours | |||
| mBERT+MLM | 0.4612 | 0.4250 | 135M |
| mBERT + CLoss | 0.1066 | 0.1056 | 135M |
| mBERT+MLM+CLoss | 0.1084 | ... | 135M |
| mPnet + CLoss | 0.1038 | 0.1024 | 278M |
| Bge-m3-korean + CLoss | 0.1085 | 0.1052 | 568M |
| Model | MSE(dataV1) | MSE(dataV2) | # Params |
|---|---|---|---|
| Bge-m3-korean + MLP + BCE | 0.0016 | 0.0803 | 568M |
-
Clone the repository:
git clone [https://github.com/kw-recuse/BERT_ContrastiveLearning.git](https://github.com/kw-recuse/BERT_ContrastiveLearning.git)
-
Navigate to the directory:
cd BERT_ContrastiveLearning -
Run the training:
You can start the training process by importing and using the
Trainerclass, as shown in the example below (e.g., in a Python script or notebook).import sys # Add the repository path if needed (e.g., if running from a notebook outside the main directory) # sys.path.append('/path/to/BERT_ContrastiveLearning') from scripts.train import Trainer # Initialize the Trainer with your configuration trainer = Trainer( config_file="configs/train/multiling_BERT.json", checkpoints_path="checkpoints", csv_file_path="output_file.csv", # path to downloaded csv file from Huggingface col_name1='resume', col_name2='jd', label_col='label' ) # Start training trainer.train()
- Knowledge Distillation

