Korean Resume-JD Matching Model

Dataset

Created synthetic datasets with 29,800 positive/negative pairs each using the GPT-4o-mini API. The discrepancy between Dataset-V1 and Dataset-V2 lies in how the preprocessing step was applied to the resume sequence. The former contains the full resume, while the latter includes only the experience, education, and skill sections.

Ablation Study Result

Conducted ablation studies varying two approaches: cross-encoder and single-encoder, fine-tuning multilingual BERT models with the custom dataset using contrastive loss.

Single Encoder	Cross Encoder

The similarity score between the resume and the job description is calculated by computing the cosine similarity between their embeddings, obtained by passing the raw sequences through the BERT model.	The resume and job description strings are concatenated and then passed through the BERT model. All the embeddings are fed into an MLP layer, which produces a single similarity score.

Single Encoder

Model	dataV1	dataV2	# Params
TD-IDF	0.2214	0.2267	NA
OpenAI-text-embedding-large-3	...	0.2853	Unknown
mBERT	0.3887	0.3449	135M
mPnet	0.3936	0.3254	278M
Bge-m3-korean	0.3621	0.3394	568M
Below are ours
mBERT+MLM	0.4612	0.4250	135M
mBERT + CLoss	0.1066	0.1056	135M
mBERT+MLM+CLoss	0.1084	...	135M
mPnet + CLoss	0.1038	0.1024	278M
Bge-m3-korean + CLoss	0.1085	0.1052	568M

Cross Encoder

Model	MSE(dataV1)	MSE(dataV2)	# Params
Bge-m3-korean + MLP + BCE	0.0016	0.0803	568M

How to Use

Clone the repository:

git clone [https://github.com/kw-recuse/BERT_ContrastiveLearning.git](https://github.com/kw-recuse/BERT_ContrastiveLearning.git)

Navigate to the directory:
```
cd BERT_ContrastiveLearning
```

Run the training:

You can start the training process by importing and using the Trainer class, as shown in the example below (e.g., in a Python script or notebook).

import sys

# Add the repository path if needed (e.g., if running from a notebook outside the main directory)
# sys.path.append('/path/to/BERT_ContrastiveLearning')

from scripts.train import Trainer

# Initialize the Trainer with your configuration
trainer = Trainer(
    config_file="configs/train/multiling_BERT.json",
    checkpoints_path="checkpoints", 
    csv_file_path="output_file.csv", # path to downloaded csv file from Huggingface
    col_name1='resume',
    col_name2='jd',
    label_col='label'
)

# Start training
trainer.train()

notebook example

Future Plans

Knowledge Distillation

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
API		API
configs		configs
data		data
eval		eval
logs		logs
models		models
notebooks		notebooks
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Korean Resume-JD Matching Model

Dataset

Ablation Study Result

Single Encoder

Cross Encoder

How to Use

Future Plans

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Korean Resume-JD Matching Model

Dataset

Ablation Study Result

Single Encoder

Cross Encoder

How to Use

Future Plans

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages