GitHub - iraceka/NatGen

NatGen: Generative pre-training by “Naturalizing” source code

Getting Started

Environment Requirements

pytorch==1.7.0 
cudatoolkit=11.1
datasets==1.18.3
transformers==4.16.2
tensorboard==2.8.0
tree-sitter==0.19.0;
nltk==3.6.7;
scipy==1.5.4;

To setup the environment. Please uncomment line 35 and 36 (or run those code in your shell).

bash run setup.sh

Download and preprocess the training data

cd scripts/pretraining;
bash process_data.sh

Data processing takes several parameters. These parameters are passed through a configuration json file. The configuration file should be in configs/pretraining/data_config directory.

Pretrain the model

cd scripts/pretraining;
bash train.sh <EXPERIMENT_NAME> <GPUS>

Adjust the per_device_train_batch_size and gradient_accumulation_steps and number of GPUS using to get the final effective batch size in the training arguments json file. per_device_train_batch_size * gradient_accumulation_steps * number of gpus. We use distributed training to pre-train.

We reused source code from various open source code repositories

CodeT5
Microsoft CodeXGLUE Out sincere thanks to the authors of these repositories for open-sourcing their work.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
configs/pretraining		configs/pretraining
data		data
models		models
scripts		scripts
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_tree_sitter_parser.py		create_tree_sitter_parser.py
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NatGen: Generative pre-training by “Naturalizing” source code

Getting Started

Environment Requirements

Download and preprocess the training data

Pretrain the model

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NatGen: Generative pre-training by “Naturalizing” source code

Getting Started

Environment Requirements

Download and preprocess the training data

Pretrain the model

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages