The authors of A Transformer-based Approach for Source Code Summarizatio n shared their code and dataset. In this repo., it offers original and runnable codes of Java dataset and therefore we can generate AST with Tree-Sitter.
However, as for Python dataset, its original codes are not runnable. An optional way to deal with such problem is that we can acquire runnable Python codes from raw data.
Download pre-processed and raw (java_hu) dataset.
bash dataset/java_hu/download.shMove code/code_tokens/docstring/docstring_tokens to ~/java_hu/flatten/*.
python -m dataset.java_hu.flattenGenerating raw/bin data with multi-processing. Before generating datasets, plz make sure config file is set correctly.
# code_tokens/docstring_tokens
python -m dataset.java_hu.preprocess