Automated docstring generation for Python

This project develops an end-to-end system for the automatic generation and improvement of docstrings in Python files using Large Language Models (LLMs). The aim is to add missing docstrings to existing scripts as well as to optimize the content of existing docstrings.

Repository structure

The repository is organized as follows:

Data:
Contains the raw data (Python scripts), filtered scripts, and training and test data for fine-tuning.
Code:
Contains the following scripts and notebooks for dataset processing, analysis, and fine-tuning:
- create_finetune_dataset.py:
  Creates the dataset used for fine-tuning the model.
- docstring_analysis.py:
  Analyzes docstrings in Python scripts. This script is used during dataset creation and is also utilized in the app.
- docstring_generator_app.py:
  A complete application that wraps the fine-tuned model, analyzes Python files, and generates docstrings for functions.
- finetune_LLM_docstring.ipynb:
  A Jupyter notebook for fine-tuning the model using LoRA (Low-Rank Adaptation).
Results:
Contains test and evaluation files, including:
- inference_finetuned_LLM_docstring.ipynb:
  A Jupyter notebook for evaluating the fine-tuned model's performance. It uses BLEU and ROUGE scores to measure similarity with reference docstrings and provides visual comparisons between generated and reference docstrings.
Icon:
Contains the application icon that is displayed in the dock when the program is executed.
Models:
Contains the final checkpoint of the fine-tuned model, which is used in the application for docstring generation.

Background

In software development, docstrings play a decisive role in the readability and maintainability of code. However, manual creation is often neglected. With the increasing availability of LLMs, this process can be automated. This project aims to automatically generate missing docstrings in Python functions and to improve the quality of existing docstrings.

Data Basis

Source Code Corpus (~25,000 GitHub repositories, Python scripts):
- The initial pre-processed training data for fine-tuning was taken from MLRepositories.
- The provided filter_sample_scripts.py script (minimally adapted) was used with adjusted conditions to filter the original ~25,000 scripts down to about ~5,000 scripts. The adapted conditions:
```
# Define conditions for filtering
conditions = {
    "lines": lambda x: 20 <= x <= 2000,  # between 20 and 2000 lines
    "functions": lambda x: x >= 2,       # at least two functions
    "functions_with_docstring": lambda x: x >= 2, # at least two functions with a docstring
    "average_line_length": lambda x: x <= 120,     # max. 120 characters per line
}
```
- After filtering, additional scripts from this reduced dataset were analyzed with our code, resulting in a fine-tuning dataset of approximately 13,000 functions with docstrings.
A typical entry in the final dataset looks like this:

[Function] def function(a, b):\n ... \nreturn list [Docstring] Do X and return a list. [EOS]

Fine-Tuning Large Language Model

In this project, the model CodeLlama from Hugging Face is fine-tuned using LoRA (Low-Rank Adaptation) on an extracted and filtered dataset.

Dataset Preparation:
- The training and test datasets are loaded from text files containing Python functions and corresponding docstrings.
- The dataset is tokenized to prepare it for fine-tuning.
Model Setup:
- The pre-trained model (meta-llama/CodeLlama-7b-Python-hf) is loaded from Hugging Face.
- LoRA is applied to fine-tune the model with efficient parameter adaptation.
Fine-Tuning Configuration:
- The training process is configured with parameters such as learning rate, batch size, and gradient accumulation.
- Additional tools like wandb and mlflow are used for experiment tracking and monitoring.
Evaluation:
- The fine-tuned model is tested using a pipeline on several prompts, and the generated docstrings are evaluated for syntax, semantic clarity, and completeness.
- Evaluation metrics include BLEU and ROUGE scores.
Inference:
- The fine-tuned model is saved and used in the application to generate docstrings for Python functions.

Docstring generation Script

The script requires Python 3.10 and uses the libraries pandas (BSD 3-Clause License), matplotlib (Matplotlib License), and the standard Python libraries json, os.

Installation

Clone the repository and install the required dependencies as follows:

git clone https://github.com/mbernahr/Docstring_generation.git
cd Docstring_generation/Code

Create virtual environment

Linux/Mac:

python3 -m venv venv
source venv/bin/activate

Windows:

python -m venv venv
venv\Scripts\activate

Install the required dependencies:

pip install -r ..\requirements.txt

Usage

Before the first use, create an account at https://huggingface.co and create a token. Next, login with the token from huggingface:

huggingface-cli login

Run the script as follows:

python docstring_generator_app.py

What the Script Does

Provides a graphical user interface (GUI) for selecting a Python script to analyze.
Extracts all functions from the selected script and displays them in a table, including their current docstring quality scores.
Allows the user to:
- Select all functions.
- Focus on functions with low-quality scores for targeted improvements.
- Select specific functions manually for docstring generation.
Generates new docstrings for missing or inadequate ones using the fine-tuned model.
Updates the Python script with the generated docstrings and saves the modified file with a new name in the same directory.

The updated file is saved with the suffix _updated.py.

Contributors

Marius Bernahrndt
Maximilian Karhausen

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Code		Code
Data		Data
Icon		Icon
Models/meta-llama/CodeLlama-7b-Python-hf_run23086723ea/end_of_training		Models/meta-llama/CodeLlama-7b-Python-hf_run23086723ea/end_of_training
Results		Results
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Automated docstring generation for Python

Repository structure

Background

Data Basis

Fine-Tuning Large Language Model

Docstring generation Script

Installation

Create virtual environment

Install the required dependencies:

Usage

What the Script Does

Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

mbernahr/Docstring_generation

Folders and files

Latest commit

History

Repository files navigation

Automated docstring generation for Python

Repository structure

Background

Data Basis

Fine-Tuning Large Language Model

Docstring generation Script

Installation

Create virtual environment

Install the required dependencies:

Usage

What the Script Does

Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages