This Python program utilizes machine learning to predict the readability of source code snippets. It is designed to work with Python 3.11 and uses Poetry for package management.
The most recent implementation of the model is made with keras (see keas folder).
Previously, we tried to achieve the same with pytorch (see toch folder), but we did not achieve the same classification accuracy for unknown reasons.
The model is based on the implementation of the following paper:
@article{mi2022towards,
title={Towards using visual, semantic and structural features to improve code readability classification},
author={Mi, Qing and Hao, Yiqun and Ou, Liwei and Ma, Wei},
journal={Journal of Systems and Software},
volume={193},
pages={111454},
year={2022},
publisher={Elsevier}
}
You can either use poetry or pip to install the required dependencies. We recommend using poetry for a clean and isolated environment on your local machine. Inside the container, we use pip to install the dependencies. In any case, make sure to add the project path to your python Project & Python Path before running the program.
To set up the project and its dependencies, follow these steps:
-
Clone this repository to your local machine:
git clone https://github.com/LuKrO2011/readability-classifier cd readability-classifier -
Install python 3.11 and pip, if you haven’t already.
-
Install wkhtmltopdf:
-
Windows/MacOS: After downloading and installing, make sure to add the
/binfolder to your PATH. -
Ubuntu: Install the package using the following commands:
sudo apt-get update sudo apt-get install -y wkhtmltopdf
-
After the installation, restart your IDE/terminal.
-
-
Use git lfs to download the models & datasets:
git lfs install git lfs pull
-
Install Poetry if you haven’t already:
pip install poetry
-
Create a virtual environment and install the project’s dependencies using Poetry:
poetry install
-
Activate the virtual environment:
poetry shell
-
For Developers only: Activate the pre-commit hooks:
pre-commit install
When using pip on your local machine, we recommend using a virtual environment, such as venv, to avoid conflicts with other projects.
Install the required dependencies using pip:
-
Windows:
foreach ($k in Get-Content requirements.txt) { if ($k -ne "#") { pip install $k } }
-
Ubuntu/MacOS:
cat requirements.txt | xargs -n 1 pip install
After the installation, make sure that your PYTHONPATH is set to the root directory of the project:
-
Windows:
$env:PYTHONPATH = "readability-classifier"
-
Ubuntu:
export PYTHONPATH="readability-classifier":$PYTHONPATH
If you are using an IDE, make sure to set your working directory to the root directory of the project (e.g., readability-classifier).
Now you’re ready to use the source code readability prediction tool.
To get an overview over all available parameters you can also use -h or --help. You can find an overview of the default parameters in the main.py file. We trained the model using the default parameters, but you can adjust them to your needs.
To predict the readability of a source code snippet, use the following command:
python src/readability_classifier/main.py PREDICT --model MODEL --input INPUT [--token-length TOKEN_LENGTH]-
--modelor-m: Path to the pre-trained machine learning model (.h5 or .keras). -
--inputor-i: Path to the source code snippet you want to evaluate. Alternatively, you can provide a folder with multiple snippets. -
--token-lengthor-l(optional): The token length of the snippet (cutting/padding applied).
Example:
python src/readability_classifier/main.py PREDICT --model tests/res/models/towards.keras --input tests/res/code_snippets/towards.javaWhile for training the data is processed batch-wise, for prediction currently only one snippet at a time is supported. If you require to predict multiple snippets, we recommend to use batch processing which is not implemented yet.
To train a new machine learning model for source code readability prediction, use the following command:
python src/readability_classifier/main.py TRAIN --input INPUT [--save SAVE] [--intermediate INTERMEDIATE] [--evaluate] [--token-length TOKEN_LENGTH] [--batch-size BATCH_SIZE] [--epochs EPOCHS] [--learning-rate LEARNING_RATE]-
--inputor-i: Path to the folder with the raw dataset or the encoded dataset generated using theintermediatecommand. -
--saveor-s(optional): Path to the folder where the trained model should be stored. If not specified, the model is not stored. -
--intermediate(optional): Path to the folder where the encoded dataset as intermediate results should be stored. If not specified, the dataset is not stored after encoding. -
--evaluate(optional): Whether to evaluate the model after training. -
--token-lengthor-l(optional): The token length of the snippets (cutting/padding applied). -
--batch-sizeor-b(optional): The batch size for training. -
--epochsor-e(optional): The number of epochs for training. -
--learning-rateor-r(optional): The learning rate for training.
Example:
python src/readability_classifier/main.py TRAIN --input tests/res/raw_datasets/combined --save outputThe datasets used for training and evaluation are from the following sources:
-
BW: Raymond PL Buse and Westley R Weimer. ‘Learning a metric for code readability’
-
Dorn: Jonathan Dorn. ‘A general software readability model’.
-
Scalabrio: Simone Scalabrino et al. ‘Improving code readability models with textual features’.
You can find the three datasets merged into one on Huggingface.
-
Krodinger: Lukas Krodinger ‘Advancing Code Readability: Mined & Modified Code for Dataset Generation‘.
You can also find this mined-and-modified dataset on Huggingface. The code for the dataset generation of the mined-and-modified dataset is also available on GitHub.
To prepare your machine for usage of GPU with podman, follow these steps.
You can download the pre-build podman container from Docker Hub using this command:
podman pull lukro2011/rc-gpu:latestTest the container using the following command:
podman run -it --rm --device nvidia.com/gpu=all lukro2011/rc-gpu:latest python src/readability_classifier/utils/cuda-checker.pyThen use the scripts/train.sh script to train the model or the scripts/predict.sh script to predict the readability of a code snippet using the pre-trained model.
Feel free to modify the scripts to your needs. We recommend using the pre-build container and changing the scripts and code, which gets mounted, instead of building the container from scratch.
In case you need to modify dependencies, you need to build the container from scratch.
The provided Dockerfile is used to build a podman container with the dependencies from the requirements.txt file.
In case you want to change some versions, change them using poetry and generate the requirements.txt file using this command:
poetry export --without-hashes -f requirements.txt | awk '{print $1}' > requirements.txtThen build the podman container using the following command:
podman build -t <your-container-name> .You can debug the container using by starting it in interactive mode:
podman run -it --rm --device nvidia.com/gpu=all <your-container-name>or by using the provided src/readability_classifier/utils/cuda-checker.py script:
podman run -it --rm --device nvidia.com/gpu=all <your-container-name> python src/readability_classifier/utils/cuda-checker.py| Layer (type) | Output Shape | Param # | Connected to |
|---|---|---|---|
struc_input (InputLayer) |
[(None, 50, 305)] |
0 |
[] |
struc_reshape (Reshape) |
(None, 50, 305, 1) |
0 |
['struc_input[0][0]'] |
vis_input (InputLayer) |
[(None, 128, 128, 3)] |
0 |
[] |
struc_conv1 (Conv2D) |
(None, 48, 303, 32) |
320 |
['struc_reshape[0][0]'] |
vis_conv1 (Conv2D) |
(None, 128, 128, 32) |
896 |
['vis_input[0][0]'] |
struc_pool1 (MaxPooling2D) |
(None, 24, 151, 32) |
0 |
['struc_conv1[0][0]'] |
seman_input_token (InputLayer) |
[(None, 100)] |
0 |
[] |
seman_input_segment (InputLayer) |
[(None, 100)] |
0 |
[] |
vis_pool1 (MaxPooling2D) |
(None, 64, 64, 32) |
0 |
['vis_conv1[0][0]'] |
struc_conv2 (Conv2D) |
(None, 22, 149, 32) |
9248 |
['struc_pool1[0][0]'] |
seman_bert (BertEmbedding) |
(None, 100, 768) |
2342553 |
['seman_input_token[0][0]', 'seman_input_segment[0][0]'] |
vis_conv2 (Conv2D) |
(None, 64, 64, 32) |
9248 |
['vis_pool1[0][0]'] |
struc_pool2 (MaxPooling2D) |
(None, 11, 74, 32) |
0 |
['struc_conv2[0][0]'] |
seman_conv1 (Conv1D) |
(None, 96, 32) |
122912 |
['seman_bert[0][0]'] |
vis_pool2 (MaxPooling2D) |
(None, 32, 32, 32) |
0 |
['vis_conv2[0][0]'] |
struc_conv3 (Conv2D) |
(None, 9, 72, 64) |
18496 |
['struc_pool2[0][0]'] |
seman_pool1 (MaxPooling1D) |
(None, 32, 32) |
0 |
['seman_conv1[0][0]'] |
vis_conv3 (Conv2D) |
(None, 32, 32, 64) |
18496 |
['vis_pool2[0][0]'] |
struc_pool3 (MaxPooling2D) |
(None, 3, 24, 64) |
0 |
['struc_conv3[0][0]'] |
seman_conv2 (Conv1D) |
(None, 28, 32) |
5152 |
['seman_pool1[0][0]'] |
vis_pool3 (MaxPooling2D) |
(None, 16, 16, 64) |
0 |
['vis_conv3[0][0]'] |
struc_flatten (Flatten) |
(None, 4608) |
0 |
['struc_pool3[0][0]'] |
seman_gru (Bidirectional) |
(None, 64) |
16640 |
['seman_conv2[0][0]'] |
vis_flatten (Flatten) |
(None, 16384) |
0 |
['vis_pool3[0][0]'] |
concatenate (Concatenate) |
(None, 21056) |
0 |
['struc_flatten[0][0]', 'seman_gru[0][0]', 'vis_flatten[0][0]'] |
class_dense1 (Dense) |
(None, 64) |
1347648 |
['concatenate[0][0]'] |
class_dropout (Dropout) |
(None, 64) |
0 |
['class_dense1[0][0]'] |
class_dense2 (Dense) |
(None, 16) |
1040 |
['class_dropout[0][0]'] |
class_dense3 (Dense) |
(None, 1) |
17 |
['class_dense2[0][0]'] |
Total params: 24975649 (95.27 MB)