Skip to content

LuKrO2011/code-readability-classifier

Repository files navigation

Code Readability Classifier

This Python program utilizes machine learning to predict the readability of source code snippets. It is designed to work with Python 3.11 and uses Poetry for package management.

The most recent implementation of the model is made with keras (see keas folder). Previously, we tried to achieve the same with pytorch (see toch folder), but we did not achieve the same classification accuracy for unknown reasons.

The model is based on the implementation of the following paper:

@article{mi2022towards,
  title={Towards using visual, semantic and structural features to improve code readability classification},
  author={Mi, Qing and Hao, Yiqun and Ou, Liwei and Ma, Wei},
  journal={Journal of Systems and Software},
  volume={193},
  pages={111454},
  year={2022},
  publisher={Elsevier}
}

Installation

You can either use poetry or pip to install the required dependencies. We recommend using poetry for a clean and isolated environment on your local machine. Inside the container, we use pip to install the dependencies. In any case, make sure to add the project path to your python Project & Python Path before running the program.

To set up the project and its dependencies, follow these steps:

  1. Clone this repository to your local machine:

    git clone https://github.com/LuKrO2011/readability-classifier
    cd readability-classifier
  2. Install python 3.11 and pip, if you haven’t already.

    • Windows: Python, Pip

    • Ubuntu:

      sudo apt-get update
      sudo apt-get install -y python3.11 python3-pip
  3. Install wkhtmltopdf:

    • Windows/MacOS: After downloading and installing, make sure to add the /bin folder to your PATH.

    • Ubuntu: Install the package using the following commands:

      sudo apt-get update
      sudo apt-get install -y wkhtmltopdf
    • After the installation, restart your IDE/terminal.

  4. Use git lfs to download the models & datasets:

    git lfs install
    git lfs pull

Using Poetry

  1. Install Poetry if you haven’t already:

    pip install poetry
  2. Create a virtual environment and install the project’s dependencies using Poetry:

    poetry install
  3. Activate the virtual environment:

    poetry shell
  4. For Developers only: Activate the pre-commit hooks:

    pre-commit install

Using Pip

When using pip on your local machine, we recommend using a virtual environment, such as venv, to avoid conflicts with other projects.

Install the required dependencies using pip:

  • Windows:

    foreach ($k in Get-Content requirements.txt) {
        if ($k -ne "#") {
            pip install $k
        }
    }
  • Ubuntu/MacOS:

    cat requirements.txt | xargs -n 1 pip install

Project & Python Path

After the installation, make sure that your PYTHONPATH is set to the root directory of the project:

  • Windows:

    $env:PYTHONPATH = "readability-classifier"
  • Ubuntu:

    export PYTHONPATH="readability-classifier":$PYTHONPATH

If you are using an IDE, make sure to set your working directory to the root directory of the project (e.g., readability-classifier).

Now you’re ready to use the source code readability prediction tool.

Usage

To get an overview over all available parameters you can also use -h or --help. You can find an overview of the default parameters in the main.py file. We trained the model using the default parameters, but you can adjust them to your needs.

Predict

To predict the readability of a source code snippet, use the following command:

python src/readability_classifier/main.py PREDICT --model MODEL --input INPUT [--token-length TOKEN_LENGTH]
  • --model or -m: Path to the pre-trained machine learning model (.h5 or .keras).

  • --input or -i: Path to the source code snippet you want to evaluate. Alternatively, you can provide a folder with multiple snippets.

  • --token-length or -l (optional): The token length of the snippet (cutting/padding applied).

Example:

python src/readability_classifier/main.py PREDICT --model tests/res/models/towards.keras --input tests/res/code_snippets/towards.java

While for training the data is processed batch-wise, for prediction currently only one snippet at a time is supported. If you require to predict multiple snippets, we recommend to use batch processing which is not implemented yet.

Train

To train a new machine learning model for source code readability prediction, use the following command:

python src/readability_classifier/main.py TRAIN --input INPUT [--save SAVE] [--intermediate INTERMEDIATE] [--evaluate] [--token-length TOKEN_LENGTH] [--batch-size BATCH_SIZE] [--epochs EPOCHS] [--learning-rate LEARNING_RATE]
  • --input or -i: Path to the folder with the raw dataset or the encoded dataset generated using the intermediate command.

  • --save or -s (optional): Path to the folder where the trained model should be stored. If not specified, the model is not stored.

  • --intermediate (optional): Path to the folder where the encoded dataset as intermediate results should be stored. If not specified, the dataset is not stored after encoding.

  • --evaluate (optional): Whether to evaluate the model after training.

  • --token-length or -l (optional): The token length of the snippets (cutting/padding applied).

  • --batch-size or -b (optional): The batch size for training.

  • --epochs or -e (optional): The number of epochs for training.

  • --learning-rate or -r (optional): The learning rate for training.

Example:

python src/readability_classifier/main.py TRAIN --input tests/res/raw_datasets/combined --save output

Dataset

The datasets used for training and evaluation are from the following sources:

  • BW: Raymond PL Buse and Westley R Weimer. ‘Learning a metric for code readability’

  • Dorn: Jonathan Dorn. ‘A general software readability model’.

  • Scalabrio: Simone Scalabrino et al. ‘Improving code readability models with textual features’.

You can find the three datasets merged into one on Huggingface.

  • Krodinger: Lukas Krodinger ‘Advancing Code Readability: Mined & Modified Code for Dataset Generation‘.

You can also find this mined-and-modified dataset on Huggingface. The code for the dataset generation of the mined-and-modified dataset is also available on GitHub.

GPU + Podman

To prepare your machine for usage of GPU with podman, follow these steps.

Using the Container

You can download the pre-build podman container from Docker Hub using this command:

podman pull lukro2011/rc-gpu:latest

Test the container using the following command:

podman run -it --rm --device nvidia.com/gpu=all lukro2011/rc-gpu:latest python src/readability_classifier/utils/cuda-checker.py

Then use the scripts/train.sh script to train the model or the scripts/predict.sh script to predict the readability of a code snippet using the pre-trained model.

Feel free to modify the scripts to your needs. We recommend using the pre-build container and changing the scripts and code, which gets mounted, instead of building the container from scratch.

Build Container

In case you need to modify dependencies, you need to build the container from scratch. The provided Dockerfile is used to build a podman container with the dependencies from the requirements.txt file.

In case you want to change some versions, change them using poetry and generate the requirements.txt file using this command:

poetry export --without-hashes -f requirements.txt | awk '{print $1}' > requirements.txt

Then build the podman container using the following command:

podman build -t <your-container-name> .

You can debug the container using by starting it in interactive mode:

podman run -it --rm --device nvidia.com/gpu=all <your-container-name>

or by using the provided src/readability_classifier/utils/cuda-checker.py script:

podman run -it --rm --device nvidia.com/gpu=all <your-container-name> python src/readability_classifier/utils/cuda-checker.py

Model Overview

Layer (type) Output Shape Param # Connected to

struc_input (InputLayer)

[(None, 50, 305)]

0

[]

struc_reshape (Reshape)

(None, 50, 305, 1)

0

['struc_input[0][0]']

vis_input (InputLayer)

[(None, 128, 128, 3)]

0

[]

struc_conv1 (Conv2D)

(None, 48, 303, 32)

320

['struc_reshape[0][0]']

vis_conv1 (Conv2D)

(None, 128, 128, 32)

896

['vis_input[0][0]']

struc_pool1 (MaxPooling2D)

(None, 24, 151, 32)

0

['struc_conv1[0][0]']

seman_input_token (InputLayer)

[(None, 100)]

0

[]

seman_input_segment (InputLayer)

[(None, 100)]

0

[]

vis_pool1 (MaxPooling2D)

(None, 64, 64, 32)

0

['vis_conv1[0][0]']

struc_conv2 (Conv2D)

(None, 22, 149, 32)

9248

['struc_pool1[0][0]']

seman_bert (BertEmbedding)

(None, 100, 768)

2342553

['seman_input_token[0][0]', 'seman_input_segment[0][0]']

vis_conv2 (Conv2D)

(None, 64, 64, 32)

9248

['vis_pool1[0][0]']

struc_pool2 (MaxPooling2D)

(None, 11, 74, 32)

0

['struc_conv2[0][0]']

seman_conv1 (Conv1D)

(None, 96, 32)

122912

['seman_bert[0][0]']

vis_pool2 (MaxPooling2D)

(None, 32, 32, 32)

0

['vis_conv2[0][0]']

struc_conv3 (Conv2D)

(None, 9, 72, 64)

18496

['struc_pool2[0][0]']

seman_pool1 (MaxPooling1D)

(None, 32, 32)

0

['seman_conv1[0][0]']

vis_conv3 (Conv2D)

(None, 32, 32, 64)

18496

['vis_pool2[0][0]']

struc_pool3 (MaxPooling2D)

(None, 3, 24, 64)

0

['struc_conv3[0][0]']

seman_conv2 (Conv1D)

(None, 28, 32)

5152

['seman_pool1[0][0]']

vis_pool3 (MaxPooling2D)

(None, 16, 16, 64)

0

['vis_conv3[0][0]']

struc_flatten (Flatten)

(None, 4608)

0

['struc_pool3[0][0]']

seman_gru (Bidirectional)

(None, 64)

16640

['seman_conv2[0][0]']

vis_flatten (Flatten)

(None, 16384)

0

['vis_pool3[0][0]']

concatenate (Concatenate)

(None, 21056)

0

['struc_flatten[0][0]', 'seman_gru[0][0]', 'vis_flatten[0][0]']

class_dense1 (Dense)

(None, 64)

1347648

['concatenate[0][0]']

class_dropout (Dropout)

(None, 64)

0

['class_dense1[0][0]']

class_dense2 (Dense)

(None, 16)

1040

['class_dropout[0][0]']

class_dense3 (Dense)

(None, 1)

17

['class_dense2[0][0]']

Total params: 24975649 (95.27 MB)

About

Code Readability prediction via machine learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •