Inside Out: Uncovering How Comment Internalization Steers LLMs for Better or Worse

Setup Instructions

Run the following command to add the current directory to the python path:

export PYTHONPATH=$PYTHONPATH:$(pwd)
export TF_ENABLE_ONEDNN_OPTS=0

Install miniconda or anaconda. Then, using the following command, create the environments required for the experiments:

conda env create -f torch-env.yml
conda env create -f ollm.yml

Due to conflicting dependencies, please try running RQ1 and RQ2 in the torch-env environment, the baseline results for RQ3 in the ollm environment, and the CA and CD experiments in RQ3 using the torch-env environment.

Setup Local Inference for Producing Baseline Results in RQ3

Install miniconda by following the instructions based on your operating system. Then, use the following command to set up a local inference environment:

conda create -n sglang
conda activate sglang
pip install --upgrade pip
pip install uv
uv pip install "sglang[all]>=0.4.3.post2" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
python -m sglang.launch_server --model-path <model-name> --port 49999 --host 0.0.0.0

Look for a log message like max_total_num_tokens=15919, max_prefill_tokens=16384, max_running_requests=2049, context_len=12000 in the logs to get the maximum number of tokens that can be processed by the server. Then, rename the .env.example file to .env and set the MAX_TOTAL_NUM_TOKENS environment variable to the maximum number of tokens that can be processed by the server. This is required for running the inference tasks.

RQ1

Please refer to tcav/README.md for instructions on replicating the results of RQ1 and generating the plots.

RQ2

The median test accuracy for Javadoc, Inline, and Multiline across all models is shown below:

To replicate the results of RQ2, please refer to tcav/README.md for instructions on training the classifiers and generating the plots.

RQ3

The tasks are located in the tasks directory. To replicate the results of RQ3 for each task, please refer to the respective README.md files in the task directory within the tasks directory. The Sheets directory contains the performance results of the experiments for each task, including the baseline results, Concept Activation (CA), and Concept Deactivation (CD) results across all LLMs, metrics, and tasks.

NEW (Added as part of revisions): For convenience, we also provide the results of RQ3 for Code Translation to C, C++, and Go below:

SR: Success Rate, $\mathbf{SR^\alpha}$: Commented SR, $\mathbf{SR^\gamma}$: Comment-free SR. Bold: Largest impact of activating or deactivating a concept. Underlined: Largest overall impact from activation and deactivation across concepts and LLMs.

RQ3 results for Code Translation to Python

Refer to Table 3 in the paper.

RQ3 results for Code Translation to C

Concept	LLM	$\mathbf{SR^\alpha}$ / CD SR ($\mathbf{\Delta_{rel}}$)	$\mathbf{SR^\gamma}$/ CA SR ($\mathbf{\Delta_{rel}}$)
Comment	Code	67.44 / 60.47 (-10.34%)	69.77 / 65.12 (-6.67%)
	Generic	55.81 / 58.14 (4.17%)	53.49 / 53.49 (0.00%)
	Reasoning	79.07 / 90.70 (14.71%)	72.09 / 60.47 (-16.13%)**
Javadoc	Code	66.67 / 58.33 (-12.5%)	69.77 / 65.12 (-6.67%)
	Generic	66.67 / 50.00 (-25.00%)	53.49 / 55.81 (4.35%)
	Reasoning	75.00 / 58.33 (-22.22%)	72.09 / 72.09 (0.00%)
Inline	Code	87.50 / 87.50 (0.00%)	69.77 / 69.77 (0.00%)
	Generic	81.25 / 68.75 (-15.38%)	53.49 / 55.81 (4.35%)
	Reasoning	87.50 / 87.50 (0.00%)	72.09 / 79.07 (9.68%)
Multiline	Code	46.67 / 46.67 (0.00%)	69.77 / 62.79 (-10.00%)
	Generic	20.00 / 20.00 (0.00%)	53.49 / 51.16 (-4.35%)
	Reasoning	73.33 / 80.00 (9.09%)	72.09 / 67.44 (-6.45%)

RQ3 results for Code Translation to C++

Concept	LLM	$\mathbf{SR^\alpha}$ / CD SR ($\Delta_{rel}$)	$\mathbf{SR^\gamma}$/ CA SR ($\Delta_{rel}$)
Comment	Code	69.77 / 60.47 (-13.33%)	67.44 / 62.79 (-6.90%)
	Generic	53.49 / 65.12 (21.74%)	62.79 / 62.79 (0.00%)
	Reasoning	65.12 / 74.42 (14.29%)	65.12 / 51.16 (-21.43%)**
Javadoc	Code	58.33 / 50.00 (-14.29%)	67.44 / 60.47 (-10.34%)
	Generic	41.67 / 50.00 (20.00%)	62.79 / 58.14 (-7.41%)
	Reasoning	58.33 / 50.00 (-14.29%)	65.12 / 67.44 (3.57%)
Inline	Code	87.50 / 10.00 (14.29%)	67.44 / 55.81 (-17.24%)
	Generic	75.00 / 75.00 (0.00%)	62.79 / 58.14 (-7.41%)
	Reasoning	87.50 / 87.50 (0.00%)	65.12 / 67.44 (3.57%)
Multiline	Code	60.00 / 40.00 (-33.33%)	67.44 / 65.12 (-3.45%)
	Generic	40.00 / 40.00 (0.00%)	62.79 / 60.47 (-3.70%)
	Reasoning	46.67 / 53.33 (14.29%)	65.12 / 65.12 (0.00%)

RQ3 results for Code Translation to Go

Concept	LLM	$\mathbf{SR^\alpha}$ / CD SR ($\Delta_{rel}$)	$\mathbf{SR^\gamma}$/ CA SR ($\Delta_{rel}$)
Comment	Code	53.49 / 58.14 (8.70%)	53.49 / 53.49 (0.00%)
	Generic	27.91 / 34.88 (25.00%)	30.23 / 25.58 (-15.38%)
	Reasoning	60.47 / 53.49 (-11.54%)	51.16 / 48.84 (-4.55%)
Javadoc	Code	41.67 / 41.67 (0.00%)	53.49 / 55.81 (4.35%)
	Generic	16.67 / 25.00 (50.00%)	30.23 / 34.88 (15.38%)
	Reasoning	50.00 / 58.33 (16.67%)	51.16 / 62.79 (22.73%)
Inline	Code	56.25 / 68.75 (22.22%)	53.49 / 53.49 (0.00%)
	Generic	50.00 / 56.25 (12.50%)	30.23 / 32.56 (7.69%)
	Reasoning	87.50 / 93.75 (7.14%)	51.16 / 60.47 (18.18%)
Multiline	Code	75.00 / 93.75 (25.00%)	53.49 / 55.81 (4.35%)
	Generic	43.75 / 50.00 (14.29%)	30.23 / 32.56 (7.69%)
	Reasoning	62.50 / 68.75 (10.00%)	51.16 / 65.12 (27.27%)

RQ4

The task instructions for each of the 10 SE tasks are located in the rq4/prompts.json file. We list them below for reference:

{
    "code summarization": "Summarize the following function in one sentence.\n{code}",
    "code translation": "Translate this code from Java to Python.\n{code}",
    "test generation": "Write test cases for the following class to check its correctness.\n{code}",
    "code completion": "Complete the last line of the following code snippet.\n{code}",
    "fault localization": "Identify the buggy line(s) in the following code that cause it to fail.\n{code}",
    "program repair": "Fix the bug in the following code snippet to make it work as intended.\n{code}",
    "vulnerability detection": "Does the following code contain any security vulnerabilities? If yes, describe them.\n{code}",
    "code review": "Review the following code change and provide suggestions for improvement.\n{code}",
    "code refactoring": "Optimize the following code to improve performance without changing its output.\n{code}",
    "code documentation": "Generate inline comments and a docstring for the following function.\n{code}"
}

The {code} placeholder will be replaced with the actual code snippet for each task. To run RQ4 experiments, please refer run the following command:

cd rq4
python run_rq4.py

Citation

If you found this work helpful, please consider citing it using the following:

@misc{imani2025insideoutuncoveringcomment,
      title={Inside Out: Uncovering How Comment Internalization Steers LLMs for Better or Worse}, 
      author={Aaron Imani and Mohammad Moshirpour and Iftekhar Ahmed},
      year={2025},
      eprint={2512.16790},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2512.16790}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Sheets		Sheets
comment_removal		comment_removal
common		common
rq4		rq4
tasks		tasks
tcav		tcav
.example-env		.example-env
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
inside-out.png		inside-out.png
ollm.yml		ollm.yml
torch-env.yml		torch-env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inside Out: Uncovering How Comment Internalization Steers LLMs for Better or Worse

Setup Instructions

Setup Local Inference for Producing Baseline Results in RQ3

RQ1

RQ2

RQ3

RQ3 results for Code Translation to Python

RQ3 results for Code Translation to C

RQ3 results for Code Translation to C++

RQ3 results for Code Translation to Go

RQ4

Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Inside Out: Uncovering How Comment Internalization Steers LLMs for Better or Worse

Setup Instructions

Setup Local Inference for Producing Baseline Results in RQ3

RQ1

RQ2

RQ3

RQ3 results for Code Translation to Python

RQ3 results for Code Translation to C

RQ3 results for Code Translation to C++

RQ3 results for Code Translation to Go

RQ4

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages