A research-oriented implementation of the Rebucket clustering algorithm.
This repository provides:
- A C++ shared library (
librebucket) implementing a single-pass clustering routine for stack traces - A Python runner (
rebucket/test/test.py) usingctypesto load the shared library and run on datasets - A dataset generator (
generate_dataset.py) to extract Java stack traces from bugrepo CSV and emit JSON
The implementation is intended for research/experimentation, not production.
- Single-pass clustering API:
single_pass_clustering(const char*) -> const char* - Deterministic behavior (fixed a historical OpenMP data race)
- Robust JSON input validation (invalid input returns empty string instead of crashing)
- Test entrypoints: CTest + Python
--self-test
- CMake (recommended >= 3.x; minimum supported by this repo:
2.8.12) - A C++ compiler with OpenMP support
- Python 3
From the repository root:
cmake -S rebucket -B rebucket/build
cmake --build rebucket/build -jThis produces:
rebucket/build/librebucket.so(Linux)rebucket/build/librebucket.dylib(macOS)rebucket/build/librebucket.dll(Windows)
The runner loads ./librebucket.* from the current working directory, so run it inside rebucket/build:
cd rebucket/build
python3 test.py -d ../../dataset/Firefox/df_mozilla_firefox.jsonSee rebucket/include/rebucket.h:26.
const char* single_pass_clustering(const char* stack_json)stack_jsonformat:{"stack_id":"...","stack_arr":["frame1","frame2", ...]}- Returns: bucket id as a NUL-terminated string
- Invalid input: returns empty string
- Lifetime: valid until the next call in the same thread
void rebucket_reset()clears internal global bucketssize_t rebucket_bucket_count()returns current bucket count
Raw dataset source: https://github.com/logpai/bugrepo.
This repository stores processed datasets as JSON arrays under dataset/*/*.json:
[
{
"stack_id": "...",
"duplicated_stack": "...",
"stack_arr": [
{"symbol": "a.b.C.m", "file": "C.java", "line": 123}
]
}
]Note: the C++ clustering core uses only the symbol sequence. The Python runner converts each stack_arr into {"stack_arr":["symbol", ...]} before calling into the shared library.
generate_dataset.py extracts Java stack traces from bugrepo CSV and emits the JSON format above.
The CSV must contain at least these columns:
Issue_idDuplicated_issueDescription
Current script uses fixed input/output path arrays (see generate_dataset.py:186). After placing CSV files accordingly:
python3 generate_dataset.pyctest --test-dir rebucket/build --output-on-failure
cd rebucket/build
python3 test.py --self-testtest.pycannot find shared library: run it inrebucket/buildso it can load./librebucket.*- OpenMP not found: install OpenMP runtime (
libgomp/libomp) and ensure CMake canfind_package(OpenMP)
本仓库包含:
- C++ 动态库
librebucket:提供单次扫描(single-pass)的堆栈聚类接口 - Python 驱动脚本
rebucket/test/test.py:通过ctypes加载动态库并跑数据集 - 数据集生成脚本
generate_dataset.py:从 bugrepo 的 CSV 中提取 Java 堆栈并生成 JSON
定位是研究/实验用途,不是生产级服务。
- 提供
single_pass_clustering(const char*) -> const char*的 C ABI - 修复历史 OpenMP 竞态,保证结果稳定
- JSON 输入做了完整校验(非法输入返回空串,不崩溃)
- 提供 CTest + Python 自检入口,便于回归
- CMake(建议 >= 3.x;本仓库最低兼容
2.8.12) - 支持 OpenMP 的 C++ 编译器
- Python 3
在仓库根目录执行:
cmake -S rebucket -B rebucket/build
cmake --build rebucket/build -j产物位于 rebucket/build,典型包括:
- Linux:
librebucket.so - macOS:
librebucket.dylib - Windows:
librebucket.dll
脚本默认从当前目录加载 ./librebucket.*,因此需要在 rebucket/build 目录下运行:
cd rebucket/build
python3 test.py -d ../../dataset/Firefox/df_mozilla_firefox.json接口定义见 rebucket/include/rebucket.h:26。
single_pass_clustering(stack_json):输入一条堆栈(JSON),返回其所属的 bucket id- 输入格式:
{"stack_id":"...","stack_arr":["frame1","frame2", ...]} - 非法输入:返回空串
- 返回值生命周期:同一线程内“下一次调用之前”有效
- 输入格式:
rebucket_reset():清空内部桶状态rebucket_bucket_count():获取当前桶数量
原始数据集来源:https://github.com/logpai/bugrepo。
本仓库 dataset/*/*.json 为处理后的数据集,格式为数组,每个元素包含一个 issue 的堆栈:
[
{
"stack_id": "...",
"duplicated_stack": "...",
"stack_arr": [
{"symbol": "a.b.C.m", "file": "C.java", "line": 123}
]
}
]说明:C++ 核心聚类只使用 symbol 序列;rebucket/test/test.py 会把每条记录的 stack_arr 转为 symbol 字符串数组后再调用动态库。
generate_dataset.py 从 bugrepo 的 CSV 中提取 Description 里的 Java 堆栈并生成上述 JSON。
CSV 至少需要三列:Issue_id、Duplicated_issue、Description。
当前脚本使用固定的输入/输出路径数组(见 generate_dataset.py:186),把 CSV 放到对应路径后运行:
python3 generate_dataset.pyctest --test-dir rebucket/build --output-on-failure
cd rebucket/build
python3 test.py --self-testtest.py找不到动态库:请确认在rebucket/build目录下执行(脚本加载./librebucket.*)- OpenMP 未找到:请安装 OpenMP 运行库(如
libgomp/libomp),并确保 CMake 能find_package(OpenMP)