feat: Add Resource Versioning and Reproducibility Tracking#27
Conversation
Implement comprehensive reproducibility infrastructure for embedded resources: **Rust Core:** - Add resource metadata embedding (resources/metadata.json) - Implement get_build_info() - Durak version, build date, package name - Implement get_resource_info() - SHA256 checksums, versions, item counts - Add serde/serde_json dependencies for metadata deserialization **Python API:** - Add durak.info module with reproducibility functions - get_build_info() - Query Durak build metadata - get_resource_info() - Query embedded resource versions/checksums - print_reproducibility_report() - Formatted reproducibility report - get_bibtex_citation() - Citation helper with exact versions - Export all functions in __init__.py - Update _durak_core.pyi type stubs **Tooling:** - Add scripts/generate_resource_metadata.py (auto-generates metadata.json) - Computes SHA256 checksums for all embedded resources - Counts items in each resource file - Tracks versions, sources, and last update dates **Documentation:** - resources/CHANGELOG.md - Resource version changelog - docs/REPRODUCIBILITY.md - Complete reproducibility guide - Examples for academic papers, model provenance, impact assessment **Resources Tracked (v1.0.0):** - stopwords_base (118 items) - stopwords_social_media (11 items) - detached_suffixes (14 items) - apostrophes (2 rules) - lemma_suffixes (30 suffixes) Benefits: - Research papers can cite exact resource versions - Experiments can verify reproducibility via checksums - Model provenance tracking for production systems - Impact assessment when upgrading Durak versions Closes #25
|
@copilot can you check any logical issues that might arrise with this PR? Report back and check the linting errors as well. |
|
@fbkaragoz I've opened a new pull request, #126, to work on those changes. Once the pull request is ready, I'll request review from you. |
issues need to be replied in here @claude @copilot after opening up a new branch |
|
@fbkaragoz I've opened a new pull request, #127, to work on those changes. Once the pull request is ready, I'll request review from you. |
Restored missing ResourceMetadata and ResourceInfo structs and get_build_info/get_resource_info functions that were lost during merge. Fixed return types to properly handle mixed Python types. Co-authored-by: fbkaragoz <59958216+fbkaragoz@users.noreply.github.com>
Replace deprecated pyo3::PyObject with Py<pyo3::types::PyAny> to comply with current PyO3 API Co-authored-by: fbkaragoz <59958216+fbkaragoz@users.noreply.github.com>
Replace env! with option_env! for CARGO_PKG_RUST_VERSION to handle cases where rust-version field is not set in Cargo.toml, improving robustness Co-authored-by: fbkaragoz <59958216+fbkaragoz@users.noreply.github.com>
Fix compilation errors from merge conflict in resource versioning
|
👋 @fbkaragoz - İnceleme tamamlandı! ✅ Genel DurumPR güçlü bir temel sunuyor - resource versioning konsepti sağlam. Ancak birkaç düzeltme gerekiyor: 🔧 Linting Issues1. Unused Importuse serde::{Deserialize, Serialize}; // ❌ Serialize kullanılmıyorFix: use serde::Deserialize; // ✅ Sadece kullanılanı import et2. Double Empty Line
use root_validator::RootValidator;
// Embedded resources...Fix: Tek boş satıra düşür.
|
Overview
Implements issue #25 - Resource versioning and reproducibility tracking for embedded resources.
What's Included
Core Implementation
scripts/generate_resource_metadata.py)python/durak/info.py)Documentation
resources/CHANGELOG.md)docs/REPRODUCIBILITY.md)Features
get_build_info()- version, build date, Rust versionget_resource_info()- checksums, item counts, versionsprint_reproducibility_report()- full audit trailget_bibtex_citation()- research-ready citationsTesting
Added comprehensive test suite covering:
Impact
Research Reproducibility
Researchers can now document exact preprocessing configuration:
Production Traceability
Production systems can track resource provenance:
Closes
Closes #25
Checklist