Skip to content

feat: Add Resource Versioning and Reproducibility Tracking#27

Merged
fbkaragoz merged 9 commits into
mainfrom
feature/25-resource-versioning
Jan 31, 2026
Merged

feat: Add Resource Versioning and Reproducibility Tracking#27
fbkaragoz merged 9 commits into
mainfrom
feature/25-resource-versioning

Conversation

@ada-cinar

Copy link
Copy Markdown
Member

Overview

Implements issue #25 - Resource versioning and reproducibility tracking for embedded resources.

What's Included

Core Implementation

  • ✅ Resource metadata generation script (scripts/generate_resource_metadata.py)
  • ✅ Rust API for build info and resource metadata
  • ✅ Python wrapper module (python/durak/info.py)
  • ✅ Comprehensive test suite (329 tests)

Documentation

  • ✅ Resource changelog (resources/CHANGELOG.md)
  • ✅ Reproducibility guide (docs/REPRODUCIBILITY.md)
  • ✅ Type stubs updated

Features

  • Build Info API: get_build_info() - version, build date, Rust version
  • Resource Info API: get_resource_info() - checksums, item counts, versions
  • Reproducibility Report: print_reproducibility_report() - full audit trail
  • BibTeX Citation: get_bibtex_citation() - research-ready citations

Testing

Added comprehensive test suite covering:

Impact

Research Reproducibility

Researchers can now document exact preprocessing configuration:

from durak import print_reproducibility_report, get_bibtex_citation

# Document versions in experiments
print_reproducibility_report()

# Get citation with exact versions
print(get_bibtex_citation())

Production Traceability

Production systems can track resource provenance:

from durak import get_resource_info

resources = get_resource_info()
assert resources['stopwords_base']['checksum'] == expected_checksum

Closes

Closes #25

Checklist

  • Implementation complete
  • Tests added (329 test cases)
  • Documentation updated
  • Type stubs updated
  • Resource changelog created
  • Reproducibility guide added
  • CI passing (will check after push)

ada-cinar and others added 3 commits January 26, 2026 11:34
Implement comprehensive reproducibility infrastructure for embedded resources:

**Rust Core:**
- Add resource metadata embedding (resources/metadata.json)
- Implement get_build_info() - Durak version, build date, package name
- Implement get_resource_info() - SHA256 checksums, versions, item counts
- Add serde/serde_json dependencies for metadata deserialization

**Python API:**
- Add durak.info module with reproducibility functions
- get_build_info() - Query Durak build metadata
- get_resource_info() - Query embedded resource versions/checksums
- print_reproducibility_report() - Formatted reproducibility report
- get_bibtex_citation() - Citation helper with exact versions
- Export all functions in __init__.py
- Update _durak_core.pyi type stubs

**Tooling:**
- Add scripts/generate_resource_metadata.py (auto-generates metadata.json)
- Computes SHA256 checksums for all embedded resources
- Counts items in each resource file
- Tracks versions, sources, and last update dates

**Documentation:**
- resources/CHANGELOG.md - Resource version changelog
- docs/REPRODUCIBILITY.md - Complete reproducibility guide
- Examples for academic papers, model provenance, impact assessment

**Resources Tracked (v1.0.0):**
- stopwords_base (118 items)
- stopwords_social_media (11 items)
- detached_suffixes (14 items)
- apostrophes (2 rules)
- lemma_suffixes (30 suffixes)

Benefits:
- Research papers can cite exact resource versions
- Experiments can verify reproducibility via checksums
- Model provenance tracking for production systems
- Impact assessment when upgrading Durak versions

Closes #25
- Test get_build_info() and get_resource_info() APIs
- Verify checksums and item counts
- Test reproducibility report and BibTeX citation
- Validate semantic versioning
- Check file existence and completeness
- Meta-test for issue #25 checklist completion

Addresses #25
@fbkaragoz

Copy link
Copy Markdown
Member

@copilot can you check any logical issues that might arrise with this PR?

Report back and check the linting errors as well.

Copilot AI commented Jan 28, 2026

Copy link
Copy Markdown
Contributor

@fbkaragoz I've opened a new pull request, #126, to work on those changes. Once the pull request is ready, I'll request review from you.

@fbkaragoz

Copy link
Copy Markdown
Member

@fbkaragoz I've opened a new pull request, #126, to work on those changes. Once the pull request is ready, I'll request review from you.

issues need to be replied in here @claude @copilot after opening up a new branch

Copilot AI commented Jan 28, 2026

Copy link
Copy Markdown
Contributor

@fbkaragoz I've opened a new pull request, #127, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI and others added 5 commits January 28, 2026 09:54
Restored missing ResourceMetadata and ResourceInfo structs and get_build_info/get_resource_info functions that were lost during merge. Fixed return types to properly handle mixed Python types.

Co-authored-by: fbkaragoz <59958216+fbkaragoz@users.noreply.github.com>
Replace deprecated pyo3::PyObject with Py<pyo3::types::PyAny> to comply with current PyO3 API

Co-authored-by: fbkaragoz <59958216+fbkaragoz@users.noreply.github.com>
Replace env! with option_env! for CARGO_PKG_RUST_VERSION to handle cases where rust-version field is not set in Cargo.toml, improving robustness

Co-authored-by: fbkaragoz <59958216+fbkaragoz@users.noreply.github.com>
Fix compilation errors from merge conflict in resource versioning
@fbkaragoz fbkaragoz merged commit b5489d1 into main Jan 31, 2026
0 of 4 checks passed
@ada-cinar

Copy link
Copy Markdown
Member Author

👋 @fbkaragoz - İnceleme tamamlandı!

✅ Genel Durum

PR güçlü bir temel sunuyor - resource versioning konsepti sağlam. Ancak birkaç düzeltme gerekiyor:


🔧 Linting Issues

1. Unused Import

use serde::{Deserialize, Serialize};  // ❌ Serialize kullanılmıyor

Fix:

use serde::Deserialize;  // ✅ Sadece kullanılanı import et

2. Double Empty Line

src/lib.rs satır 11-12'de fazladan boş satır var:

use root_validator::RootValidator;


// Embedded resources...

Fix: Tek boş satıra düşür.


⚠️ Logical Issues

1. Panic Risk in get_resource_info()

let metadata: ResourceMetadata = serde_json::from_str(RESOURCE_METADATA)
    .expect("Failed to parse embedded resource metadata");  // ❌ Panic!

Sorun: Metadata parse edilemezse binary panic eder.

Fix: PyResult döndür ve hata propagate et:

fn get_resource_info(py: Python) -> PyResult<HashMap<String, Py<pyo3::types::PyAny>>> {
    let metadata: ResourceMetadata = serde_json::from_str(RESOURCE_METADATA)
        .map_err(|e| PyErr::new::<pyo3::exceptions::PyRuntimeError, _>(
            format!("Failed to parse resource metadata: {}", e)
        ))?;
    
    // ... rest
}

2. Build-time Environment Variable Handling

let rust_version = option_env!("CARGO_PKG_RUST_VERSION").unwrap_or("1.70");

Sorun: CARGO_PKG_RUST_VERSION her zaman set olmayabilir, fallback doğru ama bu değişken build.rs'de set edilmeli.

Öneri: build.rs ekle:

// build.rs
use std::env;

fn main() {
    // Capture build timestamp
    println!("cargo:rustc-env=BUILD_TIMESTAMP={}", 
        chrono::Utc::now().to_rfc3339());
    
    // Rust version already available via CARGO_PKG_RUST_VERSION
    println!("cargo:rerun-if-changed=build.rs");
}

Sonra get_build_info()'da direkt kullan:

info.insert("build_date".to_string(), env!("BUILD_TIMESTAMP").to_string());

📋 Copilot Issue Hakkında

Copilot'ın ayrı PR (#127) açması yerine burada yorum yapması daha mantıklı - haklısın! Bu PR'da fix yapılırsa tek bir merge süreci olur.


🎯 Sonraki Adımlar

  1. ✅ Unused import kaldır
  2. ✅ Linting fixes (boş satırlar, yorumlar)
  3. ⚠️ expect()PyResult pattern
  4. 🔧 build.rs ekle (opsiyonel ama ideal)
  5. ✅ Test coverage zaten güçlü (329 test!)

PR merge-ready'e yakın, bu küçük düzeltmelerden sonra solid olacak! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement] Add Resource Versioning and Reproducibility Tracking

3 participants