Skip to content

Harden tier bucketing and PDF fallback text#2

Open
Moshbbab wants to merge 1 commit intotrunkfrom
codex/generate-pdf-report-for-valuation-mbq70y
Open

Harden tier bucketing and PDF fallback text#2
Moshbbab wants to merge 1 commit intotrunkfrom
codex/generate-pdf-report-for-valuation-mbq70y

Conversation

@Moshbbab
Copy link
Owner

@Moshbbab Moshbbab commented Feb 2, 2026

Motivation

  • Make district-tier bucketing robust when quantile edges are non-unique or distributions are small to avoid qcut failures during feature engineering.
  • Ensure PDF report generation does not fail on systems without Arabic TrueType fonts by providing safe Latin-1 fallbacks while preserving Arabic output when fonts are available.
  • Provide a reliable end-to-end CSV → model → PDF flow for automated testing and demos.

Description

  • Hardened district tiering by computing quantile bins with pd.qcut(..., duplicates='drop') and added an assign_tiers helper that adapts the number of labels to available bins and falls back to ranking when necessary; changes made in hemmah_pro_ivs_2025.py (location feature engineering).
  • Reworked PDF rendering to detect whether Arabic fonts (Amiri) were added and to use a safe_latin1 fallback plus a render helper so text written with FPDF will not raise UnicodeEncodeError when Arabic fonts are missing.
  • Kept bilingual (Arabic / English) fallback strings in report sections so generated reports remain readable when Arabic fonts are unavailable.
  • Minor robustness and logging additions to surface progress during data cleaning, model training and report generation.

Testing

  • Installed runtime dependencies and executed an end-to-end automated script that: generated a synthetic sample_real_estate.csv, loaded it with HemmahDataEngine, ran ivs_quality_check(), executed clean_and_engineer(), prepared modeling data, trained multiple models with HemmahMLEngine.train_multiple_models(), performed a predict() on a sample input, and generated a PDF via HemmahReportGenerator.generate_pdf(); this automated test completed successfully.
  • The final automated run produced these key outputs: total_records: 200, unique_pct: 100.0, predicted_price_per_sqm: 576.2103499146549, confidence_interval: {"lower": 489.77879742745665, "upper": 662.641902401853}, model_used: "Random Forest", r2_score: 0.43099558871266896, and generated test_hemmah_report.pdf.
  • During iteration there were failing runs (qcut duplicate-edge errors and a Latin-1 encoding error) which were fixed by the changes above; the final automated test passed without errors.

Codex Task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant