Skip to content

🔧 Rewire distribution_generator to external/latest catalog URLs#22

Open
paarriagadap wants to merge 33 commits into
mainfrom
paarriagadap/rewire-distribution-generator-external
Open

🔧 Rewire distribution_generator to external/latest catalog URLs#22
paarriagadap wants to merge 33 commits into
mainfrom
paarriagadap/rewire-distribution-generator-external

Conversation

@paarriagadap
Copy link
Copy Markdown
Collaborator

Written by Claude Code — @paarriagadap at the wheel.

Summary

Rewires PabloArriagada/distribution_generator/distribution_generator.py
to load its inputs from the public external/poverty_inequality/latest/... catalog paths instead of
versioned garden/wb/.../world_bank_pip_legacy and garden/poverty_inequality/.../historical_poverty URLs.

  • The two historical thousand-bins URLs (thousand_bins_interpolated_ginis, thousand_bins_interpolated_ginis_all_lognormal)
    point at the new external/poverty_inequality/latest/historical_poverty step added in owid/etl#6160.
  • The modern thousand_bins URL switches to the existing external/poverty_inequality/latest/thousand_bins_distribution.
  • The PIP percentiles and main-indicators reads switch from the wide-flat legacy tables to the dimensional
    external/poverty_inequality/latest/world_bank_pip/{percentiles,complete_series} tables (also added in 📊 Add poverty/inequality external steps and region columns etl#6160).
    A small filter/merge block in run() rebuilds the per-(country, year) flat shape that the existing plot
    functions expect. Underlying values are identical to the legacy tables when versions match — spot-checked
    World 2020: mean=19.83, median=8.2, top1_thr=157.20, decile9_thr=50.0, headcount_ratio_3000=82.98.

NATIONAL_LINES_URL (harmonized_national_poverty_lines) stays on garden/... because there's no external
mirror for it.

Net effect: three pinned version constants disappear (PIP_VERSION, THOUSAND_BINS_VERSION,
THOUSAND_BINS_HISTORICAL_VERSION); only NATIONAL_LINES_VERSION remains.

Test plan

  • Once 📊 Add poverty/inequality external steps and region columns etl#6160 lands and the catalog refreshes, run:
    ```bash
    uv run python PabloArriagada/distribution_generator/distribution_generator.py
    ```
  • Confirm the generated SVGs match the previous run (spot-check the World pen parade and the stacked-by-region distribution — both exercise the rebuilt `df_main_indicators`).

paarriagadap and others added 30 commits May 21, 2026 11:49
- Switch THOUSAND_BINS_URL, THOUSAND_BINS_HISTORICAL_URL and
  THOUSAND_BINS_HISTORICAL__ALL_LOGNORMAL_URL to
  external/poverty_inequality/latest/* (versionless).
- Switch the PIP percentiles and main-indicators reads from the legacy
  wide-flat world_bank_pip_legacy tables to the dimensional
  external/world_bank_pip/{percentiles,complete_series} tables, with a
  small filter/merge block in run() to rebuild the flat shape the plot
  code expects.
- Drop PIP_VERSION, THOUSAND_BINS_VERSION and THOUSAND_BINS_HISTORICAL_VERSION.

Underlying values are identical to the legacy tables when versions match
(spot-checked World 2020: mean=19.83, median=8.2, top1_thr=157.20,
decile9_thr=50.0, headcount_ratio_3000=82.98).

External historical_poverty and world_bank_pip URLs only become live once
owid/etl#6160 merges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New cut_percentile option on pen_parade (default 95): caps y above the
  cut so the curve plateaus at the top, leaving room for labels on the
  right. The p99 label moves to a top-of-chart annotation anchored at
  x=cut_percentile when the cut is below 99.
- Replace default $-amount y-tick labels with the reference-line labels
  themselves (IPL, World mean/median, p90, p99, country medians for
  Norway/US/Sweden/UK), with per-label collision handling.
- Country median rows pulled from complete_series, preferring consumption
  over income where both exist; pre-merged into df_main_indicators.
- New copy: "→ The richest 10% have an income of more than \$X per {period}"
  and "↑ The richest 1% live on more than \$X per {period}".
- Dollar formatting: 2 decimals for daily values, 0 for monthly/yearly.
- Pen-parade figure aspect 1:1.25 (1000x1250), right margin reserved
  for labels.
- Other pen_parade/disability_plots blocks temporarily skipped via
  triple-quoted strings to iterate on pen_parade only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flip the default so pen_parade plots the full distribution unchanged
unless the caller passes cut_percentile. The World/month example call
sets cut_percentile=95 explicitly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Prepend a (x=0, y=0) row per hue group so each curve starts at the
  origin instead of at percentile 1.
- When cut_percentile is set, extend the plateau to p100 by appending a
  (x=100, y=y_at_cut) anchor after the y-clamp.
- Add a vertical white-to-transparent gradient band across the top of
  the chart (opaque at ~0.85*y_at_cut, clear at y_at_cut) so the line
  and fill dissolve into white instead of ending hard.
- Move the p99 annotation to the right-margin column at the top-right
  axes corner so it aligns with the other y-tick labels.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Make the fade piecewise: opaque white from the bottom of the band up
  through the plateau line at y_at_cut, fading to transparent only above
  it. This fully covers the faint plateau line that was peeking through
  the old linear gradient.
- Stop the gradient at x=99.75 so it doesn't bleed over the right-hand
  y-axis line at x=100.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- When cut_percentile is in effect, draw the right-hand y-axis line with
  an explicit plot() up to y_at_cut * 1.10 (with clip_on=False) so it
  reaches slightly above the cap, hinting that data continues above.
- Bump the curve linewidth from the default to 2.5 for better presence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Flip the right-margin arrows from → to ← so they point into the chart.
- Drop the World mean reference and add two new reference lines at the
  equivalent of \$900/month and \$500/month, defined in monthly terms and
  rescaled by period_factor / month_factor so the same real-world
  amounts shift correctly between day/month/year periods.
- Centralize the reference-line color in a REFERENCE_LINE_COLOR constant
  (slate \#6c7a89) so all axhlines pick it up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Merge country-median pairs (currently Sweden+UK) whose medians are
  within 5% into a single averaged line; assert if a configured pair
  exceeds the tolerance so the chart author notices the divergence.
- If the merged country median is also within 5% of the world p90,
  fold both into one label re-using the existing p90 axhline, no
  duplicate line. Combined wording: "median income in {countries}, and
  the income above which the richest 10% of the world live".
- Display-name map renders "United States" → "the USA" and
  "United Kingdom" → "the UK" in the labels.
- Per-label anchor overrides position "median income in Sweden..." above
  its tick and "richest 10%..." below (no-op when both fragments are
  merged into one label).
- All reference lines dotted (linestyle=":"), drop the old high-income
  poverty-line block and "poorest 50%" narrative.
- Reword the IPL and World median labels to the same shape as the others:
  "← $X per {period}" / "← $X per {period} — the global median income".
- $900/month and $500/month reference lines defined in monthly terms and
  rescaled by period_factor / month_factor.
- Pen parade now 1000x1000 (square), wider right margin (right=0.55),
  wrap_width=36, curve linewidth=2.5, slate-blue reference color via a
  central REFERENCE_LINE_COLOR constant.
- Fade band on the top tail of the chart is piecewise (opaque through
  the plateau line, fades only above), with y_at_fade_floor at
  y_at_cut * 0.88 by default. Right-hand y-axis line extends to
  y_at_cut * 1.10 with clip_on=False, hinting at data continuing above.
- World pen parade call passes cut_percentile=95.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous setup positioned the labels via tick label offsets, but
matplotlib's tick-label renderer ignores transform offsets so the
per-line-count vertical nudge for multi-line wrapped labels never
actually applied (the ← arrow was misaligned with the tick).

Hide the default tick labels (set_yticklabels with empty strings) and
place each label manually with ax.annotate, using xytext in offset
points so the per-n offset is honored. Default labels: va="center" with
offset -(n-1)*line_height/2 so the first line lines up with the tick
for both single- and multi-line wrapped labels. The above/below
overrides (Sweden+UK, richest 10%) use va="bottom" / va="top" with no
offset. Collision detection is disabled for now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a small red square bracket above IPL, the global median, and the
high-income poverty line, spanning from x=0 out to where the curve
crosses that y value. Marks the share of the world earning less than
each threshold. Bracket color matches the deep-red from
sns.color_palette("deep")[3]; height ~1.2% of the visible y-range.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduce an axhline_over_curve helper that draws the dotted reference
line only from where the curve crosses that y value out to the right
edge (xmin = x_crossing / 100 in axes fraction). Apply it to all six
reference lines (IPL, \$900, \$500, World median, p90, p99, country
medians, merged Sweden+UK groups). The white space above the curve —
where the red square brackets live — now stays clean of dotted lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add three labeled annotations above the IPL, global median, and high-
income poverty brackets, each in 2-3 lines via VPacker / TextArea /
AnnotationBbox so the bold title sits above the regular-weight body
and both share a right edge anchored at the bracket's right tip.

- Poverty: \$900-bracket gets the World headcount share at \$900/month
  (headcount_ratio_3000 pulled from complete_series and merged into
  df_main_indicators).
- Deep poverty: global-median bracket gets the "poorer half of the
  world — 4 billion people — live on less than \$X per period" copy in
  three lines.
- Extreme poverty: IPL bracket gets the World headcount share at the
  IPL ($3/day → poverty_line "300", new df_pov_ipl merge) in three
  lines, right-aligned at the bracket end.

Also add PPP_VERSION = 2021 constant and route df_complete filters
through it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Shade the area under the curve below each poverty line in the same
  blue as the main fill (alpha 0.3, stacked) so the IPL / median /
  high-income bands deepen progressively.
- Annotation copy: spaced em dashes used consistently; "Deep poverty"
  reads "The poorer half of the world population — 4 billion people —
  live on less than \$X per period".
- Right-align the headers ("Poverty", "Deep poverty", "Extreme
  poverty") via ha="right"/multialignment="right" so each block has a
  single clean right edge across the bold title and the body lines.
- Anchor the AnnotationBbox's right edge at the bracket's left edge
  (box_alignment=(1.0, 0.0)) with xybox=(-4, 0) so the label's right
  edge sits just before the bracket — text → bracket flow without an
  external gap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Guard the (x=0, y=0) per-hue anchor row behind `not log_scale` so
  log-scaled pen parades (Chile/Peru/Uruguay) actually render again —
  log(0) = -inf would otherwise blank the line geometry.
- Split the Poverty annotation into 3 lines (bold title, then
  "X% of the world population" + "live on less than \$Y per period")
  to match the Deep/Extreme poverty annotations' shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- styled_annotation now wraps the body text at wrap_width=32 chars via
  textwrap.fill, so callers don't need to embed \\n breaks manually.
- All annotation boxes use box_alignment=(0.0, 0.0) with a uniform
  xybox=(-130, 0) leftward offset, so every block's LEFT edge sits at
  the same x position regardless of line count or text length.
- Drop the manual line splits from the three annotation call sites
  (text passed as one logical sentence each).
- Black-format the file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…nd fills

Replaces the integer-percentile snap (above[x].min()) with a shared
interpolate_x_at_y helper, so red bracket end-caps and dotted reference
lines meet the curve exactly. Adds interpolate=True to KDE area fills.
Use raw (unrounded) reference values for positioning so the world median
bracket lands at p=50 and matches the rounded $289 shown in labels. Build
each poverty-band fill polygon with the exact crossing point appended,
because matplotlib's interpolate=True doesn't apply when y1=0 is constant.
…ti-year plots

- Replace explicit x_axis_range tuple with share_x_axis: bool that auto-computes
  the union x-range across years in a pre-pass.
- Add share_y_axis: bool that locks every per-year SVG to the same peak density.
- Add row_by="year" to distributional_plots_per_row for one-country / multi-year
  stacked layouts.
- Add filename_suffix to distinguish the Sweden lognormal output set.
Previously share_x_axis set xlim = union of data extents, which clipped the
KDE taper on the right. Now extend each year's bounds by cut*bw in log10
space (matching seaborn's default cut=3 and Scott's bandwidth) before taking
the union, so the axis right edge lands exactly where the curve tapers to
near-zero. Tick filter still caps visible labels at the largest log tick
inside that range — for Sweden, axis runs to $263 but the last label is $200.
…SVGs

When share_x_axis or share_y_axis is on, switch from bbox_inches="tight"
(which varies the SVG bounding box per figure based on visible content) to
fixed subplots_adjust margins, so per-year SVGs can be stacked pixel-aligned.
…abels to integers

Removes the add_world_mean parameter, the world_mean_year computation, and
the line/label/area branches from distributional_plots, distributional_plots_per_row,
and pen_parade (where it was computed but never displayed). Also rounds
reference line labels via dollar_decimals (2 for day, 0 for month/year) to
match the pen parade style.
… IPL/high-income/world-median controls

- Add `add_high_income_pl`, `add_world_median` (per-year via df_main_indicators)
  and `filename_suffix` to distributional_plots_per_row.
- Forward fill, add_ipl, add_multiple_lines_day to the year-rows helper, and
  swap the axvline loop for draw_area_under_curve so reference thresholds become
  shaded regions.
- Draw constant-x reference lines (IPL, high-income, world-median when shared)
  with a figure-spanning Line2D in a blended transform so the line is continuous
  across stacked subplots (the per-axes axvline left a visible gap between rows).
- Share y across rows in distributional_plots_per_row(row_by="country").
- Apply share_x_axis-style range, KDE clip, set_xlim, and tick filtering inside
  the year-rows helper.
Pick between hanging the rotated reference-line label inside axes[0] from
its top edge (when the curve at the label's x leaves ≥50% headroom — e.g.
Sweden 1820 at the IPL) or floating it just above axes[0] in the figure
margin (when the curve fills axes[0] — e.g. Ethiopia at the IPL). Drops
the previous "always above figure margin" placement that felt too far
from the chart for low-density rows. Also: World mean kwarg removed,
reference-line defaults flipped to None, scipy imports cleaned up.
…s years

The per-year loop in distributional_plots was multiplying add_multiple_lines_day
by period_factor in place. On the second year the values were already in
period units and got multiplied again, pushing both threshold fills past the
data extent so they collapsed into the same polygon. Use a fresh local
`scaled_lines` per iteration. Also: revert add_ipl/add_world_median defaults
to "line" and set them explicitly to None for the Sweden lognormal separate
call so the line stays out of those charts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant