GEDI L2A + Sentinel-2 → XGBoost — an agentic data pipeline that ingests NASA lidar shots, extracts cloud-masked Sentinel-2 spectral features, validates spatial CV readiness, and prepares training-ready data for canopy height prediction.
Forest canopy height is a key metric for carbon stock estimation, wildfire risk, and biodiversity monitoring. This pipeline fuses two NASA/ESA datasets — GEDI L2A lidar shots (ground-truth canopy height at 25m footprint) and Sentinel-2 multispectral imagery — into a spatially cross-validated training dataset ready for XGBoost regression.
The entire ingest → transform → QA workflow is driven by a multi-agent LLM system that self-diagnoses failures, replans parameters, and aborts gracefully when data conditions are unresolvable.
The Orchestrator is the single entry point. It calls sub-agents as pydantic-ai tools, inspects their typed passed / recommended_action output, and applies adaptive replanning (replan_widen_date, replan_relax_thresholds) before retrying — up to max_replans. All intermediate state is persisted to SQLite so any stage can be inspected or replayed without re-running expensive GEE queries.
| Agent | Responsibility | Key Tools |
|---|---|---|
| Orchestrator | Drive the pipeline, replan on failure, abort when unresolvable | run_ingestor, run_transformer, run_qa, replan, abort |
| Ingestor | Query GEDI shots + S2 scene availability; filter by quality flags, sensitivity, slope | query_gedi_earthengine, query_sentinel2, apply_quality_filters, write_raw_shots_to_db |
| Transformer | Sample S2 bands at shot locations; estimate spatial autocorrelation; assign spatial CV folds | extract_sentinel_bands, compute_variogram, generate_spatial_blocks, assign_folds |
| QA | Validate feature completeness, EVI/NDVI range, fold balance before training | check_feature_distributions, check_fold_balance, check_target_range |
Each agent returns a typed Pydantic output model (IngestorDecision, TransformerDecision, etc.) with passed, rationale, recommended_action, and warnings fields — the LLM cannot return a structurally invalid decision. GEE tools catch all exceptions and return {"error": "..."} dicts so the orchestrator receives a structured failure signal rather than triggering an unintended LLM retry.
GEDI shots colored by rh98 (canopy height in metres) over the Sierra Nevada, CA. Spatial CV folds are assigned using a variogram-estimated block size (≥1.5× autocorrelation range) to ensure training and validation sets are spatially decorrelated — a critical guard against optimistic CV scores in geospatial ML:
| Tool | Role |
|---|---|
| pydantic-ai | Agent framework — typed deps, tool registration, structured LLM output |
| Google Earth Engine | GEDI L2A index queries + Sentinel-2 median composites, cloud masking via SCL |
| Logfire | Real-time observability — per-agent spans, tool-call events, LLM traces |
| SQLAlchemy + SQLite | ORM-backed persistence for raw shots, cleaned shots, agent decisions |
| pytest | 88 tests across unit (mocked tools), integration (live GEE), and runner layers |
| geopandas + contextily | Spatial DataFrames + tile basemaps for diagnostic figures |
| uv | Fast dependency management and virtual environments |
uv run python -m canopy_height_prediction.main \
--bbox "-120.5,38.5,-119.5,39.5" \
--date-start "2022-06-01" \
--date-end "2022-09-01" \
--output output/run.json \
--save-figs output/Run ID : run-9af5a9fc
AOI : (-120.5, 38.5, -119.5, 39.5)
Dates : 2022-06-01 → 2022-09-01
------------------------------------------------------------
Ingestor : PASS (128369/177310 shots accepted)
Transformer: PASS (block=28.75 km, folds=5)
QA : PASS issues=[]
Replans : 0
uv run pytest tests/ -v # 88 tests: unit + integration (live GEE)Tests are layered: unit tests mock agent tools and assert decision logic; integration tests hit live GEE endpoints with small seeded datasets; runner tests verify PipelineState writeback for each agent.
- GEE extraction:
sampleRegionsbatches 500 shots per request (payload limit) and runs batches concurrently viaThreadPoolExecutor. Shots are capped at 5,000 before extraction — sufficient for XGBoost without 100+ sequential GEE calls. Scale is 20m withtileScale=4. - Spatial CV block size: geometrically capped so the AOI always yields ≥
n_folds+1occupied blocks, preventing fold imbalance when the variogram range is large relative to the AOI. - Replanning audit trail: every agent decision (rationale, action taken, parameters) is written to the
agent_decisionstable and thePipelineState.decision_log, making post-run debugging reproducible without re-running GEE. - Logfire observability:
logfire.instrument_pydantic_ai()auto-instruments all tool calls;logfire.info()events fire at agent entry so each stage is visible in real time rather than only on completion.
The following components are planned but not yet implemented:
| Component | Description |
|---|---|
modeling/train.py |
XGBoost regression on spatially CV-validated folds; hyperparameter tuning via Optuna |
modeling/predict.py |
Inference on new AOIs using trained model + S2 feature extraction |
modeling/evaluate.py |
Per-fold RMSE / R² reporting; feature importance plots |
| Training agent | LLM-guided hyperparameter selection and early stopping decisions |
| Model registry | MLflow or similar for experiment tracking and model versioning |

