SunbirdAI · PatrickCmd · Oct 14, 2025 · Oct 14, 2025
diff --git a/.DS_Store b/.DS_Store
diff --git a/notebooks/README.md b/notebooks/README.md
@@ -0,0 +1,51 @@
+# Lamwo Composite Pipeline
+
+This folder contains the Jupyter notebooks used to build the Lamwo district monthly Sentinel-2 composites and related biomass statistics. The notebooks are meant to be run in Google Colab with access to the Sunbird shared Google Drive and Google Earth Engine (EE).
+
+## Prerequisites
+- Google account with access to the `Sunbird AI/Projects` shared drive.
+- Google Colab runtime with GPU/CPU high-RAM recommended.
+- Earth Engine account authorized for `ee-isekalala` (adjust if needed).
+- Python packages installed within the Colab session, including `pystac-client`, `rio-tiler`, `boto3`, `python-dateutil`, `geopandas`, `shapely`, `rasterio`, `earthengine-api`, `rasterstats`, and plotting libraries.
+- Expected directory structure in Drive:
+  - `/content/drive/Shareddrives/Sunbird AI/Projects/suntrace/suntrace-multimodal/data` for outputs.
+  - Administrative boundaries in `/content/drive/Shareddrives/Sunbird AI/Projects/GIZ Mini-grid Identification/Phase II/Data/administrative areas/`.
+
+## Recommended Run Order
+1. **`Pull S2 imagery for Lamwo_monthly composites.ipynb`**  
+   Creates 1 km tile grid, pulls monthly Sentinel-2 L2A composites from AWS STAC, and saves per-tile `npy` cubes plus the grid GeoJSON to Drive.
+
+2. **`Precomputed Aggregates.ipynb`**  
+   Authenticates with EE, maps tile geometries over various datasets (NDVI, elevation, climate, buildings), and exports the per-tile statistics as CSV/GeoJSON.
+
+3. **`get_lwb.ipynb`**  
+   Downloads Aboveground Live Woody Biomass rasters per tile from ArcGIS, clips them to the Lamwo AOI, and stores GeoTIFFs locally in the Colab session.
+
+4. **`merge_lwb_tile_stats.ipynb`**  
+   Loads the biomass rasters and EE tile statistics, computes total biomass per tile, and merges results into GeoJSON/CSV artifacts ready for downstream analysis.
+
+## Inputs and Outputs (by step)
+- `Pull S2 imagery …`
+  - Inputs: Lamwo AOI GeoPackage, Sentinel-2 STAC (`earth-search.aws.element84.com`).
+  - Outputs: `grid.geojson`, monthly `.npy` files per tile/year in Drive.
+- `Precomputed Aggregates`
+  - Inputs: Tile grid (`grid.geojson`), EE datasets (S2_SR, SRTM, PAR, CHIRPS, VIIRS, building layers).
+  - Outputs: EE Drive export (CSV) or downloaded GeoJSON with tile-level statistics.
+- `get_lwb`
+  - Inputs: Lamwo AOI, ArcGIS feature service `Aboveground_Live_Woody_Biomass_Density`.
+  - Outputs: Downloaded `.tif` rasters and clipped versions for intersecting tiles.
+- `merge_lwb_tile_stats`
+  - Inputs: Biomass raster (`aglwb.tif` or clipped stack), tile grid GeoJSON, tile stats CSV.
+  - Outputs: Merged GeoJSON/CSV with biomass totals and calculated metrics per tile.
+
+## Operational Tips
+- Verify Drive paths at the top of each notebook; update to your Drive mount if different.
+- When running EE exports, monitor the EE Tasks tab for completion before proceeding.
+- `merge_lwb_tile_stats.ipynb` requires selecting the correct biomass raster unit (`Mg_per_pixel` vs `Mg_per_ha`). Check dataset metadata before merging.
+- Large raster downloads may time out; re-run the `get_lwb` download cell for failed tiles.
+- Keep secrets (API keys, service accounts) out of notebooks; rely on EE auth prompts in Colab.
+
+## Troubleshooting Checklist
+- Missing imports: re-run the initial install cells or add `!pip install` for any new packages.
+- CRS mismatches: the merge notebook reprojects to UTM 36N; ensure custom rasters share this CRS or adjust the code.
+- Drive quota: exported artifacts land in `ee_exports` or the `data` folder—clean up old runs to stay within limits.
diff --git a/notebooks/analyzer_playground.ipynb b/notebooks/analyzer_playground.ipynb
diff --git a/notebooks/merge_lwb_tile_stats.ipynb b/notebooks/merge_lwb_tile_stats.ipynb
@@ -2333,7 +2333,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.18"
+   "version": "3.9.16"
   }
  },
  "nbformat": 4,

diff --git a/scripts/README.md b/scripts/README.md
@@ -0,0 +1,11 @@
+# Lamwo Data Pipeline Scripts
+
+These CLI utilities mirror the Lamwo notebook workflow so teammates can refresh inputs without opening Colab. Update the placeholder paths/project IDs before running.
+
+1. `generate_tile_grid.py` — Build a 1 km grid from an AOI GeoPackage. Outputs a GeoJSON grid (`lamwo_grid.geojson`).
+2. `download_s2_monthly.py` — Download monthly Sentinel-2 composites per tile as `.npy` cubes using Earth Engine.
+3. `export_tile_stats.py` — Start an Earth Engine export (Drive or direct download) with vegetation, terrain, climate, and building stats per tile.
+4. `download_lwb_rasters.py` — Fetch Aboveground Live Woody Biomass rasters per tile from the ArcGIS FeatureServer and clip them to the AOI.
+5. `merge_lwb_tile_stats.py` — Combine biomass totals with Earth Engine tile statistics and save merged GeoJSON/CSV outputs.
+
+Typical run order follows the notebook sequence: generate the grid → (optional) fetch Sentinel-2 composites → export the EE aggregates → download biomass rasters → merge outputs.
diff --git a/scripts/download_lwb_rasters.py b/scripts/download_lwb_rasters.py
@@ -0,0 +1,112 @@
+"""Download and optionally clip Aboveground Live Woody Biomass rasters per tile."""
+from __future__ import annotations
+
+import argparse
+import logging
+import sys
+from pathlib import Path
+from typing import List
+
+import geopandas as gpd
+import requests
+import rasterio
+from rasterio.mask import mask
+
+ROOT = Path(__file__).resolve().parents[1]
+if str(ROOT) not in sys.path:
+    sys.path.insert(0, str(ROOT))
+
+LOGGER = logging.getLogger(__name__)
+
+DEFAULT_AOI_PATH = Path("PATH/TO/lamwo_aoi.gpkg")
+DEFAULT_SERVICE_URL = (
+    "https://services2.arcgis.com/g8WusZB13b9OegfU/arcgis/rest/services/"
+    "Aboveground_Live_Woody_Biomass_Density/FeatureServer/0/query?where=1%3D1&"
+    "outFields=tile_id,Mg_px_1_download,Shape__Area,Shape__Length&outSR=4326&f=json"
+)
+DEFAULT_DOWNLOAD_DIR = Path("PATH/TO/output/aglwb_tiles")
+DEFAULT_CLIP_DIR = Path("PATH/TO/output/aglwb_clipped")
+
+
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--aoi-path", type=Path, default=DEFAULT_AOI_PATH, help="AOI footprint (GeoPackage/GeoJSON)")
+    parser.add_argument("--service-url", default=DEFAULT_SERVICE_URL, help="ArcGIS FeatureServer query URL")
+    parser.add_argument("--download-dir", type=Path, default=DEFAULT_DOWNLOAD_DIR, help="Directory for downloaded rasters")
+    parser.add_argument("--clip-dir", type=Path, default=DEFAULT_CLIP_DIR, help="Directory for AOI-clipped rasters")
+    parser.add_argument("--tile-field", default="tile_id", help="Field name containing tile identifier")
+    parser.add_argument(
+        "--url-field",
+        default="Mg_px_1_download",
+        help="Field containing download URLs for each tile raster",
+    )
+    parser.add_argument("--overwrite", action="store_true", help="Overwrite existing files")
+    parser.add_argument("--http-timeout", type=int, default=600, help="HTTP timeout (seconds)")
+    return parser.parse_args()
+
+
+def download_raster(url: str, output_path: Path, timeout: int, overwrite: bool) -> bool:
+    if output_path.exists() and not overwrite:
+        LOGGER.info("Skipping existing %s", output_path)
+        return False
+
+    response = requests.get(url, stream=True, timeout=timeout)
+    response.raise_for_status()
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    with open(output_path, "wb") as f:
+        for chunk in response.iter_content(chunk_size=8192):
+            if chunk:
+                f.write(chunk)
+    LOGGER.info("Downloaded %s", output_path)
+    return True
+
+
+def clip_raster_to_aoi(raster_path: Path, clip_path: Path, geometry) -> Path:
+    clip_path.parent.mkdir(parents=True, exist_ok=True)
+    with rasterio.open(raster_path) as src:
+        out_image, out_transform = mask(src, [geometry], crop=True)
+        out_meta = src.meta.copy()
+        out_meta.update(
+            {
+                "height": out_image.shape[1],
+                "width": out_image.shape[2],
+                "transform": out_transform,
+            }
+        )
+        with rasterio.open(clip_path, "w", **out_meta) as dest:
+            dest.write(out_image)
+    LOGGER.info("Clipped %s", clip_path)
+    return clip_path
+
+
+def main() -> None:
+    logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
+    args = parse_args()
+
+    aoi = gpd.read_file(args.aoi_path)
+    if aoi.empty:
+        raise ValueError(f"AOI at {args.aoi_path} is empty")
+    aoi_geom = aoi.to_crs(epsg=4326).geometry.iloc[0]
+
+    tiles = gpd.read_file(args.service_url)
+    tiles = tiles.to_crs(epsg=4326)
+    intersects = tiles[tiles.intersects(aoi_geom)]
+    LOGGER.info("Found %d biomass tiles intersecting AOI", len(intersects))
+
+    for _, row in intersects.iterrows():
+        tile_id = row.get(args.tile_field)
+        url = row.get(args.url_field)
+        if not url:
+            LOGGER.warning("Row %s missing URL field %s", tile_id, args.url_field)
+            continue
+
+        download_path = args.download_dir / f"{tile_id}.tif"
+        downloaded = download_raster(url, download_path, args.http_timeout, args.overwrite)
+
+        if downloaded or args.overwrite:
+            clip_path = args.clip_dir / f"{tile_id}.tif"
+            clip_raster_to_aoi(download_path, clip_path, aoi_geom)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/download_s2_monthly.py b/scripts/download_s2_monthly.py
@@ -0,0 +1,149 @@
+"""Download monthly Sentinel-2 composites per tile using Earth Engine."""
+from __future__ import annotations
+
+import argparse
+import io
+import logging
+import sys
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from pathlib import Path
+from typing import Iterable, Tuple
+
+import geopandas as gpd
+import numpy as np
+import requests
+
+ROOT = Path(__file__).resolve().parents[1]
+if str(ROOT) not in sys.path:
+    sys.path.insert(0, str(ROOT))
+
+import ee  # type: ignore
+
+from utils.lamwo_pipeline import (
+    ee_bytes_to_img,
+    geom_to_ee,
+    init_ee,
+    reshape_monthly_stack,
+)
+
+LOGGER = logging.getLogger(__name__)
+
+DEFAULT_PROJECT_ID = "your-ee-project-id"
+DEFAULT_GRID_PATH = Path("PATH/TO/input/lamwo_grid.geojson")
+DEFAULT_OUTPUT_DIR = Path("PATH/TO/output/s2_tiles")
+S2_DATASET_ID = "COPERNICUS/S2_SR_HARMONIZED"
+DEFAULT_BANDS = ["B2", "B3", "B4", "B5", "B6", "B7", "B8", "B8A", "B11", "B12"]
+
+
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--project-id", default=DEFAULT_PROJECT_ID, help="Earth Engine project ID (update placeholder)")
+    parser.add_argument("--grid-path", type=Path, default=DEFAULT_GRID_PATH, help="Path to the tile grid GeoJSON")
+    parser.add_argument("--output-dir", type=Path, default=DEFAULT_OUTPUT_DIR, help="Directory for tile numpy stacks")
+    parser.add_argument("--year", type=int, default=2024, help="Calendar year to composite")
+    parser.add_argument("--dataset-id", default=S2_DATASET_ID, help="Earth Engine dataset to sample (default: Sentinel-2 SR)")
+    parser.add_argument("--bands", nargs="+", default=DEFAULT_BANDS, help="Band list to include per month")
+    parser.add_argument("--max-cloud", type=float, default=20.0, help="Cloud percentage threshold")
+    parser.add_argument("--scale", type=float, default=10.0, help="Pixel scale in metres for downloads")
+    parser.add_argument("--format", default="NPY", choices=["NPY"], help="Download format (NPY only)")
+    parser.add_argument("--max-workers", type=int, default=1, help="Concurrent download workers")
+    parser.add_argument("--http-timeout", type=int, default=600, help="HTTP timeout per tile download (seconds)")
+    return parser.parse_args()
+
+
+def monthly_stack(region: ee.Geometry, *, year: int, dataset_id: str, bands: Iterable[str], max_cloud: float) -> ee.Image:
+    collection = ee.ImageCollection(dataset_id)
+    composites = []
+    for month in range(1, 13):
+        start = ee.Date.fromYMD(year, month, 1)
+        end = start.advance(1, "month")
+        filtered = (
+            collection
+            .filterBounds(region)
+            .filterDate(start, end)
+            .filter(ee.Filter.lt("CLOUDY_PIXEL_PERCENTAGE", max_cloud))
+            .select(list(bands))
+        )
+        image = filtered.median()
+        month_str = ee.Number(month).format("%02d")
+        renamed = image.rename(image.bandNames().map(lambda b: ee.String(b).cat("_m").cat(month_str)))
+        composites.append(renamed)
+    return ee.Image.cat(composites)
+
+
+def resolve_tile_label(row, fallback_index: int) -> str:
+    for candidate in ("tile_id", "tileID", "id"):
+        value = row.get(candidate)
+        if value not in (None, ""):
+            return str(value)
+    return f"{fallback_index:04d}"
+
+
+def process_tile(
+    idx: int,
+    row,
+    *,
+    args: argparse.Namespace,
+    band_names: Iterable[str],
+) -> Tuple[str, Path]:
+    tile_label = resolve_tile_label(row, idx)
+    geom = row.geometry
+    region = geom_to_ee(geom)
+    image = monthly_stack(
+        region,
+        year=args.year,
+        dataset_id=args.dataset_id,
+        bands=args.bands,
+        max_cloud=args.max_cloud,
+    )
+    url = image.getDownloadURL(
+        {
+            "bands": list(band_names),
+            "region": region,
+            "scale": args.scale,
+            "format": args.format,
+        }
+    )
+    response = requests.get(url, timeout=args.http_timeout)
+    response.raise_for_status()
+    payload = io.BytesIO(response.content)
+    pixel_data = np.load(payload, allow_pickle=True)
+    array = ee_bytes_to_img(pixel_data)
+    cube = reshape_monthly_stack(array, len(args.bands))
+
+    output_dir = args.output_dir
+    output_dir.mkdir(parents=True, exist_ok=True)
+    output_path = output_dir / f"tile_{tile_label}_{args.year}.npy"
+    np.save(output_path, cube)
+    LOGGER.info("Saved %s", output_path)
+    return tile_label, output_path
+
+
+def main() -> None:
+    logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
+    args = parse_args()
+
+    init_ee(args.project_id)
+
+    tiles = gpd.read_file(args.grid_path).to_crs(epsg=4326).reset_index(drop=True)
+    band_names = [f"{band}_m{month:02d}" for month in range(1, 13) for band in args.bands]
+
+    if args.max_workers > 1:
+        with ThreadPoolExecutor(max_workers=args.max_workers) as executor:
+            futures = [
+                executor.submit(process_tile, idx, row, args=args, band_names=band_names)
+                for idx, row in tiles.iterrows()
+            ]
+            for future in as_completed(futures):
+                try:
+                    future.result()
+                except Exception as exc:  # pragma: no cover - surfacing errors
+                    LOGGER.error("Tile download failed: %s", exc)
+                    raise
+    else:
+        for idx, row in tiles.iterrows():
+            process_tile(idx, row, args=args, band_names=band_names)
+
+
+if __name__ == "__main__":
+    main()