Query a cozip archive as a SQL table — locally, over HTTPS, S3, GCS, Azure, or HuggingFace — without downloading it. cozip places a Parquet manifest at byte 0 of the ZIP, so a multi-gigabyte archive becomes a queryable table with one or two HTTP range requests. No central-directory scan, no full download.
The archive is still a valid ZIP. unzip, zipfile.ZipFile, your OS file preview — all unchanged.
INSTALL cozip FROM community;
LOAD cozip;Works on every DuckDB target: Linux, macOS, Windows, and WebAssembly (try it on shell.duckdb.org).
SELECT *
FROM read_cozip('https://huggingface.co/datasets/Major-TOM/Core-VIIRS-Nighttime-Light/resolve/main/2024/MAJORTOM-VIIRS-NTL_2024_median_000.zip')
LIMIT 10;One row per entry inside the archive: name, offset, size, plus whatever columns the writer included in __metadata__ (split, label, geometry, …). Filter, join, sample, then fetch only the payloads you actually need.
-- Sample 32 training tiles from a remote archive without downloading it.
SELECT name, "cozip:gdal_vsi"
FROM read_cozip('s3://my-bucket/dataset.cozip')
WHERE split = 'train'
USING SAMPLE 32 ROWS;| Scheme | Backend | Notes |
|---|---|---|
/local/path |
local FS | No extension required. |
https:// |
httpfs |
Autoloads. Supports Range requests. |
s3:// |
httpfs |
S3-compatible (R2, MinIO) via SET s3_* settings. |
gs://, gcs:// |
httpfs |
Google Cloud Storage. |
azure:// |
azure |
Autoloads. |
hf://datasets/... |
hf |
Autoloads. |
Every row includes a synthetic cozip:gdal_vsi column with a ready-made /vsisubfile/<offset>_<size>,/vsi.../<url> path. Hand it to GDAL, rasterio, or anything that speaks VSI to open the inner file without re-downloading the archive.
import duckdb, rasterio
rows = duckdb.sql("""
SELECT name, "cozip:gdal_vsi"
FROM read_cozip('https://.../dataset.zip')
WHERE split = 'val'
LIMIT 8
""").fetchall()
for name, vsi in rows:
with rasterio.open(vsi) as src:
... # GDAL issues range requests against the archiveSkip it with gdal_vsi := false:
SELECT name, offset, size
FROM read_cozip('dataset.zip', gdal_vsi := false);- cozip — the format, the spec, and bindings for Python, R, Julia, JavaScript, and C.
MIT