Skip to content

Latest commit

 

History

History
59 lines (47 loc) · 2.15 KB

File metadata and controls

59 lines (47 loc) · 2.15 KB

Wine AI Dataset

The largest publicly available wine bottle image + text dataset, refreshed for 2025-era multimodal research.

Quick Start

# Install dependencies (requires Python 3.11+ and uv)
uv pip install -e .

# Dataset auto-downloads from HuggingFace on first use
uv run python -c "from wine_ai.data.loaders import WineDataset; ds = WineDataset.load_latest(); print(f'{len(ds.train)} train samples')"

# Quick training pipeline test
uv run wine-train --config configs/test_training.yaml --no-wandb

# Full production training
uv run wine-train --config configs/train_description.yaml

Dataset Highlights

  • 125,787 unique wines with complete metadata
  • 107,821 validated bottle images (17,966 missing locally)
  • Rich tasting descriptions, URLs, prices, and HTML archives
  • Deterministic train/validation/test splits stored in data/processed/train_val_test_splits.json

Repository Layout

data/
  raw/                # Immutable Parquet exports + metadata
  processed/          # Clean datasets and flat image storage
  external/           # Synthetic GPT-2 outputs and augmentation artefacts
src/wine_ai/          # Python package for data, models, training, inference
configs/              # Training configurations (YAML)
experiments/          # Historical and modern research runs
notebooks/            # Exploratory analysis and prototyping (includes quick-start)

Python API

from wine_ai.data.loaders import WineDataset, WineImageLoader

# Load dataset with train/val/test splits
dataset = WineDataset.load_latest()
print(f"Train: {len(dataset.train)} samples")

# Access images
loader = WineImageLoader()
sample = dataset.train.iloc[0]
if loader.exists(sample['image_filename']):
    image_path = loader.path_for(sample['image_filename'])

Why This Matters

  • Irreplaceable data – wine.com changed structure post-2020; this scrape cannot be repeated.
  • Research value – largest paired wine image + text corpus for multimodal modelling.
  • Commercial potential – recommendation engines, tasting note generation, virtual label design.

See DATASET.md for detailed documentation and CONTRIBUTING.md for development guidelines.