This document matches on-disk data under ./data/ and the exported metrics in ./archive/results/ (produced at the end of python train.py).
| Item | Value |
|---|---|
| Base checkpoint | google/vit-base-patch16-224 |
| Custom classes | my_cat, my_dog, my_car, my_house, my_phone |
| Train / validation split | 80% / 20%, stratified, random_state=42 (sklearn.model_selection.train_test_split) |
| Validation accuracy (Trainer) | 79.59% (eval_accuracy in archive/results/eval_summary.json: 0.795918…) |
| Reported accuracy (sklearn) | 80% (two-decimal rounding on the same 49 validation samples) |
Several downloaded or scraped images mixed visual cues between my_house and my_dog: e.g. dogs in front of buildings, wide outdoor facades that looked pet-centric thumbnails, or house exteriors tagged loosely as “animal” scenes. That label noise inflated confusion between the two classes and hurt validation metrics until the folders were manually reviewed.
What we did
- Opened suspect files in
data/my_house/anddata/my_dog/and removed or re-filed mislabeled examples. - Applied the same discipline to
my_phoneandmy_housebuckets (removing off-topic or broken downloads), which previously helped lift accuracy from roughly ~51% to ~80% on validation before the current dataset size.
After cleanup, counts are closer to balanced across most classes; my_house and my_phone remain somewhat smaller (see tables below).
These counts are the number of image files per class (extensions: .jpg, .jpeg, .png, .bmp, .gif). They are mirrored in archive/results/dataset_split.csv.
| Class | Images on disk |
|---|---|
| my_car | 60 |
| my_cat | 60 |
| my_dog | 60 |
| my_house | 46 |
| my_phone | 16 |
| Total | 242 |
- There are 16 curated phone images in
data/my_phone/in this repository snapshot. - With the stratified 80/20 split and
random_state=42, 13 of those 16 are assigned to training and 3 to validation (see next table). So “13” corresponds to training samples formy_phone, not the folder count.
Same numbers as archive/results/dataset_split.csv (generated automatically when you train).
| Class | Total | Train | Validation |
|---|---|---|---|
| my_car | 60 | 48 | 12 |
| my_cat | 60 | 48 | 12 |
| my_dog | 60 | 48 | 12 |
| my_house | 46 | 36 | 10 |
| my_phone | 16 | 13 | 3 |
| ALL | 242 | 193 | 49 |
Values below are from a full training run (default 30 epochs, batch size 8, capped LR 2e-5), matching the sklearn classification_report printed by src/models/train.py and saved to archive/results/validation_per_class.csv.
| Class | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| my_car | 0.78 | 0.58 | 0.67 | 12 |
| my_cat | 0.86 | 1.00 | 0.92 | 12 |
| my_dog | 0.82 | 0.75 | 0.78 | 12 |
| my_house | 0.77 | 1.00 | 0.87 | 10 |
| my_phone | 0.50 | 0.33 | 0.40 | 3 |
| macro avg | 0.74 | 0.73 | 0.73 | 49 |
| weighted avg | 0.79 | 0.80 | 0.78 | 49 |
Interpretation
my_phonehas only 3 validation images, so precision/recall swing heavily with a single mistake. Treat phone metrics as high-variance unless you add more phone images or use k-fold / more val data.my_cat/my_houseshow strong recall on this split;my_carrecall is lower—worth checking remaining confusion pairs (car vs. house façade, etc.).
| Path | Purpose |
|---|---|
archive/results/dataset_split.csv |
Per-class and total train/val counts |
archive/results/validation_per_class.csv |
Per-class precision, recall, F1, support + averages |
archive/results/eval_summary.json |
eval_accuracy, sample counts, split metadata |
src/models/train.py |
Writes the above files after each training run |
To refresh all metrics after changing data or hyperparameters:
python model_custom.py # if classes/counts changed
python train.pypython test.py --image path/to/image.jpg
python test.py --directory ./some_folder
python main.py # Gradio UILast aligned with exported archive/results/* and data/ layout as documented above.