Measuring bits/sec of information extracted by a matched ~19 M-non-emb-param TinyGPT from different tokenizations of the same speech, under iso-data + 1 epoch. Reports bits/token, bits/sec, bits/byte (BPB), channel capacity, utilization, training FLOPs, and inference FLOPs per bit/sec.
Two iterations, two datasets. Text BPE is in both as a same-data linguistic floor.
| Repr | bits/tok | tok/sec | bits/sec | capacity (bps) | util | train tokens | trunk PFLOPs |
|---|---|---|---|---|---|---|---|
| Text BPE (GPT-2) | 6.14 | 3.58 | 22 | 56 | 39% | 257 M | 29 |
| NeuCodec (FSQ, 50 fps) | 10.17 | 50.0 | 508 | 800 | 64% | 3,591 M | 407 |
| Repr | bits/tok | tok/sec | bits/sec | capacity (bps) | util | train tokens | trunk PFLOPs |
|---|---|---|---|---|---|---|---|
| Text BPE (GPT-2) (under-trained — see writeup §6) | 7.55 | 4.02 | 30 | 63 | 48% | 7.8 M | 0.9 |
| Mimi semantic (cb 0, 12.5 fps) | 4.93 | 12.5 | 62 | 138 | 45% | 24 M | 2.7 |
| Mimi all-flat (8 cb, 100 fps) | 6.45 | 100 | 645 | 1 400 | 46% | 194 M | 22 |
| SNAC Orpheus-flat (3 lvl, 84 fps) | 7.80 | 84 | 653 | 1 138 | 57% | 162 M | 18 |
trunk PFLOPs = 6 × N × tokens with N = 18.88 M non-embedding params (Kaplan-2020 convention; transformer trunk only, excludes the vocab × dim embedding/output-projection matrix). We report trunk-only because (1) it's vocab-independent so cross-rep ratios are clean, and (2) at production scale (~1 B params, d ≈ 4 k) the output projection shrinks to ~2% of trunk — so the trunk number is the scale-relevant quantity; the per-vocab output-projection overhead is a small-model artifact. We deliberately omit wall-clock here: GPU utilization is uniformly low (~6–11 % of H100 peak) and our batch-size rule (bs=32 for vocab > 16 k, else bs=64) wasn't uniform across runs, so wall ratios are noisy. Full discussion in writeup.md §0.
Three regimes, not a smooth curve:
- linguistic floor (text BPE) ≈ 22 bps (Emilia, well-trained)
- content floor with semantic codec (Mimi cb 0) ≈ 62 bps → the "non-linguistic content surcharge" is only ~40 bps
- codec ceiling under flat LM (RVQ codecs) ≈ 650 bps → mostly codec reconstruction overhead, not content
Mimi-all and SNAC converge on the same bits/sec (645 / 653) despite different fps and vocab, suggesting a robust "raw acoustic RVQ codec ceiling" for English speech under a flat LM. NeuCodec at 508 bps sits at a different point on the same Pareto frontier — fewer / denser tokens, more aggressive lossy quantization.
📈 Full narrative: notes/writeup.md · methodology rationale: notes/methodology.md · v0 archive (LibriTTS-R + Mimi/SNAC coarse, overfit iso-token): notes/v0/.
🤗 Public datasets: Trelis/libritts-tokens-codecs — LibriTTS-R tokenized with GPT-2 BPE, Mimi (semantic + all-flat), and SNAC, as per-split parquet.
- Compute: single H100 per run on Modal, bf16, plain PyTorch.
- Modal env name is parameterized via
$MODAL_ENV. Create one withmodal environment create <name>andexport MODAL_ENV=<name>. - W&B logging (optional but recommended): create a Modal Secret called
wandb-secretcontainingWANDB_API_KEY. The training functions read it automatically. Without it, runs work but skip W&B. - HuggingFace dependencies are downloaded via
hf_transfer. No HF token needed for the public datasets used (neuphonic/emilia-yodas-english-neucodec,parler-tts/libritts_r_filtered).
| File | Purpose |
|---|---|
modal_app.py |
Modal functions: v1 (Emilia download + build_bins + train) and v2 (LibriTTS download + 3 codec encoders + train) |
model.py |
TinyGPT (6 L / 512 d / 8 h, RMSNorm, RoPE, weight-tied) |
train_loop.py |
bf16 training; WSD schedule (warmup → constant → cosine decay); max_epochs cap; W&B integration |
results.py |
bits/tok, bits/sec, BPB, info/storage, capacity, util, FLOPs/bit |
plots.py |
headline, bits/sec, BPB, FLOPs/bit, loss-curves, decomposition |
sanity.py |
frame-rate, ballpark, token-diversity checks |
modal run --env=$MODAL_ENV modal_app.py::download_emilia
modal run --env=$MODAL_ENV modal_app.py::build_bins
modal run --env=$MODAL_ENV modal_app.py::train --repr-name text --max-epochs 1.0
modal run --env=$MODAL_ENV modal_app.py::train --repr-name neucodec --max-epochs 1.0
# Pull bins + runs locally for analysis
modal volume get audio-bits-data /bins ./local_data/ --env=$MODAL_ENV
modal volume get audio-bits-data /runs ./local_data/ --env=$MODAL_ENV
AUDIO_BITS_DATA=./local_data python results.py && python plots.py && python sanity.pymodal run --env=$MODAL_ENV modal_app.py::v2_download_libritts
modal run --env=$MODAL_ENV modal_app.py::v2_tokenize_text
modal run --env=$MODAL_ENV modal_app.py::v2_encode_mimi # produces mimi_sem + mimi_all
modal run --env=$MODAL_ENV modal_app.py::v2_encode_snac # produces snac_orpheus
modal run --env=$MODAL_ENV modal_app.py::v2_encode_neucodec # warning: slow; consider --only-clean-100 first
for r in text mimi_sem mimi_all snac_orpheus neucodec; do
modal run --env=$MODAL_ENV modal_app.py::v2_train --repr-name $r --max-epochs 1.0
done
modal volume get audio-bits-data /v2_bins ./local_data/ --env=$MODAL_ENV
modal volume get audio-bits-data /v2_runs ./local_data/ --env=$MODAL_ENV
AUDIO_BITS_DATA=./local_data python results.py && python plots.py && python sanity.pyThe pipeline creates two Modal volumes if they don't exist (free tier covers this):
audio-bits-data— datasets, bins, training runsaudio-bits-hf-cache— HuggingFace model + dataset cache (shared across runs)
- All training is single-epoch, iso-data. Random seeds default to 0.
- The 4-process Modal limit during v2 is a personal usage rule; the pipeline itself doesn't enforce it.
parler-tts/libritts_r_filteredis a filtered version of LibriTTS-R that drops low-quality utterances — the actual hours are ~538 h, not the nominal 960 h.
TinyGPT: 6 layers, 512 dim, 8 heads, RMSNorm, RoPE, weight-tied embeddings. 18.9 M non-embedding params. AdamW (lr 3e-4 → 3e-5, β 0.9/0.95, wd 0.1), bf16, batch 32 (vocab > 16 k) or 64 (vocab ≤ 16 k), seq 1024. WSD schedule in v2 (warmup 5% → constant 85% → cosine decay 10%); v1 used pure cosine.
Code is MIT. Datasets retain their own licenses (Emilia-YODAS: CC-BY-NC, LibriTTS-R: CC-BY-4.0). Codec model weights (Mimi, SNAC, NeuCodec) follow their respective upstream licenses.