audio-bits

Measuring bits/sec of information extracted by a matched ~19 M-non-emb-param TinyGPT from different tokenizations of the same speech, under iso-data + 1 epoch. Reports bits/token, bits/sec, bits/byte (BPB), channel capacity, utilization, training FLOPs, and inference FLOPs per bit/sec.

Headline result (same English speech, matched model)

Two iterations, two datasets. Text BPE is in both as a same-data linguistic floor.

v1 — Emilia-YODAS (20,000 h, pre-tokenized)

Repr	bits/tok	tok/sec	bits/sec	capacity (bps)	util	train tokens	trunk PFLOPs
Text BPE (GPT-2)	6.14	3.58	22	56	39%	257 M	29
NeuCodec (FSQ, 50 fps)	10.17	50.0	508	800	64%	3,591 M	407

v2 — LibriTTS-R (538 h, encoded ourselves)

Repr	bits/tok	tok/sec	bits/sec	capacity (bps)	util	train tokens	trunk PFLOPs
Text BPE (GPT-2) (under-trained — see writeup §6)	7.55	4.02	30	63	48%	7.8 M	0.9
Mimi semantic (cb 0, 12.5 fps)	4.93	12.5	62	138	45%	24 M	2.7
Mimi all-flat (8 cb, 100 fps)	6.45	100	645	1 400	46%	194 M	22
SNAC Orpheus-flat (3 lvl, 84 fps)	7.80	84	653	1 138	57%	162 M	18

trunk PFLOPs = 6 × N × tokens with N = 18.88 M non-embedding params (Kaplan-2020 convention; transformer trunk only, excludes the vocab × dim embedding/output-projection matrix). We report trunk-only because (1) it's vocab-independent so cross-rep ratios are clean, and (2) at production scale (~1 B params, d ≈ 4 k) the output projection shrinks to ~2% of trunk — so the trunk number is the scale-relevant quantity; the per-vocab output-projection overhead is a small-model artifact. We deliberately omit wall-clock here: GPU utilization is uniformly low (~6–11 % of H100 peak) and our batch-size rule (bs=32 for vocab > 16 k, else bs=64) wasn't uniform across runs, so wall ratios are noisy. Full discussion in writeup.md §0.

Three regimes, not a smooth curve:

linguistic floor (text BPE) ≈ 22 bps (Emilia, well-trained)
content floor with semantic codec (Mimi cb 0) ≈ 62 bps → the "non-linguistic content surcharge" is only ~40 bps
codec ceiling under flat LM (RVQ codecs) ≈ 650 bps → mostly codec reconstruction overhead, not content

Mimi-all and SNAC converge on the same bits/sec (645 / 653) despite different fps and vocab, suggesting a robust "raw acoustic RVQ codec ceiling" for English speech under a flat LM. NeuCodec at 508 bps sits at a different point on the same Pareto frontier — fewer / denser tokens, more aggressive lossy quantization.

📈 Full narrative: notes/writeup.md · methodology rationale: notes/methodology.md · v0 archive (LibriTTS-R + Mimi/SNAC coarse, overfit iso-token): notes/v0/.

🤗 Public datasets: Trelis/libritts-tokens-codecs — LibriTTS-R tokenized with GPT-2 BPE, Mimi (semantic + all-flat), and SNAC, as per-split parquet.

Compute / setup

Compute: single H100 per run on Modal, bf16, plain PyTorch.
Modal env name is parameterized via $MODAL_ENV. Create one with modal environment create <name> and export MODAL_ENV=<name>.
W&B logging (optional but recommended): create a Modal Secret called wandb-secret containing WANDB_API_KEY. The training functions read it automatically. Without it, runs work but skip W&B.
HuggingFace dependencies are downloaded via hf_transfer. No HF token needed for the public datasets used (neuphonic/emilia-yodas-english-neucodec, parler-tts/libritts_r_filtered).

Layout

File	Purpose
`modal_app.py`	Modal functions: v1 (Emilia download + build_bins + train) and v2 (LibriTTS download + 3 codec encoders + train)
`model.py`	TinyGPT (6 L / 512 d / 8 h, RMSNorm, RoPE, weight-tied)
`train_loop.py`	bf16 training; WSD schedule (warmup → constant → cosine decay); `max_epochs` cap; W&B integration
`results.py`	bits/tok, bits/sec, BPB, info/storage, capacity, util, FLOPs/bit
`plots.py`	headline, bits/sec, BPB, FLOPs/bit, loss-curves, decomposition
`sanity.py`	frame-rate, ballpark, token-diversity checks

v1 pipeline (text BPE vs NeuCodec on Emilia-YODAS)

modal run --env=$MODAL_ENV modal_app.py::download_emilia
modal run --env=$MODAL_ENV modal_app.py::build_bins
modal run --env=$MODAL_ENV modal_app.py::train --repr-name text     --max-epochs 1.0
modal run --env=$MODAL_ENV modal_app.py::train --repr-name neucodec --max-epochs 1.0

# Pull bins + runs locally for analysis
modal volume get audio-bits-data /bins ./local_data/ --env=$MODAL_ENV
modal volume get audio-bits-data /runs ./local_data/ --env=$MODAL_ENV
AUDIO_BITS_DATA=./local_data python results.py && python plots.py && python sanity.py

v2 pipeline (5-rep LibriTTS-R study)

modal run --env=$MODAL_ENV modal_app.py::v2_download_libritts
modal run --env=$MODAL_ENV modal_app.py::v2_tokenize_text
modal run --env=$MODAL_ENV modal_app.py::v2_encode_mimi      # produces mimi_sem + mimi_all
modal run --env=$MODAL_ENV modal_app.py::v2_encode_snac      # produces snac_orpheus
modal run --env=$MODAL_ENV modal_app.py::v2_encode_neucodec  # warning: slow; consider --only-clean-100 first
for r in text mimi_sem mimi_all snac_orpheus neucodec; do
  modal run --env=$MODAL_ENV modal_app.py::v2_train --repr-name $r --max-epochs 1.0
done
modal volume get audio-bits-data /v2_bins ./local_data/ --env=$MODAL_ENV
modal volume get audio-bits-data /v2_runs ./local_data/ --env=$MODAL_ENV
AUDIO_BITS_DATA=./local_data python results.py && python plots.py && python sanity.py

Modal volumes

The pipeline creates two Modal volumes if they don't exist (free tier covers this):

audio-bits-data — datasets, bins, training runs
audio-bits-hf-cache — HuggingFace model + dataset cache (shared across runs)

Reproducibility notes

All training is single-epoch, iso-data. Random seeds default to 0.
The 4-process Modal limit during v2 is a personal usage rule; the pipeline itself doesn't enforce it.
parler-tts/libritts_r_filtered is a filtered version of LibriTTS-R that drops low-quality utterances — the actual hours are ~538 h, not the nominal 960 h.

Architecture (all runs)

TinyGPT: 6 layers, 512 dim, 8 heads, RMSNorm, RoPE, weight-tied embeddings. 18.9 M non-embedding params. AdamW (lr 3e-4 → 3e-5, β 0.9/0.95, wd 0.1), bf16, batch 32 (vocab > 16 k) or 64 (vocab ≤ 16 k), seq 1024. WSD schedule in v2 (warmup 5% → constant 85% → cosine decay 10%); v1 used pure cosine.

License

Code is MIT. Datasets retain their own licenses (Emilia-YODAS: CC-BY-NC, LibriTTS-R: CC-BY-4.0). Codec model weights (Mimi, SNAC, NeuCodec) follow their respective upstream licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
notes		notes
samples		samples
scripts		scripts
.gitignore		.gitignore
README.md		README.md
ROADMAP.md		ROADMAP.md
modal_app.py		modal_app.py
model.py		model.py
plots.py		plots.py
plots_convergence.py		plots_convergence.py
pyproject.toml		pyproject.toml
results.py		results.py
sanity.py		sanity.py
train_loop.py		train_loop.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

audio-bits

Headline result (same English speech, matched model)

v1 — Emilia-YODAS (20,000 h, pre-tokenized)

v2 — LibriTTS-R (538 h, encoded ourselves)

Compute / setup

Layout

v1 pipeline (text BPE vs NeuCodec on Emilia-YODAS)

v2 pipeline (5-rep LibriTTS-R study)

Modal volumes

Reproducibility notes

Architecture (all runs)

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

audio-bits

Headline result (same English speech, matched model)

v1 — Emilia-YODAS (20,000 h, pre-tokenized)

v2 — LibriTTS-R (538 h, encoded ourselves)

Compute / setup

Layout

v1 pipeline (text BPE vs NeuCodec on Emilia-YODAS)

v2 pipeline (5-rep LibriTTS-R study)

Modal volumes

Reproducibility notes

Architecture (all runs)

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages