Skip to content

TrelisResearch/audio-bits

Repository files navigation

audio-bits

Measuring bits/sec of information extracted by a matched ~19 M-non-emb-param TinyGPT from different tokenizations of the same speech, under iso-data + 1 epoch. Reports bits/token, bits/sec, bits/byte (BPB), channel capacity, utilization, training FLOPs, and inference FLOPs per bit/sec.

Headline result (same English speech, matched model)

Two iterations, two datasets. Text BPE is in both as a same-data linguistic floor.

v1 — Emilia-YODAS (20,000 h, pre-tokenized)

Repr bits/tok tok/sec bits/sec capacity (bps) util train tokens trunk PFLOPs
Text BPE (GPT-2) 6.14 3.58 22 56 39% 257 M 29
NeuCodec (FSQ, 50 fps) 10.17 50.0 508 800 64% 3,591 M 407

v2 — LibriTTS-R (538 h, encoded ourselves)

Repr bits/tok tok/sec bits/sec capacity (bps) util train tokens trunk PFLOPs
Text BPE (GPT-2) (under-trained — see writeup §6) 7.55 4.02 30 63 48% 7.8 M 0.9
Mimi semantic (cb 0, 12.5 fps) 4.93 12.5 62 138 45% 24 M 2.7
Mimi all-flat (8 cb, 100 fps) 6.45 100 645 1 400 46% 194 M 22
SNAC Orpheus-flat (3 lvl, 84 fps) 7.80 84 653 1 138 57% 162 M 18

trunk PFLOPs = 6 × N × tokens with N = 18.88 M non-embedding params (Kaplan-2020 convention; transformer trunk only, excludes the vocab × dim embedding/output-projection matrix). We report trunk-only because (1) it's vocab-independent so cross-rep ratios are clean, and (2) at production scale (~1 B params, d ≈ 4 k) the output projection shrinks to ~2% of trunk — so the trunk number is the scale-relevant quantity; the per-vocab output-projection overhead is a small-model artifact. We deliberately omit wall-clock here: GPU utilization is uniformly low (~6–11 % of H100 peak) and our batch-size rule (bs=32 for vocab > 16 k, else bs=64) wasn't uniform across runs, so wall ratios are noisy. Full discussion in writeup.md §0.

Three regimes, not a smooth curve:

  • linguistic floor (text BPE) ≈ 22 bps (Emilia, well-trained)
  • content floor with semantic codec (Mimi cb 0) ≈ 62 bps → the "non-linguistic content surcharge" is only ~40 bps
  • codec ceiling under flat LM (RVQ codecs) ≈ 650 bps → mostly codec reconstruction overhead, not content

Mimi-all and SNAC converge on the same bits/sec (645 / 653) despite different fps and vocab, suggesting a robust "raw acoustic RVQ codec ceiling" for English speech under a flat LM. NeuCodec at 508 bps sits at a different point on the same Pareto frontier — fewer / denser tokens, more aggressive lossy quantization.

📈 Full narrative: notes/writeup.md · methodology rationale: notes/methodology.md · v0 archive (LibriTTS-R + Mimi/SNAC coarse, overfit iso-token): notes/v0/.

🤗 Public datasets: Trelis/libritts-tokens-codecs — LibriTTS-R tokenized with GPT-2 BPE, Mimi (semantic + all-flat), and SNAC, as per-split parquet.

Compute / setup

  • Compute: single H100 per run on Modal, bf16, plain PyTorch.
  • Modal env name is parameterized via $MODAL_ENV. Create one with modal environment create <name> and export MODAL_ENV=<name>.
  • W&B logging (optional but recommended): create a Modal Secret called wandb-secret containing WANDB_API_KEY. The training functions read it automatically. Without it, runs work but skip W&B.
  • HuggingFace dependencies are downloaded via hf_transfer. No HF token needed for the public datasets used (neuphonic/emilia-yodas-english-neucodec, parler-tts/libritts_r_filtered).

Layout

File Purpose
modal_app.py Modal functions: v1 (Emilia download + build_bins + train) and v2 (LibriTTS download + 3 codec encoders + train)
model.py TinyGPT (6 L / 512 d / 8 h, RMSNorm, RoPE, weight-tied)
train_loop.py bf16 training; WSD schedule (warmup → constant → cosine decay); max_epochs cap; W&B integration
results.py bits/tok, bits/sec, BPB, info/storage, capacity, util, FLOPs/bit
plots.py headline, bits/sec, BPB, FLOPs/bit, loss-curves, decomposition
sanity.py frame-rate, ballpark, token-diversity checks

v1 pipeline (text BPE vs NeuCodec on Emilia-YODAS)

modal run --env=$MODAL_ENV modal_app.py::download_emilia
modal run --env=$MODAL_ENV modal_app.py::build_bins
modal run --env=$MODAL_ENV modal_app.py::train --repr-name text     --max-epochs 1.0
modal run --env=$MODAL_ENV modal_app.py::train --repr-name neucodec --max-epochs 1.0

# Pull bins + runs locally for analysis
modal volume get audio-bits-data /bins ./local_data/ --env=$MODAL_ENV
modal volume get audio-bits-data /runs ./local_data/ --env=$MODAL_ENV
AUDIO_BITS_DATA=./local_data python results.py && python plots.py && python sanity.py

v2 pipeline (5-rep LibriTTS-R study)

modal run --env=$MODAL_ENV modal_app.py::v2_download_libritts
modal run --env=$MODAL_ENV modal_app.py::v2_tokenize_text
modal run --env=$MODAL_ENV modal_app.py::v2_encode_mimi      # produces mimi_sem + mimi_all
modal run --env=$MODAL_ENV modal_app.py::v2_encode_snac      # produces snac_orpheus
modal run --env=$MODAL_ENV modal_app.py::v2_encode_neucodec  # warning: slow; consider --only-clean-100 first
for r in text mimi_sem mimi_all snac_orpheus neucodec; do
  modal run --env=$MODAL_ENV modal_app.py::v2_train --repr-name $r --max-epochs 1.0
done
modal volume get audio-bits-data /v2_bins ./local_data/ --env=$MODAL_ENV
modal volume get audio-bits-data /v2_runs ./local_data/ --env=$MODAL_ENV
AUDIO_BITS_DATA=./local_data python results.py && python plots.py && python sanity.py

Modal volumes

The pipeline creates two Modal volumes if they don't exist (free tier covers this):

  • audio-bits-data — datasets, bins, training runs
  • audio-bits-hf-cache — HuggingFace model + dataset cache (shared across runs)

Reproducibility notes

  • All training is single-epoch, iso-data. Random seeds default to 0.
  • The 4-process Modal limit during v2 is a personal usage rule; the pipeline itself doesn't enforce it.
  • parler-tts/libritts_r_filtered is a filtered version of LibriTTS-R that drops low-quality utterances — the actual hours are ~538 h, not the nominal 960 h.

Architecture (all runs)

TinyGPT: 6 layers, 512 dim, 8 heads, RMSNorm, RoPE, weight-tied embeddings. 18.9 M non-embedding params. AdamW (lr 3e-4 → 3e-5, β 0.9/0.95, wd 0.1), bf16, batch 32 (vocab > 16 k) or 64 (vocab ≤ 16 k), seq 1024. WSD schedule in v2 (warmup 5% → constant 85% → cosine decay 10%); v1 used pure cosine.

License

Code is MIT. Datasets retain their own licenses (Emilia-YODAS: CC-BY-NC, LibriTTS-R: CC-BY-4.0). Codec model weights (Mimi, SNAC, NeuCodec) follow their respective upstream licenses.

About

bits/sec of speech codecs vs text BPE — matched TinyGPT, iso-data + 1 epoch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors