[Non-Record] QAT + NTK-4096 Eval + Cosine Warmdown + Aggressive SWA#326
Open
crony-io wants to merge 1 commit intoopenai:mainfrom
Open
[Non-Record] QAT + NTK-4096 Eval + Cosine Warmdown + Aggressive SWA#326crony-io wants to merge 1 commit intoopenai:mainfrom
crony-io wants to merge 1 commit intoopenai:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
QAT + NTK-4096 Eval + Cosine Warmdown + Aggressive SWA
Status: Incomplete — RunPod terminated the pod during evaluation on all 8xH100 attempts. Best run completed training (6606 steps, 600s) but no final roundtrip val_bpb.
Pre-quant val_bpb: 1.1702 (step 6606) | 1xH100 roundtrip val_bpb: 1.2890 (872 steps)
Approach & Changes from Baseline
This submission modifies the baseline
train_gpt.pyby integrating proven architectural optimizations from the community alongside my own quantization and training strategies.1. Architecture Updates
2. Training & Optimization
0.5 * (1 + cos(πt))) to sustain higher learning rates longer.3. Quantization-Aware Training (QAT)
int8quantization, I implemented QAT using a Straight-Through Estimator (STE).CastedLinearlayers fake-quantize weights during the forward pass (int5 for MLPs, int6 for Attention). This forces the model to learn robustness against quantization noise during training, minimizing the final compression penalty.4. Evaluation & Compression
zlibwithlzma(PRESET_EXTREME). Applied 5% magnitude pruning and packed weights using the mixed int5/int6 QAT scheme, fitting the artifact well under 16MB.Feature Comparison
train_gpt.pyRun Attempts