moving pre_ln from layer 0 to before the blocks #2

peter-sk · 2024-02-02T17:30:02Z

What does this PR do?

I have been reviewing your model implementation. The special case regarding pre_ln for layer 0 should be avoided. This is not common practice in the transformers library.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

skip Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* fix * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* Fix small bug in T5Gemma 1 in __init__ * Add t5gemma2 model & configurations. * Add auto support * Add test case. * Add doctree. * Update positional embeddings to match latest update. * Style fix & add use of final_logit_softcapping for attributes check. * Update tests and embedding design. * Add t5gemma2 to image-text-to-text category. * Add T5Gemma2 doc. * remove unused imports. * minor update following comments. * minor style fixes. * fix config. * Update T5Gemma2 following Anton's comments: 1. Override _prepare_cache_for_generation to take care of cross-attention cache. 2. Move vision preprocessing from main model to encoder. 3. Clean and fix bugs in modular model. * Add T5Gemma2VisionConfig. * Minor updates. * fix style * re-structure vision encoder and minor fixes. * fix parameter tying. * remove several unnecessary codes and fix small bugs. * update and fix init. * Update weight tying and other minor changes. * Skip `tie_word_embeddings` in config attributes check in T5Gemma2. * minor fix. * fix the inherence of t5gemma2decoderlayer * sync to head. * update decorator usage * disable FA and Flex due to merged module behavior * style --------- Co-authored-by: vasqu <antonprogamer@gmail.com>

* feat(cb): use context manager in `generate_batch` * refactor(cb): group `with` stmts * refactor(cb): move log line before `stop` call Co-authored-by: Rémi Ouazan <83456801+remi-or@users.noreply.github.com> * fix: lint --------- Co-authored-by: Rémi Ouazan <83456801+remi-or@users.noreply.github.com>

* remove zero_like + scatter * fix mixtral moe * fix other moe models as well * fix ci * fix modular mixtral * fix qwen2_moe + qwen3_next * fix device mismatch for qwen3_vl_moe to pass tests * fix modular mixtral * fix other models * rm slow tokenizers (huggingface#40936) * fixes missed * gemma test fix * refactor * rm legacy from llama * added renaming * add _model * update legacy * update legacy * fix docstring * always load blank, then set _tokenizer if we have it * new toks * update all berttokenizer based models * apply feedback - delete bert duplicates * more models --> fast only * more convert_slow models * fix common test refs * updating fast only tokenizers * openai and pegasus * enable sentencepiecebackend * more models * code gen * t5 * code gen tests * speecht5 * mbart * mbart50 * more models * more models * layouglmv2 * update tests * update tests * update tests * pretrainedtokenizer * whisper * whisper * layoutxlm and storing backends * refactor sentencepiecebackend and additional_special_tokens * renaming tokenization_utils --> tokenization_python * udpate tests * bert test * blenderbot * clip * codegen * code_llama * cohere * deberata, deberat v2, funnel * gpt2 * batch update tests * pegasus qwen2 roberta * more models * layout tests * some renaming * fix references to utils_fast * fix refs * fix refs * fix refs * fix refs * fix refs * fix refs * fix refs * fix some tests * regression * fix refs * fix refs * missed the most crucial file in my last commit * fix refs * fix refs * fix refs * batch encode fix * fix some tests * BC for batch_decode bc too many refs * more tests * fix more tests * fix for processors * fixing more models * deleted mbart50 by accident * seamless m4t * albert fix * whisper * layout3 * attempt to fix cached tokenizers on CI * trying another fix on CI * again try to work around CI * bertweet * tapas * mbart50 * luke * mluke * markuplm * markuplm * fix some more auto tests * some random model failures * mistralcommontestser * more fixes * ref fix * siglip * marian * plbart * update utils toks * seamless m4t * roc bert * udpate byt5 test * xlm * esm * roformer * code llama * biogpt * m2m100 * dpr and flaubert * xlm and speech to text * tok backend pass object * tokenizer object pass * wav2vec2 * wav2vec2 * cpmant * update utils tokenizers * cpmant * bartpho * test apply chat template assistant mask * apply chat template video * apply chat template assistant mask * test torch * update from slow in base and fix donut processor errors * auto to point to tokenizers backend, fix kosmos2 * some non model fixes for old slow models that no longer have their own tokenizer file as they are the same as bert * missed file from last commit * idefics2 * fixup * fixup * pretrained tokenizer fast test update * stash * bad merged * cherry pick more stuff that did not merge well * fix gptsw3 * nit warn for now * update error raising * just ran fixup * bring back bert legacy * fix * nit * fix 56 errors on blenderbotsmall? * 18 for blenderbotsmall * tok auto * missed clip * fix tests * something missed * token healing * tok common tests update - nonmodel * try to fix non-model test in test_tokenization_utils * fix hub tests * try to fix hub tests * custom vocab related fixed * bert jap * BERT JAP * rename bert legacy to bert legacy * Wav2vec2 * fix in tok python to update total vocab size - fixes speech t5 * blender bot small * forgot test file * test failures * marian * gpt2 tiktoken * big bird / marian * udop * forgot couple changes * test_serve fix * missing import * a couple processors fixes * style partly * fix to fetch tests ci * Revert branch back to commit f5bc69e state * revert branch to styling * update mistral after merge * fixes for non model tests * some processor test fixes * more processor test fixes * more processor fixes * hub tests * python tok utils * fix hub test * make style for now * remove problemattic fic copies * python utils/check_copies.py --fix_and_overwrite * more styling * fixup * silence docstirng * fix import? * fix imports * add the local test as well * throw spm error * llamas * fix a couple tests * broke ci * broke ci * broke ci * broke ci * add logs to debug gemma on ci * gemma and llama * gemma * revert las commit * gemma debug * gemma debug * gemma * safely import spiece backend * tok tests * check none * setup and qual * ruff * del dev files * tok auto * fill docstrings * update auto * blenderbot small nit * add migration guide * move mixtral patch to `TokenizersBackend`, move `TokenizerExtractor` * rename MistralCommonTokenizer to MistralCommonB ackend * nit * fix failures * fixup * remoove one old test * mark the slow one as slow * very small fixes * update auto mapping for missing ones * fixup lorsd * fixup doc and stuff * should be the final fixe * processing update * update * FIX or brute AI fix the llava test * style * slow? * fix is offline mode? * fix mt5 * One tok utils (huggingface#42462) * consolidate python and utils tokenization files, they are copies * ruff and ref * Format * fix cohere * ? * up * am I dumbb? * grumble --------- Co-authored-by: Arthur <arthur.zucker@gmail.com> * [loading/saving] Reverse all loading operations when saving (huggingface#42396) * first shot * default to reversing * oupso * oupsi 2 * oupsi 3 * fix renamed kwargs * fix timm_wrapper * remove fix_state_dict methods * can do it all the time, with __init__ as well * doc * oupsi * fix * create helper * fix annotation annoying isue * small fix * small fixes * alright commit all that already * oupsi * the fix * update quantizers * this works * the hardcoded regex got me hard.... * style * the final one * cleanup a bit * better * style * oupsi readded it * do it inside the ops instead - no need for full names anymore * reverse quantizers and simplify signatures * small thingy * add no_grad decorator * utils to rename keys * oupssii again * add test * simplify nicely * Fix T5 tests: use generation_config for generation parameters (huggingface#42419) * pass the generation parameters to generate() * fix use_task_specific_params to separate model.config and model.generation_config params * fix style * some fixes * remove redundant check * update expectation for llama_7b_bf16 on rocm * Update tests/models/llama/test_modeling_llama.py Co-authored-by: Rémi Ouazan <83456801+remi-or@users.noreply.github.com> --------- Co-authored-by: Rémi Ouazan <83456801+remi-or@users.noreply.github.com> * linting * more fix to pass the CI tests * fix lfm2 moe * fix docstring * fix docstring * fix qwen like model * fix flex olmo * revert lfm2 moe config * make fixup * fix docstring * fix conversion mapping * fix inference of gpt-oss * add some fixes to gpt-oss (but still not good) * fix modular * we need errors I think * fix config issue * this was fixed --------- Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com> Co-authored-by: Arthur <arthur.zucker@gmail.com> Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co> Co-authored-by: BADAOUI Abdennacer <106801897+Abdennacer-Badaoui@users.noreply.github.com> Co-authored-by: Rémi Ouazan <83456801+remi-or@users.noreply.github.com>

fix Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

fix

* one fix * attempt mistral3 * empty dict * that was olmoe's problem * current CI status? * actual CI status * simplify * hmm? * bird * force tie word embeddings to false * specifics of FSMT * wrong reference? * finalize * fixup * weird mamba error * fix tied weights * hack musicgen * tie_encoder_decoder workaround * revert unwanted changes * hardcode llava onevision * more * revert * fix * modular * modular --------- Co-authored-by: Pablo Montalvo <pablo.montalvo.leroux@gmail.com> Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com>

* CLI Migration guide * typo

* module.router -> module.gate * i am not smart today

* small fix * update * we prob still had 1 issue * fix * pop in case

* gemma3 * qwen3 and modulars * fix tp plans --------- Co-authored-by: vasqu <antonprogamer@gmail.com>

* Up * WIP * WIP * WIP * Apply suggestions from code review Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com> * Update src/transformers/models/ministral3/configuration_ministral3.py Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com> * fix most tests * update docsting * fixup * typo in the ocnfig * make the last 3 tests pass * fix auto * nits * WIP * WIP * WIP * per tensor * WIP * WIP * WIP * style * fixup * WIP * WIP * WIP * hack for now * add todo * fixup * WIP --------- Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com> Co-authored-by: Arthur <arthur.zucker@gmail.com> Co-authored-by: medmekk <mekk.cyber@gmail.com>

* fix * style

* Added an initial conversion script * Added a modular where FastVLM is different from LlaVA * Improved the conversion script * Adjusted the conversion script * Removed redundant labels from FastViT & improved the template * Added docs and changed default config * Fix default config * Fix default config * Fixed layer feature handling and more docs * Fixed documentation * Style fixed * Some small fixes * Improved the example script to be more inclusive * Fixes after the rebase * Made the code and docs more readable and consistent * Some fixes from the review * Reverted back to last layer only * Typos fixed * added initial tests - some still failing * Style and quality fixes * Updated modular according to the review * Tests passing and some suggested generic improvements * Docs updated with another usage tip and an auto model * Reversed changes to test_can_intialize_on_meta becuase it's not fully compatible with one existing model * Some tweaks * Typo fix * Consistency fixed * Review comment * Redundant config attr deleted * Consistency fixed * Fixed integration tests after rebase --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Raushan Turganbay <raushan@huggingface.co>

* fix all qwen models with promp tuning * forgot to rename * fix style * fallback better when no cache position * just why?

* Fix missing model attribute in Glm4vMoeIntegrationTest * Removed extra condition. --------- Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>

…ion (huggingface#42545) * fix * style

* WIP * WIP

* Fix fp8 + some enhancement * style * Add coauthor Co-authored-by: Yang Kai <kai.yang@intel.com> * fix * style * fix tests * style * assertin * style * fix * fix * Apply suggestions from code review Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com> --------- Co-authored-by: Yang Kai <kai.yang@intel.com> Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>

…dition` (huggingface#42562) delete Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* fix * eetq dep removed * maybe ? * Fix ! * eveyrthing is passing ! * Apply style fixes * move to nn.paramters * grad false * fix * fix --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…ryEmbeddingConfigMixin` (huggingface#42517) * Add backward compatibility for methods which have been moved to `RotaryEmbeddingConfigMixin` Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * They're not actually no-ops Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * Fix type hint Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * If they're calling this function, they haven't standardised either Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * No need to BC method that wasn't in any releases Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> --------- Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* Remove references to parse_response in the docs * Re-add parse_response (with appropriate warnings) * future annotations * Guard the text generation pipeline correctly

* fix regression * fix * fix

…ngface#42904) * Document new default shard size + dropped unsafe serialization * a * Update MIGRATION_GUIDE_V5.md Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co> --------- Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>

* this way betetr maybe? * delete legacy from bart and mvp * import not found * fix some tests * fix more tests * revert smth to run tests again * i though I fixed it already, but there were more models * commit and check tests, clean-up later * assisted deocding shoudl work now * docs and whisper * fix a few more tests * no circular import errors pls * wording * add a test for defaults following TRL example * nit * Update src/transformers/configuration_utils.py Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * Update src/transformers/generation/candidate_generator.py Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * Update src/transformers/generation/utils.py Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * comments * final fix tests * more comments --------- Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

…ggingface#42939) rm

* initial commit * update

* Allow block sharing in hybrid architectures * nit and style * Better docstring for mark_shareable_blocks_as_complete

refactor tests Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>

* clean * int * check * better * working * remove unrelated stuff * rm print * torchao * Fix * added * fix quanto * revert * reverted * rm comment * fix

…lity (huggingface#42431) * rewrite to improve its usability * rewrite to improve its usability * Clarify comment about function parameter elements * Update implementation of _process_parameter_type * rewrite to improve its usability * Clarify comment about function parameter elements * reformat it a little bit * reformat it a little bit * used a wrong ruff version..... this one should be good * update the string manipulation * Trying for more consistency * make fixup * Try another approach * Don't include "None" in the out_str when we're already setting optional * Add some new-style types to GPT-J to see what happens * Correct use of UnionType * make fixup * Add a little snarky comment about typing just because * Correctly return the same strings as the old function * Drop unnecessary args * Remove redundant args information * add one more elif statement to deal with the case when type hint is None * add if statement to handle the parameter with default value * Revert GPT-J changes * Trigger tests --------- Co-authored-by: Matt <Rocketknight1@users.noreply.github.com> Co-authored-by: Matt <rocketknight1@gmail.com>

* start * all until clvp * all until gpt2 * until lfm2_moe * all until seamless * finally all first batch * style * Copied from * apply modulars * small fixes * add test * name * fix * more * fix typos * more * fix * typo * fix * revert annoying dates auto change.... * fixes * fix * fix * oupsi * fixes * start more fixes * fix * add norm buffers * modular * improve * copies * fixes * fix advanced rope modules * more and more * improve error * fix * fixes * fix * fixes * post rebase * last fix * really last fix * stupid layoutlm2 with its external lib * stupid layoutlmv2 finally.... * create functions

* fix * maybe more clearer ? * style * style

…P8 (huggingface#42945) * rm misleading * add comment

…face#42928) Fixes huggingface#42925 The `type_vocab_size` parameter was missing from TvpConfig, causing an AttributeError when instantiating TvpModel. This adds the parameter with a default value of 2, which matches all TVP models on Hugging Face Hub. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

…ingface#42934) * Fix ConvNeXt image processor default interpolation to BICUBIC The original ConvNeXt implementation uses BICUBIC interpolation for image preprocessing, but the HuggingFace image processor defaulted to BILINEAR. This change aligns the default with the original implementation. Changes: - Update default resample from BILINEAR to BICUBIC in ConvNextImageProcessor - Update conversion script to explicitly use BICUBIC - Add test to verify default interpolation is BICUBIC Reference: https://github.com/facebookresearch/ConvNeXt/blob/main/datasets.py Fixes part of huggingface#28180 * Fix import ordering for ruff linter * Fix ConvNextImageProcessorFast default interpolation to BICUBIC * nit + fix center_crop --------- Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com> Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>

* fix * add comment * add a test * wording * this is it!

dtype

…e#42927) * Updated backbone_config docstrings * Fix typos

* initial * initial commit * fix * fix * first fix * second fix * second fix * revert * fix

fix

…huggingface#42802) * fix error: 'BlockMask' object has no attribute 'dtype' for lasr model Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix error: 'BlockMask' object has no attribute 'dtype' for lasr model Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * disable flex_attn Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * add comment Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update comment Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

…ngface#42941) * use meta device directly * style * move back non-persistent * fix * make helper * fix it * use native param dtype * make tensors buffers * style * fix * oupsi * add a test and fix * fix * create timm integration to reinit non-persistemnt buffers.... * style * style * more * better * add doc * more timm stuff * more * fix * small change * no actually it was fine before

* Handle inifinity and NaNs in JSON serialization * Docs * Tests

…face#42958) * fix * test * fix 2 * should not happen but safety * fast "integration" test

* more attention cleanup * llama like text attention * generates different text but cos and sin tensors are always close - 1e-8 * another round of rope fixups * yea, gonna check tomorrow cant cheat w freqs for whatever reason * NOTE: last time where comp with old rope * rope cleanup * more rope * somewhat clean 3d rope with attn - sin / cos has very small diffs to original formula (torch.allclose always True) leading to slightly different generations * new rope type * style * attempt at moe, gonna need a deeper look * cleanup gate * more cleaning * NOTE remove attempt at moe for now * another round of cleanups * whoops * we back boys, reattempting moe start * moe should be done with this * cleanup * more cleanup * nits * add conversion and adjust code accordingly * fix * make moe copyable as far as we can * cleanup conversion a bit, next config * cleanup config part1 * small removal of unused things * config conversion, rope type doesnt get loaded tho... * fix rope * last hardcoded values * remove unnecessary class * starting to make copies available for vision, vision rope refactor tomorrow * vl rope changes * simplify variable resolution resampler * nit * conversion update * more conversions, standardization, and big dtype fix! * remove some docs (tmp), focus on code for me * oops * nit * fixup embeddings, add todos * more cleanup * more cleanup, next caching changes * revert fp16, internally discussed weights are supposed to be bf16 * fix rope (a bit), prepare cache logic changes * more prep for cache * cache class is used, fixup some flags * modular refactor * partially docstrings, docs, etc * cleaner order * nit * fix config * remove old artefacts/todos * sync with remote and add some todos for orientation * remove img process dep on modeling code * image processor with a few diffs highlighted to copy from maybe * fast img processor version * modular image processors * convert tokenizer to have dedicated video placeholder token * before i forget * a modular bug :/ * more processor things, some modular adjustments * remove dependency on token type ids * position ids ala qwen vl and modular is bugging * fixup some inheritances + nits * token type ids * moe loss, docs, simplify pos ids * align some feature getters * docs * rename conv -> merge aka our naming convention * style * fixup tokenizer class in auto * no more nn sequential * fix chat template, fix tokenizer conversion, modular bug * remove this * remove old deps (from the remote processor) * whoops * argh * todo, restarting progress tomorrow * fast image processor changes output, keeping slow for now * NOTE rm debugging code on processor conversion * first complete conversion script version, todo on whether to use fast processor * config docs * image processor tests, only kept to images as videos need different resolutions * processor tests * first ish version for video processor, very much WIP tho * sync with main and all the changes that happened, fix ernie moe bug in dtype casting * mini style fix * vid processor is properly separated now * make vid processor its own thing * style * video processing and cleanups, img processing done, processing needs one TODO, vid processing needs tests * readd vid patch fn * make 4D RoPE possible if manually passed * simplify the msg on packing, allow external prep but not internal one * nit * revert general changes video utils, make it specific to ernie, fixup tests * vid to auto * left to check: pos ids (rope) + token type ids * move token type ids to processor, fix processor to ernie logic TODOs: tests, tests, tests * processor fixes, conversion todo for fast img processor TODOs: tests for vid processor and modeling * fix * video processor tests, torch compile does not work due to PIL drawing being needed * fix config consistency * style * wip tests * fix most tests, 2 failing ones remain * fix last tests * check * docs consistency * fix conversion script, more docs * optional drawing on frames, style * add error on compile x draw on frames * fix * fix * change font loading to hub dep with default font * fix config try 2 * fix diff resolution, tests (not fast processor, a100) * fix test * style * torch 2.9 (fa2 untested, video from 2.6) * raushan's review (part 1) * Update docs/source/en/model_doc/ernie4_5_vl.md Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com> * Pablo's review * style * fix device/dtype stuff that is no longer needed * revert vision property rm, necessary for composite sdpa test * fixup few smaller things + refactor how we load the font entirely (based on font name with expected associated file at same repo) * remove bc min max pixels --> less modular on processor parts but way cleaner code * fix fps and add fixme to the inefficient conversion stuff * rope * style * copies and last rope stuff i fogot * revert glm4v copies * fix * simplify temporal slicing and add more descriptions * that ":" 😢 * fixup init * conversion for moe split and merge + general renamings etc -- encountering OOM (automap maybe?) * wrong order whoops * style * copies * fix init * fix * fix * allow the resolved path to be passed to explicit video processor classes and refactor how we load them for ernie * simplify * shoot, I need it there as well * better err handling * style * initial fixes after merge * working loading version * cleanup * change moe order and fix vl version * reverse op is mapping incorrectly TODO * reverse loading somewhat works, name conversion has issues it seems 👀 * fix renaming issue, slow tests pass (except the integration ones ~ expected due to fused weights) * conversion mapping with native features + remove conversion mapping restriction * add test for new conversion * style * update conversion * fix integration tests, remove fa tests * fix * update docs a bit * style * fix ernie moe and routing ernie series * style * fix rope warning * i fucked up again pain * update expectations * remove EP, broken atm be it sole or in combination with TP * update docs a bit * first part of addressing review comments * fixup * fix vid processor * fix font saving * readd decorators oops * add mm token type id shortcut * always compose mm token type ids if needed * move config to modular * fix loading by enforcing correct order * fix * address first bunch of comments * smaller comments * let's make moe layer types, ill fix modular in a second * modular * style * renamed version along a few fixes in conversion and processor tests * fix * style + decorator * fix tokenizer handling of additional special tokens * style * fix doc refs * test fix * fix * was this too breaking? * fix conversion via workaround for now * post merge fix * revert a few tok things (additional_special_tokens), updated conversion * fix video processing loading logic add exception for auto class (reload config as we have a circular dep on finding which class we have, i.e. we need to load to find the class then load with specific logic) remove some original ideas * style * processor path change * add small dummy integration tests * style * fix rope modeling to follow qwen2 vl instead + change auto loading to specifically load via pretrained (overridable from pretrained for auto classes) * seems to be skipped in other similar vlms * small conversion updates and adjust max vram usage during the big integration test * update test paths * style * style attmpt 2 * docs * trigger ci * review * post merge fixes * fix * safety * fix test * style * oops * fix * ... * simplify the config init for moe pattern * gonna be fixed by huggingface#42963 --------- Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com>

ydshieh and others added 30 commits December 1, 2025 16:06

skip MLukeTokenizerTest temporarily (huggingface#42520)

ffc485f

skip Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

another way of fix for huggingface#42400 (huggingface#42509)

11e0fc1

* fix * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Add afmoe to models __init__.py (huggingface#42501)

cfb43ae

TP + EP + MoE disclaimer (huggingface#42519)

3ea7ecd

avoid afmoe job crashing (huggingface#42524)

70baf07

fix Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Fix silent loading error (huggingface#42525)

641c995

fix

CLI Migration guide (huggingface#42526)

58f092e

* CLI Migration guide * typo

RC0 Disclaimer (huggingface#42530)

13f8ec7

Migration guide typo (huggingface#42532)

063b431

Fix Qwen3OmniMoE weight init (huggingface#42531)

dac2ad7

* module.router -> module.gate * i am not smart today

small fix tokenizer regex patch (huggingface#42528)

83fe012

* small fix * update * we prob still had 1 issue * fix * pop in case

[TP plans] Fix some incorrects TP plans (huggingface#42448)

6e3f2f8

* gemma3 * qwen3 and modulars * fix tp plans --------- Co-authored-by: vasqu <antonprogamer@gmail.com>

Fix ernie moe (huggingface#42535)

0fa49db

* fix * style

Fix Qwen-VL family with prompt tuning (huggingface#42508)

57eeb9c

* fix all qwen models with promp tuning * forgot to rename * fix style * fallback better when no cache position * just why?

Fix failing test in Glm4vMoeIntegrationTest (huggingface#42488)

ac0769c

* Fix missing model attribute in Glm4vMoeIntegrationTest * Removed extra condition. --------- Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>

[Quantization] fix dequant when block size is none & static quantizat…

bb09a30

…ion (huggingface#42545) * fix * style

[Ministral 3] Small fix config (huggingface#42537)

64d8cf4

* WIP * WIP

[Fix] dots1 expert bias routing (huggingface#41663)

29e8522

[test] delete `SeamlessM4TProcessorTest::test_save_load_pretrained_ad…

4ec83fe

…dition` (huggingface#42562) delete Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Fix eetq quanto quant methods (huggingface#42557)

5efd0d4

* fix * eetq dep removed * maybe ? * Fix ! * eveyrthing is passing ! * Apply style fixes * move to nn.paramters * grad false * fix * fix --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Fix parse_response after tokenizer refactor (huggingface#42300)

5690f24

* Remove references to parse_response in the docs * Re-add parse_response (with appropriate warnings) * future annotations * Guard the text generation pipeline correctly

fix regression (huggingface#42569)

53d2bf6

* fix regression * fix * fix

Wauplin and others added 29 commits December 18, 2025 10:08

[Quantization] rm _pre_quantization_dtype from quantization tests (hu…

d3d4b62

…ggingface#42939) rm

[Quantization] Misc tests fixes (huggingface#42940)

d7dd443

* initial commit * update

[CB] Allow block sharing in hybrid models (huggingface#42877)

04e78e6

* Allow block sharing in hybrid architectures * nit and style * Better docstring for mark_shareable_blocks_as_complete

[Tests] Fix CompressedTensors tests (huggingface#42935)

0a84654

refactor tests Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>

Update param_element_size (huggingface#42818)

dd8057a

* clean * int * check * better * working * remove unrelated stuff * rm print * torchao * Fix * added * fix quanto * revert * reverted * rm comment * fix

Fp8 dq (huggingface#42926)

af91c0b

* fix * maybe more clearer ? * style * style

[Quantization] Removing misleading int8 quantization in Finegrained F…

4dc60fd

…P8 (huggingface#42945) * rm misleading * add comment

Load generation config from nested configs (huggingface#42922)

f2c6d2a

* fix * add comment * add a test * wording * this is it!

[docs] dtype (huggingface#42883)

b93f2e3

dtype

Updated backbone_config docstrings and type annotations (huggingfac…

3e4baf8

…e#42927) * Updated backbone_config docstrings * Fix typos

[Quantization] CI green by end of year (huggingface#42951)

b5eea34

* initial * initial commit * fix * fix * first fix * second fix * second fix * revert * fix

[kernels] Fix failling tests (huggingface#42953)

cf0f071

fix

fix concat order (huggingface#42946)

a4d6229

Add runner specification to CodeQL workflow (huggingface#42955)

d54d78f

Fix infinity in JSON serialized files (huggingface#42959)

d14d99e

* Handle inifinity and NaNs in JSON serialization * Docs * Tests

[Generation] Fix default overwrite for non-None defaults (hugging…

f218ed2

…face#42958) * fix * test * fix 2 * should not happen but safety * fast "integration" test

Fix tests trainer again (huggingface#42933)

7017994

duplicated flex_olmo to flex_more

e9d2215

expose experts implementation

6fb59f9

list of 2D experts instead of joined 3D experts

2a770f1

peter-sk force-pushed the main branch from aec0763 to 2a770f1 Compare December 23, 2025 14:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

moving pre_ln from layer 0 to before the blocks #2

moving pre_ln from layer 0 to before the blocks #2

Uh oh!

peter-sk commented Feb 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

moving pre_ln from layer 0 to before the blocks #2

Are you sure you want to change the base?

moving pre_ln from layer 0 to before the blocks #2

Uh oh!

Conversation

peter-sk commented Feb 2, 2024

What does this PR do?

Before submitting

Who can review?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants