feat: multi-lora training by mathewjhan · Pull Request #1141 · radixark/miles

mathewjhan · 2026-05-16T02:28:50Z

Summary

Allow training on multiple loras (all-linear, excluding expert) per step in Miles using megatron-bridge (related PR)

Feature

Train multiple LoRAs in a single training step (currently only colocated until [WIP][lora] support disaggregate model lora training #988 is merged)
Use as a long running service, supporting online loading and unloading
Use as normal training, stopping when no more LoRAs are left to train

Running

see: examples/multi_lora

For normal training (not as a service):

run examples/multi_lora/provision.sh
configure W&B credentials and settings in examples/multi_lora/single_run.sh
run examples/multi_lora/single_run.sh |& tee run.log

For multi-lora training as a long running service:

run examples/multi_lora/provision.sh
configure W&B credentials and settings in examples/multi_lora/start_service.sh
run examples/multi_lora/start_service.sh |& tee run.log in one shell
run examples/multi_lora/submit_schedule.sh in a separate shell

Model checkpoints and LoRA safetensors are saved in `examples/adapters/*/checkpoints.

Changes to existing code (backwards compatible)

Support --custom-generate-state flag to allow users to define their own GenerateState
Add GenerateState hooks to allow custom generate states to access lifecycle of rollout, used for rollout request tracking
Add AdapterRef and RewardSpec to Sample type so individual samples can access their own reward functions and adapter names during rollout

Notes

Currently doesn't checkpoint optimizer + scheduler state yet, but can be added later as future PR
Dataset checkpoint loading per adapter doesn't fully work yet since data source API doesn't support per adapter loading yet
MultiLoRA not applied to experts as of now due to more complex bookkeeping required (need to keep track of the [adapter index, routed experts] together)
The sglang lora csgmv kernel has some problems with the CUDA graph, so by default we use triton for now
Doesn't support load from *.bin/*.safetensors yet, can be added later as a future PR, only resume from a megatron checkpoint or train from scratch

Tests

e2e Qwen3-4B test using 2 LoRA adapters trained on gsm8k and dapo_math
MultiLoRAController tests
AdapterConfig tests

[feat] add adapter args and adapter config [fix] clean up config and unused logic [feat] add multiloracontroller [feat] add multilora state to actor and model [feat] add adapter lock [feat] support setting the controller [feat] add multi lora data source [feat] improve training + data [feat] add sglang config settings [feat] update weight sync logic [fix] support input label keys [feat] deregister run after completion [fix] deregister the adapters [misc] add example [fix] hide adapters on loading checkpoint for multilora [misc] temp example [fix] clear cached params [debug] logging [fix] typo debug [fix] use lora [fix] colocated engine [fix] [fix] simplify update [fix] [fix] [fix] [fix] skip list [fix] override lora adapter name [fix] keep track of previously loaded loras [fix] [debug] [fix] [fix] use lora configs to sync [fix] revert [fix] support mixed adapter ranks [fix] sync adapter alpha as well [feat] support individual reward fn [feat] per-adapter metrics [fix] adapter name prefix [fix] optimizer state refresh [fix] examples use dapo and gsm8k [fix] clean up [fix] name the metric raw_reward instead of reward [fix] correct split sample tokens [fix] possible fix? [fix]

[fix] update to support new miles lora changes [fix] assert all gather cp [debug] [fix] sync base weights [fix] revert [fix] [fix] [debug] [misc] add provision script [fix] full rank dapo [fix] [fix] keep this [debug] [temp] [fix] remove [debug] [fix] [fix] [fix] [fix] [fix] [fix] [fix] [test] [fix] [fix] [fix] [fix] [fix] [fix] [fix] [test] [fix] [fix] [fix] [fix] [fix] [fix] fix [test] [cleanup] remove extraneous loggging [cleanup] [fix] use a global controller actor and avoid setting controller into args [misc] clean up log utils [refactor] multilora controller [fix] [refactor] remove excess dataclasses [misc] update example script [fix] multilora checkpointing [fix] clone the tensor [fix] checkpointing saving [fix] checkpoint saving [fix] logging [refactor] naming registeration for adapters [feat] update lifecycle for register/deregister [feat] service multilora [fix] [fix] [fix] [fix] [fix] [fix] [fox] [fix] [test] [fox] [fix] [fix] [fix] [fix] [fix] [fix] [fix] naming [feat] new dataflow [fix] remove rollout id arg [fix] [fix] return a value [fix] [fix] [fix] [fix] [fix] use .items() [fix] [fix] [fix] [fix] [fix] [fix] testing [fix] [fix] [misc] shorten the cycles [fix] async lifecycle hooks + fix states [fix] [fix] [fix] [fix] metadata key [fix] [fix] [fix] [fix] [fix] [fix] [fix] remove simplemultiloralinear [fix] use num_row [fix] use pop [chore] clean up ai comments [fix] skip using contiguous [fix] [fix] [fix] [fix] [fix] step_counts -> train_steps [fix] checkpointing [fix] logging bug [refactor] part 1: move all to multi_lora file [fix] dataset saving and round robin fix [fix] imports [refactor] remove multi_lora sync and rename to multi_lora_utils [misc] update comments [misc] copy over train_multi_lora [misc] update comments

[fix] [fix] correctly step checkpoint step [feat] support service mode + one time mode [misc] clean up scripts [misc] add submit_schedule [fix] share namespaces [fix] wait for ray to be up [fix] logging [fix] use exception instead of connection error [fix] print idle [fix] update print [fix] wait in submit [fix] schedule [fix] name clash [refactor] controller functionality [feat] support cli multilora [fix] [fix] add mkdir to directory [misc] comments [example] single run example [fix] typo in script [fix] remove print [fix] remove dead code

gemini-code-assist

Code Review

This pull request introduces Multi-LoRA training, enabling the concurrent training of multiple LoRA adapters against a shared base model with slot-based hot swapping. Key components include a central MultiLoRAController for lifecycle management, a MultiLoRADataSource for interleaved sampling, and backend updates to Megatron and SGLang for weight synchronization. Review feedback identifies several high-severity issues: a bug in the adapter registration sequence that prevents successful checkpoint resumption, a missing import causing a NameError during weight updates, and logic errors in step tracking for inactive adapters. Additionally, improvements were suggested for the efficiency of the singleton meta-class, the robustness of checkpoint path parsing, and the implementation of port-based liveness checks for Ray cluster initialization.

gemini-code-assist · 2026-05-16T02:34:53Z

+    if ckpt is None:
+        logger.info(f"{log_prefix} no checkpoint under {ckpt_root}, starting from random init")
+    else:
+        state_dict = torch.load(ckpt, map_location="cpu", weights_only=True)
+        loaded = load_adapter(model, config.slot, state_dict)
+        assert loaded > 0, (
+            f"{log_prefix} loaded 0 tensors from {ckpt} "
+            f"(state_dict has {len(state_dict)} entries) — name mismatch?"
+        )
+        logger.info(f"{log_prefix} loaded from {ckpt} ({loaded} tensors)")
+
+    init_adapter_slot(model, config.slot, rank=config.rank, alpha=config.alpha)


load_adapter is called before init_adapter_slot. In the megatron-bridge implementation, init_adapter_slot is responsible for allocating the parameters for the specific LoRA slot. If it is called after load_adapter, the loading process will fail to find the target parameters, resulting in zero tensors being loaded and triggering the assertion failure on line 372. This effectively breaks checkpoint resumption for multi-LoRA.

Suggested change

if ckpt is None:

logger.info(f"{log_prefix} no checkpoint under {ckpt_root}, starting from random init")

else:

state_dict = torch.load(ckpt, map_location="cpu", weights_only=True)

loaded = load_adapter(model, config.slot, state_dict)

assert loaded > 0, (

f"{log_prefix} loaded 0 tensors from {ckpt} "

f"(state_dict has {len(state_dict)} entries) — name mismatch?"

)

logger.info(f"{log_prefix} loaded from {ckpt} ({loaded} tensors)")

init_adapter_slot(model, config.slot, rank=config.rank, alpha=config.alpha)

init_adapter_slot(model, config.slot, rank=config.rank, alpha=config.alpha)

if ckpt is None:

logger.info(f"{log_prefix} no checkpoint under {ckpt_root}, starting from random init")

else:

state_dict = torch.load(ckpt, map_location="cpu", weights_only=True)

loaded = load_adapter(model, config.slot, state_dict)

assert loaded > 0, (

f"{log_prefix} loaded 0 tensors from {ckpt} "

f"(state_dict has {len(state_dict)} entries) — name mismatch?"

)

logger.info(f"{log_prefix} loaded from {ckpt} ({loaded} tensors)")

gemini-code-assist · 2026-05-16T02:34:53Z

+    def update_multi_lora_weights(self, adapter_configs: dict[str, dict], active_slots: set[int] | None = None) -> None:
+        """Sync multiple LoRA adapters. Pause/resume once, loop export+send per adapter."""
+        from megatron.bridge.peft.multi_lora_layers import expose_adapter_slot


slice_lora_to_rank is used on line 302 but is not imported within this function or at the top level of the file. This will cause a NameError when update_multi_lora_weights is invoked.

Suggested change

def update_multi_lora_weights(self, adapter_configs: dict[str, dict], active_slots: set[int] | None = None) -> None:

"""Sync multiple LoRA adapters. Pause/resume once, loop export+send per adapter."""

from megatron.bridge.peft.multi_lora_layers import expose_adapter_slot

def update_multi_lora_weights(self, adapter_configs: dict[str, dict], active_slots: set[int] | None = None) -> None:

"""Sync multiple LoRA adapters. Pause/resume once, loop export+send per adapter."""

from megatron.bridge.peft.multi_lora_layers import expose_adapter_slot

from miles.backends.megatron_utils.multi_lora_utils import slice_lora_to_rank

gemini-code-assist · 2026-05-16T02:34:53Z

+        for name in self.train_steps.keys():
+            self.train_steps[name] += 1


Incrementing train_steps for all registered adapters in every training step is incorrect. In multi-LoRA training, an adapter might be skipped in a particular iteration (e.g., due to round-robin distribution or an empty data source). Since MultiLoRADataSource uses these steps to calculate data offsets for resumption and skipping, incrementing steps for adapters that did not contribute samples will cause them to skip data they never actually processed.

# Increment the step count upon training completion. # TODO: Only increment for adapters that were actually included in the training batch # to avoid incorrect data skipping in MultiLoRADataSource. for name in self.train_steps.keys(): self.train_steps[name] += 1

gemini-code-assist · 2026-05-16T02:34:53Z

+            ray.init(address=args.ray_address)
+            break
+        except Exception:
+            print("Waiting for Ray cluster to start...")


When waiting for a server process like the Ray cluster to start, simply catching broad exceptions or checking process liveness is insufficient. According to repository guidelines, the check must verify that the server is actively listening for connections on its designated port (e.g., via a socket connection or HTTP request) to ensure it is truly ready.

Suggested change

print("Waiting for Ray cluster to start...")

except Exception as e:

# TODO: Implement port-based liveness check per repository rules

print(f"Waiting for Ray cluster to start... ({e})")

References

When waiting for a server process to start, verify it is listening on its port, not just process liveness.

gemini-code-assist · 2026-05-16T02:34:53Z

+    tp_rank = mpu.get_tensor_model_parallel_rank()
+    pp_rank = mpu.get_pipeline_model_parallel_rank()
+
+    get_step = lambda d: int(d.name.split("_")[1])


The logic for extracting the step number from directory names is fragile. It assumes that every directory starting with step_ has an integer as its second component. If a directory like step_latest or a system-generated folder exists, this will raise a ValueError or IndexError.

Suggested change

get_step = lambda d: int(d.name.split("_")[1])

get_step = lambda d: int(d.name.split("_")[1]) if len(d.name.split("_")) > 1 and d.name.split("_")[1].isdigit() else -1

gemini-code-assist · 2026-05-16T02:34:53Z

    def __call__(cls, *args, **kwargs):
-        if cls not in cls._instances:
-            instance = super().__call__(*args, **kwargs)
-            cls._instances[cls] = instance
-        return cls._instances[cls]
+        # Note: this creates uninitialized instance copy of the
+        # class even if it has been cached, for getting the type,
+        # but the real instance will always be returned
+        obj = cls.__new__(cls, *args, **kwargs)
+        real_cls = type(obj)
+        if real_cls not in cls._instances:
+            obj.__init__(*args, **kwargs)
+            cls._instances[real_cls] = obj
+
+        return cls._instances[real_cls]


The implementation of SingletonMeta.__call__ is inefficient. It invokes cls.__new__ on every call, which in the case of GenerateState triggers dynamic class loading and uninitialized object creation even when the instance is already cached. The cache should be checked before calling __new__.

Suggested change

def __call__(cls, *args, **kwargs):

if cls not in cls._instances:

instance = super().__call__(*args, **kwargs)

cls._instances[cls] = instance

return cls._instances[cls]

# Note: this creates uninitialized instance copy of the

# class even if it has been cached, for getting the type,

# but the real instance will always be returned

obj = cls.__new__(cls, *args, **kwargs)

real_cls = type(obj)

if real_cls not in cls._instances:

obj.__init__(*args, **kwargs)

cls._instances[real_cls] = obj

return cls._instances[real_cls]

def __call__(cls, *args, **kwargs):

if cls in cls._instances:

return cls._instances[cls]

obj = cls.__new__(cls, *args, **kwargs)

real_cls = type(obj)

if real_cls not in cls._instances:

obj.__init__(*args, **kwargs)

cls._instances[real_cls] = obj

if real_cls != cls:

cls._instances[cls] = obj

return cls._instances[real_cls]

[refactor] decouple state from config [test] fix tests [refactor] use updated active adapters [refactor] rename ACTIVE to RUNNING for clarity [fix] tests [chore] clean up comments [fix] pre-commit + ruff [misc] remove

maocheng23 and others added 15 commits May 12, 2026 16:16

Merge branch 'main' into feat/multilora-rebase

93d1e22

[fix] minor incorrect merge conflict fix during rebase

b983c2b

[fix] typo

ea20deb

[misc] update scripts

e20b2b2

[misc] change controller to controllerimpl for easier testing

70fe3fe

[test] add tests for adapter config and controller

582eea7

[misc] add cache

7d2529e

[refactor] data source multilora

782f87d

[test] e2e test qwen3-4b

29170d8

[doc] update readme

f4db33f

[misc] use args as defaults

e2b788b

[misc] update the comments

235e321

mathewjhan requested review from Zhichenzzz, fzyzcjy, guapisolo, jybsuper, maocheng23, yueming-yuan and yushengsu-thu as code owners May 16, 2026 02:28

gemini-code-assist Bot reviewed May 16, 2026

View reviewed changes

mathewjhan added 7 commits May 15, 2026 19:42

Merge branch 'main' into feat/multilora-rebase

b8425a1

[ci] run pre-commit

3222e9e

[ci] update based on pre-commit

6a536e9

[fix] use ParallelState

e0fe167

[test] use pairwise instead of zip

de745d5

[fix] merge_sample_pair

35f9569

[fix] resolve the reward from metadata

16f17e2

yushengsu-thu self-assigned this May 17, 2026

mathewjhan added 7 commits May 19, 2026 01:15

Merge branch 'main' into feat/multilora-rebase

7229c49

[fix] remove rollout_id from unload drained

9e13809

[misc] add metadata field

fcd5592

[refactor] decouple state from config [test] fix tests [refactor] use updated active adapters [refactor] rename ACTIVE to RUNNING for clarity [fix] tests [chore] clean up comments [fix] pre-commit + ruff [misc] remove

[misc] add validation to register

eb38629

[misc] add controller info

82833d8

[misc] add adapter meta

c99b752

[misc] validate checkpoint path collision

e553a4d

mathewjhan marked this pull request as draft May 22, 2026 20:59

mathewjhan added 3 commits May 22, 2026 15:36

[fix] don't require rm and custom rm path to be set

48b6019

[fix] return correct samples filtering

5dd2fe5

Merge branch 'fix/sample-filtering' into feat/multilora-rebase

6359633

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: multi-lora training#1141

feat: multi-lora training#1141
mathewjhan wants to merge 32 commits into
radixark:mainfrom
mathewjhan:feat/multilora-rebase

mathewjhan commented May 16, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Uh oh!

gemini-code-assist Bot May 16, 2026

Uh oh!

gemini-code-assist Bot May 16, 2026

Uh oh!

gemini-code-assist Bot May 16, 2026

Uh oh!

gemini-code-assist Bot May 16, 2026

Uh oh!

gemini-code-assist Bot May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		for name in self.train_steps.keys():
		self.train_steps[name] += 1

-            print("Waiting for Ray cluster to start...")
+        except Exception as e:
+            # TODO: Implement port-based liveness check per repository rules
+            print(f"Waiting for Ray cluster to start... ({e})")

	get_step = lambda d: int(d.name.split("_")[1])
	get_step = lambda d: int(d.name.split("_")[1]) if len(d.name.split("_")) > 1 and d.name.split("_")[1].isdigit() else -1

Conversation

mathewjhan commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Feature

Running

Changes to existing code (backwards compatible)

Notes

Tests

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mathewjhan commented May 16, 2026 •

edited

Loading