Add torch_xla TPU support for ZeRO-1/2 by PKUWZP · Pull Request #7917 · deepspeedai/DeepSpeed

PKUWZP · 2026-03-21T18:43:45Z

Summary

add an XLA accelerator for TPU execution through torch_xla
initialize DeepSpeed distributed with the torch.distributed xla backend and xla:// init flow
make ZeRO-1/2 gradient bucketing backend-agnostic and add TPU/XLA-focused tests and docs

Testing

git diff --check
python3 -m py_compile accelerator/xla_accelerator.py accelerator/real_accelerator.py deepspeed/comm/constants.py deepspeed/comm/comm.py deepspeed/comm/torch.py deepspeed/runtime/zero/stage_1_and_2.py tests/accelerator/test_ds_init.py tests/unit/common.py tests/unit/accelerator/test_accelerator.py tests/unit/comm/test_xla_backend.py

Notes

local python environment in this session did not have pytest or pre-commit installed, so those checks could not be executed here
scope is intentionally limited to TPU support for ZeRO-1/2 via torch_xla

Signed-off-by: PKUWZP <zhipeng.rainbowserie@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8acc916e4e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-21T18:50:58Z

accelerator/xla_accelerator.py

+    def device_name(self, device_index=None):
+        if device_index is None:
+            return 'xla'
+        return f'xla:{device_index}'


Map XLA device names to addressable device indices

device_name(device_index) is used by DeepSpeedEngine._set_distributed_vars() (deepspeed/runtime/engine.py:1287-1290) to build self.device from LOCAL_RANK, and the module is then moved there at line 1427. Under PJRT, LOCAL_RANK is not the per-process XLA device index: each worker only sees its own addressable XLA devices, which is often just xla:0. Returning xla:{LOCAL_RANK} here means nonzero ranks on a multi-chip TPU will try to place the model on xla:1, xla:2, etc., which are not addressable in that process and causes multi-rank launches to fail or target the wrong chip.

Useful? React with 👍 / 👎.

Addressed in 7f82c20. device_name(device_index) now resolves through the process addressable XLA device list instead of treating LOCAL_RANK as a raw XLA ordinal, so single-device-per-process workers map nonzero local ranks back to their local xla:0 device correctly.

chatgpt-codex-connector · 2026-03-21T18:50:58Z

accelerator/xla_accelerator.py

+    def set_device(self, device_index):
+        os.environ['LOCAL_RANK'] = str(device_index)


Actually bind each TPU worker in set_device()

DeepSpeed's launcher gives every local process the same TPU visibility mask (deepspeed/launcher/launch.py:182-183) and relies on get_accelerator().set_device(local_rank) from DeepSpeedEngine._set_distributed_vars() to pin each worker to its chip. This implementation only rewrites LOCAL_RANK; it never calls a torch_xla/PJRT device-selection API or sets the PJRT process-rank env that torch_xla uses to derive local ordinals. On a host with multiple TPU chips, multiple ranks can therefore attach to the same default XLA device, which breaks distributed initialization and ZeRO synchronization.

Useful? React with 👍 / 👎.

Addressed in 7f82c20. set_device() now calls into xm.xla_device() to select the XLA default device for the current process before DeepSpeed moves the model, and it preserves the launcher-provided rank information in the environment.

Signed-off-by: PKUWZP <zhipeng.rainbowserie@gmail.com>

tohtana

Thank you for submitting a great PR, @PKUWZP!
I left two comments, though they are not the core part of this PR.

tohtana · 2026-03-23T18:47:03Z

accelerator/xla_accelerator.py

+            raise RuntimeError("No addressable XLA devices are available in the current process.")
+        if device_index is None:
+            return 0
+        return min(device_index, len(devices) - 1)


This assumes device_index is int, but can actually be

string (device_name: code)

torch.device (set_device: code)

For these, the function should throw TypeError. I suggest handling these types in this function.

@tohtana good catch! I made some changes, it now handles str (e.g., "0" or "xla:0" from os.environ["LOCAL_RANK"]) and torch.device (from torch.device(get_accelerator().device_name(...))) in addition to int.

Thank you, I see you added handling in _normalize_device_index. But I wonder if set_device still has an issue with the path from partition_parameters.py.

In set_device, we write device_index back to the env vars (code). If the device_index is torch.device, LOCAL_RANK and PJRT_LOCAL_PROCESS_RANK will include device type (e.g. xla:0), not only a number.

accelerator/real_accelerator.py

sfc-gh-truwase · 2026-03-24T17:29:36Z

deepspeed/runtime/zero/stage_1_and_2.py

+    accelerator = get_accelerator()
+    dtype_order = (torch.float16, torch.float32, torch.float64, torch.bfloat16)
+    for dtype in dtype_order:
+        bucket = [tensor for tensor in tensors if tensor.dtype == dtype and accelerator.on_accelerator(tensor)]


Why does xla require special handling given the prior code worked for other accelerators?

Another issue is that it seems on_accelerator is only defined for xla and so other accelerators will break here.

@sfc-gh-truwase Thanks for the comments:) I removed the accelerator.on_accelerator(tensor) filter that would have changed behavior for all backends. I also kept the dtype-based comparison (replacing the old string-based type names like torch.cuda.HalfTensor) since that's the actual fix needed for XLA compatibility without breaking other accelerators.

sfc-gh-truwase · 2026-03-24T17:43:44Z

deepspeed/comm/comm.py

        utils.logger.info(f'cdb={cdb}')
    if cdb is None and torch.distributed.is_initialized():
        # The user initialized torch.dist themselves, create cdb and short-circuit
+        if dist_backend is None:


Why do we need this behavior? Is it specific to xla?

@sfc-gh-truwase Thanks for the call out! I added a comment clarifying that it's a general fix (not XLA-specific) — it prevents passing None to TorchBackend when the user pre-initialized torch.distributed without specifying dist_backend.

sfc-gh-truwase · 2026-03-24T17:53:38Z

tests/unit/accelerator/test_accelerator.py

+            selected_device["index"] = n
+        return FakeDevice(selected_device["index"])
+
+    torch_xla.devices = lambda: [FakeDevice(idx) for idx in range(device_count)]


I don't think using a fake TPU device is useful since computation cannot be tested. Rather, I think we should condition these tests to only run when TPU is available. Another possibility is to setup CI tests specifically for TPU on the cloud credits.

@sfc-gh-truwase Great suggestions. I initially thought adding the fake TPU device is useful for testing, now I removed all three fake TPU device tests and the _install_fake_torch_xla helper. I agree with you that these should be conditioned on real TPU availability instead of using faked devices.

- split_half_float_double: use dtype comparison instead of string-based type names, without adding on_accelerator filtering that would change behavior for all backends - comm.py: clarify that dist_backend fallback is not XLA-specific - Remove fake TPU device tests per reviewer guidance; XLA accelerator tests should run on real TPU hardware Signed-off-by: PKUWZP <zhipeng.rainbowserie@gmail.com>

- _normalize_device_index: handle str and torch.device types in addition to int, since callers pass LOCAL_RANK strings and torch.device objects - real_accelerator: catch RuntimeError from get_xla_supported_devices() when torch_xla is installed but no TPU/PJRT runtime is available Signed-off-by: PKUWZP <zhipeng.rainbowserie@gmail.com>

Add torch_xla TPU support for ZeRO-1/2

8acc916

Signed-off-by: PKUWZP <zhipeng.rainbowserie@gmail.com>

PKUWZP requested review from GuanhuaWang, loadams, tjruwase and tohtana as code owners March 21, 2026 18:43

PKUWZP requested review from delock and removed request for GuanhuaWang March 21, 2026 18:44

chatgpt-codex-connector bot reviewed Mar 21, 2026

View reviewed changes

PKUWZP added 6 commits March 21, 2026 15:46

Fix TPU device selection for XLA workers

7f82c20

Signed-off-by: PKUWZP <zhipeng.rainbowserie@gmail.com>

Fix formatting hooks for XLA backend test

dbfd0a9

Signed-off-by: PKUWZP <zhipeng.rainbowserie@gmail.com>

Fix XLA backend unit test import

41d0f7e

Signed-off-by: PKUWZP <zhipeng.rainbowserie@gmail.com>

Patch fake torch_xla before import in test

7debfbb

Signed-off-by: PKUWZP <zhipeng.rainbowserie@gmail.com>

Import XLA test target module explicitly

623df9e

Signed-off-by: PKUWZP <zhipeng.rainbowserie@gmail.com>

Relax XLA backend test initialization assertion

65cf60f

Signed-off-by: PKUWZP <zhipeng.rainbowserie@gmail.com>

tohtana reviewed Mar 23, 2026

View reviewed changes

sfc-gh-truwase reviewed Mar 24, 2026

View reviewed changes

PKUWZP added 2 commits March 26, 2026 00:02

		def set_device(self, device_index):
		os.environ['LOCAL_RANK'] = str(device_index)

Conversation

PKUWZP commented Mar 21, 2026

Summary

Testing

Notes

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tohtana left a comment

Choose a reason for hiding this comment

Uh oh!

tohtana Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tohtana Mar 23, 2026 •

edited

Loading