Skip to content

Simple and stable Inference APIs#4697

Open
YangFei1990 wants to merge 18 commits intoNVIDIA:mainfrom
YangFei1990:inference_apis
Open

Simple and stable Inference APIs#4697
YangFei1990 wants to merge 18 commits intoNVIDIA:mainfrom
YangFei1990:inference_apis

Conversation

@YangFei1990
Copy link
Copy Markdown
Contributor

What does this PR do ?

Motivation And Goals

The current Megatron inference APIs expose many internal building blocks:

  • InferenceConfig
  • DynamicInferenceContext
  • GPTInferenceWrapper
  • TextGenerationController
  • DynamicInferenceEngine
  • InferenceClient
  • DataParallelInferenceCoordinator
  • dynamic text generation server helpers

This is powerful, but it makes simple usage verbose. A user who wants to run offline generation or serve requests must understand engine construction, context selection, tokenizer setup, coordinator lifecycle, and per-rank behavior.

APIs

Inspired by vLLM, we propose two dimensions for the API design.

Sync and Async APIs

We propose two major APIs for Megatron inference:

  • MegatronLLM: synchronous offline inference API. Calls block for final
    outputs.
  • MegatronAsyncLLM: asyncio-native generation, online serving (OpenAI
    compatible).

Both classes support offline inference, lifecycle control (pause/unpause/suspend/resume), and access to the underlying engine for expert use. The differentiators are: MegatronAsyncLLM exposes async methods and the online HTTP server (serve(...)); MegatronLLM exposes sync methods.

The underlying primitive APIs can also be accessed through corresponding
property attributes (engine, context, controller).

Coordinator

Both sync and async APIs support direct mode and coordinator mode, specified by the use_coordinator argument in the API constructor. We also provide an is_primary_rank property to help users understand which rank should feed data and collect outputs.

Without coordinator, all ranks are treated as user-managed ranks, and users need to handle load balancing between different DP/EP ranks. Every rank's is_primary_rank returns true: the API does not decide which rank should receive which prompts or which rank should emit output. Users must split data across different DP/EP ranks, ensure consistent inputs across TP/PP/CP ranks, and gather/aggregate results from different DP/EP ranks. If users do not shard inputs correctly, they may duplicate work or violate TP/PP/EP/DP group expectations.

With coordinator, the coordinator manages load balancing. Users feed data on the coordinator (primary) rank and collect output on that rank. is_primary_rank returns true only on the coordinator rank, which is global rank 0. Online serving mode requires use_coordinator=True when DP/EP size is greater than 1.

Lifecycle methods (pause/unpause/suspend/resume) are only meaningful in coordinator mode. They raise RuntimeError in direct mode.

Examples

Here we list some common examples, for details check examples/inference

Offline Sync Generation With Coordinator

from megatron.inference import MegatronLLM, SamplingParams

llm = MegatronLLM(
    ...,
    use_coordinator=True,
)

if llm.is_primary_rank:
    outputs = llm.generate(prompts, SamplingParams(num_tokens_to_generate=128))
    for output in outputs:
        print(output.generated_text)

llm.shutdown()

Concurrent Async Generation With Multiple Prompts

from megatron.inference import MegatronAsyncLLM, SamplingParams

llm = MegatronAsyncLLM(
    ...,
    use_coordinator=True,
    coordinator_host="10.0.0.1",
    coordinator_port=6000,
)

if llm.is_primary_rank:
    sampling_params = SamplingParams(num_tokens_to_generate=64)
    results = await llm.generate(prompts, sampling_params)
    for result in results:
        print(result.generated_text)

await llm.shutdown()

Programmatic OpenAI-Compatible Server

from megatron.inference import MegatronAsyncLLM, ServeConfig

llm = MegatronAsyncLLM(
    ...,
    use_coordinator=True,
    coordinator_host="10.0.0.1",  # Internal/routable host for coordinator ZMQ.
    coordinator_port=6000,
)

# All ranks enter llm.serve, but only the primary rank hosts the HTTP server.
# `blocking=True` (default) keeps serve() awaiting until the server stops.
await llm.serve(
    ServeConfig(
        host="0.0.0.0",  # HTTP bind host.
        port=5000,
    ),
)

# Users can send OpenAI-compatible requests to the primary rank's HTTP endpoint.
# For example, from another process:
#
# from openai import OpenAI
#
# client = OpenAI(api_key="EMPTY", base_url="http://<primary-host>:5000/v1")
# response = client.chat.completions.create(
#     model="megatron-gpt",
#     messages=[{"role": "user", "content": "Explain Megatron inference."}],
#     max_tokens=128,
#     temperature=0.8,
#     top_p=0.95,
#     extra_body={"top_k": 40},  # Sent to the server as top-level {"top_k": 40}.
# )
# print(response.choices[0].message.content)
#
# Equivalent raw HTTP request:
#
# curl http://<primary-host>:5000/v1/chat/completions \
#   -H "Content-Type: application/json" \
#   -d '{
#     "model": "megatron-gpt",
#     "messages": [{"role": "user", "content": "Explain Megatron inference."}],
#     "max_tokens": 128,
#     "temperature": 0.8,
#     "top_p": 0.95,
#     "top_k": 40
#   }'

PR review

The major files to review are newly added examples in examples/inference and the high level implementations in megatron/inference, the rest are most test coverage and doc changes.

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

YangFei1990 and others added 14 commits May 6, 2026 20:03
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ad CUDA device

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…inference/legacy/

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ith reused legacy goldens

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… bespoke driver

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…al inference sections

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@YangFei1990 YangFei1990 requested a review from a team as a code owner May 8, 2026 06:30
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@svcnvidia-nemo-ci svcnvidia-nemo-ci marked this pull request as draft May 8, 2026 06:30
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

  1. Add the oncall reviewer (optional reviewer)
  2. Add required review teams based on your changes

See the contribution guide for more details.

@YangFei1990 YangFei1990 marked this pull request as ready for review May 8, 2026 06:31
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team May 8, 2026 06:31
@YangFei1990
Copy link
Copy Markdown
Contributor Author

/ok to test 5f651b7

@YangFei1990
Copy link
Copy Markdown
Contributor Author

/ok to test 963f663

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants