Add ChatVLLM wrapper for proper chat template support#8
Merged
ferreirafabio merged 1 commit intoOpenEuroLLM:mainfrom Jan 12, 2026
Merged
Conversation
The default LangChain VLLM wrapper uses vllm.LLM.generate() which does *not* apply the model's chat template. This is problematic for models that rely on chat templates (e.g., with <|im_start|>, <|im_end|>, <think> tags). This PR introduces ChatVLLM which uses vllm.LLM.chat() instead, correctly formatting prompts with the model's native chat template. Also updates default max_tokens from 200 to 8192 to avoid truncation.
geoalgo
approved these changes
Jan 12, 2026
Collaborator
geoalgo
left a comment
There was a problem hiding this comment.
Thanks again for the catch. Just have two minor comments.
Comment on lines
+106
to
+111
| def __init__(self, model: str, max_tokens: int = 8192, **vllm_kwargs): | ||
| from vllm import LLM, SamplingParams | ||
|
|
||
| self.model_path = model | ||
| self.max_tokens = max_tokens | ||
| self.llm = LLM(model=model, trust_remote_code=True, **vllm_kwargs) |
Collaborator
There was a problem hiding this comment.
We probably want to leave the possibility to change trust_remote_code.
Suggested change
| def __init__(self, model: str, max_tokens: int = 8192, **vllm_kwargs): | |
| from vllm import LLM, SamplingParams | |
| self.model_path = model | |
| self.max_tokens = max_tokens | |
| self.llm = LLM(model=model, trust_remote_code=True, **vllm_kwargs) | |
| def __init__(self, model: str, max_tokens: int = 8192, trust_remote_code: bool = True, **vllm_kwargs): | |
| from vllm import LLM, SamplingParams | |
| self.model_path = model | |
| self.max_tokens = max_tokens | |
| self.llm = LLM(model=model, trust_remote_code=trust_remote_code, **vllm_kwargs) |
| if model_provider == "VLLM": | ||
| return ChatVLLM( | ||
| model=model_name, | ||
| max_tokens=max_tokens if max_tokens else 8192, |
Collaborator
There was a problem hiding this comment.
why do we need else 8192? The default is already 8192 in l168 right?
Suggested change
| max_tokens=max_tokens if max_tokens else 8192, | |
| max_tokens=max_tokens, |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The default LangChain VLLM wrapper uses
vllm.LLM.generate()which does not apply the model's chat template (see reference [1]). This causes issues for models that rely on chat templates (e.g., with<|im_start|>,<|im_end|>,<think>tags), leading to malformed inputs, truncated outputs, and biased judge evaluations.This PR implements a
ChatVLLMwrapper that usesvllm.LLM.chat()instead, which automatically applies the model's native chat templatetokenizer_config.jsonstored in each model directory.The
ChatVLLMwrapper converts LangChain prompts to OpenAI-style messages, then vLLM applies the chat template automatically.Example Impact
Without Chat Template (LangChain default):
With Chat Template (ChatVLLM):
Changes
ChatVLLMclass inutils.py:vllm.LLM.chat()instead ofgenerate()make_model()to useChatVLLMfor VLLM providermax_tokensfrom 200 to 8192 to prevent runaway generationReferences
[1] https://docs.vllm.ai/en/latest/getting_started/quickstart/#offline-batched-inference:~:text=r%7D%22)-,Note,same%20format%20as%20those%20passed%20to%20OpenAI%27s%20client.chat.completions%3A,-Code
"The llm.generate method does not automatically apply the model's chat template to the input prompt. Therefore, if you are using an Instruct model or Chat model, you should manually apply the corresponding chat template to ensure the expected behavior. Alternatively, you can use the llm.chat method..."