Move operations like chat templates to the server side for VLM deployment

**Is your feature request related to a problem? Please describe.**
In multi-modal deployment for accuracy evaluation it was noticed that there are some inconsistencies between LLMs and VLMs deployment in pytriton. The biggest different is that for LLMs the chat template is applied on the server side ([here](https://github.com/NVIDIA-NeMo/Export-Deploy/blob/df3415779b4bcb1c829368fbb8f8a00a10fac560/nemo_deploy/nlp/megatronllm_deployable.py#L271)), while for VLMs there's no such method and everything need to happen on the client side ([here](https://github.com/NVIDIA-NeMo/Export-Deploy/blob/df3415779b4bcb1c829368fbb8f8a00a10fac560/scripts/deploy/multimodal/query_inframework.py#L102)). Would it be possible to move this processing to the server side for VLMs too? Without this we don't have OpenAI-like api and it cannot be used the server for evaluation.

Solution:
The operations like chat template should be moved to the server side.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move operations like chat templates to the server side for VLM deployment #423

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Move operations like chat templates to the server side for VLM deployment #423

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions