xPyD-hub · hlin99 · Apr 6, 2026 · Apr 6, 2026 · Apr 6, 2026
diff --git a/docs/DESIGN.md b/docs/DESIGN.md
@@ -131,10 +131,88 @@ Real models produce variable-length outputs. The simulator mimics this:
 - `ignore_eos: true` → always output full max_tokens
 - Works in both streaming and non-streaming modes
 
-## OpenAI API Compliance
+## API Compliance
+
+xPyD-sim targets two levels of API compatibility:
+
+### Level 1: OpenAI API Spec (Required)
+
+All endpoints must:
+1. Accept ALL parameters defined in the OpenAI API spec without errors
+2. Produce responses that match the spec format — content can be dummy, but structure must be correct
+3. Validate parameter ranges per spec (e.g., temperature 0-2, top_p 0-1) and return 400 on invalid values
+4. Support all parameter behaviors that affect response format (not just accept and ignore)
+
+Specific requirements:
+
+| Feature | Endpoint | Behavior |
+|---|---|---|
+| response_format (json_object) | /v1/chat/completions | Return valid JSON string as content |
+| response_format (json_schema) | /v1/chat/completions | Return JSON conforming to provided schema |
+| max_completion_tokens | /v1/chat/completions | Fallback when max_tokens not set (already implemented) |
+| encoding_format: base64 | /v1/embeddings | Return base64-encoded float vector |
+| Parameter range validation | All | temperature [0,2], top_p (0,1], frequency_penalty [-2,2], presence_penalty [-2,2] |
+
+### Level 2: vLLM Backend Extensions (Required)
+
+xPyD-sim must also accept vLLM-specific parameters so it can serve as a drop-in replacement when testing xPyD-proxy against vLLM backends. These parameters should be accepted without error; behavior can be simulated where practical.
+
+#### Sampling Parameters (accept, simulate where noted)
+
+| Parameter | Type | Behavior |
+|---|---|---|
+| best_of | int | Accept; generate n candidates and return best (or simulate: just return n=1) |
+| use_beam_search | bool | Accept; ignore (sim doesn't do real search) |
+| top_k | int | Accept; ignore |
+| min_p | float | Accept; ignore |
+| repetition_penalty | float | Accept; ignore |
+| length_penalty | float | Accept; ignore |
+| stop_token_ids | list[int] | Accept; ignore (sim uses stop strings) |
+| include_stop_str_in_output | bool | Accept; ignore |
+| min_tokens | int | Accept; ignore |
+| skip_special_tokens | bool | Accept; ignore |
+| spaces_between_special_tokens | bool | Accept; ignore |
+| truncate_prompt_tokens | int | Accept; ignore |
+| prompt_logprobs | int | Accept; return null (sim doesn't track prompt logprobs) |
+| allowed_token_ids | list[int] | Accept; ignore |
+| bad_words | list[str] | Accept; ignore |
+
+#### Extra Parameters (accept and ignore)
+
+| Parameter | Type | Notes |
+|---|---|---|
+| echo | bool | Already implemented for completions; accept for chat too |
+| add_generation_prompt | bool | Accept; ignore |
+| continue_final_message | bool | Accept; ignore |
+| add_special_tokens | bool | Accept; ignore |
+| documents | list[dict] | Accept; ignore (RAG) |
+| chat_template | str | Accept; ignore |
+| chat_template_kwargs | dict | Accept; ignore |
+| mm_processor_kwargs | dict | Accept; ignore |
+| structured_outputs | dict | Accept; ignore (use response_format instead) |
+| priority | int | Accept; ignore |
+| request_id | str | Accept; ignore |
+| return_tokens_as_token_ids | bool | Accept; ignore |
+| return_token_ids | bool | Accept; ignore |
+| cache_salt | str | Accept; ignore |
+| kv_transfer_params | dict | Accept; ignore |
+| vllm_xargs | dict | Accept; ignore |
+| repetition_detection | dict | Accept; ignore |
+| reasoning_effort | str | Accept; ignore |
+| thinking_token_budget | int | Accept; ignore |
+| include_reasoning | bool | Accept; ignore |
+| prompt_embeds | bytes | Accept; ignore |
+
+#### Response Extensions (include in responses)
+
+| Field | Where | Behavior |
+|---|---|---|
+| stop_reason | choices[].stop_reason | null (sim never stops on token IDs) |
+| service_tier | response.service_tier | null |
+| kv_transfer_params | response.kv_transfer_params | null |
+
+### Legacy Notes
 
-- Accept ALL OpenAI API parameters without errors
-- Response JSON format must exactly match OpenAI spec
 - Streaming: first chat chunk delta must include `role: "assistant"`
 - Streaming: final chunk includes `usage` when `stream_options.include_usage` is set
 - All responses include `system_fingerprint`
@@ -560,3 +638,34 @@ scheduling:
 | TC13.10 | /debug/batch shows correct state | All fields accurate |
 | TC13.11 | Request log captures batch events | All events logged correctly |
 | TC13.12 | E2E with proxy: PD disaggregation | Full flow works, TTFT/TPOT correct |
+
+### 14. OpenAI Spec Compliance — Response Format
+| ID | Test | Expected |
+|---|---|---|
+| TC14.1 | response_format: json_object | Content is valid JSON string |
+| TC14.2 | response_format: json_schema with schema | Content conforms to provided JSON schema |
+| TC14.3 | response_format in streaming | Streamed content assembles into valid JSON |
+
+### 15. OpenAI Spec Compliance — Parameter Validation
+| ID | Test | Expected |
+|---|---|---|
+| TC15.1 | temperature=3.0 | HTTP 400, clear error message |
+| TC15.2 | top_p=-0.5 | HTTP 400 |
+| TC15.3 | frequency_penalty=5.0 | HTTP 400 |
+| TC15.4 | presence_penalty=-3.0 | HTTP 400 |
+| TC15.5 | n=0 or n=-1 | HTTP 400 |
+| TC15.6 | best_of < n | HTTP 400 |
+
+### 16. OpenAI Spec Compliance — Embedding base64
+| ID | Test | Expected |
+|---|---|---|
+| TC16.1 | encoding_format=float | Returns list of floats (current behavior) |
+| TC16.2 | encoding_format=base64 | Returns base64-encoded float vector |
+
+### 17. vLLM Backend Compatibility
+| ID | Test | Expected |
+|---|---|---|
+| TC17.1 | All vLLM sampling params accepted | No 422/400 error |
+| TC17.2 | All vLLM extra params accepted | No 422/400 error |
+| TC17.3 | Response includes stop_reason field | null in choices |
+| TC17.4 | Response includes service_tier field | null or absent |
diff --git a/docs/GAP_ANALYSIS.md b/docs/GAP_ANALYSIS.md
@@ -0,0 +1,36 @@
+# xPyD-sim Gap Analysis
+
+Generated: 2026-04-06
+
+## OpenAI Spec Gaps (Must Fix)
+
+| # | Feature | Current State | Required | Difficulty |
+|---|---|---|---|---|
+| 1 | Parameter range validation (temperature, top_p, frequency_penalty, presence_penalty) | No validation — any value accepted silently | Return HTTP 400 for out-of-range values (temperature [0,2], top_p (0,1], frequency_penalty [-2,2], presence_penalty [-2,2]) | Easy |
+| 2 | `n` validation (n≤0) | No validation | Return HTTP 400 for n≤0 | Easy |
+| 3 | `response_format: json_object` | Field accepted but ignored — content is plain dummy text | Return valid JSON string as content | Medium |
+| 4 | `response_format: json_schema` | Field accepted but ignored | Return JSON conforming to provided schema | Complex |
+| 5 | `response_format` in streaming | Not handled | Streamed content must assemble into valid JSON | Medium |
+| 6 | `encoding_format: base64` for embeddings | Field accepted but always returns float array | Return base64-encoded float vector when `encoding_format=base64` | Easy |
+| 7 | `best_of < n` validation | `best_of` exists on CompletionRequest but no cross-field validation | Return HTTP 400 when best_of < n | Easy |
+
+## vLLM Backend Gaps (Must Add)
+
+| # | Feature | Current State | Required | Difficulty |
+|---|---|---|---|---|
+| 1 | Accept vLLM sampling params on ChatCompletionRequest | `ChatCompletionRequest` has no `extra="allow"` — unknown fields cause 422 | Add `model_config = {"extra": "allow"}` or explicit Optional fields for all vLLM sampling params (top_k, min_p, repetition_penalty, use_beam_search, etc.) | Easy |
+| 2 | Accept vLLM sampling params on CompletionRequest | `CompletionRequest` has no `extra="allow"` — unknown fields cause 422 | Add `model_config = {"extra": "allow"}` or explicit Optional fields | Easy |
+| 3 | Accept vLLM extra params (chat_template, documents, add_generation_prompt, priority, request_id, etc.) | Not accepted — 422 error | Accept without error on all request models | Easy |
+| 4 | `best_of` on ChatCompletionRequest | Only defined on CompletionRequest | Add `best_of` field to ChatCompletionRequest | Easy |
+| 5 | `echo` on ChatCompletionRequest | Only defined on CompletionRequest | Accept on chat endpoint too | Easy |
+| 6 | `stop_reason` in response choices | Not present in Choice/CompletionChoice models | Add `stop_reason: Optional[str] = None` to Choice, CompletionChoice, StreamChoice, CompletionStreamChoice | Easy |
+| 7 | `service_tier` in response objects | Not present in response models | Add `service_tier: Optional[str] = None` to ChatCompletionResponse, CompletionResponse, ChatCompletionChunk, CompletionChunk | Easy |
+| 8 | `kv_transfer_params` in response objects | Not present | Add `kv_transfer_params: Optional[dict] = None` to response models | Easy |
+| 9 | `prompt_logprobs` support | Not present — would 422 | Accept and return null in response | Easy |
+
+## Summary
+
+- **OpenAI Spec Gaps**: 7 items (4 Easy, 2 Medium, 1 Complex)
+- **vLLM Backend Gaps**: 9 items (all Easy)
+- **Highest risk**: `response_format: json_schema` — requires parsing JSON Schema and generating conforming dummy data
+- **Quick wins**: Parameter validation, `extra="allow"`, response field additions — can all be done in one PR