Connect your local LLMs directly to VS Code for a private and powerful AI coding experience.
A VS Code extension that connects your editor to self‑hosted or local LLMs via any OpenAI‑compatible server (vLLM, Ollama, TGI, llama.cpp, LocalAI, etc.). Keep source code on your infrastructure while using AI for coding, refactoring, analysis, and more.
- Works with any OpenAI Chat Completions–compatible endpoint
- Function calling tools with optional parallel execution
- Safe token budgeting based on model context window
- Built‑in retries with exponential backoff and detailed logging
- Model list caching for fewer network calls
- API keys securely stored in VS Code SecretStorage
- Status bar health monitor with quick actions
- Server presets for quick switching between endpoints
- Usage statistics tracking (requests, tokens, response times)
- Default model selection for consistent workflow
- vLLM (recommended)
- LM Studio
- Ollama
- llama.cpp
- Text Generation Inference (Hugging Face)
- LocalAI
- Any other OpenAI‑compatible server
- Install “Local Model Provider” from the VS Code Marketplace.
- Reload VS Code if prompted.
- Start a server
-
vLLM example (gpt-oss-120b)
vllm serve openai/gpt-oss-120b \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser openai \ --reasoning-parser openai_gptoss \ --tensor-parallel-size 2 \ --host 0.0.0.0 \ --port 8000 \ --max-model-len 131072 \ --gpu-memory-utilization 0.8 \ --disable-log-requests \ --enable-prefix-caching \ --async-scheduling
Options explained (brief):
--trust-remote-code: allow custom model repo code to run (required by some model repos)--enable-auto-tool-choice: let the model/server automatically pick and call tools--tool-call-parser openai: use OpenAI function calling format--reasoning-parser openai_gptoss: reasoning parser compatible with GPT‑OSS--tensor-parallel-size 2: split the model across 2 GPUs (tensor parallelism)--host 0.0.0.0: listen on all network interfaces--port 8000: server port--max-model-len 131072: max context length (tokens)--gpu-memory-utilization 0.8: VRAM usage ratio per GPU--disable-log-requests: reduce request logging noise--enable-prefix-caching: enable prefix/KV cache for repeated prompts--async-scheduling: schedule requests asynchronously for better throughput
-
LM Studio example
- Download and install LM Studio
- Load a model in the LM Studio UI
- Start the local server (default:
http://localhost:1234) - Set
local.model.provider.serverUrltohttp://localhost:1234
Tip: If tool calling causes errors with your model, disable
local.model.provider.enableToolCallingin settings. -
Ollama example
ollama run qwen3:8b
- Configure the extension
- Open VS Code Settings and search for “Local Model Provider”.
- Required: set
local.model.provider.serverUrl(e.g. http://localhost:8000) - Optional: run “Local Model Provider: Set API Key (Secure)” to store a key in SecretStorage
- Use your models
- Open the model manager and enable models from the “Local Model Provider”.
- Model configuration
- Model selection
- Test execution
- Feature menu
- Server preset
All settings are under the local.model.provider.* namespace.
serverUrl(string): base URL, e.g.http://localhost:8000serverPresets(array): saved server configurations for quick switchingdefaultModel(string): default model ID to use (leave empty for auto-select)requestTimeout(number, ms): default 60000
defaultMaxTokens(number): estimated context window (default 32768). If your model/server supports larger context, consider increasing this for better continuity (e.g., 65k–128k).defaultMaxOutputTokens(number): max generation tokens (default 4096). Increase when you need longer answers; ensure input + output stays within the model's context window.
enableToolCalling(boolean): enable function calling (default true)parallelToolCalling(boolean): allow parallel tool calls (default true)agentTemperature(number): temperature with tools (default 0.0)
topP(number): nucleus sampling (default 1.0)frequencyPenalty(number): repetition penalty (default 0.0)presencePenalty(number): topic shift encouragement (default 0.0)
maxRetries(number): retry attempts (default 3)retryDelayMs(number): backoff base delay (default 1000)modelCacheTtlMs(number): model list cache TTL (default 300000)logLevel("debug" | "info" | "warn" | "error")
API keys are not stored in settings. Use the command palette:
- “Local Model Provider: Set API Key (Secure)”
- "Local Model Provider: Set API Key (Secure)" — Store/remove API key in SecretStorage
- "Local Model Provider: Show Server Status" — Open the status bar menu with quick actions
- "Local Model Provider: View Models & Set Default" — Browse available models and set a default
- "Local Model Provider: Switch Server Preset" — Quick switch between configured server endpoints
- "Local Model Provider: View Usage Statistics" — Display session statistics (requests, tokens, response times)
- "Local Model Provider: Refresh Model Cache" — Clear cache and fetch models from server
See connection status at a glance. Click to open quick actions:
- View and set default model
- Switch server presets
- View usage statistics
- Refresh model cache
- Set API key
- Open settings
- Show logs
The status bar displays:
- Connection status (connected/error/unknown)
- Number of available models
- Session statistics when available
Models don’t appear
curl http://HOST:PORT/v1/modelsand confirm the server responds- Verify
serverUrlis correct (protocol/port included) - Run “Local Model Provider: Test Server Connection”
Empty response
- Ensure the correct tool‑call parser for your model family (e.g. vLLM
--tool-call-parser) - Disable
enableToolCallingto test plain chat - Large conversations are truncated automatically; try with fewer messages
Tool call formatting issues
- Disable
parallelToolCallingfor unstable models - Set
agentTemperatureto 0.0 for more consistent formatting
LM Studio connection errors ("terminated" / "request failed")
- Set
serverUrltohttp://localhost:1234(LM Studio default port) - Do NOT include
/v1in the URL — the extension adds it automatically - Disable
parallelToolCalling— LM Studio may not support this parameter - If tool calling causes crashes, disable
enableToolCallingin settings - Increase
requestTimeoutif the model takes a long time to load (e.g. 120000) - Check the LM Studio server logs for detailed error messages
Out‑of‑memory (OOM)
- Reduce
--max-model-len, use a quantized model (AWQ/GPTQ/FP8), or pick a smaller model
- Requests are sent only to the server you configure.
- If authentication is required, API keys are stored securely via VS Code SecretStorage.
- Sensitive data (like API keys) is never written to logs.
Licensed under the MIT license.
- Issues & Feature Requests: https://github.com/krevas/local-model-provider/issues
