Skip to content

Optimize start time#2138

Open
dgageot wants to merge 2 commits intodocker:mainfrom
dgageot:board/run-docker-agent-run-examples-gopher-yam-ecbcff53
Open

Optimize start time#2138
dgageot wants to merge 2 commits intodocker:mainfrom
dgageot:board/run-docker-agent-run-examples-gopher-yam-ecbcff53

Conversation

@dgageot
Copy link
Member

@dgageot dgageot commented Mar 16, 2026

No description provided.

dgageot added 2 commits March 16, 2026 21:44
Make modelsdev.Store a process-wide singleton via sync.OnceValues so the
models.dev database is loaded at most once and shared across all callers
(teamloader, runtime, bedrock, embedding, config tests).

Key changes:
- NewStore() returns the same *Store instance for the entire process
- Add ETag support: store the ETag from models.dev responses and send
  If-None-Match on refresh; a 304 Not Modified skips the download
- Switch from json.Decoder to ReadFile/ReadAll + json.Unmarshal to
  avoid the decoder's intermediate buffering overhead
- Hoist store creation out of per-model loops in teamloader to make
  the singleton intent explicit and handle errors gracefully
- Persist ETag in the cache file for reuse across process restarts

This eliminates ~50-60MB of redundant heap allocations per run (the 3MB
JSON was being parsed into ~66MB of Go objects up to 6 times).

Assisted-By: docker-agent
Replace json.NewDecoder streaming decode with os.ReadFile + string
splitting + strconv.Unquote. The history file stores one JSON-quoted
string per line; Unquote handles the same escapes without the full
JSON state machine and reflection overhead.

Pre-size slices by counting newlines upfront and use map[string]struct{}
for deduplication to reduce allocation count.

Benchmarked on a 658KB / 7348-line history file:
  Before: 3.05ms, 15703 allocs, 2.32MB
  After:  1.41ms,  2606 allocs, 2.96MB

Assisted-By: docker-agent
@dgageot dgageot requested a review from a team as a code owner March 16, 2026 21:11
Copy link

@docker-agent docker-agent bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Assessment: 🔴 CRITICAL

This PR introduces performance optimizations through singleton patterns and optimized parsing. While the optimizations are well-intentioned, there are several issues that need attention:

Critical Issues (1):

  • Singleton initialization failure is permanent and cannot recover

Notable Issues (5):

  • Silent error handling may cause data loss in history
  • Cache inconsistency if disk write fails
  • Store initialization errors are silently ignored (2 instances)
  • No validation of cached data structure

Key Findings

The singleton pattern for the models.dev store introduces a critical availability issue: if initialization fails during a transient condition (e.g., network mount not ready), the error is cached permanently and the application cannot recover without a restart.

Additionally, error handling has been weakened in several places - the history loader now silently skips malformed lines, and store initialization errors are never logged or returned to callers.

Recommendations

  1. Consider making the singleton recoverable or adding retry logic
  2. Add logging for skipped history entries
  3. Log store initialization errors at appropriate levels
  4. Validate cached data structure before returning it
  5. Handle saveToCache failures more gracefully

continue
}

message, err := strconv.Unquote(line)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MEDIUM: Silent error handling in strconv.Unquote may cause data loss

When strconv.Unquote fails on line 243, the error is silently ignored with continue. This means corrupt or malformed lines will be skipped without logging. If the history file contains invalid JSON strings due to disk corruption, application bugs, or manual editing, users will silently lose history entries. The previous implementation with json.Decoder.Decode would have failed and returned an error on the first invalid line, alerting the user.

Recommendation: Consider at minimum logging the skipped line or counting skipped entries and returning an error if too many are skipped.

// on the first call to NewStore. All subsequent calls return the same value.
var singleton = sync.OnceValues(func() (*Store, error) {
homeDir, err := os.UserHomeDir()
if err != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 HIGH: Singleton initialization failure is permanent and cannot recover

The singleton pattern using sync.OnceValues means that if os.UserHomeDir() or os.MkdirAll fails during the first call, the error is cached permanently. If the first call happens during startup when the home directory is temporarily unavailable (e.g., network mount not ready, Docker volume not attached), ALL future calls will fail with the same error even after the condition is resolved. The singleton cannot be reset, requiring a process restart.

For a long-running process, this is a severe availability issue.

Recommendation: Consider making the singleton recoverable by using a mutex-protected initialization that can retry on failure, or document this limitation clearly and ensure proper error handling at startup.


s.db = db
s.etag = etag
return db, nil
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MEDIUM: Cache inconsistency if saveToCache fails after updating in-memory fields

If saveToCache fails after updating s.db and s.etag in memory, the in-memory cache and disk cache are out of sync. Subsequent calls will return the cached db but the disk cache wasn't updated, so after a restart, the old cache is loaded. This leads to unnecessary API calls after restart.

Recommendation: Consider updating the in-memory cache only after successful disk write, or add logic to detect and recover from this inconsistency.

thinkingConfigured := false

// Obtain the singleton store once, outside the loop.
modelsStore, modelsStoreErr := modelsdev.NewStore()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MEDIUM: Singleton store initialization error is silently ignored

The modelsStore is obtained once before the loop. If modelsStoreErr is not nil, it's stored but never returned to the caller or logged. The code continues and skips the models.dev lookup when modelsStoreErr != nil. If the user hasn't configured max_tokens and modelsStoreErr is not nil, the code silently falls back to defaultMaxTokens without informing the user that the models.dev lookup failed.

This makes debugging harder when users encounter unexpected token limit behavior.

Recommendation: Log the error at an appropriate level (at least debug, preferably warn) so users can diagnose issues.

slog.Warn("Warning: failed to save to cache", "error", err)
// Save the fresh data to cache.
if saveErr := saveToCache(cacheFile, database, newETag); saveErr != nil {
slog.Warn("Failed to save to cache", "error", saveErr)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MEDIUM: No validation of cached data before returning after 304 response

When handling a 304 Not Modified response, the code returns cached.Database without validating that it's structurally sound. If the cache file was corrupted or partially written, json.Unmarshal might succeed but produce a Database with nil Providers, causing nil pointer panics in GetModel.

Recommendation: Add validation of the cached data structure (e.g., check that Providers is not nil) before returning it. This is good defensive programming against cache corruption.

var fallbackModels []provider.Provider

// Obtain the singleton store once, outside the loop.
modelsStore, modelsStoreErr := modelsdev.NewStore()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MEDIUM: Store initialization error in fallback models is never logged

In getFallbackModelsForAgent, modelsStore initialization error is silently ignored. If NewStore() fails, the error is never logged. If the user has configured fallback models that depend on models.dev for max_tokens, they'll silently get defaultMaxTokens without knowing why.

This could lead to unexpected behavior where models hit token limits earlier than expected.

Recommendation: Log the error at an appropriate level (at least debug, preferably warn) to help users diagnose issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant