Skip to content

CI: build-agents jobs recompile Rust from scratch despite cache-from #1224

Description

@chaodu-agent

Problem

Each build-agents matrix job in build-images.yml recompiles the entire Rust binary from scratch, despite the cache-from + BUILDER_IMAGE mechanism designed to reuse the build-core output.

Evidence

From run #28303304486:

  • build-core (arm64) completes and pushes the builder image to registry
  • build-agents jobs start after build-core completes (correct needs dependency)
  • But each variant job still shows full cargo build compilation in the "Build and push by digest" step

Example: hermes arm64 job — log shows compiling 2000+ crates from scratch (fdeflate, webpki-roots, image-webp, etc.)

Timing (arm64 jobs)

Variant Duration
grok 6 min
cursor 10 min
copilot 11 min
claude 11 min
codex 11 min
kiro 12 min
hermes 14+ min

All are spending the majority of time on Rust compilation.

Root Cause

In Dockerfile.unified:

ARG BUILDER_IMAGE=rust:1-bookworm
FROM ${BUILDER_IMAGE} AS builder
WORKDIR /build
COPY Cargo.toml Cargo.lock ./
COPY crates/openab-core/Cargo.toml crates/openab-core/Cargo.toml
COPY crates/openab-gateway/Cargo.toml crates/openab-gateway/Cargo.toml
RUN ... cargo build --release --features unified ...
COPY crates/ crates/
COPY src/ src/
RUN ... cargo build --release --features unified

Even though BUILDER_IMAGE is set to the pre-built builder from registry, Docker buildx still evaluates the COPY + RUN layers. If the build context (file hashes) does not exactly match the cached layers, all subsequent layers are invalidated and recompiled.

The cache-from: type=registry helps with layer matching, but in practice the full recompile still happens — likely because the context sent to buildx differs slightly between build-core and build-agents jobs (same checkout, but timing/metadata differences can affect layer hashes).

Impact

  • N variants × 2 architectures = ~28 redundant Rust compilations
  • Each takes 6-14 min → total CI time much higher than necessary
  • Wastes GitHub Actions minutes

Suggested Fix

Instead of relying on Docker layer cache for the builder stage, have build-agents directly copy the pre-built binary:

# Option A: Multi-stage with explicit image reference
ARG BUILDER_IMAGE
FROM ${BUILDER_IMAGE} AS prebuilt-builder
FROM debian:bookworm-slim AS hermes
COPY --from=prebuilt-builder /build/target/release/openab /usr/local/bin/openab
# ... install runtime deps ...

Or separate the Dockerfile so variant targets do NOT include the builder stage at all — they just COPY --from=<registry>/builder:<tag>-<arch>.

This would reduce each variant job from 10+ min to under 1 min (just pulling image + adding thin layer).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions