Skip to content

CI: Code Coverage job (cargo-tarpaulin) times out at 90m on main, masking coverage signal #622

@bpowers

Description

@bpowers

Summary

The Code Coverage job in the CI GitHub Actions workflow (.github/workflows/ci.yaml, the coverage: job, step Run cargo-tarpaulin) is failing on main. The Run cargo-tarpaulin step has timeout-minutes: 90, and recent runs hit that cap: the job runs ~1h30m and then errors without producing/uploading coverage XML.

This has been failing across multiple recent main commits, so it is not specific to any single PR. (It is unrelated to the recent cargo-deny failure, which was already fixed in 113204c.)

Affected runs (both failure on main)

All other CI jobs (build, frontend, pysimlin, simlin-serve smoke, Lint/cargo-deny) pass.

Why it matters

  • Masks real coverage signal: when the step times out, no coverage XML is produced, so the Upload Coverage step has nothing to send to Codecov. Coverage data on main goes stale.
  • Wastes CI minutes: every main push burns ~90 minutes of runner time on a job that always fails.
  • Erodes trust in CI: a perpetually-red job trains people to ignore the CI status.

Hypotheses (not yet confirmed)

  1. The WebAssembly simulation backend added in engine: WebAssembly simulation backend with full VM parity #620 may have introduced slow test(s) that, under tarpaulin's instrumentation, push the whole suite past the 90-minute outer cap. Note engine: WebAssembly simulation backend with full VM parity #620 is the older of the two cited runs, which is consistent with the regression entering around then.
  2. cargo-tarpaulin's instrumentation may interact badly with the WASM build and/or the wasm-interpreter git dependency, inflating wall-clock time.

The existing inline comment in ci.yaml (just above the step) already documents prior history of this same job blowing past a 45-minute cap once the simlin-serve suite was added, which is why the cap was raised to 90 minutes. That budget now appears to be exhausted again.

Component(s) affected

Possible approaches for resolution

  • Confirm the root cause first: compare per-test timings / total tarpaulin wall-clock before vs. after engine: WebAssembly simulation backend with full VM parity #620; check whether specific WASM-backend tests dominate.
  • The inline comment already names the intended next move: switch to source-based coverage via cargo tarpaulin --engine llvm (or cargo-llvm-cov) so coverage runs at native test speed instead of ptrace-instrumented speed.
  • Alternatively, exclude/limit the slow WASM-backend test(s) from the coverage run, or split coverage into faster shards.

How it was discovered

Identified while fixing an unrelated cargo-deny failure on main; noticed the Code Coverage job had been red across several recent main runs. Not yet investigated beyond gh run view output (no local reproduction of the tarpaulin run).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions