Skip to content

[CI debug - DO NOT MERGE] Diagnose ninja cascade-rebuild on test side#195

Closed
lamb-j wants to merge 1 commit into
amd-stagingfrom
users/lambj/spirv-ci-debug-ninja-rebuild
Closed

[CI debug - DO NOT MERGE] Diagnose ninja cascade-rebuild on test side#195
lamb-j wants to merge 1 commit into
amd-stagingfrom
users/lambj/spirv-ci-debug-ninja-rebuild

Conversation

@lamb-j
Copy link
Copy Markdown
Collaborator

@lamb-j lamb-j commented May 10, 2026

⚠️ Diagnostic PR — do not merge

After #194 landed, codegen and Comgr test jobs cascade-rebuild despite `tar -xmf`. Codegen rebuilds 2926 targets (~10 min); Comgr rebuilds 228 targets (~4 min, fits in budget so it appears green).

Adds diagnostic steps after the untar in both jobs to capture:

  • `ls -la build/.ninja_deps build/.ninja_log build/build.ninja` — confirms whether the ninja state files survived the tar round-trip
  • `wc -c` of the same — detects corruption (e.g., truncated to 0 bytes)
  • mtimes of source CMakeLists.txt — settles whether mtime is the trigger
  • `ninja -d explain -n check-X` — the gold signal: ninja prints exactly why it would rebuild each target, without actually building

Once the diagnostic comes back we'll know which of these is the cause:

  • `.ninja_deps` missing → tar excludes it, or upload-artifact corrupts the tar
  • `.ninja_deps` truncated → corruption somewhere in upload/download
  • mtime trigger → `tar -xmf` not behaving as expected
  • command-hash mismatch → cmake regen produces different commands than original build

Then we land the targeted fix in a separate PR and revert this debug.

Codegen and Comgr test jobs cascade-rebuild even with tar -xmf. Add
diagnostic steps after untar in both jobs to capture:

  - mtime + size of build/.ninja_deps, .ninja_log, build.ninja
    (corruption or absence would explain "ninja thinks everything is
    dirty")
  - mtime of source CMakeLists.txt (for comparison vs build.ninja
    mtime — settles the "is mtime the trigger" question)
  - ninja -d explain -n output, which prints why ninja decides to
    rebuild each target without actually building (this is the gold
    signal — tells us if it's mtime, command-hash, depfile-missing,
    or something else)

One-shot debug PR. Once we see the explain output we'll know what to
fix and revert this. Do not merge.
@github-actions
Copy link
Copy Markdown
Contributor

⚠️ 21 pre-existing translator lit failures on baseline; not caused by this PR (see run).

🔴 New failures (0) — likely caused by this PR

(none)

🟢 Fixed by this PR (0) — failing on baseline, passing here

(none)

⚠️ Pre-existing on `amd-staging` (21)
FAIL: LLVM_SPIRV :: constant/local-float-point-constants.ll
FAIL: LLVM_SPIRV :: extensions/EXT/SPV_EXT_float8/conversions_matrix.ll
FAIL: LLVM_SPIRV :: extensions/EXT/SPV_EXT_float8/conversions_scalar_vector.ll
FAIL: LLVM_SPIRV :: extensions/EXT/SPV_EXT_shader_atomic_float_/atomicrmw_fsub_half.ll
FAIL: LLVM_SPIRV :: extensions/INTEL/SPV_INTEL_float4/conversions_packed.ll
FAIL: LLVM_SPIRV :: extensions/INTEL/SPV_INTEL_float4/conversions_scalar_vector.ll
FAIL: LLVM_SPIRV :: extensions/INTEL/SPV_INTEL_fp_conversions/spv_intel_fp_conversions.ll
FAIL: LLVM_SPIRV :: extensions/INTEL/SPV_INTEL_int4/conversions_packed.ll
FAIL: LLVM_SPIRV :: extensions/INTEL/SPV_INTEL_sigmoid/sigmoid_f16.ll
FAIL: LLVM_SPIRV :: extensions/KHR/SPV_KHR_bfloat16/cooperative_matrix_bfloat16.ll
FAIL: LLVM_SPIRV :: extensions/KHR/SPV_KHR_cooperative_matrix/conversion_instructions.ll
FAIL: LLVM_SPIRV :: extensions/KHR/SPV_KHR_subgroup_rotate/SPV_KHR_subgroup_rotate.cl
FAIL: LLVM_SPIRV :: extensions/KHR/SPV_KHR_uniform_group_instructions/group-instructions.ll
FAIL: LLVM_SPIRV :: transcoding/float16.ll
FAIL: LLVM_SPIRV :: transcoding/image_signedness_spv_ir.ll
FAIL: LLVM_SPIRV :: transcoding/OpImageSampleExplicitLod_arg.cl
FAIL: LLVM_SPIRV :: transcoding/spec_const.ll
FAIL: LLVM_SPIRV :: transcoding/sub_group_clustered_reduce.ll
FAIL: LLVM_SPIRV :: transcoding/sub_group_non_uniform_arithmetic.ll
FAIL: LLVM_SPIRV :: transcoding/sub_group_shuffle.ll
FAIL: LLVM_SPIRV :: transcoding/sub_group_shuffle_relative.ll

lamb-j added a commit that referenced this pull request May 11, 2026
Diagnostic in #195 showed test-side ninja invocations cascade-rebuild
because tar -m sets per-file mtimes from sequential extraction order,
leaving build.ninja older than CMakeCache.txt within the same build
dir:

  ninja explain: output build.ninja older than most recent input
                 CMakeCache.txt (1778450813199019645 vs 1778450820955083819)

That triggers ninja's auto-regen rule, cmake regenerates build.ninja
with subtly different command lines, and ninja's command-hash check
then rebuilds every .o it has on file.

Fix: explicitly touch build.ninja after untar so it's the newest file
in the build tree. Trivial 3-file touch, applies to all 3 test jobs
via the shared "Untar build trees" step pattern.
@lamb-j lamb-j closed this in #197 May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant