Skip to content

Migration runner sorts files lexically but gates on numeric id — 3-digit migrations silently skip earlier ones #197

@Lutherwaves

Description

@Lutherwaves

Migration files are named NN__description.yaml, and the runner derives each migration id by parsing the numeric prefix (strconv.Atoi(strings.Split(k, "__")[0])). Applied state is tracked as a single numeric high-water mark, and a file runs only when its parsed id is strictly greater than that mark.

But the run order is decided by sort.Strings(keys) — i.e. lexical order over the filenames — while the apply gate is numeric. These two orderings agree only as long as every id has the same number of digits (e.g. all two-digit 0099).

The bug

As soon as a project adds its first three-digit migration (100__…), lexical and numeric order diverge, because "100__" sorts before "10__""99__" (the character after 100 is _ = 0x5F, which is greater than any digit 09 = 0x30–0x39, so shorter two-digit prefixes that share the 1 lose the comparison).

On a fresh database (high-water mark = 0) the runner therefore:

  1. applies 01__09__ (mark advances to 9),
  2. then hits 100__, 101__ next in lexical order and applies them — advancing the numeric mark to 101,
  3. then reaches 10__99__, whose ids are all now less than the mark, so every one of them is silently skipped.

No error is raised — the runner just records 100/101 as applied and reports success. The result is a database missing every object created in migrations 10–99 (tables, columns, indexes), which then surfaces as confusing downstream runtime errors rather than a migration failure.

Databases migrated incrementally (mark already ≥ 99 before 100 was introduced) are unaffected, which makes this especially nasty: it only bites fresh environments — new developer setups, CI, ephemeral test databases — while existing/long-lived databases look fine. So it passes locally for whoever has been migrating all along and fails only for everyone starting clean.

Reproduction (conceptual)

Given migration files 01__… through 101__… and an empty database:

  • expected: ids applied in order 1, 2, …, 99, 100, 101
  • actual: 1..9, then 100, 101, then 10..99 skipped (mark jumped to 101)

Suggested fix

Order the keys by their parsed numeric id (the same value the apply gate uses), not by sort.Strings. A stable numeric sort with a lexical tie-break keeps ordering deterministic and makes the run order match the gate regardless of digit count, so zero-padding is no longer load-bearing.

Alternatively (or additionally), if lexical ordering is intended to remain authoritative, the requirement that all migration ids be fixed-width / zero-padded should be documented and ideally validated at startup (fail fast if a prefix width regresses), since the failure is otherwise silent.

I am happy to open a PR implementing the numeric-sort approach.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions