Migration files are named NN__description.yaml, and the runner derives each migration id by parsing the numeric prefix (strconv.Atoi(strings.Split(k, "__")[0])). Applied state is tracked as a single numeric high-water mark, and a file runs only when its parsed id is strictly greater than that mark.
But the run order is decided by sort.Strings(keys) — i.e. lexical order over the filenames — while the apply gate is numeric. These two orderings agree only as long as every id has the same number of digits (e.g. all two-digit 00–99).
The bug
As soon as a project adds its first three-digit migration (100__…), lexical and numeric order diverge, because "100__" sorts before "10__"…"99__" (the character after 100 is _ = 0x5F, which is greater than any digit 0–9 = 0x30–0x39, so shorter two-digit prefixes that share the 1 lose the comparison).
On a fresh database (high-water mark = 0) the runner therefore:
- applies
01__…09__ (mark advances to 9),
- then hits
100__, 101__ next in lexical order and applies them — advancing the numeric mark to 101,
- then reaches
10__…99__, whose ids are all now less than the mark, so every one of them is silently skipped.
No error is raised — the runner just records 100/101 as applied and reports success. The result is a database missing every object created in migrations 10–99 (tables, columns, indexes), which then surfaces as confusing downstream runtime errors rather than a migration failure.
Databases migrated incrementally (mark already ≥ 99 before 100 was introduced) are unaffected, which makes this especially nasty: it only bites fresh environments — new developer setups, CI, ephemeral test databases — while existing/long-lived databases look fine. So it passes locally for whoever has been migrating all along and fails only for everyone starting clean.
Reproduction (conceptual)
Given migration files 01__… through 101__… and an empty database:
- expected: ids applied in order 1, 2, …, 99, 100, 101
- actual: 1..9, then 100, 101, then 10..99 skipped (mark jumped to 101)
Suggested fix
Order the keys by their parsed numeric id (the same value the apply gate uses), not by sort.Strings. A stable numeric sort with a lexical tie-break keeps ordering deterministic and makes the run order match the gate regardless of digit count, so zero-padding is no longer load-bearing.
Alternatively (or additionally), if lexical ordering is intended to remain authoritative, the requirement that all migration ids be fixed-width / zero-padded should be documented and ideally validated at startup (fail fast if a prefix width regresses), since the failure is otherwise silent.
I am happy to open a PR implementing the numeric-sort approach.
Migration files are named
NN__description.yaml, and the runner derives each migration id by parsing the numeric prefix (strconv.Atoi(strings.Split(k, "__")[0])). Applied state is tracked as a single numeric high-water mark, and a file runs only when its parsed id is strictly greater than that mark.But the run order is decided by
sort.Strings(keys)— i.e. lexical order over the filenames — while the apply gate is numeric. These two orderings agree only as long as every id has the same number of digits (e.g. all two-digit00–99).The bug
As soon as a project adds its first three-digit migration (
100__…), lexical and numeric order diverge, because"100__"sorts before"10__"…"99__"(the character after100is_= 0x5F, which is greater than any digit0–9= 0x30–0x39, so shorter two-digit prefixes that share the1lose the comparison).On a fresh database (high-water mark = 0) the runner therefore:
01__…09__(mark advances to 9),100__,101__next in lexical order and applies them — advancing the numeric mark to 101,10__…99__, whose ids are all now less than the mark, so every one of them is silently skipped.No error is raised — the runner just records 100/101 as applied and reports success. The result is a database missing every object created in migrations 10–99 (tables, columns, indexes), which then surfaces as confusing downstream runtime errors rather than a migration failure.
Databases migrated incrementally (mark already ≥ 99 before 100 was introduced) are unaffected, which makes this especially nasty: it only bites fresh environments — new developer setups, CI, ephemeral test databases — while existing/long-lived databases look fine. So it passes locally for whoever has been migrating all along and fails only for everyone starting clean.
Reproduction (conceptual)
Given migration files
01__…through101__…and an empty database:Suggested fix
Order the keys by their parsed numeric id (the same value the apply gate uses), not by
sort.Strings. A stable numeric sort with a lexical tie-break keeps ordering deterministic and makes the run order match the gate regardless of digit count, so zero-padding is no longer load-bearing.Alternatively (or additionally), if lexical ordering is intended to remain authoritative, the requirement that all migration ids be fixed-width / zero-padded should be documented and ideally validated at startup (fail fast if a prefix width regresses), since the failure is otherwise silent.
I am happy to open a PR implementing the numeric-sort approach.