Skip to content

Potential Test-Time Data Leakage in Evaluation Protocol #1

@JaeeonPark

Description

@JaeeonPark

Hi @LabRAI team — congrats on the SIGSPATIAL 2025 Best Short Paper award.

While attempting to reproduce the reported numbers for a comparison in our own work, I noticed a pattern in the evaluation code that may cause test-time leakage of the ground-truth target into the model input. I wanted to flag it for your review.

The issue

In eval_typhoformer.py (line 57-58):

y_last = Y[:, 0, :] # commented: "上一步真实坐标" (last real coordinate)
output, gate = model(X_num, X_text, y_last, pred_steps=Y.shape[1])

But in prepare_typhoformer_data.py (around line 148-151):

for i in range(num_seq - INPUT_LEN - PRED_LEN + 1):
    X_seq = X_full[i : i + INPUT_LEN]
    Y_seq = coords[i + INPUT_LEN : i + INPUT_LEN + PRED_LEN]                            

Y_seq starts at i + INPUT_LEN, which is the first future timestep — one step beyond the input window. So Y[:, 0, :] is the ground-truth future target, not the last observed coordinate as the comment suggests.

Why it matters

In model/TyphoFormer.py (decoder forward, lines ~36-52):

def forward(self, h_enc, y_prev, pred_steps):
    preds = []
    y_t = y_prev   # y_prev = y_last = Y[:, 0, :] = ground-truth future
    for _ in range(pred_steps):
        z_t = torch.cat([h_enc, y_t], dim=-1)
        y_t = self.fc2(F.relu(self.fc1(z_t)))
        preds.append(y_t)                                                               

With PRED_LEN = 1, the decoder loop runs exactly once. The prediction is computed as MLP([h_enc, GT_future_target]), meaning the target coordinate is available as decoder input at test time.

The same pattern appears in train_typhoformer.py (line ~93), so the model is trained to reproduce y_last with a small MLP correction from h_enc — effectively an identity function on y_last plus a residual. At inference, feeding GT y_last means the model only needs to approximate this small correction, yielding near-zero error regardless of actual predictive skill.

Expected impact on reported numbers:

  • 6h = 31.5 km is suspiciously close to what a pure persistence baseline would achieve when anchored to GT_target (essentially zero prediction horizon).
  • 24h = 49.56 km likely benefits from the same anchoring (model output ≈ GT + small residual).
  • The CLIPER baseline in Table 1 (58.3 km at 24h) appears abnormally low vs. CLIPER's typical 24h AR error (~200+ km), suggesting all baselines may have been evaluated under the same leaky protocol.

Suggested fix

Replace the eval seed with the last observation from the input window:

eval_typhoformer.py;

y_last = X_num[:, -1, :2]   # last observed lat/lon (input, not target)

Apply the same change at train_typhoformer.py line ~93.

This way the decoder is seeded with genuinely observed data and the prediction becomes an actual forecast. The reported numbers should be rerun under this corrected protocol before comparison with other methods.

I don't want to be alarmist - it could be that I'm missing something in the data pipeline (e.g., if Y_seq is actually constructed with overlap with the last input step somewhere I didn't see). Could you confirm whether Y[:, 0, :] at eval time is indeed the first future target, or an observed step?

If the leakage is confirmed, I'd be happy to coordinate on reporting corrected baselines in any follow-up or erratum.

Happy to discuss.

Temporal overlap with LLM training: The 2020-2023 time range partially overlaps with GPT-4o's training data. LLM-generated descriptions of real historical TCs from this period may leak memorized forecast information rather than providing novel synoptic reasoning.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions