Potential Test-Time Data Leakage in Evaluation Protocol

Hi @LabRAI team — congrats on the SIGSPATIAL 2025 Best Short Paper award.

While attempting to reproduce the reported numbers for a comparison in our own work, I noticed a pattern in the evaluation code that may cause test-time leakage of the ground-truth target into the model input. I wanted to flag it for your review. 

### The issue

In `eval_typhoformer.py` (line 57-58):

```python
y_last = Y[:, 0, :] # commented: "上一步真实坐标" (last real coordinate)
output, gate = model(X_num, X_text, y_last, pred_steps=Y.shape[1])
```
But in `prepare_typhoformer_data.py` (around line 148-151):

```python
for i in range(num_seq - INPUT_LEN - PRED_LEN + 1):
    X_seq = X_full[i : i + INPUT_LEN]
    Y_seq = coords[i + INPUT_LEN : i + INPUT_LEN + PRED_LEN]                            
```

Y_seq starts at i + INPUT_LEN, which is the first future timestep — one step beyond the input window. So `Y[:, 0, :]` is the ground-truth future target, not the last observed coordinate as the comment suggests.

### Why it matters

In `model/TyphoFormer.py` (decoder forward, lines ~36-52):

```python
def forward(self, h_enc, y_prev, pred_steps):
    preds = []
    y_t = y_prev   # y_prev = y_last = Y[:, 0, :] = ground-truth future
    for _ in range(pred_steps):
        z_t = torch.cat([h_enc, y_t], dim=-1)
        y_t = self.fc2(F.relu(self.fc1(z_t)))
        preds.append(y_t)                                                               
```

With `PRED_LEN = 1`, the decoder loop runs exactly once. The prediction is computed as `MLP([h_enc, GT_future_target])`, meaning the target coordinate is available as decoder input at test time.

The same pattern appears in `train_typhoformer.py` (line ~93), so the model is trained to reproduce y_last with a small MLP correction from h_enc — effectively an identity function on `y_last` plus a residual. At inference, feeding GT y_last means the model only needs to approximate this small correction, yielding near-zero error regardless of actual predictive skill.

Expected impact on reported numbers:
- 6h = 31.5 km is suspiciously close to what a pure persistence baseline would achieve when anchored to GT_target (essentially zero prediction horizon).
- 24h = 49.56 km likely benefits from the same anchoring (model output ≈ GT + small residual).
- The CLIPER baseline in Table 1 (58.3 km at 24h) appears abnormally low vs. CLIPER's typical 24h AR error (~200+ km), suggesting all baselines may have been evaluated under the same leaky protocol.

###  Suggested fix

Replace the eval seed with the last observation from the input window:

`eval_typhoformer.py`;

```python
y_last = X_num[:, -1, :2]   # last observed lat/lon (input, not target)
```
Apply the same change at `train_typhoformer.py` line ~93.

This way the decoder is seeded with genuinely observed data and the prediction becomes an actual forecast. The reported numbers should be rerun under this corrected protocol before comparison with other methods.

I don't want to be alarmist - it could be that I'm missing something in the data pipeline (e.g., if `Y_seq` is actually constructed with overlap with the last input step somewhere I didn't see). Could you confirm whether `Y[:, 0, :]` at eval time is indeed the first future target, or an observed step?

If the leakage is confirmed, I'd be happy to coordinate on reporting corrected baselines in any follow-up or erratum.

Happy to discuss.  

**Temporal overlap with LLM training**: The 2020-2023 time range partially overlaps with GPT-4o's training data. LLM-generated descriptions of real historical TCs from this period may leak memorized forecast information rather than providing novel synoptic reasoning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential Test-Time Data Leakage in Evaluation Protocol #1

The issue

Why it matters

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Potential Test-Time Data Leakage in Evaluation Protocol #1

Description

The issue

Why it matters

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions