Hi @LabRAI team — congrats on the SIGSPATIAL 2025 Best Short Paper award.
While attempting to reproduce the reported numbers for a comparison in our own work, I noticed a pattern in the evaluation code that may cause test-time leakage of the ground-truth target into the model input. I wanted to flag it for your review.
The issue
In eval_typhoformer.py (line 57-58):
y_last = Y[:, 0, :] # commented: "上一步真实坐标" (last real coordinate)
output, gate = model(X_num, X_text, y_last, pred_steps=Y.shape[1])
But in prepare_typhoformer_data.py (around line 148-151):
for i in range(num_seq - INPUT_LEN - PRED_LEN + 1):
X_seq = X_full[i : i + INPUT_LEN]
Y_seq = coords[i + INPUT_LEN : i + INPUT_LEN + PRED_LEN]
Y_seq starts at i + INPUT_LEN, which is the first future timestep — one step beyond the input window. So Y[:, 0, :] is the ground-truth future target, not the last observed coordinate as the comment suggests.
Why it matters
In model/TyphoFormer.py (decoder forward, lines ~36-52):
def forward(self, h_enc, y_prev, pred_steps):
preds = []
y_t = y_prev # y_prev = y_last = Y[:, 0, :] = ground-truth future
for _ in range(pred_steps):
z_t = torch.cat([h_enc, y_t], dim=-1)
y_t = self.fc2(F.relu(self.fc1(z_t)))
preds.append(y_t)
With PRED_LEN = 1, the decoder loop runs exactly once. The prediction is computed as MLP([h_enc, GT_future_target]), meaning the target coordinate is available as decoder input at test time.
The same pattern appears in train_typhoformer.py (line ~93), so the model is trained to reproduce y_last with a small MLP correction from h_enc — effectively an identity function on y_last plus a residual. At inference, feeding GT y_last means the model only needs to approximate this small correction, yielding near-zero error regardless of actual predictive skill.
Expected impact on reported numbers:
- 6h = 31.5 km is suspiciously close to what a pure persistence baseline would achieve when anchored to GT_target (essentially zero prediction horizon).
- 24h = 49.56 km likely benefits from the same anchoring (model output ≈ GT + small residual).
- The CLIPER baseline in Table 1 (58.3 km at 24h) appears abnormally low vs. CLIPER's typical 24h AR error (~200+ km), suggesting all baselines may have been evaluated under the same leaky protocol.
Suggested fix
Replace the eval seed with the last observation from the input window:
eval_typhoformer.py;
y_last = X_num[:, -1, :2] # last observed lat/lon (input, not target)
Apply the same change at train_typhoformer.py line ~93.
This way the decoder is seeded with genuinely observed data and the prediction becomes an actual forecast. The reported numbers should be rerun under this corrected protocol before comparison with other methods.
I don't want to be alarmist - it could be that I'm missing something in the data pipeline (e.g., if Y_seq is actually constructed with overlap with the last input step somewhere I didn't see). Could you confirm whether Y[:, 0, :] at eval time is indeed the first future target, or an observed step?
If the leakage is confirmed, I'd be happy to coordinate on reporting corrected baselines in any follow-up or erratum.
Happy to discuss.
Temporal overlap with LLM training: The 2020-2023 time range partially overlaps with GPT-4o's training data. LLM-generated descriptions of real historical TCs from this period may leak memorized forecast information rather than providing novel synoptic reasoning.
Hi @LabRAI team — congrats on the SIGSPATIAL 2025 Best Short Paper award.
While attempting to reproduce the reported numbers for a comparison in our own work, I noticed a pattern in the evaluation code that may cause test-time leakage of the ground-truth target into the model input. I wanted to flag it for your review.
The issue
In
eval_typhoformer.py(line 57-58):But in
prepare_typhoformer_data.py(around line 148-151):Y_seq starts at i + INPUT_LEN, which is the first future timestep — one step beyond the input window. So
Y[:, 0, :]is the ground-truth future target, not the last observed coordinate as the comment suggests.Why it matters
In
model/TyphoFormer.py(decoder forward, lines ~36-52):With
PRED_LEN = 1, the decoder loop runs exactly once. The prediction is computed asMLP([h_enc, GT_future_target]), meaning the target coordinate is available as decoder input at test time.The same pattern appears in
train_typhoformer.py(line ~93), so the model is trained to reproduce y_last with a small MLP correction from h_enc — effectively an identity function ony_lastplus a residual. At inference, feeding GT y_last means the model only needs to approximate this small correction, yielding near-zero error regardless of actual predictive skill.Expected impact on reported numbers:
Suggested fix
Replace the eval seed with the last observation from the input window:
eval_typhoformer.py;Apply the same change at
train_typhoformer.pyline ~93.This way the decoder is seeded with genuinely observed data and the prediction becomes an actual forecast. The reported numbers should be rerun under this corrected protocol before comparison with other methods.
I don't want to be alarmist - it could be that I'm missing something in the data pipeline (e.g., if
Y_seqis actually constructed with overlap with the last input step somewhere I didn't see). Could you confirm whetherY[:, 0, :]at eval time is indeed the first future target, or an observed step?If the leakage is confirmed, I'd be happy to coordinate on reporting corrected baselines in any follow-up or erratum.
Happy to discuss.
Temporal overlap with LLM training: The 2020-2023 time range partially overlaps with GPT-4o's training data. LLM-generated descriptions of real historical TCs from this period may leak memorized forecast information rather than providing novel synoptic reasoning.