Harmonise sources' existing file checking

Related to #204 

Review what information each source needs to check what to (re)process. #204 sets this up so that we can easily bring in either resolution file ids or full resolution files.

Before #204 , the setup look like so:

For a source, like infer, we [load](https://github.com/nikbpetrov/forecastbench/blob/main/src/orchestration/func_infer_fetch/main.py#L26C5-L30C67) files_in_storage :
```python
 files_in_storage = gcp.storage.list_with_prefix(
      bucket_name=env.QUESTION_BANK_BUCKET, prefix=SOURCE
  )

  dff = source.fetch(dfq=dfq, files_in_storage=files_in_storage)
```

but those are essentially used as an [id filter](https://github.com/nikbpetrov/forecastbench/blob/e1cde0742deb50c25ecaf49d225d1a52b244bdb9/src/sources/infer.py#L64-L66):
```python
resolved_ids_without_files = [id for id in resolved_ids if f"{self.name}/{id}.jsonl" not in files_in_storage]
```

Crucially, this operation (because of how `list_with_prefix` is set up) will list a file even if it's empty.

This is NOT the same behavaviour as the the `load_existing_resolution_files` function, which explicitly [drops empty files](https://github.com/nikbpetrov/forecastbench/blob/e1cde0742deb50c25ecaf49d225d1a52b244bdb9/src/orchestration/_source_io.py#L82-L83) from its result:
```python
if not df.empty: `result[question_id] = df
```

A few options exist:
- listing
- list and check file sizes
- fully loading the actual files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harmonise sources' existing file checking #210

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Harmonise sources' existing file checking #210

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions