Related to #204
Review what information each source needs to check what to (re)process. #204 sets this up so that we can easily bring in either resolution file ids or full resolution files.
Before #204 , the setup look like so:
For a source, like infer, we load files_in_storage :
files_in_storage = gcp.storage.list_with_prefix(
bucket_name=env.QUESTION_BANK_BUCKET, prefix=SOURCE
)
dff = source.fetch(dfq=dfq, files_in_storage=files_in_storage)
but those are essentially used as an id filter:
resolved_ids_without_files = [id for id in resolved_ids if f"{self.name}/{id}.jsonl" not in files_in_storage]
Crucially, this operation (because of how list_with_prefix is set up) will list a file even if it's empty.
This is NOT the same behavaviour as the the load_existing_resolution_files function, which explicitly drops empty files from its result:
if not df.empty: `result[question_id] = df
A few options exist:
- listing
- list and check file sizes
- fully loading the actual files
Related to #204
Review what information each source needs to check what to (re)process. #204 sets this up so that we can easily bring in either resolution file ids or full resolution files.
Before #204 , the setup look like so:
For a source, like infer, we load files_in_storage :
but those are essentially used as an id filter:
Crucially, this operation (because of how
list_with_prefixis set up) will list a file even if it's empty.This is NOT the same behavaviour as the the
load_existing_resolution_filesfunction, which explicitly drops empty files from its result:A few options exist: