Skip to content

Harmonise sources' existing file checking #210

@nikbpetrov

Description

@nikbpetrov

Related to #204

Review what information each source needs to check what to (re)process. #204 sets this up so that we can easily bring in either resolution file ids or full resolution files.

Before #204 , the setup look like so:

For a source, like infer, we load files_in_storage :

 files_in_storage = gcp.storage.list_with_prefix(
      bucket_name=env.QUESTION_BANK_BUCKET, prefix=SOURCE
  )

  dff = source.fetch(dfq=dfq, files_in_storage=files_in_storage)

but those are essentially used as an id filter:

resolved_ids_without_files = [id for id in resolved_ids if f"{self.name}/{id}.jsonl" not in files_in_storage]

Crucially, this operation (because of how list_with_prefix is set up) will list a file even if it's empty.

This is NOT the same behavaviour as the the load_existing_resolution_files function, which explicitly drops empty files from its result:

if not df.empty: `result[question_id] = df

A few options exist:

  • listing
  • list and check file sizes
  • fully loading the actual files

Metadata

Metadata

Assignees

Labels

datapulling in data from various APIs

Type

No fields configured for Task.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions