Fix: Accumulator overflow during aggregation (#6793)#7641
Fix: Accumulator overflow during aggregation (#6793)#7641dsotirho-ucsc wants to merge 53 commits intodevelopfrom
Conversation
d47920c to
8248dae
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7641 +/- ##
===========================================
+ Coverage 84.82% 84.85% +0.02%
===========================================
Files 157 157
Lines 23060 23117 +57
===========================================
+ Hits 19561 19616 +55
- Misses 3499 3501 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
be95d6d to
061cc73
Compare
2c1ed6b to
8733d33
Compare
8403e69 to
b4de0da
Compare
| pass | ||
|
|
||
| def _accumulator(self, field: str) -> Accumulator | None: | ||
| # dataset.document_id aggregation is required for creating of manifests |
There was a problem hiding this comment.
Either "creating manifests" or "creation of manifests", not "creating of"
There was a problem hiding this comment.
Also it would be useful to record why the field is required, i.e., what the symptom of failure would be if it were to be removed.
| if field in ('activity_id', 'document_id', 'source_datarepo_row_ids'): | ||
| # Added to the files aggregate in order to be included in manifests | ||
| if self.outer_entity_type == 'files': | ||
| return super()._accumulator(field) | ||
| else: | ||
| return None | ||
| else: | ||
| return super()._accumulator(field) |
There was a problem hiding this comment.
Nitpick: each of these could be rewritten as:
| if field in ('activity_id', 'document_id', 'source_datarepo_row_ids'): | |
| # Added to the files aggregate in order to be included in manifests | |
| if self.outer_entity_type == 'files': | |
| return super()._accumulator(field) | |
| else: | |
| return None | |
| else: | |
| return super()._accumulator(field) | |
| # Added to the files aggregate in order to be included in manifests | |
| if field in ('activity_id', 'document_id', 'source_datarepo_row_ids') and self.outer_entity_type != 'files': | |
| return None | |
| else: | |
| return super()._accumulator(field) |
Which I personally think is cleaner since it reduces the number of return pathways and keeps a consistent, flat level of nesting. But it does cause issues with line length, and there's also an argument to be made that keeping the return statements separate is the more resilient approach.
|
|
||
| def _accumulator(self, field: str) -> Accumulator | None: | ||
| if field in ('document_id', 'donor_id', 'source_datarepo_row_ids'): | ||
| # Added to the files aggregate in order to be included in manifests |
There was a problem hiding this comment.
Grammar nit
| # Added to the files aggregate in order to be included in manifests | |
| # Add to the files aggregate so that it can be included in manifests |
There was a problem hiding this comment.
Adding fixup commits for this would have nearly doubled the count of commits, so I've squashed this change into the existing commits.
| return SetAccumulator(max_size=200) | ||
| elif field == 'organism_age_range': | ||
| return SetAccumulator(max_size=200) | ||
| elif field == 'organism_age': |
There was a problem hiding this comment.
Was it intentional to keep these separate?
There was a problem hiding this comment.
No. I've combined them in the fixup commit.
| pass | ||
|
|
||
| def _accumulator(self, field) -> Accumulator | None: | ||
| return super()._accumulator(field) |
There was a problem hiding this comment.
What's the point of adding these here? Surely the diff on subsequent commits would be cleaner if they were removed?
abca22d to
1ee87e7
Compare
hannes-ucsc
left a comment
There was a problem hiding this comment.
Until further notice, please push commits individually. Ask on Slack for advice on how to automate that.
| 'size': file.get('size'), | ||
| 'fileSource': file.get('file_source'), | ||
| self.plugin.special_fields.file_uuid.name_in_hit: file.get('uuid'), | ||
| 'version': file.get('version'), |
There was a problem hiding this comment.
Why are we removing the version?
| 'document_id', | ||
| 'source_datarepo_row_ids' | ||
| } and self.outer_entity_type != 'files': | ||
| # These fields are removed from all aggregations, except file |
There was a problem hiding this comment.
In the context of indexing, aggregation is a process, and aggregate is the output of that process. Make sure you don't conflate these two anywhere.
| # These fields are removed from all aggregations, except file | |
| # These fields are only aggregated for files, where they are needed for manifests |
PL: Which manifests
| # If any dataset IDs are missing from the aggregate, those datasets | ||
| # will be omitted during the verbatim handover. Datasets are a "hot" | ||
| # entity type, and we can't track their hubs in replica documents, | ||
| # so we rely on the inner entity IDs instead. We also need to | ||
| # aggregate document_id to allow filtering by the value on | ||
| # non-dataset endpoints. |
There was a problem hiding this comment.
PL: hypothetical; datasets don't have hubs, replicas do, "the value"
There was a problem hiding this comment.
@hannes-ucsc: "I think I'm wrong about one thing: If replicas have hubs, then replicas of datasets have hubs, therefor datasets have hubs."
There was a problem hiding this comment.
Also find a a good place to include this ascii chart
┏━━━━━━━━━━━━━━━━━━┓
┃ Project replica ┃
┃ ┃
┣━━━━━━━━━━━━━━━━━━┫ ┏━━━━━━━━━━━━━━━━━━┓
┃ entity_id ┃◀─┐ ┃ File replica ┃
┣━━━━━━━━━━━━━━━━━━┫ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ ┃
┃ hub_ids ┃ │ ┃ File aggregate ┃ ┣━━━━━━━━━━━━━━━━━━┫
┗━━━━━━━━━━━━━━━━━━┛ │ ┃ ┃ ┌─▶┃ entity_id ┃
│ ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫ │ ┣━━━━━━━━━━━━━━━━━━┫
┏━━━━━━━━━━━━━━━━━━┓ │ ┃ entity_id ┃──┼─▶┃ hub_ids ┃
┃ Donor replica ┃ │ ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫ │ ┗━━━━━━━━━━━━━━━━━━┛
┃ ┃ └──┃ contents.projects.document_id ┃ │
┣━━━━━━━━━━━━━━━━━━┫ ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫ │ ┏━━━━━━━━━━━━━━━━━━┓
┃ entity_id ┃◀────┃ contents.donors.document_id ┃ │ ┃ Specimen replica ┃
┣━━━━━━━━━━━━━━━━━━┫ ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫ │ ┃ ┃
┃ hub_ids ┃ ┌──┃ contents.protocols.document_id ┃ │ ┣━━━━━━━━━━━━━━━━━━┫
┗━━━━━━━━━━━━━━━━━━┛ │ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ │ ┃ entity_id ┃
│ │ ┣━━━━━━━━━━━━━━━━━━┫
┏━━━━━━━━━━━━━━━━━━┓ │ └─▶┃ hub_ids ┃
┃ Protocol replica ┃ │ ┗━━━━━━━━━━━━━━━━━━┛
┃ ┃ │
┣━━━━━━━━━━━━━━━━━━┫ │
┃ entity_id ┃◀─┘
┣━━━━━━━━━━━━━━━━━━┫
┃ hub_ids ┃
┗━━━━━━━━━━━━━━━━━━┛
| key=compose_keys(none_safe_tuple_key(none_last=True), | ||
| itemgetter('lte', 'gte'))) | ||
| elif field == 'disease': | ||
| return SetAccumulator(max_size=17700) |
|
|
||
| def _accumulator(self, field: str) -> Accumulator | None: | ||
| if field in ('count', 'file_size'): | ||
| if field in { |
| document_ids = [ | ||
| document_id | ||
| for entity_type in self.hot_entity_types | ||
| # Some "hot" entity types may be missing from hit['contents'] |
There was a problem hiding this comment.
PL: What's an empty inner entity, why is it "empty" and which types can be "empty"?

Linked issues: #6793
Checklist
Author
developissues/<GitHub handle of author>/<issue#>-<slug>1 when the issue title describes a problem, the corresponding PR
title is
Fix:followed by the issue titleAuthor (partiality)
ptag to titles of partial commitspartialor completely resolves all linked issuespartiallabelAuthor (reindex)
rtag to commit title or the changes introduced by this PR will not require reindexing of any deploymentreindex:devor the changes introduced by it will not require reindexing ofdevreindex:anvildevor the changes introduced by it will not require reindexing ofanvildevreindex:anvilprodor the changes introduced by it will not require reindexing ofanvilprodreindex:prodor the changes introduced by it will not require reindexing ofprodreindex:partialand its description documents the specific reindexing procedure fordev,anvildev,anvilprodandprodor requires a full reindex or carries none of the labelsreindex:dev,reindex:anvildev,reindex:anvilprodandreindex:prodAuthor (API changes)
APIor this PR does not modify a REST APIa(A) tag to commit title for backwards (in)compatible changes or this PR does not modify a REST APIapp.pyor this PR does not modify a REST APIAuthor (upgrading deployments)
make docker_images.jsonand committed the resulting changes or this PR does not modifyazul_docker_images, or any other variables referenced in the definition of that variableutag to commit title or this PR does not require upgrading deploymentsupgradeor does not require upgrading deploymentsdeploy:sharedor does not modifydocker_images.json, and does not require deploying thesharedcomponent for any other reasondeploy:gitlabor does not require deploying thegitlabcomponentdeploy:runneror does not require deploying therunnerimageAuthor (hotfixes)
Ftag to main commit title or this PR does not include permanent fix for a temporary hotfixanvilprodandprod) have temporary hotfixes for any of the issues linked to this PRAuthor (before every review)
develop, squashed fixups from prior reviewsmake requirements_updateor this PR does not modifyDockerfile,environment,requirements*.txt,common.mk,Makefileorenvironment.bootRtag to commit title or this PR does not modifyrequirements*.txtreqsor does not modifyrequirements*.txtmake integration_testpasses in personal deployment or this PR does not modify functionality that could affect the IT outcomePeer reviewer (after approval)
Note that after requesting changes, the PR must be assigned to only the author.
System administrator (after approval)
demoorno demono demono sandboxN reviewslabel is accurateOperator
reindex:…labels andrcommit title tagno demodevelopOperator (deploy
.sharedand.gitlabcomponents)_select dev.shared && CI_COMMIT_REF_NAME=develop make -C terraform/shared apply_keep_unusedor this PR is not labeleddeploy:shared_select dev.gitlab && CI_COMMIT_REF_NAME=develop make -C terraform/gitlab applyor this PR is not labeleddeploy:gitlab_select anvildev.shared && CI_COMMIT_REF_NAME=develop make -C terraform/shared apply_keep_unusedor this PR is not labeleddeploy:shared_select anvildev.gitlab && CI_COMMIT_REF_NAME=develop make -C terraform/gitlab applyor this PR is not labeleddeploy:gitlabdeploy:gitlabdeploy:gitlabSystem administrator (post-deploy of
.gitlabcomponent)dev.gitlabare complete or this PR is not labeleddeploy:gitlabanvildev.gitlabare complete or this PR is not labeleddeploy:gitlabOperator (deploy runner image)
_select dev.gitlab && make -C terraform/gitlab/runneror this PR is not labeleddeploy:runner_select anvildev.gitlab && make -C terraform/gitlab/runneror this PR is not labeleddeploy:runnerOperator (sandbox build)
sandboxlabel or PR is labeledno sandboxdevor PR is labeledno sandboxanvildevor PR is labeledno sandboxsandboxdeployment or PR is labeledno sandboxanvilboxdeployment or PR is labeledno sandboxsandboxdeployment or PR is labeledno sandboxanvilboxdeployment or PR is labeledno sandboxsandboxor this PR does not remove catalogs or otherwise causes unreferenced indices indevanvilboxor this PR does not remove catalogs or otherwise causes unreferenced indices inanvildevsandboxor this PR is not labeledreindex:devanvilboxor this PR is not labeledreindex:anvildevsandboxor this PR is not labeledreindex:devanvilboxor this PR is not labeledreindex:anvildevOperator (merge the branch)
pif the PR is also labeledpartialOperator (main build)
devanvildevdevdevanvildevanvildev_select dev.shared && make -C terraform/shared applyor this PR is not labeleddeploy:shared_select anvildev.shared && make -C terraform/shared applyor this PR is not labeleddeploy:shareddevanvildevOperator (reindex)
devor this PR is neither labeledreindex:partialnorreindex:devanvildevor this PR is neither labeledreindex:partialnorreindex:anvildevdevor this PR is neither labeledreindex:partialnorreindex:devanvildevor this PR is neither labeledreindex:partialnorreindex:anvildevdevor this PR is neither labeledreindex:partialnorreindex:devanvildevor this PR is neither labeledreindex:partialnorreindex:anvildevdevor this PR does not require reindexingdevanvildevor this PR does not require reindexinganvildevdevor this PR does not require reindexingdevanvildevor this PR does not require reindexinganvildevdevor this PR does not require reindexingdevanvildevor this PR does not require reindexinganvildevdevor this PR does not require reindexingdevdevor this PR does not require reindexingdevdeploy_browserjob in the GitLab pipeline for this PR indevor this PR does not require reindexingdevanvildevor this PR does not require reindexinganvildevdeploy_browserjob in the GitLab pipeline for this PR inanvildevor this PR does not require reindexinganvildevOperator (mirroring)
devor this PR does not require mirroringdevanvildevor this PR does not require mirroringanvildevdevor this PR does not require mirroringdevanvildevor this PR does not require mirroringanvildevdevor this PR does not require mirroringdevanvildevor this PR does not require mirroringanvildevOperator
deploy:shared,deploy:gitlab,deploy:runner,API,reindex:partial,reindex:anvilprodandreindex:prodlabels to the next promotion PRs or this PR carries none of these labelsdeploy:shared,deploy:gitlab,deploy:runner,API,reindex:partial,reindex:anvilprodandreindex:prodlabels, from the description of this PR to that of the next promotion PRs or this PR carries none of these labelsShorthand for review comments
Lline is too longWline wrapping is wrongQbad quotesFother formatting problem