Skip to content

Problem: Package.files() can cause DB errors for packages with tens of thousands of files #1758

@vmn296

Description

@vmn296

Please describe the problem you'd like to be solved
The Package.files() function iterates over all files in a package using queryset.iterator(). For very large packages (thousands of files), this can cause memory spikes, long database queries, OperationalErrors (MySQL 2006 “MySQL server has gone away”) or connection dropped errors, and job failures during ingestion.

Describe the solution you'd like to see implemented
Add a configurable chunk_size parameter to the Package.files() method so that users can reduce the number of rows fetched at a time.
packages.py#L663

Describe alternatives you've considered
Leaving the default chunk size as is still causes issues for large packages (i.e. file count > 20,000).

Additional context
Example error encountered when iterating over very large packages:
Traceback shows failure in package.files() during queryset iteration:

  File "/usr/lib/archivematica/MCPServer/server/jobs/client.py", line 215, in submit_tasks
    for file_replacements in self.package.files(
  File "/usr/lib/archivematica/MCPServer/server/packages.py", line 663, in files
    for file_obj in queryset.iterator():
  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/django/db/models/query.py", line 516, in _iterator
    yield from iterable
  File "/usr/share/archivematica/virtualenvs/archivematica/lib64/python3.9/site-packages/MySQLdb/cursors.py", line 95, in _discard
    while con.next_result() == 0:  # -1 means no more data.
MySQLdb.OperationalError: (2006, '')

For Artefactual use:

Before you close this issue, you must check off the following:

  • All pull requests related to this issue are properly linked
  • All pull requests related to this issue have been merged
  • A testing plan for this issue has been implemented and passed (testing plan information should be included in the issue body or comments)
  • Documentation regarding this issue has been written and merged
  • Details about this issue have been added to the release notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions