You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Enhance the existing collection_delete_task to handle document cleanup in stages:
Stage 1: Mark collection as DELETED (current behavior)
Stage 2: Mark all documents as DELETED
Stage 3: Delete document indexes
Stage 4: Clean up source files
Stage 5: Final cleanup of database records
Pros:
Non-blocking
Better failure handling
Progress tracking
Resource cleanup can be retried
Scalable to large collections
Cons:
Temporary inconsistency
More complex implementation
Need careful status tracking
Approach 3: Lazy Cleanup with Background Job
Keep current deletion behavior but add a periodic cleanup job:
Mark resources as DELETED (current behavior)
Run periodic job to find and clean up orphaned files
Use timestamp-based cleanup strategy
Pros:
Simple implementation
Low impact on main operations
Can batch cleanup operations
Cons:
Delayed cleanup
Resources held longer than necessary
More complex monitoring needed
Implementation Details (for Approach 2)
Enhance collection_delete_task:
@app.task(bind=True)defcollection_delete_task(self, collection_id: str) ->Any:
# Stage 1: Current collection deletion logic# Stage 2: Get all documents and mark as deleteddocuments=document_service.get_collection_documents(collection_id)
fordocindocuments:
document_service.delete_document(doc.user, collection_id, doc.id)
# Stage 3 & 4: Delete indexes and source files (handled by delete_document)# Stage 5: Final cleanupcleanup_collection_records(collection_id)
Add status tracking for deletion stages
Implement retry mechanisms for each stage
Add monitoring and logging for cleanup progress
Questions to Consider
Consistency Requirements:
How strict should the consistency between collection and document status be?
Should we allow querying deleted collections/documents during deletion?
Recovery Strategy:
How to handle partial failures during deletion?
Should we implement an "undelete" feature within a time window?
Performance Impact:
How to handle deletion of large collections?
Should we implement batching for large deletions?
Compliance:
Are there regulatory requirements for data deletion timing?
Do we need to maintain deletion audit logs?
Success Criteria
No orphaned files remain after collection deletion
Current Behavior
When a collection is deleted:
collection_delete_taskis triggered asynchronously.objectsdirectory or object storage like S3)Problem Impact
Proposed Solutions
Approach 1: Synchronous Cascade Delete
Delete all child resources (documents, indexes, files) within the same transaction as the collection deletion.
Pros:
Cons:
Approach 2: Asynchronous Staged Deletion (Recommended)
Enhance the existing
collection_delete_taskto handle document cleanup in stages:Pros:
Cons:
Approach 3: Lazy Cleanup with Background Job
Keep current deletion behavior but add a periodic cleanup job:
Pros:
Cons:
Implementation Details (for Approach 2)
collection_delete_task:Questions to Consider
Consistency Requirements:
Recovery Strategy:
Performance Impact:
Compliance:
Success Criteria
Related Issues
Next Steps