Bugfix: Resolve race condition with FIFO cleanup and child process termination#2636
Open
robsyme wants to merge 2 commits intoalexdobin:masterfrom
Open
Bugfix: Resolve race condition with FIFO cleanup and child process termination#2636robsyme wants to merge 2 commits intoalexdobin:masterfrom
robsyme wants to merge 2 commits intoalexdobin:masterfrom
Conversation
Author
|
@felixschlesinger - this is my best guess at the race condition bug. If you have an opportunity - would it be possible to compile STAR from this branch and try it in your container? |
Contributor
|
Thanks @robsyme, we'll try this. AFAIK relying on closing the fifo is not sufficient, because the samtools command might only start writing to it after main STAR already closed it (race condition). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem Description
Users encountered a race condition where STAR would delete FIFO files while child processes (running samtools view via --readFilesCommand) were still writing to them. This manifested as the process leaving unclosed file descriptors. In lsof output, we see:
The child process holds a file descriptor to a deleted FIFO, indicating improper synchronization between process termination and file cleanup.
Root Cause Analysis
Primary Issue: Incorrect Cleanup Ordering
The main problem was in STAR.cpp - the cleanup sequence was backwards:
Secondary Issue: Inadequate Process Synchronization
In Parameters_closeReadsFiles.cpp, the process cleanup was too aggressive:
The Race Condition Window
Timeline:
Solution
Fix 1: Correct Cleanup Ordering (STAR.cpp)
Fix 2: Proper Process Synchronization (Parameters_closeReadsFiles.cpp)
Why This Fixes The Race Condition
Impact
The combination of correct ordering and proper synchronization ensures that sysRemoveDir() only executes after all child processes have properly released their file descriptors, eliminating the race condition entirely.