Address a deadlock issue in multi-GPU scenario#1407
Address a deadlock issue in multi-GPU scenario#1407mangupta wants to merge 1 commit intoROCm:clang_tot_upgradefrom
Conversation
mangupta
commented
Mar 4, 2020
- Fixes SWDEV-219322
|
The error message for the failing tests is curious. |
|
I kicked off a unit test run on my local system to see if I could reproduce the failures. |
| Kalmar::HSADevice* device = static_cast<Kalmar::HSADevice*>(getDev()); | ||
|
|
||
| wait(); | ||
|
|
There was a problem hiding this comment.
Some comment would help. It seems calling 'wait()' is equivalent to holding qmutex and wait_no_lock() which is what is lines 4185 below after locking rocrQueuesMutex. Does this fix imply that rocrQueuesMutex should NOT be held before qmutex as it may cause deadlock? If so, shouldn't line 4185 locking of qmutex be removed?
Or, could moving acquiring of rocrQueuesMutex to just before calling removeRocrQueue() help?
There was a problem hiding this comment.
wait_no_lock() could potentially call EnqueueMarkerNoLock(). If the HSAQueue does not hold a rocr queue, it will end up calling createOrstealRocrQueue() and lock rocrQueuesMutex.
Moving the lock on rocrQueuesMutex until just before line 4206 might work, too.
That's a known problem that @david-salinas will address |
|
@jeffdaily do we still need an explicit wait if we clear the asyncops vector? |
Isn't the |