Skip to content

Parallel merger mpimpi2prv hangs in a deadlock #141

@valentin-seitz

Description

@valentin-seitz

When trying to merge a trace with the parallel merger of extrae 4.2.12 (and probably newer and older versions) like mpimpi2prv -f TRACE.mpits with 308 processors the merging gets stuck.

I used MUST to confirm the observed deadlock and provide some hints on where its happening:

mpi2prv: Error! Found unmatched communication! Continuing...
mpi2prv: Progress ... 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% done
mpi2prv: Error! Found 2874 unmatched communications. Resulting tracefile may be inconsistent.
mpi2prv: Error! Found 94610 pending communications. Resulting tracefile may be inconsistent.
[MUST-RUNTIME] ============MUST===============
[MUST-RUNTIME] ERROR: MUST detected a deadlock, writing output.===============================
[MUST-RUNTIME] ============MUST===============
[MUST-RUNTIME] ERROR: MUST detected a deadlock, detailed information is available in the MUST output file. You should either investigate details with a debugger or abort, the operation of MUST will stop from now.
[MUST-RUNTIME] ===============================
[MUST-RUNTIME] ----Deadlock detection timing ----
[MUST-RUNTIME] syncTime=3500366
[MUST-RUNTIME] wfgGatherTme=323
[MUST-RUNTIME] preparationTime=1448
[MUST-RUNTIME] wfgCheckTime=1797
[MUST-RUNTIME] outputTime=25416
[MUST-RUNTIME] dotTime=0
Image

The offending MPI_Recv is located in

res = MPI_Recv (&tmp, 1, MPI_INT, my_master, ASK_MERGE_REMOTE_BLOCK_TAG, MPI_COMM_WORLD, &s);

For this execution it seems that process 26 was waiting in that RECV for a message of process 0, which was waiting in a barrier:

res = MPI_Barrier (MPI_COMM_WORLD);

I attached the whole MUST output in case you need it to debug the case :)

extrae-merger-hangs.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions