Loci communicator refactor#335
Conversation
… for LOCI_STRICT_COMM define to force users to pass communicator into Loci APIs
|
@cfdrcmpgale This is the changes to Loci to allow scheduling under a MPI communicator other than MPI_COMM_WORLD. You have to be careful with the solver as you might be using MPI_COMM_WORLD without realizing it. To keep the API compatible, some calls have a final argument that is the MPI communicator but will default to MPI_COMM_WORLD. You can use the LOCI_STRICT_COMM to change the API to require explicit communicators which you can use figure out what changes you need to make to your code. You can look at the changes to the quickTest/FVM code to see how that works. I have only tested that this works with the FVM code, I am not sure if there aren't problems with advanced features like FVMOverset or FVMAdapt. That hasn't been tested using the sub communicator. |
|
@EdwardALuke I presume you also went and removed all of the MPI_COMM_WORLD's from chem? I see it in the latest version we have from you all in the linear solvers/etc. |
|
@rlfontenot I have not done that for CHEM or flowPsi yet. Some care may need to be taken to make sure we maintain compatibility with the 4.1 stable release of Loci. This is very much a work in progress. But since the changes touch so much of the code base, we probably will want to have several pull requests to prevent developing an orphaned branch that is too difficult to merge back into the dev branch. Right now I am wanting feedback on if this is going to meet the requirements. |
|
@EdwardALuke I reviewed the changes and plan for adoption of this feature. We should not have any issue with the changes since the default API behavior is to use MPI_COMM_WORLD, and the LOCI_STRICT_COMM points to all the locations where the change needs to occur in solvers. This meets the initial requirements. Now, how do you envision the fact_db to work for loaded applications with separate communicators? |
|
@cfdrcmpgale Ah, so I suspected that this might not be what you want. If you want Loci to coordinate between different meshes loaded on different processors, then communicators is not what you are after. What this would allow is for Loci to make independent schedules on independent communicators. But there would be no way for Loci to directly coordinate communication between them. If you wanted the different codes to talk to each other through maps or similar constructs, then all of the data would need to live in the same communicator. In that case you just want to have a facility that could read in a mesh data structure into a subset of the processors. Note, that if the applications don't follow the same iteration behavior, then you might be idling processors while Loci schedules one phase of computations because of an implicit synchronization that happens at the level of iterations. Where the current features set would be useful is if we wanted to create a subset of processors to run an independent application, say a structures simulation, and then another subset to run say chem. Then the two codes could run independently until some external-to-loci solver coordination infrastructure would allow inter-communicator communication. |
|
@EdwardALuke Yes, the use case that we discussed was more around Loci-solvers driving Loci-applications on a subset of CPUs. But I do see the fundamental issue, in order for communication to occur there would need to be a way for communicating contact surface information across groups. Face contact maps would be significantly more difficult if at all possible. I see now that the target of this current feature is for synchronization being handled externally, non-Loci-based solvers driving simulations with Loci-based solvers (as with Loci/RTE). Thanks! |
|
@cfdrcmpgale Would just having a feature where the vog file reader could be directed to read the database onto only a subset of processors be sufficient to meet your needs, or do you want to decouple models in a more significant way? It should be relatively simple to add that feature and would probably be the most effective way to deal with running a coupled problem where one was very small, but you needed to run on a large number of processors for the other larger part. But this wouldn't solve other coordination problems. |
|
@EdwardALuke You are suggesting keeping the same communicator for the loaded application but having the vog information distributed only on a subset of processors? In this case, would it be equivalent to having empty domain partitions on the "extra" cores? The face contact maps would likely work with that approach. |
This is a major refactoring to allow for Loci schedules to be generated inside of a localized MPI communicator. The changes are broad ranging but straightforward. To use this you create a communicator under which you want to run Loci schedules, and pass set this as the default communicator for Loci with the call
Loci::SetDefaultComm(sub_comm) ;
When you create the fact database set the communicator it uses with code such as
fact_db facts ;
fact.set_comm(sub_comm) ;
Now when you generate schedules they will do communication in the sub_comm communicator. However note that you need to be careful if you use MPI_COMM_WORLD anywhere in your code, this could cause deadlocks. You can define the LOCI_STRICT_COMM to disable API calls that have a communicator argument that defaults to MPI_COMM_WORLD to check if you need to change any parts of your program. Set this by changing the MISC line in sys.conf to include -DLOCI_STRICT_COMM