ISP: A Tool for Model Checking MPI Programs
School of Computing, University of Utah

1  Overview

ISP stands for (In-Situ Partial order). Instead of modeling MPI programs, ISP directly executes the MPI programs to detect deadlocks and assertion failures. ISP reduces the number of interleavings using DPOR (Dynamic Partial order Reduction)

2  ISP Poster

isp.pdf

3  Downloads

Here is the link for the tool and the tests cases. Test cases can be found in ISP/tests, ISP/Umpire_tests, ISP/misc_tests.

isp.tar.gz

4  Deadlocks Detected

ISP was successful in detecting deadlocks in three MPI programs. The buggy MPI programs with deadlocks and the fixed programs are available with the software distribution of ISP.

4.1  Parallel Trapezoidal Rule

The buggy code is available here.
trap
The deadlock in this program happens due to the MPI_Recv in line 74 when a process with rank 0 invokes an MPI_Recv from itself without a matching MPI_Send.

The fixed code is available here.
trap_fixed

4.2  Monte Carlo computation of PI

The buggy code is available here.
MonteCarlo
This MPI program requires 4 or more processes to execute. For n processes the processes with rank n−1 and n−2 are the co-ordinating processes. The process with rank n−3 is the server process. The rest of the processes are worker processes. The co-ordinating processes and server process only communicate with the worker processes and never communicate with each other. Coordinating process n−2 determines the program termination randomly and communicates this information to the worker processes. The worker processes in-turn communicate this information to the server process and the c-ordinating process n−1.

The first deadlock was because the two co-ordinating processes never exited their infinite loop where they first do an MPI_Recv from the worker processes and then do an MPI_Send to the worker processes. When coordinating process n−2 has randomly decided to terminate, this is communicated to the worker processes, which communicate this to the server. Hence, the worker and the server processes terminate but, the coordinating processes deadlock on MPI_Recv. This was fixed by adding proper communication to the co-ordinating process 1 to exit from the worker process. The coordinating process can exit by itself as it is the generator of the termination decision.

The second deadlock was because the worker processes with rank 0 sends the termination value even if the value is false. This causes an unbalanced MPI_Send and MPI_Recv between the server and workers causing a deadlock. This was fixed by changing the if condition in the worker process to check for the terminating condition before invoking the MPI_Send.

The third deadlock was because the server performs only one single wildcard receive and a single send. This causes mismatched sends and receives when more than one worker is present (i.e.., n > 4) causing a deadlock. This deadlock was fixed by adding a for body in the server process code similar to those in the co-ordinating processes.

The code with the fixed deadlocks is available here. MonteCarlo_fixed

5  Byte-range Locking

The deadlock and its fixes are provided in detail here.

EuroPVM06.pdf


This document was translated from LATEX by HEVEA.