MPI collective debugger extension proposal

Overview

The current MPI debugger interface is used to export information from a running application to a debugger. The current interface allow the debugger to look at a MPI Process, to iterate over communicators within that process and to view message queues associated with a communicator.

I propose an extension to this to export information about individual communicators within a process, in particular information about collective operations (MPI_Bcast, MPI_Reduce et. al.)

Implementation

The specific information I propose is to add a communicator specific counter for each possible collective where the counter simply records the number of times the collective has been called on this communicator. Along with this is keeping a second piece of data, that of if the process is still performing the collective operation.

A new enum is added to the interface, mqs_comm_class with values for each collective call.

A single extra callback function mqs_get_comm_coll_state is added to the interface and queries the current communicator in the same way as mqs_next_operation. This function takes the standard process parameter, a mqs_comm_class enum as input for which collective to query and two int *, the first of these is a pointer to a int set which should be set to the count of the number of calls to the collective, the second is a pointer to a int which should be set zero or one depending if the collective operation is still active.

A successful call to the mqs_get_comm_coll_state should return mqs_ok with mqs_no_information being used in the case where information isn't available. This allows further enum values to be added in the future should the mpi-forum approve new collective functions without needing to change the debugger function interface.

Performance Impact

Maintaining this data does add code to the "critical path" of the MPI stack, in it's simplest form all it requires is a pair of counter increments per collective call, one on function entry and one on function exit so whilst there is a non-zero run-time cost associated with maintaining this information it's a minimal one.

mpi_interface.h

The additions required to mpi_interface.h are shown below.

typedef enum
{
  mqs_comm_barrier,
  mqs_comm_broadcast,
  mqs_comm_allgather,
  mqs_comm_allgatherv,
  mqs_comm_allreduce,
  mqs_comm_alltoall,
  mqs_comm_alltoallv,
  mqs_comm_reduce_scatter,
  mqs_comm_reduce,
  mqs_comm_gather,
  mqs_comm_gatherv,
  mqs_comm_scan,
  mqs_comm_scatter,
  mqs_comm_scatterv
} mqs_comm_class;

/***********************************************************************
 * Collective extension
 *
 * This extension should be considered optional and the debugger should
 * correctly the case where it doesn't exist.
 *
 */

/*
 * Return the state of collective operations for the currently active
 * communicator, that is the number of times the collective has been
 * called and if the operation is still in progress.
 * 
 * The first int is *really* mqs_comm_class.
 */
extern int mqs_get_comm_coll_state (mqs_process *, int, int *, int *);

Benefits

The extension allows a debugger or external program to know the state of collective calls with the parallel program. In the typical scenario of debugging a hung application this knowledge allows the debugger and programmer to know instantly which processes are stuck in collective calls and which aren't, either because they have successful made the collective call and returned or because they haven't made the calls other ranks in a communicator have. This information allows swift identification of problem areas within the job where further investigation may be required.

This extension was originally developed in early 2007 whilst I was working at Quadrics and has proved it's value numerous times in real-life cases.

Sample Implementation

At this time a sample implementation is available for OpenMPI only although work is being done a MPICH2 version.

Patch for OpenMPI v1.7