Stall in some multi-node parallel calculations

Hello,

I am trying to solve some problem in parallel, with FreeFem++ 4.9 installed on a cluster, where each node has 2 Intel Broadwell processors with 14 cores, and either 128 or 256 GB of RAM.
Everything works fine when I run the code in parallel on several processors of the same single node (e.g. 1 node, 16 procs), but sometimes a problem occurs when I run the exact same code on several nodes (e.g. 2 nodes, 8 procs each).

The code starts like this:

verbosity = 0;
load "msh3"
load "PETSc"        
load "gmsh"

int[int] n2oSaved;
int[int] n2oLoaded;

macro dimension()3// EOM         
include "macro_ddm.idp"             

macro def(i)[i, i#B, i#C, i#D]//    
macro init(i)[i, i, i, i]// 
macro grad(u)[dx(u), dy(u), dz(u)]// 
macro div(u)(dx(u) + dy(u#B) + dz(u#C))//
macro div1(u)(dx(u#1) + dy(u#2) + dz(u#3))// 
macro UgradV(u, v)[[u#1, u#2, u#3]' * [dx(v#1), dy(v#1), dz(v#1)],
                   [u#1, u#2, u#3]' * [dx(v#2), dy(v#2), dz(v#2)],
                   [u#1, u#2, u#3]' * [dx(v#3), dy(v#3), dz(v#3)]]//

func Pk = [P1b,P1b,P1b, P1];

mesh3 Th = readmesh3("mesh.mesh");

Mat A;
macro ThN2O()n2oSaved//
createMat(Th, A, Pk);

(All the rest of the code is commented out.)

If I run this code on 2 nodes, the calculation seems stalled and no output is written in the log, but If I remove the last line createMat(…); then the calculation runs correctly.
On 1 node, it runs correctly both with and without the last line.

Do you think the problem has to do with the code, with the cluster, or with something else?

Thank you!

Hi,
My naive guess is that the mesh.mesh file is not accessible for one of the nodes. You have to make sure this is the case.

If the problem is from reading the mesh, you can either make sure it is accessible for all nodes, or you can read the file from the master process, and then broadcast it to all… Something like

mesh3 Th;
if(!mpirank) { Th = readmesh3(“mesh.mesh”);}
broadcast(processor(0,mpiCommWorld),Th);

I hope this helps

Also, are you using the same MPI implementation for launching the code (e.g., mpirun) and compiling FreeFEM (e.g., mpic++)?
If you share your mesh and I can try to launch the code and confirm that the problem comes from your side (or not).

Hi,
Thank you very much for the prompt reply.

  • During installation, I think FF was compiled with mpicc. Now I launch jobs with these commands:
dirff=/path_to_FreeFem_sources/src/mpi/FreeFem++-mpi
MV2_ENABLE_AFFINITY=0 srun $dirff mycode.edp > log.txt
  • Thank you for proposing to try on your side, but I don’t seem to be able to upload any attachment (new user).

Thank you for your suggestion.

I tried it, but the behavior is still the same.

  • Which MPI implementation is used by the wrapper mpicc? Which MPI implement is invoked by the launcher srun?
  • Either upload the mesh somewhere, send it in private (maybe not possible either), or via email

Thank you.

  • I am not quite sure about mpicc. The installation was done by the people who manage the cluster. Would it be useful if I gave you the full list of commands used to install FreeFem? Or is there something I can do to check?
  • srun is the command used to run parallel jobs with the workload manager Slurm (i.e. to submit parallel jobs to the queuing system). So in my case I think it runs /path_to_FreeFem_sources/src/mpi/FreeFem++-mpi.

If the installation was not done by you, you need to forward my questions to the sysadmins to make sure that there is indeed no mismatch.
I know about srun, but again, you are not telling us what is the underlying MPI launcher used (srun can be used with OpenMPI/MPICH/IntelMPI/…), so we can’t tell you if there is indeed a mismatch between link-time and run-time.
Also, you could try to force the number of MPI processes to srun, i.e., srun -n XY FreeFem++-mpi your.edp.

I forwarded your questions to the system administrators / support.

In the meantime I tried to force the number of processes with the -n option but it did not change the behavior.

I will send you the mesh via email.

Also, I forgot something. Could you just try a simple program such as:

mpiBarrier(mpiCommWorld);
cout << mpirank << "/" << mpisize << endl; 

If this runs fine on a single node, but not on multiple nodes, then, the issue definitely comes from your installation. We can try to figure this out alongside the sysadmins, you’ll need to send FreeFem-sources/config.log.

I can confirm I was able to run your .edp on up to 256 processes. Please try the above code and let me know if this deadlocks as well.

I tried this simple program, this one seems to run correctly on both 1 and 2 nodes. The output on 1 node * 4 processes is the following:

-- FreeFem++ v4.9 (Fri Jun 18 14:45:02 CEST 2021 - git v4.9)
 Load: lg_fem lg_mesh lg_mesh3 eigenvalue parallelempi 
    1 : mpiBarrier(mpiCommWorld);
    2 : cout << mpirank << "/" << mpisize << endl; sizestack + 1024 =1072  ( 48 )

0/4
times: compile 0s, execution 0.01s,  mpirank:0
 CodeAlloc : nb ptr  3814,  size :519944 mpirank: 0
Ok: Normal End
1/4
times: compile 0s, execution 0s,  mpirank:1
 CodeAlloc : nb ptr  3814,  size :519944 mpirank: 1
2/4
times: compile 0.01s, execution 0s,  mpirank:2
 CodeAlloc : nb ptr  3814,  size :519944 mpirank: 2
3/4
times: compile 0s, execution 0s,  mpirank:3
 CodeAlloc : nb ptr  3814,  size :519944 mpirank: 3

And the output on 2 nodes * 2 processes is the following:

-- FreeFem++ v4.9 (Fri Jun 18 14:45:02 CEST 2021 - git v4.9)
 Load: lg_fem lg_mesh lg_mesh3 eigenvalue parallelempi 
    1 : mpiBarrier(mpiCommWorld);
    2 : cout << mpirank << "/" << mpisize << endl; sizestack + 1024 =1072  ( 48 )

0/4
times: compile 0.01s, execution 0.15s,  mpirank:0
 ######## We forget of deleting   0 Nb pointer,   0Bytes  ,  mpirank 0, memory leak =526176
 CodeAlloc : nb ptr  3814,  size :519944 mpirank: 0
Ok: Normal End
2/4
times: compile 0.02s, execution 0.15s,  mpirank:2
 ######## We forget of deleting   0 Nb pointer,   0Bytes  ,  mpirank 2, memory leak =484688
 CodeAlloc : nb ptr  3814,  size :519944 mpirank: 2
1/4
times: compile 0.01s, execution 0.16s,  mpirank:1
 ######## We forget of deleting   0 Nb pointer,   0Bytes  ,  mpirank 1, memory leak =526176
 CodeAlloc : nb ptr  3814,  size :519944 mpirank: 1
3/4
times: compile 0.01s, execution 0.17s,  mpirank:3
 ######## We forget of deleting   0 Nb pointer,   0Bytes  ,  mpirank 3, memory leak =484656
 CodeAlloc : nb ptr  3814,  size :519944 mpirank: 3

Ok, thank you for trying and confirming the problem comes from the installation. I will send you the config.log separately.

Could you please try to sature the nodes? E.g., if you can run up to X processes per node, then run with -n X on one node and -n 2*X on two nodes? Also, if you add load "PETSc" at the beginning of the trivial .edp, is everything still running fine?

I added load PETSc at the beginning, the code still ran fine on:

  • 1 node * 4 processes,
  • 2 nodes * 2 processes,
  • 1 node * 28 processes,
  • 2 nodes * 28 processes,
    where 28 is the max number of processes per node.

In the original code, does the problem still appear with 2 nodes and 2 processes?

The problem still appears on 2 nodes * 2 processes each (= 4 processes), but the code runs fine on 2 nodes * 1 process each (= 2 processes)!

In the meantime I got the reply from the sysadmin:

  • FF was compiled with the gcc compiler, with library files from Mvapich2,
  • when I submit a FF job, the script contains a command that loads the same modules as those used during the installation (inlcuding gcc and mvapich2).

Do you have access to another MPI implementation? Could you launch an interactive job an attach a debugger to see where the code is deadlocking?

  • In principle the cluster supports 3 MPI implementations: Intel MPI, Mvapich2, and openmpi. Would you recommend trying another one?
  • Yes it’s possible to run interactive jobs, please let me check how to do it properly in parallel. What do you mean exactly by “attach a debugger”?

PETSc has debugging capabilities to help troubleshoot such issues. Once you have access to an interactive shell, you can add the command line argument -start_in_debugger:

MV2_ENABLE_AFFINITY=0 srun $dirff mycode.edp -start_in_debugger > log.txt

This will spawn terminal windows with gdb. Once everything is loaded, type c (like “continue”) in both terminals (just use two processes on two nodes), and after a while, once processes are stuck in the deadlock, press Ctrl+C in one of the window, plus bt (like “backtrace”) to see where the code is stuck.