Stall in some multi-node parallel calculations

By the way, in case it helps: the example hppdm/stokes-2d-SLEPc.edp runs fine on both 2 nodes * 1 process and 2 nodes * 2 processes.

That does not make any kind of sense :grinning_face_with_smiling_eyes:
What if you replace readmesh by a simple cube(10, 10, 10)?

Indeed this is surprising.

So, if replace readmesh by cube(10,10,10) in my code, it runs fine on 2 nodes * 1 process, but neither on 2 nodes * 2 processes nor on 2 nodes * 8 processes.

The behavior with readmesh is slightly different: runs fine on 2 nodes * 1 process and 2 nodes * 2 processes, but not on 2 nodes * 8 processes…

I will let you know when I manage to run interactively with a debugger.

Out of curiosity, is there a way to use gdb without graphical display?

-start_in_debugger noxterm, but it gets slightly more tricky to decipher…

I realized that the behavior depends on how much memory I request, but also on whether I use the debugger or not.

Here is the current status:

  • launching non-interactively with a script, the code runs fine on 2x2 and 2x4 processes (whatever the memory requested, at least up to 250’000 MB/node), runs fine on 2x8 processes up to 190’000 MB/node, but freezes on 2x8 processes above 200’000 MB/node;
  • launching interactively without the debugger, the code runs fine on 2x4 processes (whatever the memory requested), runs fine on 2x8 processes with 150’000 MB/node, but freezes on 2x8 processes with 250’000 MB/node;
  • launching interactively with the debugger (and noxterm), the code runs fine on 2x8, 2x16 and 2x28 processes whatever the memory requested (at least up to 250’000 MB/node)…

And the mystery continues… I can’t do much more remotely, appart from saying that this looks like a configuration mismatch. You mentioned that you have access to other MPI implementation, so maybe if it is not too much trouble for you, you could try to recompile from scratch with MPICH?

Yes, very mysterious.

I will see with the sysadmin if we can recompile from scratch with MPICH.

In the meantime he asked me to check, if the code uses MPI communication, that “every send has a matching and reachable recv”. Do you know how to do that?

You could check that… using a debugger :slight_smile: Maybe you have access to Arm DDT on your machine?
But this code is tested with hundreds of examples, and again, runs perfectly fine on my multiple machines with various numbers of processes. I must admit I’m not a MVAPICH2 user, so maybe there is a very weird corner case, but I’m using MPICH and IntelMPI (MVAPICH2 and IntelMPI are both based on MPICH), and I don’t have any issue with your code.

Hi,
The sysadmin tested “my” code

// 
verbosity = 0;
load "msh3"
load "PETSc"        
load "gmsh"
int[int] n2oSaved;
int[int] n2oLoaded;
macro dimension()3// 
include "macro_ddm.idp"             // additional DDM functions
macro def(i)[i, i#B, i#C, i#D]//   
macro init(i)[i, i, i, i]// 
macro grad(u)[dx(u), dy(u), dz(u)]//
macro div(u)(dx(u) + dy(u#B) + dz(u#C))// 
macro div1(u)(dx(u#1) + dy(u#2) + dz(u#3))//
macro UgradV(u, v)[[u#1, u#2, u#3]' * [dx(v#1), dy(v#1), dz(v#1)],
                   [u#1, u#2, u#3]' * [dx(v#2), dy(v#2), dz(v#2)],
                   [u#1, u#2, u#3]' * [dx(v#3), dy(v#3), dz(v#3)]]//
func Pk = [P1b,P1b,P1b, P1];
mesh3 Th = cube(10,10,10);
Mat A;
macro ThN2O()n2oSaved//
createMat(Th, A, Pk);

on 2 nodes * 2 processes and checked the backtrace of each process.
Here is the result for 3 of the 4 tasks:

#0  0x00002af01e80af30 in MPIDI_CH3I_MRAILI_Cq_poll_ib () from /ssoft/spack/humagne/v1/opt/spack/linux-rhel7-x86_E5v4_Mellanox/gcc-7.4.0/mvapich2-2.3.1-nvnz73udfwp4zqg47ttdrpodvvc2zvc5/lib/libmpi.so.12
#1  0x00002af01e7e1350 in MPIDI_CH3I_read_progress () from /ssoft/spack/humagne/v1/opt/spack/linux-rhel7-x86_E5v4_Mellanox/gcc-7.4.0/mvapich2-2.3.1-nvnz73udfwp4zqg47ttdrpodvvc2zvc5/lib/libmpi.so.12
#2  0x00002af01e7e0c24 in MPIDI_CH3I_Progress () from /ssoft/spack/humagne/v1/opt/spack/linux-rhel7-x86_E5v4_Mellanox/gcc-7.4.0/mvapich2-2.3.1-nvnz73udfwp4zqg47ttdrpodvvc2zvc5/lib/libmpi.so.12
#3  0x00002af01e773efd in MPIC_Wait () from /ssoft/spack/humagne/v1/opt/spack/linux-rhel7-x86_E5v4_Mellanox/gcc-7.4.0/mvapich2-2.3.1-nvnz73udfwp4zqg47ttdrpodvvc2zvc5/lib/libmpi.so.12
#4  0x00002af01e774651 in MPIC_Sendrecv () from /ssoft/spack/humagne/v1/opt/spack/linux-rhel7-x86_E5v4_Mellanox/gcc-7.4.0/mvapich2-2.3.1-nvnz73udfwp4zqg47ttdrpodvvc2zvc5/lib/libmpi.so.12
#5  0x00002af01e4e967f in MPIR_Allgather_RD_MV2 () from /ssoft/spack/humagne/v1/opt/spack/linux-rhel7-x86_E5v4_Mellanox/gcc-7.4.0/mvapich2-2.3.1-nvnz73udfwp4zqg47ttdrpodvvc2zvc5/lib/libmpi.so.12
#6  0x00002af01e4ecf7c in MPIR_Allgather_index_tuned_intra_MV2 () from /ssoft/spack/humagne/v1/opt/spack/linux-rhel7-x86_E5v4_Mellanox/gcc-7.4.0/mvapich2-2.3.1-nvnz73udfwp4zqg47ttdrpodvvc2zvc5/lib/libmpi.so.12
#7  0x00002af01e4edb02 in MPIR_Allgather_MV2 () from /ssoft/spack/humagne/v1/opt/spack/linux-rhel7-x86_E5v4_Mellanox/gcc-7.4.0/mvapich2-2.3.1-nvnz73udfwp4zqg47ttdrpodvvc2zvc5/lib/libmpi.so.12
#8  0x00002af01e4b6159 in MPIR_Allgather_impl () from /ssoft/spack/humagne/v1/opt/spack/linux-rhel7-x86_E5v4_Mellanox/gcc-7.4.0/mvapich2-2.3.1-nvnz73udfwp4zqg47ttdrpodvvc2zvc5/lib/libmpi.so.12
#9  0x00002af01e4b6a66 in PMPI_Allgather () from /ssoft/spack/humagne/v1/opt/spack/linux-rhel7-x86_E5v4_Mellanox/gcc-7.4.0/mvapich2-2.3.1-nvnz73udfwp4zqg47ttdrpodvvc2zvc5/lib/libmpi.so.12
#10 0x00002af026c61ae3 in HPDDM::Subdomain<double>::globalMapping<(char)67, int*, int> (this=0x382c2c0, first=0x391bc10, last=0x39240d8, start=@0x3750560: 0, end=@0x3750564: 0, 
    global=@0x7ffda4864e38: 58000672, d=0x39314a0, list=0x0) at /home/lanti/tmp/FF_install_develop/ff-petsc/r/include/HPDDM_subdomain.hpp:476
#11 0x00002af026cc9259 in distributedNumbering<int> (global=<optimized out>, last=@0x3750564: 0, first=@0x3750560: 0, in=<optimized out>, this=<optimized out>) at /home/lanti/tmp/FF_install_develop/ff-petsc/r/include/HPDDM_schwarz.hpp:1309
#12 PETSc::initPETScStructure<true, PETSc::DistributedCSR<HPDDM::Schwarz<double> >, (void*)0> (ptA=ptA@entry=0x3750520, bs=bs@entry=1, symmetric=symmetric@entry=PETSC_FALSE, ptD=ptD@entry=0x399ad98) at PETSc-code.hpp:126
#13 0x00002af026cc9f74 in PETSc::initCSR<HPDDM::Schwarz<double>, true>::E_initCSR::operator() (this=0x382b680, stack=0x37502f0) at PETSc-code.hpp:1272
#14 0x0000000000a14921 in ListOfInst::operator() (this=0x38192d0, s=0x37502f0) at AFunction2.cpp:794
#15 0x0000000000a14921 in ListOfInst::operator() (this=0x382a2f0, s=0x37502f0) at AFunction2.cpp:794
#16 0x0000000000a14921 in ListOfInst::operator() (this=0x3752820, s=0x37502f0) at AFunction2.cpp:794
#17 0x0000000000930255 in eval (this=0x7ffda48653f8, s=0x37502f0) at ./../fflib/AFunction.hpp:1486
#18 lgparse () at lg.ypp:360
#19 0x0000000000930a95 in Compile () at lg.ypp:848
#20 0x000000000093138c in mainff (argc=<optimized out>, argv=<optimized out>) at lg.ypp:1022
#21 0x00002af020a95495 in __libc_start_main () from /lib64/libc.so.6
#22 0x00000000009270c7 in _start ()

and for the 4th task:

#0  #   0x00002ab57d423f2d in MPIDI_CH3I_MRAILI_Cq_poll_ib () from /ssoft/spack/humagne/v1/opt/spack/linux-rhel7-x86_E5v4_Mellanox/gcc-7.4.0/mvapich2-2.3.1-nvnz73udfwp4zqg47ttdrpodvvc2zvc5/lib/libmpi.so.12
#1  #   0x00002ab57d3fa350 in MPIDI_CH3I_read_progress () from /ssoft/spack/humagne/v1/opt/spack/linux-rhel7-x86_E5v4_Mellanox/gcc-7.4.0/mvapich2-2.3.1-nvnz73udfwp4zqg47ttdrpodvvc2zvc5/lib/libmpi.so.12
#2  #   0x00002ab57d3fa0f5 in MPIDI_CH3I_Progress_test () from /ssoft/spack/humagne/v1/opt/spack/linux-rhel7-x86_E5v4_Mellanox/gcc-7.4.0/mvapich2-2.3.1-nvnz73udfwp4zqg47ttdrpodvvc2zvc5/lib/libmpi.so.12
#3  #   0x00002ab57d36507c in PMPI_Waitany () from /ssoft/spack/humagne/v1/opt/spack/linux-rhel7-x86_E5v4_Mellanox/gcc-7.4.0/mvapich2-2.3.1-nvnz73udfwp4zqg47ttdrpodvvc2zvc5/lib/libmpi.so.12
#4  #   0x00002ab5858dee10 in HPDDM::Schwarz<double>::restriction (this=<optimized out>, D=D@entry=0x29b64c0) at /home/lanti/tmp/FF_install_develop/ff-petsc/r/include/HPDDM_schwarz.hpp:255
#5  #12 0x00002ab5858e2525 in PETSc::initPETScStructure<true, PETSc::DistributedCSR<HPDDM::Schwarz<double> >, (void*)0> (ptA=ptA@entry=0x27970f0, bs=bs@entry=1, symmetric=symmetric@entry=PETSC_FALSE, ptD=ptD@entry=0x2860898) at PETSc-code.hpp:113
#6  #13 0x00002ab5858e2f74 in PETSc::initCSR<HPDDM::Schwarz<double>, true>::E_initCSR::operator() (this=0x2871b90, stack=0x2796ec0) at PETSc-code.hpp:1272
#7  #14 0x0000000000a14921 in ListOfInst::operator() (this=0x285f7e0, s=0x2796ec0) at AFunction2.cpp:794
#8  #15 0x0000000000a14921 in ListOfInst::operator() (this=0x2870800, s=0x2796ec0) at AFunction2.cpp:794
#9  #16 0x0000000000a14921 in ListOfInst::operator() (this=0x27993f0, s=0x2796ec0) at AFunction2.cpp:794
#10 #17 0x0000000000930255 in eval (this=0x7ffed834a0d8, s=0x2796ec0) at ./../fflib/AFunction.hpp:1486
#11 #18 lgparse () at lg.ypp:360
#12 #19 0x0000000000930a95 in Compile () at lg.ypp:848
#13 #20 0x000000000093138c in mainff (argc=<optimized out>, argv=<optimized out>) at lg.ypp:1022
#14 #21 0x00002ab57f6ae495 in __libc_start_main () from /lib64/libc.so.6
#15 #22 0x00000000009270c7 in _start ()

So, according to the sysadmin:

As you can see, up to #12 (in reverse order) they are all doing the same thing. Then, they diverge. 3/4 tasks call an MPI_allgather while the last task calls the restriction routine and then an MPI_waitany. There are thus communications that are not complete, hence the deadlock.

All the signs seems to point to a problem with [FF] (uninitialized variable maybe?) rather than on [the installation]. The MPI implementation used here is the same as the one used to compile the code so it is not the problem. If it is a problem of uninitialized values then it could also explain why you had your code working on a different machine. The compilers don’t treat uninitialized variables the same way and there are also flags to change that… In the [hppdm/stokes-2d-PETSc.edp example], there was a PETSc error while attempting to allocate an array of size 2147483647, which once again is a typical value for an uninitialized variable.

(A comment about the hppdm/stokes-2d-PETSc.edp example: I tried it with

  • 2 nodes * 24 processes on the debug partition, it works,
  • 2 nodes * 24 processes on the parallel partition, it does NOT work,
  • 1 node * 24 processes on the parallel partition, it works.)

Is the result of the backtrace useful? Do you understand what the problem is?

uninitialized variable maybe?

Unlikely, this part of the code is Valgrind-clean.

Is the result of the backtrace useful?

Yes, to understand where the problem is, but, cf. below.

Do you understand what the problem is?

Not a clue. This part of the code is heavily tested and independent on the number of nodes. So if it’s running on 1 node with 4 processes, it doesn’t make much sense that it’s not running on 2 nodes with 2 processes per node. If it’s not too much trouble, I’ve prepared a small patch (patch-HPDDM_schwarz.log). Could you please apply it to /home/lanti/tmp/FF_install_develop/ff-petsc/r/include/HPDDM_schwarz.hpp, and then recompile the plugin PETSc (via cd FreeFem-sources/plugin/mpi && ../seq/ff-c++ -auto PETSc.cpp && make install).
Then, re-run the code and please send me the output. You should see something like:

rank 0 received from 1 (1) with tag 0 and count 3233
rank 1 received from 0 (0) with tag 0 and count 3233
rank 0 received from 2 (2) with tag 0 and count 2500
rank 2 received from 0 (0) with tag 0 and count 2500
rank 0 received from 3 (3) with tag 0 and count 50
rank 3 received from 0 (0) with tag 0 and count 50
rank 0 sending/receiving 2500 to 2
rank 2 sending/receiving 2500 to 0
rank 0 sending/receiving 3233 to 1
rank 1 sending/receiving 3233 to 0
rank 0 sending/receiving 50 to 3
rank 3 sending/receiving 50 to 0
rank 1 received from 2 (2) with tag 0 and count 2138
rank 2 received from 1 (1) with tag 0 and count 2138
rank 1 sending/receiving 2400 to 3
rank 3 sending/receiving 2400 to 1
rank 3 received from 2 (2) with tag 0 and count 2897
rank 2 received from 3 (3) with tag 0 and count 2897
rank 2 sending/receiving 2138 to 1
rank 1 sending/receiving 2138 to 2
rank 3 received from 1 (1) with tag 0 and count 2400
rank 1 received from 3 (3) with tag 0 and count 2400
rank 3 sending/receiving 2897 to 2
rank 2 sending/receiving 2897 to 3

and we will be able to see if there is a mismatch (it’s the only reason why one process would be stuck in the given MPI_Waitany). Furthermore, does the code still hangs if you add to your command line arguments -Dpartitioner=scotch?

Thank you for your patience!

In the [hppdm/stokes-2d-PETSc.edp example], there was a PETSc error while attempting to allocate an array of size 2147483647, which once again is a typical value for an uninitialized variable.

What’s the stack? The value of 2147483647 is used as a sentinel value inside HPDDM, so that comment, without further context, is wrong.

No, thank you for your patience. We will try the patch and let you know.

Hi,

  1. Here is the “interesting part” of the result of the patch for my code (full output attached FreeFem++_HPDDM_patch.log (350.1 KB)):
rank 1 sending/receiving 2747 to 0
rank 3 sending/receiving 2618 to 0
rank 3 sending/receiving 2950 to 2

Define matrix A
rank 0 sending/receiving 2747 to 1
rank 1 received from 0 (0) with tag 0 and count 2747
rank 0 sending/receiving 2047 to 2
rank 0 received from 1 (1) with tag 0 and count 2747
rank 2 sending/receiving 2047 to 0
rank 2 sending/receiving 2950 to 3
rank 3 received from 2 (2) with tag 0 and count 2950
rank 2 received from 3 (3) with tag 0 and count 2950
rank 0 received from 2 (2) with tag 0 and count 2047
rank 2 received from 0 (0) with tag 0 and count 2047
  1. Regarding the -Dpartitioner=scotch option, where do you mean it should be used?

  2. Regarding the hppdm/stokes-2d-PETSc.edp example, I don’t have the stack but here is attached the error log for 2 nodes * 2 processesFreeFem++_HPDDM_patch_stokes_2d_example.log (289.2 KB). Do you think you could try to reproduce it?

Thank you.

  1. Please re-run the job with the -ns parameter, it’s unreadable right now with all the parsed .edp being printed to the file.
  2. On the command line.
  3. Same as 1., please, always use -ns when sending outputs.
  1. Here is the output for my code with the -ns option:
    – FreeFem++ v4.9 (Fri Jun 18 14:45:02 CEST 2021 - git v4.9)
    Load: lg_fem lg_mesh lg_mesh3 eigenvalue parallelempi
    load: init metis (v 5 )
    (already loaded: msh3) sizestack + 1024 =9104 ( 8080 )

rank 0 sending/receiving 2747 to 1
rank 0 sending/receiving 2047 to 2
rank 3 sending/receiving 2618 to 0
rank 3 sending/receiving 2950 to 2
rank 0 received from 1 (1) with tag 0 and count 2747
rank 1 sending/receiving 2747 to 0
rank 1 received from 0 (0) with tag 0 and count 2747
rank 0 received from 2 (2) with tag 0 and count 2047
rank 2 sending/receiving 2047 to 0
rank 2 sending/receiving 2950 to 3
rank 3 received from 2 (2) with tag 0 and count 2950
rank 2 received from 3 (3) with tag 0 and count 2950
rank 2 received from 0 (0) with tag 0 and count 2047

  1. Trying to add the -Dpartitioner=scotch option as
MV2_ENABLE_AFFINITY=0   srun /home/FreeFem/FreeFem-sources/src/mpi/FreeFem++-mpi   -ns   -Dpartitioner=scotch   mycode.edp > log.txt

yields the following output:

 Error opening file -Dpartitioner=scotch in: 
 Load: lg_fem lg_mesh lg_mesh3 eigenvalue parallelempi 
 Error opening file -Dpartitioner=scotch in: 
  -- try  :"-Dpartitioner=scotch"
  -- try  :"/home/.../FreeFem/lib/ff++/4.9/idp/-Dpartitioner=scotch"

 Error line number 1, in file -- unkown --, before  token 
lex: Error input opening file 
  current line = -1 mpirank 0 / 4
Compile error : lex: Error input opening file 
	line number :1, 
 Error opening file -Dpartitioner=scotch in: 
  -- try  :"-Dpartitioner=scotch"
  -- try  :"/home/.../FreeFem/lib/ff++/4.9/idp/-Dpartitioner=scotch"
  current line = -1 mpirank 1 / 4
 Error opening file -Dpartitioner=scotch in: 
  -- try  :"-Dpartitioner=scotch"
  -- try  :"/home/.../FreeFem/lib/ff++/4.9/idp/-Dpartitioner=scotch"
  current line = -1 mpirank 2 / 4
  -- try  :"-Dpartitioner=scotch"
  -- try  :"/home/.../FreeFem/lib/ff++/4.9/idp/-Dpartitioner=scotch"
  current line = -1 mpirank 3 / 4
  1. Please give me a moment to rerun hppdm/stokes-2d-PETSc.edp.

The flag -Dpartitioner=scotch must be put after the .edp file.
I can see now why there is a deadlock, there is no matching receive for rank 0 from rank 3. But I don’t know what that is. I’ll try to send you another patch for getting to the root cause of this.

Here is another small patch (patch-macro_ddm.log)
that does not require any recompilation. First find macro_ddm.idp, likely in /home/.../FreeFem/lib/ff++/4.9/idp/macro_ddm.idp and apply the patch. Then, rerun the code and send the updated output, please. You should get some extra text, like in the following:

0:
3
	  1	  2	  3
2747	2047	2618
1:
2
	  0	  2
2747	2995
2:
3
	  0	  1	  3
2047	2995	2950
3:
2
	  0	  2
2618	2950

Thank you.

  1. Here is the output with the additional patch:
-- FreeFem++ v4.9 (Fri Jun 18 14:45:02 CEST 2021 - git v4.9)
 Load: lg_fem lg_mesh lg_mesh3 eigenvalue parallelempi 
 load: init metis (v  5 )
 (already loaded: msh3) sizestack + 1024 =9120  ( 8096 )

0: 
2	
	  1	  2	
2747	2047	
rank 0 sending/receiving 2747 to 1
1: 
1	
	  0	
2747	
rank 1 sending/receiving 2747 to 0
3: 
1	
	  2	
2950	
rank 3 sending/receiving 2950 to 2
rank 1 received from 0 (0) with tag 0 and count 2747
rank 0 sending/receiving 2047 to 2
rank 0 received from 1 (1) with tag 0 and count 2747
rank 0 received from 2 (2) with tag 0 and count 2047
2: 
2	
	  0	  3	
2047	2950	
rank 2 sending/receiving 2047 to 0
rank 2 sending/receiving 2950 to 3
rank 3 received from 2 (2) with tag 0 and count 2950
rank 2 received from 3 (3) with tag 0 and count 2950
rank 2 received from 0 (0) with tag 0 and count 2047
  1. With the -Dpartitioner=scotch option, the output is the following, and the code hangs (whereas without this option, it simply stops):
-- FreeFem++ v4.9 (Fri Jun 18 14:45:02 CEST 2021 - git v4.9)
 Load: lg_fem lg_mesh lg_mesh3 eigenvalue parallelempi 
 (already loaded: msh3) sizestack + 1024 =9120  ( 8096 )

0: 
3	
	  1	  2	  3	
3233	2500	50	
rank 0 sending/receiving 3233 to 1
1: 
2	
	  0	  3	
3233	2400	
rank 1 sending/receiving 3233 to 0
rank 0 sending/receiving 2500 to 2
rank 0 sending/receiving 50 to 3
2: 
2	
	  0	  3	
2500	2897	
rank 2 sending/receiving 2500 to 0
rank 2 sending/receiving 2897 to 3
rank 1 sending/receiving 2400 to 3
rank 0 received from 1 (1) with tag 0 and count 3233
rank 1 received from 0 (0) with tag 0 and count 3233
rank 0 received from 2 (2) with tag 0 and count 2500
rank 1 received from 3 (3) with tag 0 and count 2400
3: 
2	
	  1	  2	
2400	2897	
rank 3 sending/receiving 2400 to 1
rank 3 sending/receiving 2897 to 2
rank 2 received from 3 (3) with tag 0 and count 2897
rank 3 received from 2 (2) with tag 0 and count 2897
rank 2 received from 0 (0) with tag 0 and count 2500
rank 3 received from 1 (1) with tag 0 and count 2400

What do you mean by “simply stops”? Does the program run to completion, or does it crash? I thought it was hanging w/o the parameter -Dpartitioner?