Stall in some multi-node parallel calculations

Sorry, it was not clear but I meant “it runs to completion” (I added a cout print to make sure).

3 days ago, when I tried non-interactively (without -start_in_debugger noxterm and without -Dpartitioner=scotch), the code ran fine on 2x2 and 2x4 processes (whatever the memory requested, at least up to 250’000 MB/node), ran fine on 2x8 processes up to 190’000 MB/node, but freezed on 2x8 processes above 200’000 MB/node.

Now, still non-interactively (and still without -start_in_debugger noxterm and without -Dpartitioner=scotch), the code:

  • runs fine on 2x2 processes with 10 and 100 GB/node but hangs with 200 GB/node,
  • hangs on 2x4 and 2x8 processes with 10, 100 and 200 GB/node.

Could you please send the output with 2x2 processes and 200 GB/node, please?

Sure, here it is:

-- FreeFem++ v4.9 (Fri Jun 18 14:45:02 CEST 2021 - git v4.9)
 Load: lg_fem lg_mesh lg_mesh3 eigenvalue parallelempi 
 load: init metis (v  5 )
 (already loaded: msh3) sizestack + 1024 =9120  ( 8096 )

0: 
2	
	  1	  2	
2747	2047	
rank 0 sending/receiving 2747 to 1
1: 
1	
	  0	
2747	
rank 1 sending/receiving 2747 to 0
rank 1 received from 0 (0) with tag 0 and count 2747
rank 0 sending/receiving 2047 to 2
2: 
2	
	  0	  3	
2047	2950	
rank 2 sending/receiving 2047 to 0
rank 2 sending/receiving 2950 to 3
rank 0 received from 1 (1) with tag 0 and count 2747
rank 0 received from 2 (2) with tag 0 and count 2047
3: 
2	
	  0	  2	
2618	2950	
rank 3 sending/receiving 2618 to 0
rank 3 sending/receiving 2950 to 2
rank 2 received from 3 (3) with tag 0 and count 2950
rank 3 received from 2 (2) with tag 0 and count 2950
rank 2 received from 0 (0) with tag 0 and count 2047

OK, even if it looks like it doesn’t, we are moving forward. There is indeed a clear mismatch (rank 3 sees rank 0 as a neighbor, but rank 0 does not see rank 3 as a neighbor). If you still have som gas left in the tank, please apply the following patch-macro_ddm_bis.log
and re-run the job (keep the existing changes, please). There is no need to recompile. Only send logs of hanging or crashing jobs, always with -ns. Thank you!

Thank you for your support and optimism. :slight_smile:

Here is the output of the same code after applying your latest patch (same submit options, still hanging):

-- FreeFem++ v4.9 (Fri Jun 18 14:45:02 CEST 2021 - git v4.9)
 Load: lg_fem lg_mesh lg_mesh3 eigenvalue parallelempi 
 load: init metis (v  5 )
 (already loaded: msh3) sizestack + 1024 =9120  ( 8096 )

0 true neighbor is 1 (1)
0 true neighbor is 2 (2)
0 true neighbor is 3 (3)
0: 
3	
	  1	  2	  3	
2747	2047	2618	
rank 0 sending/receiving 2747 to 1
1 true neighbor is 0 (1)
1 false neighbor bis is 2 (1)
1: 
1	
	  0	
2747	
rank 1 sending/receiving 2747 to 0
rank 1 received from 0 (0) with tag 0 and count 2747
3 false neighbor bis is 0 (0)
3 true neighbor is 2 (1)
3: 
1	
	  2	
2950	
rank 3 sending/receiving 2950 to 2
rank 0 sending/receiving 2047 to 2
rank 0 sending/receiving 2618 to 3
rank 0 received from 1 (1) with tag 0 and count 2747
rank 0 received from 2 (2) with tag 0 and count 2047
2 true neighbor is 0 (1)
2 false neighbor bis is 1 (1)
2 true neighbor is 3 (2)
2: 
2	
	  0	  3	
2047	2950	
rank 2 sending/receiving 2047 to 0
rank 2 sending/receiving 2950 to 3
rank 3 received from 2 (2) with tag 0 and count 2950
rank 2 received from 3 (3) with tag 0 and count 2950
rank 2 received from 0 (0) with tag 0 and count 2047

OK, now this is getting interesting. I won’t bother you with another patch, could you simply edit file macro_ddm.idp, and change the line

} else cout << mpirank << " false neighbor bis is " << intersection[0][i] << " (" << numberIntersection << ")\n";

to

} else cout << mpirank << " false neighbor bis is " << intersection[0][i] << " (" << numberIntersection << "), eps = " << eps << ", epsTab = " << epsTab[i] << "\n";

Thank you. Here is the output:

-- FreeFem++ v4.9 (Fri Jun 18 14:45:02 CEST 2021 - git v4.9)
 Load: lg_fem lg_mesh lg_mesh3 eigenvalue parallelempi 
 load: init metis (v  5 )
 (already loaded: msh3) sizestack + 1024 =9120  ( 8096 )

0 true neighbor is 1 (1)
0 true neighbor is 2 (2)
0 false neighbor bis is 3 (2), eps = 0.502167, epsTab = 0
0: 
2	
	  1	  2	
2747	2047	
rank 0 sending/receiving 2747 to 1
1 true neighbor is 0 (1)
1 false neighbor bis is 2 (1), eps = 0.459333, epsTab = 0
1: 
1	
	  0	
2747	
rank 1 sending/receiving 2747 to 0
rank 1 received from 0 (0) with tag 0 and count 2747
rank 0 sending/receiving 2047 to 2
2 true neighbor is 0 (1)
2 false neighbor bis is 1 (1), eps = 0.529, epsTab = 0
2 true neighbor is 3 (2)
2: 
2	
	  0	  3	
2047	2950	
rank 2 sending/receiving 2047 to 0
rank 2 sending/receiving 2950 to 3
rank 0 received from 1 (1) with tag 0 and count 2747
rank 0 received from 2 (2) with tag 0 and count 2047
3 true neighbor is 0 (1)
3 true neighbor is 2 (2)
3: 
2	
	  0	  2	
2618	2950	
rank 3 sending/receiving 2618 to 0
rank 3 sending/receiving 2950 to 2
rank 2 received from 3 (3) with tag 0 and count 2950
rank 3 received from 2 (2) with tag 0 and count 2950
rank 2 received from 0 (0) with tag 0 and count 2047

Excellent! Now it should be rather straightforward to debug. Line 251, could you please replace
mpiWaitAll(rq); by for(int debugI = 0; debugI < rq.n; ++debugI) mpiWaitAny(rq);. Does the code still hang?

Thank you. Here is the output:

-- FreeFem++ v4.9 (Fri Jun 18 14:45:02 CEST 2021 - git v4.9)
 Load: lg_fem lg_mesh lg_mesh3 eigenvalue parallelempi 
 No operator .n for type <P2KNIiE>
 load: init metis (v  5 )
 (already loaded: msh3) No operator .n for type <P2KNIiE>

 Error line number 213, in file  macro: partitionPrivate in /home/boujo/FreeFem/lib/ff++/4.9/idp/macro_ddm.idp, before  token n

  current line = 213 mpirank 0 / 4
Compile error : 
	line number :213, n
 No operator .n for type <P2KNIiE>
  current line = 213 mpirank 1 / 4
error Compile error : 
	line number :213, n
 code = 1 mpirank: 0
 No operator .n for type <P2KNIiE>
  current line = 213 mpirank 3 / 4
  current line = 213 mpirank 2 / 4

Oops, should not be rq.n, but rather 2 * intersection[0].n, sorry about that…

No worries. So here is the new output:

-- FreeFem++ v4.9 (Fri Jun 18 14:45:02 CEST 2021 - git v4.9)
 Load: lg_fem lg_mesh lg_mesh3 eigenvalue parallelempi 
 load: init metis (v  5 )
 (already loaded: msh3) sizestack + 1024 =9128  ( 8104 )

0 true neighbor is 1 (1)
0 true neighbor is 2 (2)
0 false neighbor bis is 3 (2), eps = 0.502167, epsTab = 0
0: 
2	
	  1	  2	
2747	2047	
rank 0 sending/receiving 2747 to 1
1 true neighbor is 0 (1)
1 false neighbor bis is 2 (1), eps = 0.459333, epsTab = 0
1: 
1	
	  0	
2747	
rank 1 sending/receiving 2747 to 0
3 false neighbor bis is 0 (0), eps = 0.445833, epsTab = 0
3 true neighbor is 2 (1)
3: 
1	
	  2	
2950	
rank 3 sending/receiving 2950 to 2
rank 1 received from 0 (0) with tag 0 and count 2747
rank 0 sending/receiving 2047 to 2
rank 0 received from 1 (1) with tag 0 and count 2747
rank 0 received from 2 (2) with tag 0 and count 2047
Done
Done
2 true neighbor is 0 (1)
2 false neighbor bis is 1 (1), eps = 0.529, epsTab = 0
2 true neighbor is 3 (2)
2: 
2	
	  0	  3	
2047	2950	
rank 2 sending/receiving 2047 to 0
rank 2 sending/receiving 2950 to 3
rank 3 received from 2 (2) with tag 0 and count 2950
rank 2 received from 3 (3) with tag 0 and count 2950
rank 2 received from 0 (0) with tag 0 and count 2047
Done
Done

(“Done” comes from the cout print I added to the code, in the last line.)

And now it seems to run to completion with 2x2, 2x4 and 2x8 processes!

This is exactly the same output has before, so I’m not sure why that would change anything. But if changing from mpiWaitAll to mpiWaitAny makes things run smoothly, that’s good to know. I’d try with another MPI implementation, because really, that should not cause such behavior…

(I should maybe create a new discussion but I first ask here, please let me know.)

Have you ever seen such a thing when saving vtu files FF savevtk("file.vtu", Th, [U,V,W], order = [1,1,1], dataname="Ub"); and visualizing them with Paraview?



The two images correspond to the same code, same parameters (except the number of processes - but always on 1 single node), but the first one is a result computed before we applied the patched yesterday, the second one after.

Actually this behavior is not consistent: that’s indeed what I observed for my code, but trying again the example hppdm/stokes-2d-PETSc.edp it still hangs on 2x2 processes.

For ParaView, I don’t know what you are saving, but that could make sense.
For stokes-2d-PETSc.edp, output, please?

Yes, here is the output for stokes-2d-PETSc.edp:

-- FreeFem++ v4.9 (Fri Jun 18 14:45:02 CEST 2021 - git v4.9)
 Load: lg_fem lg_mesh lg_mesh3 eigenvalue parallelempi 
 load: init metis (v  5 )
 sizestack + 1024 =9800  ( 8776 )

  -- Square mesh : nb vertices  =1681 ,  nb triangles = 3200 ,  nb boundary edges 160
  -- Square mesh : nb vertices  =1681 ,  nb triangles = 3200 ,  nb boundary edges 160
  -- Square mesh : nb vertices  =1681 ,  nb triangles = 3200 ,  nb boundary edges 160
  -- Square mesh : nb vertices  =1681 ,  nb triangles = 3200 ,  nb boundary edges 160
 --- global mesh of 4800 elements (prior to refinement) partitioned with metis  --metisA: 4-way Edge-Cut:       3, Balance:  1.01 Nodal=0/Dual 1
 (in 4.325628e-03)
 --- partition of unity built (in 2.602980e-01)
rank 0 sending/receiving 480 to 1
rank 2 sending/receiving 426 to 3
rank 1 sending/receiving 480 to 0
rank 1 sending/receiving 964 to 3
rank 3 sending/receiving 964 to 1
rank 3 sending/receiving 426 to 2
rank 0 received from 1 (1) with tag 0 and count 480
rank 2 received from 3 (3) with tag 0 and count 426
rank 1 received from 0 (0) with tag 0 and count 480
rank 1 received from 3 (3) with tag 0 and count 964
rank 3 received from 2 (2) with tag 0 and count 426
rank 3 received from 1 (1) with tag 0 and count 964
 --- global numbering created (in 3.905296e-04)
 --- global CSR created (in 5.421638e-04)
 Warning: -- Your set of boundary condition is incompatible with the mesh label.
 Warning: -- Your set of boundary condition is incompatible with the mesh label.
 Warning: -- Your set of boundary condition is incompatible with the mesh label.
 Warning: -- Your set of boundary condition is incompatible with the mesh label.

In that example, if you replace

buildMat(Th, getARGV("-split", 1), A, Pk, mpiCommWorld)

by

createMat(Th, A, Pk)

do you still have the same behavior?

With 2x1 processes:

-- FreeFem++ v4.9 (Fri Jun 18 14:45:02 CEST 2021 - git v4.9)
 Load: lg_fem lg_mesh lg_mesh3 eigenvalue parallelempi 
 load: init metis (v  5 )
 sizestack + 1024 =10800  ( 9776 )

  -- Square mesh : nb vertices  =1681 ,  nb triangles = 3200 ,  nb boundary edges 160
  -- Square mesh : nb vertices  =1681 ,  nb triangles = 3200 ,  nb boundary edges 160
 --- global mesh of 4800 elements (prior to refinement) partitioned with metis  --metisA: 2-way Edge-Cut:       3, Balance:  1.00 Nodal=0/Dual 1
 (in 4.200935e-03)
 --- partition of unity built (in 1.551914e-02)
rank 0 sending/receiving 995 to 1
rank 0 received from 1 (1) with tag 0 and count 995
rank 1 sending/receiving 995 to 0
rank 1 received from 0 (0) with tag 0 and count 995
 --- global numbering created (in 1.698017e-02)
 --- global CSR created (in 7.243156e-04)
 Warning: -- Your set of boundary condition is incompatible with the mesh label.
 Warning: -- Your set of boundary condition is incompatible with the mesh label.
 --- system solved with PETSc (in 2.283962e-01)
times: compile 1.100000e-01s, execution 4.700000e-01s,  mpirank:0
 ######## We forget of deleting   0 Nb pointer,   0Bytes  ,  mpirank 0, memory leak =948496
 CodeAlloc : nb ptr  7147,  size :650304 mpirank: 0
Ok: Normal End
times: compile 0.12s, execution 0.46s,  mpirank:1
 ######## We forget of deleting   0 Nb pointer,   0Bytes  ,  mpirank 1, memory leak =948448
 CodeAlloc : nb ptr  7147,  size :650304 mpirank: 1

And with 2x2 processes (hanging):

-- FreeFem++ v4.9 (Fri Jun 18 14:45:02 CEST 2021 - git v4.9)
 Load: lg_fem lg_mesh lg_mesh3 eigenvalue parallelempi 
  -- Square mesh : nb vertices  =1681 ,  nb triangles = 3200 ,  nb boundary edges 160
 load: init metis (v  5 )
 sizestack + 1024 =10800  ( 9776 )

  -- Square mesh : nb vertices  =1681 ,  nb triangles = 3200 ,  nb boundary edges 160
  -- Square mesh : nb vertices  =1681 ,  nb triangles = 3200 ,  nb boundary edges 160
  -- Square mesh : nb vertices  =1681 ,  nb triangles = 3200 ,  nb boundary edges 160
rank 2 sending/receiving 426 to 3
rank 1 sending/receiving 480 to 0
rank 1 sending/receiving 964 to 3
 --- global mesh of 4800 elements (prior to refinement) partitioned with metis  --metisA: 4-way Edge-Cut:       3, Balance:  1.01 Nodal=0/Dual 1
 (in 4.354715e-03)
 --- partition of unity built (in 2.464640e-01)
rank 0 sending/receiving 480 to 1
rank 1 received from 0 (0) with tag 0 and count 480
rank 0 received from 1 (1) with tag 0 and count 480
rank 1 received from 3 (3) with tag 0 and count 964
rank 3 sending/receiving 964 to 1
rank 3 sending/receiving 426 to 2
rank 2 received from 3 (3) with tag 0 and count 426
rank 3 received from 2 (2) with tag 0 and count 426
rank 3 received from 1 (1) with tag 0 and count 964
 --- global numbering created (in 5.839586e-03)
 --- global CSR created (in 8.101463e-04)
 Warning: -- Your set of boundary condition is incompatible with the mesh label.
 Warning: -- Your set of boundary condition is incompatible with the mesh label.
 Warning: -- Your set of boundary condition is incompatible with the mesh label.
 Warning: -- Your set of boundary condition is incompatible with the mesh label.

Did you revert the latest patches? Because it’s supposed to print more output, like

0 true neighbor is 1 (1)
0 true neighbor is 2 (2)
0 false neighbor bis is 3 (2), eps = 0.502167, epsTab = 0