Sorry, it was not clear but I meant “it runs to completion” (I added a cout print to make sure).
3 days ago, when I tried non-interactively (without -start_in_debugger noxterm and without -Dpartitioner=scotch), the code ran fine on 2x2 and 2x4 processes (whatever the memory requested, at least up to 250’000 MB/node), ran fine on 2x8 processes up to 190’000 MB/node, but freezed on 2x8 processes above 200’000 MB/node.
Now, still non-interactively (and still without -start_in_debugger noxterm and without -Dpartitioner=scotch), the code:
runs fine on 2x2 processes with 10 and 100 GB/node but hangs with 200 GB/node,
hangs on 2x4 and 2x8 processes with 10, 100 and 200 GB/node.
-- FreeFem++ v4.9 (Fri Jun 18 14:45:02 CEST 2021 - git v4.9)
Load: lg_fem lg_mesh lg_mesh3 eigenvalue parallelempi
load: init metis (v 5 )
(already loaded: msh3) sizestack + 1024 =9120 ( 8096 )
0:
2
1 2
2747 2047
rank 0 sending/receiving 2747 to 1
1:
1
0
2747
rank 1 sending/receiving 2747 to 0
rank 1 received from 0 (0) with tag 0 and count 2747
rank 0 sending/receiving 2047 to 2
2:
2
0 3
2047 2950
rank 2 sending/receiving 2047 to 0
rank 2 sending/receiving 2950 to 3
rank 0 received from 1 (1) with tag 0 and count 2747
rank 0 received from 2 (2) with tag 0 and count 2047
3:
2
0 2
2618 2950
rank 3 sending/receiving 2618 to 0
rank 3 sending/receiving 2950 to 2
rank 2 received from 3 (3) with tag 0 and count 2950
rank 3 received from 2 (2) with tag 0 and count 2950
rank 2 received from 0 (0) with tag 0 and count 2047
OK, even if it looks like it doesn’t, we are moving forward. There is indeed a clear mismatch (rank 3 sees rank 0 as a neighbor, but rank 0 does not see rank 3 as a neighbor). If you still have som gas left in the tank, please apply the following patch-macro_ddm_bis.log
and re-run the job (keep the existing changes, please). There is no need to recompile. Only send logs of hanging or crashing jobs, always with -ns. Thank you!
Here is the output of the same code after applying your latest patch (same submit options, still hanging):
-- FreeFem++ v4.9 (Fri Jun 18 14:45:02 CEST 2021 - git v4.9)
Load: lg_fem lg_mesh lg_mesh3 eigenvalue parallelempi
load: init metis (v 5 )
(already loaded: msh3) sizestack + 1024 =9120 ( 8096 )
0 true neighbor is 1 (1)
0 true neighbor is 2 (2)
0 true neighbor is 3 (3)
0:
3
1 2 3
2747 2047 2618
rank 0 sending/receiving 2747 to 1
1 true neighbor is 0 (1)
1 false neighbor bis is 2 (1)
1:
1
0
2747
rank 1 sending/receiving 2747 to 0
rank 1 received from 0 (0) with tag 0 and count 2747
3 false neighbor bis is 0 (0)
3 true neighbor is 2 (1)
3:
1
2
2950
rank 3 sending/receiving 2950 to 2
rank 0 sending/receiving 2047 to 2
rank 0 sending/receiving 2618 to 3
rank 0 received from 1 (1) with tag 0 and count 2747
rank 0 received from 2 (2) with tag 0 and count 2047
2 true neighbor is 0 (1)
2 false neighbor bis is 1 (1)
2 true neighbor is 3 (2)
2:
2
0 3
2047 2950
rank 2 sending/receiving 2047 to 0
rank 2 sending/receiving 2950 to 3
rank 3 received from 2 (2) with tag 0 and count 2950
rank 2 received from 3 (3) with tag 0 and count 2950
rank 2 received from 0 (0) with tag 0 and count 2047
-- FreeFem++ v4.9 (Fri Jun 18 14:45:02 CEST 2021 - git v4.9)
Load: lg_fem lg_mesh lg_mesh3 eigenvalue parallelempi
load: init metis (v 5 )
(already loaded: msh3) sizestack + 1024 =9120 ( 8096 )
0 true neighbor is 1 (1)
0 true neighbor is 2 (2)
0 false neighbor bis is 3 (2), eps = 0.502167, epsTab = 0
0:
2
1 2
2747 2047
rank 0 sending/receiving 2747 to 1
1 true neighbor is 0 (1)
1 false neighbor bis is 2 (1), eps = 0.459333, epsTab = 0
1:
1
0
2747
rank 1 sending/receiving 2747 to 0
rank 1 received from 0 (0) with tag 0 and count 2747
rank 0 sending/receiving 2047 to 2
2 true neighbor is 0 (1)
2 false neighbor bis is 1 (1), eps = 0.529, epsTab = 0
2 true neighbor is 3 (2)
2:
2
0 3
2047 2950
rank 2 sending/receiving 2047 to 0
rank 2 sending/receiving 2950 to 3
rank 0 received from 1 (1) with tag 0 and count 2747
rank 0 received from 2 (2) with tag 0 and count 2047
3 true neighbor is 0 (1)
3 true neighbor is 2 (2)
3:
2
0 2
2618 2950
rank 3 sending/receiving 2618 to 0
rank 3 sending/receiving 2950 to 2
rank 2 received from 3 (3) with tag 0 and count 2950
rank 3 received from 2 (2) with tag 0 and count 2950
rank 2 received from 0 (0) with tag 0 and count 2047
Excellent! Now it should be rather straightforward to debug. Line 251, could you please replace mpiWaitAll(rq); by for(int debugI = 0; debugI < rq.n; ++debugI) mpiWaitAny(rq);. Does the code still hang?
-- FreeFem++ v4.9 (Fri Jun 18 14:45:02 CEST 2021 - git v4.9)
Load: lg_fem lg_mesh lg_mesh3 eigenvalue parallelempi
No operator .n for type <P2KNIiE>
load: init metis (v 5 )
(already loaded: msh3) No operator .n for type <P2KNIiE>
Error line number 213, in file macro: partitionPrivate in /home/boujo/FreeFem/lib/ff++/4.9/idp/macro_ddm.idp, before token n
current line = 213 mpirank 0 / 4
Compile error :
line number :213, n
No operator .n for type <P2KNIiE>
current line = 213 mpirank 1 / 4
error Compile error :
line number :213, n
code = 1 mpirank: 0
No operator .n for type <P2KNIiE>
current line = 213 mpirank 3 / 4
current line = 213 mpirank 2 / 4
-- FreeFem++ v4.9 (Fri Jun 18 14:45:02 CEST 2021 - git v4.9)
Load: lg_fem lg_mesh lg_mesh3 eigenvalue parallelempi
load: init metis (v 5 )
(already loaded: msh3) sizestack + 1024 =9128 ( 8104 )
0 true neighbor is 1 (1)
0 true neighbor is 2 (2)
0 false neighbor bis is 3 (2), eps = 0.502167, epsTab = 0
0:
2
1 2
2747 2047
rank 0 sending/receiving 2747 to 1
1 true neighbor is 0 (1)
1 false neighbor bis is 2 (1), eps = 0.459333, epsTab = 0
1:
1
0
2747
rank 1 sending/receiving 2747 to 0
3 false neighbor bis is 0 (0), eps = 0.445833, epsTab = 0
3 true neighbor is 2 (1)
3:
1
2
2950
rank 3 sending/receiving 2950 to 2
rank 1 received from 0 (0) with tag 0 and count 2747
rank 0 sending/receiving 2047 to 2
rank 0 received from 1 (1) with tag 0 and count 2747
rank 0 received from 2 (2) with tag 0 and count 2047
Done
Done
2 true neighbor is 0 (1)
2 false neighbor bis is 1 (1), eps = 0.529, epsTab = 0
2 true neighbor is 3 (2)
2:
2
0 3
2047 2950
rank 2 sending/receiving 2047 to 0
rank 2 sending/receiving 2950 to 3
rank 3 received from 2 (2) with tag 0 and count 2950
rank 2 received from 3 (3) with tag 0 and count 2950
rank 2 received from 0 (0) with tag 0 and count 2047
Done
Done
(“Done” comes from the cout print I added to the code, in the last line.)
This is exactly the same output has before, so I’m not sure why that would change anything. But if changing from mpiWaitAll to mpiWaitAny makes things run smoothly, that’s good to know. I’d try with another MPI implementation, because really, that should not cause such behavior…
(I should maybe create a new discussion but I first ask here, please let me know.)
Have you ever seen such a thing when saving vtu files FF savevtk("file.vtu", Th, [U,V,W], order = [1,1,1], dataname="Ub"); and visualizing them with Paraview?
The two images correspond to the same code, same parameters (except the number of processes - but always on 1 single node), but the first one is a result computed before we applied the patched yesterday, the second one after.
Actually this behavior is not consistent: that’s indeed what I observed for my code, but trying again the example hppdm/stokes-2d-PETSc.edp it still hangs on 2x2 processes.
-- FreeFem++ v4.9 (Fri Jun 18 14:45:02 CEST 2021 - git v4.9)
Load: lg_fem lg_mesh lg_mesh3 eigenvalue parallelempi
load: init metis (v 5 )
sizestack + 1024 =9800 ( 8776 )
-- Square mesh : nb vertices =1681 , nb triangles = 3200 , nb boundary edges 160
-- Square mesh : nb vertices =1681 , nb triangles = 3200 , nb boundary edges 160
-- Square mesh : nb vertices =1681 , nb triangles = 3200 , nb boundary edges 160
-- Square mesh : nb vertices =1681 , nb triangles = 3200 , nb boundary edges 160
--- global mesh of 4800 elements (prior to refinement) partitioned with metis --metisA: 4-way Edge-Cut: 3, Balance: 1.01 Nodal=0/Dual 1
(in 4.325628e-03)
--- partition of unity built (in 2.602980e-01)
rank 0 sending/receiving 480 to 1
rank 2 sending/receiving 426 to 3
rank 1 sending/receiving 480 to 0
rank 1 sending/receiving 964 to 3
rank 3 sending/receiving 964 to 1
rank 3 sending/receiving 426 to 2
rank 0 received from 1 (1) with tag 0 and count 480
rank 2 received from 3 (3) with tag 0 and count 426
rank 1 received from 0 (0) with tag 0 and count 480
rank 1 received from 3 (3) with tag 0 and count 964
rank 3 received from 2 (2) with tag 0 and count 426
rank 3 received from 1 (1) with tag 0 and count 964
--- global numbering created (in 3.905296e-04)
--- global CSR created (in 5.421638e-04)
Warning: -- Your set of boundary condition is incompatible with the mesh label.
Warning: -- Your set of boundary condition is incompatible with the mesh label.
Warning: -- Your set of boundary condition is incompatible with the mesh label.
Warning: -- Your set of boundary condition is incompatible with the mesh label.
-- FreeFem++ v4.9 (Fri Jun 18 14:45:02 CEST 2021 - git v4.9)
Load: lg_fem lg_mesh lg_mesh3 eigenvalue parallelempi
load: init metis (v 5 )
sizestack + 1024 =10800 ( 9776 )
-- Square mesh : nb vertices =1681 , nb triangles = 3200 , nb boundary edges 160
-- Square mesh : nb vertices =1681 , nb triangles = 3200 , nb boundary edges 160
--- global mesh of 4800 elements (prior to refinement) partitioned with metis --metisA: 2-way Edge-Cut: 3, Balance: 1.00 Nodal=0/Dual 1
(in 4.200935e-03)
--- partition of unity built (in 1.551914e-02)
rank 0 sending/receiving 995 to 1
rank 0 received from 1 (1) with tag 0 and count 995
rank 1 sending/receiving 995 to 0
rank 1 received from 0 (0) with tag 0 and count 995
--- global numbering created (in 1.698017e-02)
--- global CSR created (in 7.243156e-04)
Warning: -- Your set of boundary condition is incompatible with the mesh label.
Warning: -- Your set of boundary condition is incompatible with the mesh label.
--- system solved with PETSc (in 2.283962e-01)
times: compile 1.100000e-01s, execution 4.700000e-01s, mpirank:0
######## We forget of deleting 0 Nb pointer, 0Bytes , mpirank 0, memory leak =948496
CodeAlloc : nb ptr 7147, size :650304 mpirank: 0
Ok: Normal End
times: compile 0.12s, execution 0.46s, mpirank:1
######## We forget of deleting 0 Nb pointer, 0Bytes , mpirank 1, memory leak =948448
CodeAlloc : nb ptr 7147, size :650304 mpirank: 1
And with 2x2 processes (hanging):
-- FreeFem++ v4.9 (Fri Jun 18 14:45:02 CEST 2021 - git v4.9)
Load: lg_fem lg_mesh lg_mesh3 eigenvalue parallelempi
-- Square mesh : nb vertices =1681 , nb triangles = 3200 , nb boundary edges 160
load: init metis (v 5 )
sizestack + 1024 =10800 ( 9776 )
-- Square mesh : nb vertices =1681 , nb triangles = 3200 , nb boundary edges 160
-- Square mesh : nb vertices =1681 , nb triangles = 3200 , nb boundary edges 160
-- Square mesh : nb vertices =1681 , nb triangles = 3200 , nb boundary edges 160
rank 2 sending/receiving 426 to 3
rank 1 sending/receiving 480 to 0
rank 1 sending/receiving 964 to 3
--- global mesh of 4800 elements (prior to refinement) partitioned with metis --metisA: 4-way Edge-Cut: 3, Balance: 1.01 Nodal=0/Dual 1
(in 4.354715e-03)
--- partition of unity built (in 2.464640e-01)
rank 0 sending/receiving 480 to 1
rank 1 received from 0 (0) with tag 0 and count 480
rank 0 received from 1 (1) with tag 0 and count 480
rank 1 received from 3 (3) with tag 0 and count 964
rank 3 sending/receiving 964 to 1
rank 3 sending/receiving 426 to 2
rank 2 received from 3 (3) with tag 0 and count 426
rank 3 received from 2 (2) with tag 0 and count 426
rank 3 received from 1 (1) with tag 0 and count 964
--- global numbering created (in 5.839586e-03)
--- global CSR created (in 8.101463e-04)
Warning: -- Your set of boundary condition is incompatible with the mesh label.
Warning: -- Your set of boundary condition is incompatible with the mesh label.
Warning: -- Your set of boundary condition is incompatible with the mesh label.
Warning: -- Your set of boundary condition is incompatible with the mesh label.