Problem when increasing the number of procs

Hi,

Thanks to the nice tutorial held in December, I am now able to run parallel computation with FreeFem. However, I can only do so up to 7 processors. If I go beyond 7 processors, FreeFem gets stuck when running “CreateMat”.

Below is a minimal script that reproduces my problem:

load “PETSc”
macro dimension()2 // EOM
include “macro_ddm.idp”

func PuPuPp=[P2, P2, P1];

macro def(i)[i, i#B, i#C]//
macro init(i)[i, i, i]//

mesh Th = square(80, 80);
buildDmesh(Th);

Mat minusJacobien;
createMat(Th, minusJacobien, PuPuPp);

With less than 8 procs, the script is executed normally and the output is:

   -- Square mesh : nb vertices  =6561 ,  nb triangles = 12800 ,  nb boundary edges 320
 --- global mesh of 12800 elements (prior to refinement) partitioned with metis  --metisA: 7-way Edge-Cut:       3, Balance:  1.01 Nodal=0/Dual 1
 (in 1.279660e-02)
 --- partition of unity built (in 2.170192e-02)
 --- global numbering created (in 2.978065e-02)
 --- global CSR created (in 4.831478e-04)

But with more than 8 procs, Freefem does not go beyond -partition of unity built (in XXX).

Has somebody experienced a similar problem ? Any help or suggestion is welcome !

Lucas

The problem is that your mesh is tiny, so there may be some really badly shaped subdomains which cause a deadlock. However, I cannot reproduce this issue on my machine (for example with 10 processes it runs fine on my macOS). You could try another partitioner, by adding the flag -Dpartitioner=scotch on the command line, but I’d suggest overall not using too many processes for tiny problems.

Thank you for your answer.

In my previous example, everything is fine where there are 12800 triangles for 7 procs which corresponds to a ratio of ~1828.

If I increase the refinement to have 100 times more triangles, the deadlock is still present with 8 procs even though the ratio is about 100 times higher. This tends to show that it is not a matter of size of the mesh.

With “macro partitioner()”, I tried metis, scotch and parmetis. I observe the same deadlock for all three practitioners.

Could you please share your script in full? Again, I don’t see this kind of behavior on my machine with just the lines you provided, maybe it comes from something else.

The script corresponds exactly to the lines given in my first message.

EssaiHPDDM.edp (327 Bytes)

I can’t reproduce your issue. I’ve run from 4 to 24 processes with your example and it runs smoothly.
What is your configuration? Are you running on a cluster? With which MPI implementation?

$ for ((i = 4; i < 25; i++)); do mpirun -n $i FreeFem++-mpi EssaiHPDDM.edp -v 0 && echo $?; done
 --- global mesh of 1280000 elements (prior to refinement) partitioned with metis (in 8.124607e-01)
 --- partition of unity built (in 2.983628e+00)
 --- global numbering created (in 2.545270e-02)
 --- global CSR created (in 1.557464e-02)
0
 --- global mesh of 1280000 elements (prior to refinement) partitioned with metis (in 7.701178e-01)
 --- partition of unity built (in 2.779283e+00)
 --- global numbering created (in 2.403119e-02)
 --- global CSR created (in 1.373224e-02)
0
 --- global mesh of 1280000 elements (prior to refinement) partitioned with metis (in 8.016183e-01)
 --- partition of unity built (in 2.436816e+00)
 --- global numbering created (in 2.245745e-02)
 --- global CSR created (in 1.625757e-02)
0
 --- global mesh of 1280000 elements (prior to refinement) partitioned with metis (in 8.158428e-01)
 --- partition of unity built (in 2.383430e+00)
 --- global numbering created (in 2.207405e-02)
 --- global CSR created (in 1.168836e-02)
0
 --- global mesh of 1280000 elements (prior to refinement) partitioned with metis (in 7.821187e-01)
 --- partition of unity built (in 2.194874e+00)
 --- global numbering created (in 2.186057e-02)
 --- global CSR created (in 9.923214e-03)
0
 --- global mesh of 1280000 elements (prior to refinement) partitioned with metis (in 7.894702e-01)
 --- partition of unity built (in 2.185621e+00)
 --- global numbering created (in 1.947581e-02)
 --- global CSR created (in 1.124519e-02)
0
 --- global mesh of 1280000 elements (prior to refinement) partitioned with metis (in 7.968843e-01)
 --- partition of unity built (in 2.016374e+00)
 --- global numbering created (in 1.930942e-02)
 --- global CSR created (in 9.936541e-03)
0
 --- global mesh of 1280000 elements (prior to refinement) partitioned with metis (in 8.249298e-01)
 --- partition of unity built (in 2.057022e+00)
 --- global numbering created (in 1.885168e-02)
 --- global CSR created (in 1.095425e-02)
0
 --- global mesh of 1280000 elements (prior to refinement) partitioned with metis (in 7.960227e-01)
 --- partition of unity built (in 1.849867e+00)
 --- global numbering created (in 1.778322e-02)
 --- global CSR created (in 9.653186e-03)
0
 --- global mesh of 1280000 elements (prior to refinement) partitioned with metis (in 8.432965e-01)
 --- partition of unity built (in 1.789337e+00)
 --- global numbering created (in 1.694442e-02)
 --- global CSR created (in 9.601373e-03)
0
 --- global mesh of 1280000 elements (prior to refinement) partitioned with metis (in 8.488112e-01)
 --- partition of unity built (in 1.793308e+00)
 --- global numbering created (in 1.680793e-02)
 --- global CSR created (in 9.469281e-03)
0
 --- global mesh of 1280000 elements (prior to refinement) partitioned with metis (in 8.278477e-01)
 --- partition of unity built (in 1.691208e+00)
 --- global numbering created (in 1.558425e-02)
 --- global CSR created (in 1.196884e-02)
0
 --- global mesh of 1280000 elements (prior to refinement) partitioned with metis (in 8.444813e-01)
 --- partition of unity built (in 1.836711e+00)
 --- global numbering created (in 1.539397e-02)
 --- global CSR created (in 8.134508e-03)
0
 --- global mesh of 1280000 elements (prior to refinement) partitioned with metis (in 8.079114e-01)
 --- partition of unity built (in 1.683846e+00)
 --- global numbering created (in 1.415324e-02)
 --- global CSR created (in 7.245136e-03)
0
 --- global mesh of 1280000 elements (prior to refinement) partitioned with metis (in 8.428335e-01)
 --- partition of unity built (in 1.626412e+00)
 --- global numbering created (in 1.353985e-02)
 --- global CSR created (in 9.047549e-03)
0
 --- global mesh of 1280000 elements (prior to refinement) partitioned with metis (in 8.441713e-01)
 --- partition of unity built (in 1.597590e+00)
 --- global numbering created (in 1.234946e-02)
 --- global CSR created (in 7.290195e-03)
0
 --- global mesh of 1280000 elements (prior to refinement) partitioned with metis (in 8.219871e-01)
 --- partition of unity built (in 1.567110e+00)
 --- global numbering created (in 1.154948e-02)
 --- global CSR created (in 6.599104e-03)
0
 --- global mesh of 1280000 elements (prior to refinement) partitioned with metis (in 8.446810e-01)
 --- partition of unity built (in 1.536520e+00)
 --- global numbering created (in 1.158284e-02)
 --- global CSR created (in 7.373002e-03)
0
 --- global mesh of 1280000 elements (prior to refinement) partitioned with metis (in 8.582563e-01)
 --- partition of unity built (in 1.455584e+00)
 --- global numbering created (in 1.035004e-02)
 --- global CSR created (in 6.298294e-03)
0
 --- global mesh of 1280000 elements (prior to refinement) partitioned with metis (in 8.122391e-01)
 --- partition of unity built (in 1.433167e+00)
 --- global numbering created (in 1.021385e-02)
 --- global CSR created (in 5.918284e-03)
0
 --- global mesh of 1280000 elements (prior to refinement) partitioned with metis (in 8.502283e-01)
 --- partition of unity built (in 1.454966e+00)
 --- global numbering created (in 9.467571e-03)
 --- global CSR created (in 6.059158e-03)
0

Hi,

I observe indeed that the problem comes from my configuration. Until now, I have been running FreeFem on an interactive batch session on a supercomputer. I tried this morning to run FreeFem through job submission and it is was possible to run successfully with 24 procs.
So I can’t say that the problem is solved but at least I found a workaround !

Thanks a lot for your help !

Probably something wrong with your batch session (like different modules loaded than when submitting a job, missing source ~/.profile…). I’m glad this works through standard job submission.