Dear Prj, currently I saw a strange error with my FF code, when I run it on my school cluster;
" Fatal error in PMPI_Waitall: Other MPI error, error stack: "
It happened randomly kind of.
I believe this problem only happened when the code is running cross multiple machines [nodes]. I am not sure if this problem has anything related with my freefem code, but I still hope you can provide me some cues about it.
which reproducer you want, the code or the error generated file? since I didn’t run “export I_MPI_DEBUG=6” the error message is like above. I am currently trying to run my code again and with the debug flag
sorry Prj, according to the policy I can not distribute the code without permission. Let me talk with my professor to see if I can simply the code and send it to you.
Forgot to mention that I actually don’t want the full code, just the piece that exhibit the error. So if it’s OK for you to just send this piece, that’s all I need.
I am not sure, I never saw this problem with small meshes; maybe it is because I just run it with single machine. This problem is happened kind of randomly. I doubt it is because of the intel mpi or the net work of my school’s cluster.
I ran your code on 256 processes. It was OK up until line 763
[...]
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Max of meshsize: is 1.820027e-01 Min of meshsize: is 4.936972e-02
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
R1 = 1.250000e+00, R2 = 5.000000e-01, tref = 2.000000e-01, dext = 1.500000e+00, 2*hmin = 9.873944e-02
betamax = 7.216878e+00, beta = 1.000000e-01
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1/0 : d d d
current line = 763 mpirank 24 / 256
[...]
Are you at least reaching this point of the script, or do you have an error before that?
No, but I think I know what is happening here, I should solve elasticity equation first, and then run this adjoint, otherwise the inpu FREE_FATAL_ERROR_DEBUG_correct.edp (45.0 KB)
t is wrong ,sorry for this. let me send you another one; and at line 67 “hmincsd = .1” you can change it to “hmincsd = .2 or 0.25 or 0.5” to make the mesh smaller.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Max of meshsize: is 1.825596e-01 Min of meshsize: is 4.936972e-02
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
R1 = 9.000000e-01, R2 = 5.000000e-01, tref = 2.000000e-01, dext = 1.080000e+00, 2*hmin = 9.873944e-02
betamax = 5.196152e+00, beta = 1.000000e-01
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Volume of nonpaded design domain is = 3.243770e+02
KSP final norm of residual 0.0111526
KSP final norm of residual 0.000112754
Solved Adjoint
Yes, I do get the same thing; The problem is like this: in my code I have a loop to do this process; when I am running this code in my school’s cluster, sometimes at some iteration for example 738 or 1141; it is terminated with that issue at that adjoint equation. Then I will rerun this code, this issue is happened randomly. But thank you for the help Prj; I am trying to reinstall the FF with openmpi; but the avil module of my cluster is " openmpi/gcc/64/3.1.6 " , seems too old for FF. at least currently, the intel mpi is the only one I can compile FF over there.
In that case, what you could do is the following. At each iteration, save the state of your script before partitioning. Each iteration, overwrite what was previously written. Then, if the crash occurs at iteration XYZ, we could just restart from there and troubleshoot the issue.