Fatal error with PMPI_WaitAny

Dear Prj, currently I saw a strange error with my FF code, when I run it on my school cluster;
" Fatal error in PMPI_Waitall: Other MPI error, error stack: "
It happened randomly kind of.

I believe this problem only happened when the code is running cross multiple machines [nodes]. I am not sure if this problem has anything related with my freefem code, but I still hope you can provide me some cues about it.

Thank you very much

“Abort(807022991) on node 24 (rank 24 in comm 0): Fatal error in PMPI_Waitany: Other MPI error, error stack:
PMPI_Waitany(295)…: MPI_Waitany(count=122, req_array=0x2d7320d0, index=0x7fffffff972c, status=0x7fffffff9860) failed
PMPI_Waitany(263)…:
MPIDI_Progress_test_impl(195).:
MPIDI_OFI_handle_cq_error(991): OFI poll failed (ofi_events.h:991:MPIDI_OFI_handle_cq_error:Input/output error)”

more information: the cluster is Cray cs400 & using infiniband for comunication; I am using intel/mpi

Could you please share a reproducer?

which reproducer you want, the code or the error generated file? since I didn’t run “export I_MPI_DEBUG=6” the error message is like above. I am currently trying to run my code again and with the debug flag

A code that I could use to try to reproduce this, to see whether the problem comes from your installation or from within FreeFEM.

sorry Prj, according to the policy I can not distribute the code without permission. Let me talk with my professor to see if I can simply the code and send it to you.

One more question, do you think if I change the intel/mpi to open mpi this issue maybe go away?

Maybe, can’t tell you for sure.

Forgot to mention that I actually don’t want the full code, just the piece that exhibit the error. So if it’s OK for you to just send this piece, that’s all I need.

sorry for waiting so long I will send you that portion

FREE_FATAL_ERROR_DEBUG.edp (36.3 KB)

here is the code, the error comes from the Adjoint equation

Does the error occur on smaller meshes too? Debugging on such a large problem will probably be tricky…

I am not sure, I never saw this problem with small meshes; maybe it is because I just run it with single machine. This problem is happened kind of randomly. I doubt it is because of the intel mpi or the net work of my school’s cluster.

I ran your code on 256 processes. It was OK up until line 763

[...]
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Max of meshsize: is 1.820027e-01 Min of meshsize: is 4.936972e-02
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
R1 = 1.250000e+00,  R2 = 5.000000e-01,  tref = 2.000000e-01,  dext = 1.500000e+00,  2*hmin = 9.873944e-02
betamax = 7.216878e+00, beta = 1.000000e-01
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1/0 : d d d
  current line = 763 mpirank 24 / 256
[...]

Are you at least reaching this point of the script, or do you have an error before that?

No, but I think I know what is happening here, I should solve elasticity equation first, and then run this adjoint, otherwise the inpu
FREE_FATAL_ERROR_DEBUG_correct.edp (45.0 KB)
t is wrong ,sorry for this. let me send you another one; and at line 67 “hmincsd = .1” you can change it to “hmincsd = .2 or 0.25 or 0.5” to make the mesh smaller.

Here is what I get.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Max of meshsize: is 1.825596e-01 Min of meshsize: is 4.936972e-02
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
R1 = 9.000000e-01,  R2 = 5.000000e-01,  tref = 2.000000e-01,  dext = 1.080000e+00,  2*hmin = 9.873944e-02
betamax = 5.196152e+00, beta = 1.000000e-01
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Volume of nonpaded design domain is = 3.243770e+02
KSP final norm of residual 0.0111526
KSP final norm of residual 0.000112754
Solved Adjoint

The code ends fine.

Yes, I do get the same thing; The problem is like this: in my code I have a loop to do this process; when I am running this code in my school’s cluster, sometimes at some iteration for example 738 or 1141; it is terminated with that issue at that adjoint equation. Then I will rerun this code, this issue is happened randomly. But thank you for the help Prj; I am trying to reinstall the FF with openmpi; but the avil module of my cluster is " openmpi/gcc/64/3.1.6 " , seems too old for FF. at least currently, the intel mpi is the only one I can compile FF over there.

In that case, what you could do is the following. At each iteration, save the state of your script before partitioning. Each iteration, overwrite what was previously written. Then, if the crash occurs at iteration XYZ, we could just restart from there and troubleshoot the issue.

Yes that’s what I am going to do in the next. Thank you I update this with you.