Fatal error with PMPI_WaitAny

zcwang6 · December 16, 2021, 3:15pm

Dear Prj, currently I saw a strange error with my FF code, when I run it on my school cluster;
" Fatal error in PMPI_Waitall: Other MPI error, error stack: "
It happened randomly kind of.

I believe this problem only happened when the code is running cross multiple machines [nodes]. I am not sure if this problem has anything related with my freefem code, but I still hope you can provide me some cues about it.

Thank you very much

zcwang6 · December 16, 2021, 6:30pm

“Abort(807022991) on node 24 (rank 24 in comm 0): Fatal error in PMPI_Waitany: Other MPI error, error stack:
PMPI_Waitany(295)…: MPI_Waitany(count=122, req_array=0x2d7320d0, index=0x7fffffff972c, status=0x7fffffff9860) failed
PMPI_Waitany(263)…:
MPIDI_Progress_test_impl(195).:
MPIDI_OFI_handle_cq_error(991): OFI poll failed (ofi_events.h:991:MPIDI_OFI_handle_cq_error:Input/output error)”

more information: the cluster is Cray cs400 & using infiniband for comunication; I am using intel/mpi

prj · December 16, 2021, 7:31pm

Could you please share a reproducer?

zcwang6 · December 16, 2021, 8:03pm

which reproducer you want, the code or the error generated file? since I didn’t run “export I_MPI_DEBUG=6” the error message is like above. I am currently trying to run my code again and with the debug flag

prj · December 16, 2021, 8:08pm

A code that I could use to try to reproduce this, to see whether the problem comes from your installation or from within FreeFEM.

zcwang6 · December 16, 2021, 8:13pm

sorry Prj, according to the policy I can not distribute the code without permission. Let me talk with my professor to see if I can simply the code and send it to you.

zcwang6 · December 16, 2021, 8:14pm

One more question, do you think if I change the intel/mpi to open mpi this issue maybe go away?

prj · December 16, 2021, 8:26pm

Maybe, can’t tell you for sure.

prj · December 17, 2021, 8:28am

Forgot to mention that I actually don’t want the full code, just the piece that exhibit the error. So if it’s OK for you to just send this piece, that’s all I need.

zcwang6 · December 22, 2021, 3:50pm

sorry for waiting so long I will send you that portion

zcwang6 · December 22, 2021, 3:56pm

FREE_FATAL_ERROR_DEBUG.edp (36.3 KB)

here is the code, the error comes from the Adjoint equation

prj · December 22, 2021, 4:06pm

Does the error occur on smaller meshes too? Debugging on such a large problem will probably be tricky…

zcwang6 · December 22, 2021, 4:23pm

I am not sure, I never saw this problem with small meshes; maybe it is because I just run it with single machine. This problem is happened kind of randomly. I doubt it is because of the intel mpi or the net work of my school’s cluster.

prj · December 22, 2021, 8:21pm

I ran your code on 256 processes. It was OK up until line 763

[...]
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Max of meshsize: is 1.820027e-01 Min of meshsize: is 4.936972e-02
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
R1 = 1.250000e+00,  R2 = 5.000000e-01,  tref = 2.000000e-01,  dext = 1.500000e+00,  2*hmin = 9.873944e-02
betamax = 7.216878e+00, beta = 1.000000e-01
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1/0 : d d d
  current line = 763 mpirank 24 / 256
[...]

Are you at least reaching this point of the script, or do you have an error before that?

zcwang6 · December 22, 2021, 10:38pm

No, but I think I know what is happening here, I should solve elasticity equation first, and then run this adjoint, otherwise the inpu
FREE_FATAL_ERROR_DEBUG_correct.edp (45.0 KB)
t is wrong ,sorry for this. let me send you another one; and at line 67 “hmincsd = .1” you can change it to “hmincsd = .2 or 0.25 or 0.5” to make the mesh smaller.

prj · December 23, 2021, 7:43am

Here is what I get.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Max of meshsize: is 1.825596e-01 Min of meshsize: is 4.936972e-02
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
R1 = 9.000000e-01,  R2 = 5.000000e-01,  tref = 2.000000e-01,  dext = 1.080000e+00,  2*hmin = 9.873944e-02
betamax = 5.196152e+00, beta = 1.000000e-01
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Volume of nonpaded design domain is = 3.243770e+02
KSP final norm of residual 0.0111526
KSP final norm of residual 0.000112754
Solved Adjoint

The code ends fine.

zcwang6 · December 23, 2021, 1:34pm

Yes, I do get the same thing; The problem is like this: in my code I have a loop to do this process; when I am running this code in my school’s cluster, sometimes at some iteration for example 738 or 1141; it is terminated with that issue at that adjoint equation. Then I will rerun this code, this issue is happened randomly. But thank you for the help Prj; I am trying to reinstall the FF with openmpi; but the avil module of my cluster is " openmpi/gcc/64/3.1.6 " , seems too old for FF. at least currently, the intel mpi is the only one I can compile FF over there.

prj · December 23, 2021, 3:17pm

In that case, what you could do is the following. At each iteration, save the state of your script before partitioning. Each iteration, overwrite what was previously written. Then, if the crash occurs at iteration XYZ, we could just restart from there and troubleshoot the issue.

zcwang6 · December 23, 2021, 5:31pm

Yes that’s what I am going to do in the next. Thank you I update this with you.

Topic		Replies	Views
Stall in some multi-node parallel calculations General Discussion	83	1180	August 14, 2021
MPI fail on HPC cluster FreeFEM installation	30	3292	November 22, 2021
Building FreeFEM with MPI from source with PETSc FreeFEM installation	13	1471	July 7, 2021
FreeFem++-mpi.exe ended prematurely and may have crashed. exit code 3 General Discussion	0	425	February 23, 2022
Cannot run ff-mpirun on ubuntu 20.04 FreeFEM installation	2	1003	July 15, 2021

Fatal error with PMPI_WaitAny

Related topics