I have noticed some new conflicts with PETSc and MPI after upgrading to v4.9. In particular, I am finding segmentation violations in parts of my codes (real matrix factorizations using MUMPS in kspsolve) which did not used to give this error.
I built FF v4.9 following the tutorial here. However, I noticed the following line in the changelog on GitHub which made me wonder if the optimal compilation process has changed.
Now the order to find MPI in configure is first if you have PETSC then take MPI from PETSc otherwise use previous method.
Please upload your config.log. This change actually made things slightly more robust. If you have two MPI implementations lying around (terrible idea, but it happens), previously, one could have been used for PETSc and another for FreeFEM. Now, we stick to the one used by PETSc in FreeFEM, so that such a conflict does not appear.
Thank you @prj, understood. Yes, unfortunately many of the machines I am setting up do have multiple MPI implementations on them (openmpi and mpich). I do expect the issue is my own fault, though I cannot change the system setup. I’ve attached the config.log below. config.log (343.1 KB)
There is nothing shocking in the log. Could you please try to put a stack of the error, by invoking the problem with the additional flag -on_error_attach_debugger? Does it only happen with MUMPS? Or any time PETSc is being called, e.g., distributed matrix–vector product?
I will try to get a MWE going soon to answer all of your questions, but this is an example of the error I’ve gotten. This was a test case with one MPI process. The code is a Newton-type nonlinear solver which requires two separate linear factorizations in each Newton step which are re-used to solve 4 linear systems in each Newton step (2 with each matrix). The code executes the first Newton step correctly and fails in the second. This exact code executes correctly on another system which is running FF v4.8.
-- mesh: Nb of Triangles = 134270, Nb of Vertices 67899
--- partition of unity built (in 2.01979000e-04)
-- mesh: Nb of Triangles = 134270, Nb of Vertices 67899
-- mesh: Nb of Triangles = 134270, Nb of Vertices 67899
Loading solution...
--- global numbering created (in 1.75119500e-03)
--- global CSR created (in 1.52139850e-02)
Warning: -- Your set of boundary condition is incompatible with the mesh label.
Step 0: ||R|| = 4.95207315e-03.
Warning: -- Your set of boundary condition is incompatible with the mesh label.
--- global numbering created (in 2.74660000e-04)
--- global CSR created (in 1.56980511e+00)
--- global numbering created (in 2.66629000e-04)
--- global CSR created (in 1.57308449e+00)
--- global numbering created (in 2.40907000e-04)
--- global CSR created (in 1.62662566e+00)
Warning: -- Your set of boundary condition is incompatible with the mesh label.
Warning: -- Your set of boundary condition is incompatible with the mesh label.
Warning: -- Your set of boundary condition is incompatible with the mesh label.
Warning: -- Your set of boundary condition is incompatible with the mesh label.
Warning: -- Your set of boundary condition is incompatible with the mesh label.
Step 1: ||R|| = 5.23150203e-06.
Warning: -- Your set of boundary condition is incompatible with the mesh label.
--- global numbering created (in 2.83546000e-04)
--- global CSR created (in 1.55207416e+00)
--- global numbering created (in 2.70395000e-04)
--- global CSR created (in 1.57533331e+00)
--- global numbering created (in 2.54864000e-04)
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Signal received
[0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[0]PETSC ERROR: Petsc Development GIT revision: v3.15.0-647-ge728e0e494 GIT Date: 2021-05-27 10:32:01 -0500
[0]PETSC ERROR: FreeFem++-mpi on a arch-FreeFem named ae-tlrg-405002 by cdouglas33 Mon Jul 5 10:01:42 2021
[0]PETSC ERROR: Configure options --download-mumps --download-parmetis --download-metis --download-hypre --download-superlu --download-slepc --download-hpddm --download-ptscotch --download-suitesparse --download-scalapack --download-tetgen --with-fortran-bindings=no --with-scalar-type=real --with-debugging=no
[0]PETSC ERROR: #1 User provided function() at unknown file:0
[0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
--- global CSR created (in 1.57046952e+00)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 59.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: User provided function() at unknown file:0
PETSC: Attaching gdb to FreeFem++-mpi of pid 3358843 on display localhost:10.0 on machine ae-tlrg-405002
Unable to start debugger in xterm: No such file or directory
[0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
[ae-tlrg-405002:3358843] *** Process received signal ***
[ae-tlrg-405002:3358843] Signal: Aborted (6)
[ae-tlrg-405002:3358843] Signal code: User function (kill, sigsend, abort, etc.) (0)
[ae-tlrg-405002:3358843] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7fc6a8c3f210]
[ae-tlrg-405002:3358843] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fc6a8c3f18b]
[ae-tlrg-405002:3358843] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fc6a8c1e859]
[ae-tlrg-405002:3358843] [ 3] /home/cdouglas33/.local/petsc/arch-FreeFem/lib/libpetsc.so.3.015(+0xf2218)[0x7fc6a098e218]
[ae-tlrg-405002:3358843] [ 4] /home/cdouglas33/.local/petsc/arch-FreeFem/lib/libpetsc.so.3.015(+0xf2251)[0x7fc6a098e251]
[ae-tlrg-405002:3358843] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7fc6a8c3f210]
[ae-tlrg-405002:3358843] [ 6] /home/cdouglas33/.local/FreeFem-sources/plugin/mpi/PETSc.so(_ZNK5PETSc22initCSRfromBlockMatrixIN5HPDDM7SchwarzIdEEEclEPv+0x301)[0x7fc6a26627b1]
[ae-tlrg-405002:3358843] [ 7] FreeFem++-mpi(_ZNK10ListOfInstclEPv+0xb7)[0x563064fd53d7]
[ae-tlrg-405002:3358843] [ 8] FreeFem++-mpi(_ZNK9E_RoutineclEPv+0x2a9)[0x563064fd6de9]
[ae-tlrg-405002:3358843] [ 9] FreeFem++-mpi(_ZNK10ListOfInstclEPv+0xb7)[0x563064fd53d7]
[ae-tlrg-405002:3358843] [10] FreeFem++-mpi(_Z6FWhilePvP4E_F0S1_+0x7d)[0x563064f4f72d]
[ae-tlrg-405002:3358843] [11] FreeFem++-mpi(_ZNK11E_F0_CFunc2clEPv+0x2f)[0x563064ebf4ff]
[ae-tlrg-405002:3358843] [12] FreeFem++-mpi(_ZNK10ListOfInstclEPv+0xb7)[0x563064fd53d7]
[ae-tlrg-405002:3358843] [13] FreeFem++-mpi(_Z6FWhilePvP4E_F0S1_+0x7d)[0x563064f4f72d]
[ae-tlrg-405002:3358843] [14] FreeFem++-mpi(_ZNK11E_F0_CFunc2clEPv+0x2f)[0x563064ebf4ff]
[ae-tlrg-405002:3358843] [15] FreeFem++-mpi(_ZNK10ListOfInstclEPv+0xb7)[0x563064fd53d7]
[ae-tlrg-405002:3358843] [16] FreeFem++-mpi(_Z7lgparsev+0x42c8)[0x563064ebc2a8]
[ae-tlrg-405002:3358843] [17] FreeFem++-mpi(_Z7Compilev+0x46)[0x563064ebe746]
[ae-tlrg-405002:3358843] [18] FreeFem++-mpi(_Z6mainffiPPc+0x50a)[0x563064ebee8a]
[ae-tlrg-405002:3358843] [19] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fc6a8c200b3]
[ae-tlrg-405002:3358843] [20] FreeFem++-mpi(_start+0x2e)[0x563064eb08be]
[ae-tlrg-405002:3358843] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ae-tlrg-405002 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
Dear @prj, thank you again for the support and sorry for the delay. Here is a simplified code which reproduces the bug for me. The first iteration runs smoothly, but I get the segmentation violation right after the first matrix solve on the second iteration. The error occurs even with one MPI process. bug.edp (2.2 KB)