Potential memory access bug?

Hello,

I am finding non-deterministic behavior when solving “large” problems using PETSc-complex on MacOS. For the exact same code, sometimes the code runs, and sometimes I get a SEGV error in PETSc.

I am able to reproduce the error with the following approach:

  1. Making the following change to increase the problem size. (The key issue here seems to be that this error does not occur unless the problem is sufficiently large.)
diff --git a/examples/hpddm/navier-stokes-2d-PETSc.edp b/examples/hpddm/navier-stokes-2d-PETSc.edp
index 7a794908..719ddee2 100644
--- a/examples/hpddm/navier-stokes-2d-PETSc.edp
+++ b/examples/hpddm/navier-stokes-2d-PETSc.edp
@@ -11,7 +11,7 @@ mesh Th;
     border c(t=0, 1) {x=t^0.9*L; y=-R; label=3;}
     border d(t=1, 0) {x=t^0.9*L; y=R; label=3;}
     border e(t=-R, R) {x=L; y=t; label=0;}
-    real ratio = 1.0;
+    real ratio = 4.0;
     Th = buildmesh(a(-70*ratio) + b(30*ratio) + c(80*ratio) + d(80*ratio) + e(20*ratio));
     plot(Th);
 }
  1. Then run
ff-mpirun -np 4 examples/hpddm/navier-stokes-2d-PETSc.edp -v 0
ff-mpirun -np 4 examples/hpddm/navier-stokes-2d-SLEPc-complex.edp -v 0

Sometimes the SLEPc line needs the additional argument -st_mat_mumps_icntl_13 1, sometimes it does not. Either way, I have also encountered this error even if I use SuperLU_dist instead of MUMPS. For me, the segmentation violation occurs roughly 50% of the time.

I am using the latest version of PETSc/SLEPc from the main branch with a fresh compile of the FreeFEM develop branch. make check is clean.

Can anyone let me know if they are able to reproduce this error?

This is a ScaLAPACK bug (see trace below). I don’t understand this line: Sometimes the SLEPc line needs the additional argument -st_mat_mumps_icntl_13 1, sometimes it does not. If you indeed use -st_mat_mumps_icntl_13 1, do you still get occasional segfaults?

Assertion failed in file src/mpi/datatype/typerep/src/typerep_yaksa_pack.c at line 339: element_size < 0 || *actual_unpack_bytes % element_size == 0
0   libpmpi.12.dylib                    0x000000010750fc74 MPL_backtrace_show + 52
1   libpmpi.12.dylib                    0x00000001073234e8 MPIR_Assert_fail + 92
2   libpmpi.12.dylib                    0x0000000107201c7c typerep_do_unpack + 920
3   libpmpi.12.dylib                    0x00000001072020f0 MPIR_Typerep_unpack + 96
4   libpmpi.12.dylib                    0x00000001072a9534 MPIDIG_recv_copy_seg + 172
5   libpmpi.12.dylib                    0x00000001072a8f18 MPIDI_POSIX_progress_recv + 572
6   libpmpi.12.dylib                    0x00000001072a8bb0 MPIDI_POSIX_progress + 104
7   libpmpi.12.dylib                    0x00000001072a8850 MPIDI_SHM_progress + 36
8   libpmpi.12.dylib                    0x00000001072a7fa8 MPIDI_progress_test + 4152
9   libpmpi.12.dylib                    0x00000001072a04cc MPID_Progress_test + 108
10  libpmpi.12.dylib                    0x00000001072a24d4 MPID_Progress_wait + 24
11  libpmpi.12.dylib                    0x00000001072a23ec MPIR_Wait_state + 68
12  libpmpi.12.dylib                    0x00000001072a287c MPID_Wait + 68
13  libpmpi.12.dylib                    0x00000001072a2708 MPIR_Wait + 228
14  libmpi.12.dylib                     0x0000000104e88574 internal_Recv + 3448
15  libmpi.12.dylib                     0x0000000104e877f0 MPI_Recv + 72
16  libscalapack.2.2.2.dylib            0x0000000103e9d774 BI_Srecv + 88
17  libscalapack.2.2.2.dylib            0x0000000103e89390 Czgerv2d + 236
18  libscalapack.2.2.2.dylib            0x0000000103fc1820 PB_CInOutV2 + 7708
19  libscalapack.2.2.2.dylib            0x000000010401e7a8 PB_CptrsmB + 2732
20  libscalapack.2.2.2.dylib            0x0000000103f7b3b0 pztrsm_ + 7144
21  libscalapack.2.2.2.dylib            0x0000000104396948 pzgetrf_ + 2596
22  libpetsc.3.023.5.dylib              0x000000011b12ea2c zmumps_facto_root_ + 2672
23  libpetsc.3.023.5.dylib              0x000000011afac5d0 __zmumps_fac_par_m_MOD_zmumps_fac_par + 18916
24  libpetsc.3.023.5.dylib              0x000000011b0bb480 zmumps_fac_par_i_ + 4112
25  libpetsc.3.023.5.dylib              0x000000011b0bde7c zmumps_fac_b_ + 10712
26  libpetsc.3.023.5.dylib              0x000000011b0e5eb4 zmumps_fac_driver_ + 78260
27  libpetsc.3.023.5.dylib              0x000000011b14d87c zmumps_ + 18508
28  libpetsc.3.023.5.dylib              0x000000011b152f50 zmumps_f77_ + 9256
29  libpetsc.3.023.5.dylib              0x000000011b144c30 zmumps_c + 5400

Thanks @prj. I am aware of that bug. I get the issue whether or not I use -st_mat_mumps_icntl_13 1. Indeed, I get the issue whether or not I use MUMPS. The error I receive is the following:

'/Users/cmdouglas/local/petsc/real/bin/mpiexec' -np 4 /Users/cmdouglas/local/FreeFem-sources/src/mpi/FreeFem++-mpi -nw 'examples/hpddm/navier-stokes-2d-SLEPc-complex.edp' -v 0 -st_mat_mumps_icntl_13 1
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[1]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/
[1]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[1]PETSC ERROR: to get more information on the crash.
[1]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
Abort(59) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 59) - process 1

Other times, it is successful and I get the expected output:

'/Users/cmdouglas/local/petsc/real/bin/mpiexec' -np 4 /Users/cmdouglas/local/FreeFem-sources/src/mpi/FreeFem++-mpi -nw 'examples/hpddm/navier-stokes-2d-SLEPc-complex.edp' -v 0 -st_mat_mumps_icntl_13 1
  1 EPS nconv=0 Values (Errors) 0.0987875+0.815584i (5.61776316e-05) -0.251228+0.644158i (1.04571032e-04) -0.2547+0.57598i (1.39984687e-04) -0.233419+0.711143i (1.89146077e-04) -0.0985848+0.841307i (1.24164980e-04) -0.247737+0.511791i (2.80710937e-04) -0.200688+0.774565i (2.66071523e-04) -0.148836+0.842466i (3.02551958e-04) -0.236908+0.441695i (6.06919178e-04) -0.210188+0.362264i (1.98038863e-03) -0.0819866+0.95825i (5.18008855e-03) -0.173597+0.271578i (7.43501790e-03) -0.120251+0.146039i (3.48309146e-02) -0.0526763-0.027885i (1.23986992e-01) -2.09975-0.622206i (1.51829294e+00) 5.50329+2.74698i (1.33410940e+00)
  2 EPS nconv=2 Values (Errors) 0.0987871+0.815584i (2.53653442e-08) -0.100274+0.839395i (4.94285776e-07) -0.288566+0.540013i (9.31353927e-05) -0.273603+0.485691i (2.39510914e-04) -0.298624+0.598384i (3.76451479e-05) -0.224729+0.401639i (1.74613982e-03) -0.295685+0.658633i (2.41145358e-05) -0.284019+0.717808i (2.21096931e-05) -0.263384+0.777432i (2.43134520e-05) -0.233998+0.836721i (3.41902425e-05) -0.125876+0.287301i (2.92165504e-02) -0.196284+0.909277i (1.45746551e-04) -0.130657+1.04038i (5.69855044e-03) -0.286577+1.11312i (2.17181041e-02) -0.500938+0.221255i (2.70689324e-01) -0.174703-0.284642i (7.01646203e-01)
WARNING! There are options you set that were not used!
WARNING! could be spelling mistake, etc!
There are 2 unused database options. They are:
Option left: name:-nw value: examples/hpddm/navier-stokes-2d-SLEPc-complex.edp source: command line
Option left: name:-v value: 0 source: command line

Here is my config.log (148.4 KB).

I found an error in Valgrind, I will give you an update when it is fixed.

Glad I’m not going crazy :slight_smile: Appreciate your support!

Does this fix your issue?

diff --git a/plugin/mpi/PETSc-code.hpp b/plugin/mpi/PETSc-code.hpp
index 6e8aad87..b1d4416b 100644
--- a/plugin/mpi/PETSc-code.hpp
+++ b/plugin/mpi/PETSc-code.hpp
@@ -3858,11 +3858,19 @@ namespace PETSc {
         if (ptB->_A->getScaling()) {
             ptA->_D = new KN<PetscReal>(dA->HPDDM_n);
             for (int i = 0; i < dA->HPDDM_n; ++i) ptA->_D->operator[](i) = ptB->_A->getScaling()[i];
+#if defined(PETSC_USE_REAL_SINGLE)
+            empty = new KN<double>(dA->HPDDM_n);
+            for (int i = 0; i < dA->HPDDM_n; ++i) empty->operator[](i) = ptB->_A->getScaling()[i];
+#else
+            empty = ptA->_D;
+#endif
         }
-        empty = new KN< double >(dA->HPDDM_n, (double*)(ptA->_D));
         initPETScStructure<D>(ptA, bs,
           nargs[1] && GetAny< bool >((*nargs[1])(stack)) ? PETSC_TRUE : PETSC_FALSE,
           empty);
+#if !defined(PETSC_USE_REAL_SINGLE)
+        empty = nullptr;
+#endif
       } else {
         int n = ptB->_A->getDof();
         ffassert(dA->HPDDM_n == n);

I believe it has. I’ve tried a dozen or so times with no errors. Thank you for the patch!

I’ll commit this later today, thanks for the follow-up.

1 Like