I am finding non-deterministic behavior when solving “large” problems using PETSc-complex on MacOS. For the exact same code, sometimes the code runs, and sometimes I get a SEGV error in PETSc.
I am able to reproduce the error with the following approach:
Making the following change to increase the problem size. (The key issue here seems to be that this error does not occur unless the problem is sufficiently large.)
Sometimes the SLEPc line needs the additional argument -st_mat_mumps_icntl_13 1, sometimes it does not. Either way, I have also encountered this error even if I use SuperLU_dist instead of MUMPS. For me, the segmentation violation occurs roughly 50% of the time.
I am using the latest version of PETSc/SLEPc from the main branch with a fresh compile of the FreeFEM develop branch. make check is clean.
Can anyone let me know if they are able to reproduce this error?
This is a ScaLAPACK bug (see trace below). I don’t understand this line: Sometimes the SLEPc line needs the additional argument -st_mat_mumps_icntl_13 1, sometimes it does not. If you indeed use -st_mat_mumps_icntl_13 1, do you still get occasional segfaults?
Thanks @prj. I am aware of that bug. I get the issue whether or not I use -st_mat_mumps_icntl_13 1. Indeed, I get the issue whether or not I use MUMPS. The error I receive is the following:
'/Users/cmdouglas/local/petsc/real/bin/mpiexec' -np 4 /Users/cmdouglas/local/FreeFem-sources/src/mpi/FreeFem++-mpi -nw 'examples/hpddm/navier-stokes-2d-SLEPc-complex.edp' -v 0 -st_mat_mumps_icntl_13 1
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[1]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/
[1]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[1]PETSC ERROR: to get more information on the crash.
[1]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
Abort(59) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 59) - process 1
Other times, it is successful and I get the expected output:
'/Users/cmdouglas/local/petsc/real/bin/mpiexec' -np 4 /Users/cmdouglas/local/FreeFem-sources/src/mpi/FreeFem++-mpi -nw 'examples/hpddm/navier-stokes-2d-SLEPc-complex.edp' -v 0 -st_mat_mumps_icntl_13 1
1 EPS nconv=0 Values (Errors) 0.0987875+0.815584i (5.61776316e-05) -0.251228+0.644158i (1.04571032e-04) -0.2547+0.57598i (1.39984687e-04) -0.233419+0.711143i (1.89146077e-04) -0.0985848+0.841307i (1.24164980e-04) -0.247737+0.511791i (2.80710937e-04) -0.200688+0.774565i (2.66071523e-04) -0.148836+0.842466i (3.02551958e-04) -0.236908+0.441695i (6.06919178e-04) -0.210188+0.362264i (1.98038863e-03) -0.0819866+0.95825i (5.18008855e-03) -0.173597+0.271578i (7.43501790e-03) -0.120251+0.146039i (3.48309146e-02) -0.0526763-0.027885i (1.23986992e-01) -2.09975-0.622206i (1.51829294e+00) 5.50329+2.74698i (1.33410940e+00)
2 EPS nconv=2 Values (Errors) 0.0987871+0.815584i (2.53653442e-08) -0.100274+0.839395i (4.94285776e-07) -0.288566+0.540013i (9.31353927e-05) -0.273603+0.485691i (2.39510914e-04) -0.298624+0.598384i (3.76451479e-05) -0.224729+0.401639i (1.74613982e-03) -0.295685+0.658633i (2.41145358e-05) -0.284019+0.717808i (2.21096931e-05) -0.263384+0.777432i (2.43134520e-05) -0.233998+0.836721i (3.41902425e-05) -0.125876+0.287301i (2.92165504e-02) -0.196284+0.909277i (1.45746551e-04) -0.130657+1.04038i (5.69855044e-03) -0.286577+1.11312i (2.17181041e-02) -0.500938+0.221255i (2.70689324e-01) -0.174703-0.284642i (7.01646203e-01)
WARNING! There are options you set that were not used!
WARNING! could be spelling mistake, etc!
There are 2 unused database options. They are:
Option left: name:-nw value: examples/hpddm/navier-stokes-2d-SLEPc-complex.edp source: command line
Option left: name:-v value: 0 source: command line