Sure! I think I did it. I attached to a process, and printed out the full backtrace. I attach three ones: two sleeping that are related to the srun
command, and one running related to the FreeFem++=mpi command. I am not sure which one you were referring to. All the parallel freefrem processes seem to be running.
gdb_srun_1059945.log (1.9 KB)
gdb_srun_1059944.log (4.5 KB)
gdb-backtrace.log (61.3 KB)
Are those the back traces when processes are hanging? They look like the back traces when processes are starting (before the KSPSolve()
).
Maybe I did not explain myself well enough. I mean the process is still running. It just stops doing anything useful, I assume right before KSPSolve()
.
Edit: as GDB starts, it writes warning: Could not load shared library symbols for 2 libraries, e.g. ../../plugin/mpi/PETSc.so.
I am not sure if that is meaningful.
Sorry, I didn’t read carefully enough. I now see that one process is stuck in MUMPS. Could you try to run with something trivial, like PCJACOBI for just 10 iterations. It will of course not converge, but it would be nice to check if the deadlock is specific to MUMPS.
I tried out -pc_type jacobi -ksp_max_it 10
as you suggested and the program seems to run fine. Same applied for -pc_type bjacobi
. However, -pc_type lu -pc_factor_mat_solver type superlu_dist
get stuck just like mumps. Maybe the issue is somehow related to the conversion between matrix types (that is required by PCLU
) behind the scenes?
I attach the backtrace of the superlu run.
gdb_265145_superlu.log (27.1 KB)
Can you reproduce this on a standalone PETSc example?
I do not think so. If I run the previously mentioned PETSc example, ex15.c with srun -n 384 ex15 -m 200 -n 200 -ksp_type preonly -pc_type lu -pc_factor_mat_solver_type mumps -log_view -options_left
, it seems like all the options are detected, and the program completes successfully.
Maybe try to edit src/fflib/ffapi.cpp
, in ffapi_mpi_init()
, remove everything and just keep the call to PetscInitialize()
(you may have to add #include <petsc.h>
at the top of the file), then try to recompile in src
.
Thanks for the tip. Unfortunately, that did not work.
The corresponding function looks like the following:
static void ffapi_mpi_init(int &argc, char** &argv){
#ifdef WITH_PETSCxxxxx
PetscInitialize(&argc, &argv, 0, "");
#endif
}
The compilation is successful, but the tests do not even start properly - I get the message Attempting to use an MPI routine before initializing MPICH
; then, the program exits.
You need to remove the #ifdef
.
Unfortunately, that does not work either. The following function
static void ffapi_mpi_init(int &argc, char** &argv){
PetscInitialize(&argc, &argv, 0, "");
}
returns the error
../fflib/ffapi.cpp: In function ‘void ffapi::ffapi_mpi_init(int&, char**&)’:
../fflib/ffapi.cpp:258:5: error: ‘PetscInitialize’ was not declared in this scope
PetscInitialize(&argc, &argv, 0, "");
^~~~~~~~~~~~~~~
At the top of the file petsc.h
is included like this:
#ifdef WITH_PETSC
#include <petsc.h>
#endif
If I remove the macro here, I get a different error.
Well, you need to remove the macro and fix the error.
Unfortunately, I just cannot figure it out. I think removing the macro inside the function mentioned causes problems in the non-petsc part of FreeFEM. I understand that you do not have much time to dedicate to this issue (I already appreciate very much the help you gave me), but without further guidance the only solution I have in mind is export FreeFEM matrices in PETSc format, and do the linear algebra in a separate PETSc code, which, given the convenience the FreeFEM-PETSc interface offers, would be rather underwhelming.
Right, you just want to compile the files needed for FreeFem++-mpi
, and you probably need to fix the compile line/link line to include the proper PETSc flags. Anyway, if you can’t get it to compile (my idea was to let PETSc initialize MPI since the code seems to be working when it’s PETSc-only), you should ask your sysadmin what may cause the deadlock. I’m not much of an help without access to the machine.
Thanks for the reply.
The sysadmin suggested a different route: spack + MPICH with libfabric installation. Unfortunately, the FreeFEM compilatio gets stuck, and I am not sure why. I attach the config.log, and the stdout/stderr of the make command. Could you please take a look at it? Maybe you could spot what goes wrong.
makeout.log (283.7 KB)
makeerr.log (41.3 KB)
config.log (381.1 KB)
From the output of ps aux
, I think this is what gets stuck:
/sw/pkgs/arc/intel/2022.1.2/compiler/2022.0.2/linux/bin-llvm/clang++ -cc1 -triple x86_64-unknown-linux-gnu -emit-obj --mrelax-relocations -disable-free -disable-llvm-verifier -discard-value-names -main-file-name cmaes.cpp -mrelocation-model pic -pic-level 2 -fhalf-no-semantic-interposition -fveclib=SVML -mframe-pointer=none -menable-no-infs -menable-no-nans -menable-unsafe-fp-math -fno-signed-zeros -mreassociate -freciprocal-math -fdenormal-fp-math=preserve-sign,preserve-sign -ffp-contract=fast -fno-rounding-math -ffast-math -ffinite-math-only -mconstructor-aliases -munwind-tables -target-cpu x86-64 -target-feature +mmx -target-feature +avx2 -target-feature +avx -target-feature +sse4.2 -target-feature +sse2 -target-feature +sse -mllvm -x86-enable-unaligned-vector-move=true -tune-cpu generic -debug-info-kind=limited -dwarf-version=4 -debugger-tuning=gdb -fcoverage-compilation-dir=/home/andrasz/FFinstall_mpich/FreeFem-sources/plugin/mpi -resource-dir /sw/pkgs/arc/intel/2022.1.2/compiler/2022.0.2/linux/lib/clang/14.0.0 -I ../seq/include -I ../seq -I /home/andrasz/.spack/opt/spack/intel-2022.0.2/mpich/4.2.3-akkh/include -I /home/andrasz/FFinstall_mpich/petsc/arch-FreeFem/include/suitesparse -I /home/andrasz/FFinstall_mpich/petsc/arch-FreeFem/include -I /home/andrasz/.spack/opt/spack/intel-2022.0.2/mpich/4.2.3-akkh/include -D NDEBUG -D BAMG_LONG_LONG -D NCHECKPTR -I/sw/pkgs/arc/intel/2022.1.2/tbb/2021.5.1/include -internal-isystem /sw/pkgs/arc/intel/2022.1.2/compiler/2022.0.2/linux/bin-llvm/../compiler/include -internal-isystem /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8 -internal-isystem /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/x86_64-redhat-linux -internal-isystem /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/backward -internal-isystem /sw/pkgs/arc/intel/2022.1.2/compiler/2022.0.2/linux/lib/clang/14.0.0/include -internal-isystem /usr/local/include -internal-isystem /usr/lib/gcc/x86_64-redhat-linux/8/../../../../x86_64-redhat-linux/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -std=gnu++14 -fdeprecated-macro -fdebug-compilation-dir=/home/andrasz/FFinstall_mpich/FreeFem-sources/plugin/mpi -ferror-limit 19 -fheinous-gnu-extensions -fgnuc-version=4.2.1 -fcxx-exceptions -fexceptions -mllvm -enable-gvn-hoist -vectorize-loops -vectorize-slp -D__GCC_HAVE_DWARF2_CFI_ASM=1 -fintel-compatibility -mllvm -disable-hir-generate-mkl-call -mllvm -intel-libirc-allowed -mllvm -loopopt=0 -floopopt-pipeline=none -mllvm -enable-lv -o cmaes.o -x c++ ../seq/cmaes.cpp
This will fix things with some compiler:
diff --git a/plugin/mpi/mpi-cmaes.cpp b/plugin/mpi/mpi-cmaes.cpp
index c80ee72e..2d03f0e2 100644
--- a/plugin/mpi/mpi-cmaes.cpp
+++ b/plugin/mpi/mpi-cmaes.cpp
@@ -28,3 +28,3 @@
-//ff-c++-LIBRARY-dep: mpi
+//ff-c++-LIBRARY-dep: mpi broken
//ff-c++-cpp-dep: ../seq/cmaes.cpp -I../seq
diff --git a/plugin/seq/cmaes.cpp b/plugin/seq/cmaes.cpp
index b8196b79..92847e45 100644
--- a/plugin/seq/cmaes.cpp
+++ b/plugin/seq/cmaes.cpp
@@ -23,3 +23,3 @@
/* clang-format off */
-//ff-c++-LIBRARY-dep:
+//ff-c++-LIBRARY-dep: broken
//ff-c++-cpp-dep:
Thanks for the quick help! This indeed fixed the problem. I will test whether this version can work on multiple nodes.
With MPICH, I experience similar unreliability when using a high number of processes an with openMPI. I managed to reproduce the issue with a PETSc-only script. I wrote to the PETSc developers; hopefully, they can offer some help with the investigation.
As recommended by the PETSc developers, I am trying out the newest intel compilers/mpi with spack. Unfortunaltely, tine FreeFEM configure command returns an error. Could please take a look at the config.log
what goes wrong?
config.log (53.9 KB)
You need an up-to-date autoconf library.