FreeFEM compilation: issues on HPC

Hello everyone,

I am having trouble installing FreeFEM + PETSc on a HPC cluster. Precisely, the installation seems OK at first glance, and everything runs smoothly when I run the codes on one node; however, when I try to run on multiple nodes, the code just stops randomly, similarly to a previously problem report.

First, I tried the installation with the following modules:

module load git/2.39.1
module load intel/2022.1.2
module load openmpi/5.0.3

With this, the installation is successful, but the program stops without a message on multiple nodes.

Then, as suggested in a previous post, I tried to switch to the only intel MPI version which was the following: module load impi/2021.5.1 .
Then, the PETSc compilation was successful, while the compilation FreeFEM failed with the following message:

mv -f $depbase.Tpo $depbase.Po
depbase=`echo ../lglib/lg.tab.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
mpicxx -DHAVE_CONFIG_H -I. -I../..  -DPARALLELE -I./../fflib -I./../Graphics -I./../femlib -I./../bamglib/ -I"/sw/pkgs/arc/intel/2022.1.2/mpi/2021.5.1/include" -Xlinker --enable-new-dtags -Xlinker -rpath -Xlinker -Xlinker -rpath -Xlinker -Xlinker --enable-new-dtags   -I/home/andrasz/FFinstall/petsc/arch-FreeFem/include/suitesparse -I/home/andrasz/FFinstall/petsc/arch-FreeFem/include -I./../../3rdparty/include/BemTool/ -I./../../3rdparty/boost/include -I../../plugin/mpi  -DPARALLELE -DHAVE_ZLIB -g  -DNDEBUG -O3 -mmmx -mavx2 -mavx -msse4.2 -msse2 -msse -std=gnu++14 -DBAMG_LONG_LONG  -DNCHECKPTR -fPIC -MT ../lglib/lg.tab.o -MD -MP -MF $depbase.Tpo -c -o ../lglib/lg.tab.o ../lglib/lg.tab.cpp &&\
mv -f $depbase.Tpo $depbase.Po
depbase=`echo compositeFESpace.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
mpicxx -DHAVE_CONFIG_H -I. -I../..  -DPARALLELE -I./../fflib -I./../Graphics -I./../femlib -I./../bamglib/ -I"/sw/pkgs/arc/intel/2022.1.2/mpi/2021.5.1/include" -Xlinker --enable-new-dtags -Xlinker -rpath -Xlinker -Xlinker -rpath -Xlinker -Xlinker --enable-new-dtags   -I/home/andrasz/FFinstall/petsc/arch-FreeFem/include/suitesparse -I/home/andrasz/FFinstall/petsc/arch-FreeFem/include -I./../../3rdparty/include/BemTool/ -I./../../3rdparty/boost/include -I../../plugin/mpi  -DPARALLELE -DHAVE_ZLIB -g  -DNDEBUG -O3 -mmmx -mavx2 -mavx -msse4.2 -msse2 -msse -std=gnu++14 -DBAMG_LONG_LONG  -DNCHECKPTR -fPIC -MT compositeFESpace.o -MD -MP -MF $depbase.Tpo -c -o compositeFESpace.o compositeFESpace.cpp &&\
mv -f $depbase.Tpo $depbase.Po
depbase=`echo parallelempi.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
mpicxx -DHAVE_CONFIG_H -I. -I../..  -DPARALLELE -I./../fflib -I./../Graphics -I./../femlib -I./../bamglib/ -I"/sw/pkgs/arc/intel/2022.1.2/mpi/2021.5.1/include" -Xlinker --enable-new-dtags -Xlinker -rpath -Xlinker -Xlinker -rpath -Xlinker -Xlinker --enable-new-dtags   -I/home/andrasz/FFinstall/petsc/arch-FreeFem/include/suitesparse -I/home/andrasz/FFinstall/petsc/arch-FreeFem/include -I./../../3rdparty/include/BemTool/ -I./../../3rdparty/boost/include -I../../plugin/mpi  -DPARALLELE -DHAVE_ZLIB -g  -DNDEBUG -O3 -mmmx -mavx2 -mavx -msse4.2 -msse2 -msse -std=gnu++14 -DBAMG_LONG_LONG  -DNCHECKPTR -fPIC -MT parallelempi.o -MD -MP -MF $depbase.Tpo -c -o parallelempi.o parallelempi.cpp &&\
mv -f $depbase.Tpo $depbase.Po
g++: error: unrecognized command line option ‘-rpath’
make[3]: *** [Makefile:710: ffapi.o] Error 1
make[3]: *** Waiting for unfinished jobs....
g++: error: unrecognized command line option ‘-rpath’
g++: error: unrecognized command line option ‘-rpath’
g++: error: unrecognized command line option ‘-rpath’
g++: error: unrecognized command line option ‘-rpath’
make[3]: *** [Makefile:710: ../lglib/mymain.o] Error 1
make[3]: *** [Makefile:710: ../lglib/lg.tab.o] Error 1
make[3]: *** [Makefile:710: parallelempi.o] Error 1
make[3]: *** [Makefile:710: compositeFESpace.o] Error 1
make[3]: Leaving directory '/home/andrasz/FFinstall/FreeFem-sources/src/mpi'
make[2]: *** [Makefile:554: all-recursive] Error 1
make[2]: Leaving directory '/home/andrasz/FFinstall/FreeFem-sources/src'
make[1]: *** [Makefile:896: all-recursive] Error 1
make[1]: Leaving directory '/home/andrasz/FFinstall/FreeFem-sources'

I attach the config.logs of both cases. Any help with finding the issue is much appreciated.

config_intel_openmpi.zip (643.0 KB)

config_intel_intelmpi.zip (631.5 KB)

I don’t know of a better fix than to manually remove -Xlinker --enable-new-dtags -Xlinker -rpath -Xlinker -Xlinker -rpath -Xlinker -Xlinker --enable-new-dtags from the various Makefile in the arborescence.

Thanks for the tip. Removing the corresponding string is not enough. There is a occurence of it in the file plugin/seq/WHERE_LIBRARY-config which is created by the ./configure command. If I remove the command there completely, the compilation seems fine. However, the original issue remains: the code runs fine on a single node, but stalls when I run it on two nodes. Do you have any ideas how to figure out the solution?

Are you sure you are not mixing MPI implementations at compile- and run-time?

I do not think so. I load either the module impi/2021.5.1 or openmpi/5.0.3, and set the compiler flags/environmental variables myself. Still, the issue persists in both cases.

I will try to contact the HPC admins about this. In the meantime, if you have any ideas where I should investigate further, I would appreciate it.

How are you launching the FreeFEM code?

I tried the following:
ff-mpirun -n 384 Nonlinear-solver.edp -Re 50 -v 0
srun --mpi=pmi2 -n 384 FreeFem++-mpi Nonlinear-solver.edp -Re ${Re} -v 0
Neither did work.

Can you do ldd FreeFem++-mpi and send the output, please? Does a simple cout << mpirank << endl; run?

Sure, here is the log.
Note that in the meantime, I switched back to openmpi as its compilation was easier.
ldd.log (4.6 KB)
Yes, the program displays all the mpiranks.

What is really odd (in the case of multiple nodes ofc) that with the current openmpi installation, when I use ff-mpirun -n NPROC, the code stalls, but when I launch with srun -n NPROC FreeFem++-mpi (without --mpi=pmi2), the code seems to work. However, it displays the following error message: [lh1659:3771795] Error: coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
I suppose this happens everytime when communication occurs between the nodes.
I am not sure how serious of an issue this is, if it is one at all. Upon googling, I think this may cause suboptimal performance.
Anyways, I would appreciate if you could possibly provide some insight into this behavior which would help me (and the community) with sorting out similar issues.

This error comes from your machine, not the installation.

Ok, thank you so much for your help!!