Number of processes and memory usage

Dear FreeFem users,

when I run FreeFem in parallel on e.g., 5 processes, it also starts 10 sleeping processes. These processes seem to consume memory. Should I be worried about this? That is the reason for this? See also the attached image.

I experience this behaviour on ubuntu 18.04, either mounted from oracle virtualbox or on a machine with just ubuntu.
I use a recently compiled version of FreeFem (develop branch). I run FreeFem with mpirun --allow-run-as-root -np 5 FreeFem++-mpi .....

What BLAS are you using?

I must admit, I have no idea what the FreeFem compilation process does, so it is likely what I am not looking at the right place. I am sending you the config.log file from the FreeFem-sources directory. Could you please help me out how can i check the BLAS version?
config.log (347.1 KB)

It’s in the file you sent or in PETSc configure.log. But it looks alright. I think it’s OK, are you getting bad performance?

The speed of the calculation is fine I suppose. I am trying out the steady-state mAL preconditioner NS solver, and I seems to run out of memory quite fast (at least compared to other commercial CFD software with about the same mesh size).

I found that changing the last part of string paramsV = "-ksp_type gmres -ksp_pc_side right " + "-ksp_rtol " + tolV + " -ksp_gmres_restart 50 -pc_type asm " + "-pc_asm_overlap 1 -sub_pc_type lu -sub_pc_factor_mat_solver_type mumps";
to -pc_asm_overlap 1 -sub_pc_type ilu helps.

Do you have any suggestion maybe to decrease the memory requirement (even if the computational speed is slowed down)?

What’s the size of the problem?

This run just dpes not run out of ram on a virtual machine with 12 GB ram.
number of triangles: 244925,number of vertices: 47183
total ndof: 1183721 (in FreeFem numbering)
I am also sending you the output of the -log_view option.
Nonlinear-solver_Re350.log (14.3 KB)

This is a tiny problem, why not use plain LU?

For this mesh size the mAL preconditioned GMRES method seems to outperform plain LU factorization considering lower memory usage, but LU is slightly faster. The thing is, we want to run this on our desktop machine, or even if we will use a cluster, we would like to use only a few processes (4-6/case), and run several cases in parallel (the scaling of other parts of the workflow which take up most of the computation determine the number of processes). This is why we are concerned with the memory usage.

Do you have maybe other suggestions to altering the numerics? E.g., use -pc_type gamg instead of -pc_type asm? Inner iteration tolerance?

And referring back to the original question regarding the additional sleeping processes: to you think I can ignore these, or should I investigate the cause? If the latter, where should I start!

If you’ll stick to low process counts, you could try PCASM + PCLU for subdomain solvers (maybe with MUMPS instead of the default PETSc LU factorisation). For the sleeping processes, I don’t know, sorry.

Thanks for the helpful comments. Just to be sure, you mean by PCASM+PCLU for the subdomain solver the following: -pc_type asm -pc_asm_overlap 1 -sub_pc_type asm -sub_sub_pc_type lu -sub_sub_pc_factor_mat_solver_type mumps? (this seems to run and converge)

No, it should be -pc_type asm -pc_asm_overlap 1 -sub_pc_type lu -sub_pc_factor_mat_solver_type [mumps,petsc]. Always check on a small problem with -ksp_view to make sure that what you think you are feeding to PETSc is what it is actually using.

I see what I mixed up. Thanks for the tip.

Just two more questions: as I said previously, for PCASM, -sub_pc_type ilu seems to outperform -sub_pc_type lu. Do you think it is worth experimenting with the -sub_ksp_type option? Do you have other suggestion what other method could be chosen for the subdomain?

Finding the most efficient solver is a job in itself. There are too many options to just list them here, it takes time to adjust and fine tune such a method.

Ok, thanks for the tips!