Slow MUMPS on 3D thermal parallel problem

Timothee99 · March 6, 2024, 9:08am

Hello,

I am writing to you because I encounter some problems using MUMPS for 3D simulations. The calculations themselves are fine but increasing the number of processors to solve my problem does not speed-up the solving of the system. It is even slower to go from 5 processors to 10 or 20 processors (with a 5-fold increase in calculation time between going from 5 to 20 processors).

I have seen on this forum that using PETSc could perhaps solve my problem, however I am using for my calculations a supercomputer on which I have no control on the implementation of FreeFEM and stuck with only MUMPS. I am giving you one of my test code to solve a simple thermal problem with a thermal power.

load "msh3"
load "iovtk"
load "medit"
load "MUMPS"
load "mmg"
int master = 0; 

/*Création du domaine et maillage*/
/*Creation cylindre*/

mesh3 Th ;
int Dirichlet=3; int Neumann=2; int fixe=1; int interface=0;


int N=200;//20
real radius = 1. ;
real Longueur = 3.3 ;

border cc(t=0,2*pi){x=radius*cos(t);y=radius*sin(t);label=Dirichlet;}
mesh Th2= buildmesh(cc(N));
real z0=0. ;
real z1 = Longueur;
int[int] labelmidvec = [0,Dirichlet,1,Dirichlet,2,Dirichlet,3,Dirichlet];
int[int] labelupvec = [0,fixe,1,fixe,2,fixe,3,fixe];
int[int] labeldownvec = [0,fixe,1,fixe,2,fixe,3,fixe];

Th = buildlayers(Th2, N, zbound=[z0,z1],
  labelmid=labelmidvec, labelup = labelupvec, labeldown = labeldownvec);

Th=change(Th,fregion= nuTriangle%mpisize);

/* Espaces éléments finis */
fespace ScalarField0(Th,P0);
fespace ScalarField1(Th,P1);
fespace ScalarField2(Th, P1);

/* Characteristic function and measure of the observation area */
real area = int3d(Th)(1.);

/*Variables fonctionnelles*/
ScalarField1 T; // Temperature
ScalarField1 Trenorm; // renormalized temperature
ScalarField1 v; // fonction test


/* Volume function */
macro volume(hr) (int3d(Th)(hr)) // EOM

// Operators
real sqrt2 = sqrt(2.);

// gradient 3D of scalar function
macro grad3D(u)[dx(u), dy(u),dz(u)] // EOM


varf heatvarf(Trenorm,v) = 
		  int3d(Th,mpirank)((15. * grad3D(Trenorm))' * grad3D(v))
	    + int3d(Th,mpirank)(1. * v) 
		+ on(3,Trenorm=0.); 

matrix A = heatvarf(ScalarField1,ScalarField1);

set(A, solver = sparsesolver,master=-1);

real[int] bheat = heatvarf(0, ScalarField1);

Trenorm[] = A^-1 * bheat;

The command to launch the programm is :

mpirun -np N FreeFem++-mpi Thermic.edp

The code is based on the example MUMPS.edp on Github : https://github.com/FreeFem/FreeFem-sources/blob/master/examples/mpi/MUMPS.edp. Are there any improvements/changes I could make to this code to improve the speed-up or to reduce the construction and solving time of my system ?

Thank you !

PS : the DoF of my problem during my tests on this program was equal to 4198800.

marchywka · March 6, 2024, 11:41am

If you look online its not hard to find performance vs number of processor curves
that are non-monotonic including for MUMPS. This is why I hate parallel processing However, it may help to look at the nature of your particular problem
and see if the domain can be decomposed in a way to minimize “overhead” and
let each processor do its thing. There may be better more optimized freefem
tactics you can use, I don’t know them, but ultimately you may need to dig into
it a little. fwiw.

cmd · March 6, 2024, 2:29pm

Hi Timothee!

Overall answer is, yes, there is a lot you can do to accelerate your code.

Firstly, you have not parallelized your code yet. When you are calling mpirun, you are simultaneously executing N copies of the same program. This is why your code is getting slower with more procs. You will need to partition the problem and break up the complexity in order to benefit from more processors.

Secondly, LU factorization is certainly not the best solution approach for this problem. You will achieve much faster performance with iterative methods.

Thirdly, even though you don’t have root access, you can build FreeFEM+PETSc in a local user directory on your super machine without sudo privileges. I would certainly recommend you take this route. Then you can modify examples from Pierre’s tutorial and implement a much faster code leveraging PETSc.

Hope that helps! Best of luck with your parallel endeavors.

cmd · March 6, 2024, 2:32pm

This is why I hate parallel processing

This is a very strange remark… You will certainly get better performance by breaking up complexity and parallelizing the solution in a problem like his – even with MUMPS.

prj · March 6, 2024, 4:23pm

It’s easy to hate on things that you don’t understand.

prj · March 6, 2024, 4:58pm

If you want decent performance, you should not use MUMPS for such an easy problem. If you are running things at TGCC, then you can install PETSc and FreeFEM no problem, I could also let you use my installation. Just FYI, on my machine with 8 processes, I solve this problem with BoomerAMG in 24 seconds, and in 40 seconds with MUMPS…

Timothee99 · March 7, 2024, 10:21am

Thank you for all your answers ! I finally found the origin of my problem.

It was due to a different version of mpirun (there is a specific version for my machine) so it leads to a problem similar to https://community.freefem.org/t/windows-cannot-run-ff-mpirun/104/4 which lead to the execution of copies of my program and not the “true” parallel used of the different processor (as mentionned by @cmd ).

I obtain now the same execution time as your machine for the MUMPS solver @prj . For now I do not require a more sophisticated or complex parallel solving system (I am already glad to see my program correctly running) but if I need to explore more in depth the field of parallel calculation, I will be happy to have your expertise @prj.

Thank you again !

marchywka · March 7, 2024, 11:00am

I guess it may help to backup a little and see if there are useful diganostics
to turn on such as amount of time spent in overhead vs actual calculations.
From your original post, it sounded like you may have seen improvements
adding a few cores or processors suggesting something was doing right.

Indeed this is complicated and I don’t think anyone sees it as an ideal
solution because it is easy to get pathological conditions where
adding more processors makes things slower and actual measurements
may help find the issues.

Its great if you can abstract things out and make a general framework for
distributing a task but that may not always be possible to do efficiently.

In some cases, if you really do have decoupled independent tasks that take a long time to
run it may be worth the effort to just run them separaely

fwiw.

prj · March 7, 2024, 12:58pm

There is nothing to abstract, the finite element method is inherently parallel, you should not have to come up with any sort of “general framework”. How do you even deploy a “general framework” in a distributed-memory machine?

Topic		Replies	Views
MUMPS, solve and possible memory leaks General Discussion	2	1661	June 29, 2020
Question for parallelization in FreeFem General Discussion	5	395	February 23, 2023
Recommend a server General Discussion	1	47	September 20, 2024
Question of PETSc and parallelization in FreeFEM++ General Discussion	7	464	July 27, 2023
Output of the solution of a assigned point General Discussion	24	505	October 2, 2022

Slow MUMPS on 3D thermal parallel problem

Related topics