Slow MUMPS on 3D thermal parallel problem

Hello,

I am writing to you because I encounter some problems using MUMPS for 3D simulations. The calculations themselves are fine but increasing the number of processors to solve my problem does not speed-up the solving of the system. It is even slower to go from 5 processors to 10 or 20 processors (with a 5-fold increase in calculation time between going from 5 to 20 processors).

I have seen on this forum that using PETSc could perhaps solve my problem, however I am using for my calculations a supercomputer on which I have no control on the implementation of FreeFEM and stuck with only MUMPS. I am giving you one of my test code to solve a simple thermal problem with a thermal power.

load "msh3"
load "iovtk"
load "medit"
load "MUMPS"
load "mmg"
int master = 0; 

/*Création du domaine et maillage*/
/*Creation cylindre*/

mesh3 Th ;
int Dirichlet=3; int Neumann=2; int fixe=1; int interface=0;


int N=200;//20
real radius = 1. ;
real Longueur = 3.3 ;

border cc(t=0,2*pi){x=radius*cos(t);y=radius*sin(t);label=Dirichlet;}
mesh Th2= buildmesh(cc(N));
real z0=0. ;
real z1 = Longueur;
int[int] labelmidvec = [0,Dirichlet,1,Dirichlet,2,Dirichlet,3,Dirichlet];
int[int] labelupvec = [0,fixe,1,fixe,2,fixe,3,fixe];
int[int] labeldownvec = [0,fixe,1,fixe,2,fixe,3,fixe];

Th = buildlayers(Th2, N, zbound=[z0,z1],
  labelmid=labelmidvec, labelup = labelupvec, labeldown = labeldownvec);

Th=change(Th,fregion= nuTriangle%mpisize);

/* Espaces éléments finis */
fespace ScalarField0(Th,P0);
fespace ScalarField1(Th,P1);
fespace ScalarField2(Th, P1);

/* Characteristic function and measure of the observation area */
real area = int3d(Th)(1.);

/*Variables fonctionnelles*/
ScalarField1 T; // Temperature
ScalarField1 Trenorm; // renormalized temperature
ScalarField1 v; // fonction test


/* Volume function */
macro volume(hr) (int3d(Th)(hr)) // EOM

// Operators
real sqrt2 = sqrt(2.);

// gradient 3D of scalar function
macro grad3D(u)[dx(u), dy(u),dz(u)] // EOM


varf heatvarf(Trenorm,v) = 
		  int3d(Th,mpirank)((15. * grad3D(Trenorm))' * grad3D(v))
	    + int3d(Th,mpirank)(1. * v) 
		+ on(3,Trenorm=0.); 

matrix A = heatvarf(ScalarField1,ScalarField1);

set(A, solver = sparsesolver,master=-1);

real[int] bheat = heatvarf(0, ScalarField1);

Trenorm[] = A^-1 * bheat;


The command to launch the programm is :

mpirun -np N FreeFem++-mpi Thermic.edp

The code is based on the example MUMPS.edp on Github : https://github.com/FreeFem/FreeFem-sources/blob/master/examples/mpi/MUMPS.edp. Are there any improvements/changes I could make to this code to improve the speed-up or to reduce the construction and solving time of my system ?

Thank you !

PS : the DoF of my problem during my tests on this program was equal to 4198800.

If you look online its not hard to find performance vs number of processor curves
that are non-monotonic including for MUMPS. This is why I hate parallel processing :slight_smile: However, it may help to look at the nature of your particular problem
and see if the domain can be decomposed in a way to minimize “overhead” and
let each processor do its thing. There may be better more optimized freefem
tactics you can use, I don’t know them, but ultimately you may need to dig into
it a little. fwiw.

Hi Timothee!

Overall answer is, yes, there is a lot you can do to accelerate your code.

Firstly, you have not parallelized your code yet. When you are calling mpirun, you are simultaneously executing N copies of the same program. This is why your code is getting slower with more procs. You will need to partition the problem and break up the complexity in order to benefit from more processors.

Secondly, LU factorization is certainly not the best solution approach for this problem. You will achieve much faster performance with iterative methods.

Thirdly, even though you don’t have root access, you can build FreeFEM+PETSc in a local user directory on your super machine without sudo privileges. I would certainly recommend you take this route. Then you can modify examples from Pierre’s tutorial and implement a much faster code leveraging PETSc.

Hope that helps! Best of luck with your parallel endeavors.

1 Like

This is why I hate parallel processing :slight_smile:

This is a very strange remark… You will certainly get better performance by breaking up complexity and parallelizing the solution in a problem like his – even with MUMPS.

It’s easy to hate on things that you don’t understand.

If you want decent performance, you should not use MUMPS for such an easy problem. If you are running things at TGCC, then you can install PETSc and FreeFEM no problem, I could also let you use my installation. Just FYI, on my machine with 8 processes, I solve this problem with BoomerAMG in 24 seconds, and in 40 seconds with MUMPS…

1 Like

Thank you for all your answers ! I finally found the origin of my problem.

It was due to a different version of mpirun (there is a specific version for my machine) so it leads to a problem similar to https://community.freefem.org/t/windows-cannot-run-ff-mpirun/104/4 which lead to the execution of copies of my program and not the “true” parallel used of the different processor (as mentionned by @cmd ).

I obtain now the same execution time as your machine for the MUMPS solver @prj . For now I do not require a more sophisticated or complex parallel solving system (I am already glad to see my program correctly running) but if I need to explore more in depth the field of parallel calculation, I will be happy to have your expertise @prj.

Thank you again !

I guess it may help to backup a little and see if there are useful diganostics
to turn on such as amount of time spent in overhead vs actual calculations.
From your original post, it sounded like you may have seen improvements
adding a few cores or processors suggesting something was doing right.

Indeed this is complicated and I don’t think anyone sees it as an ideal
solution because it is easy to get pathological conditions where
adding more processors makes things slower and actual measurements
may help find the issues.

Its great if you can abstract things out and make a general framework for
distributing a task but that may not always be possible to do efficiently.

In some cases, if you really do have decoupled independent tasks that take a long time to
run it may be worth the effort to just run them separaely :slight_smile:

fwiw.

There is nothing to abstract, the finite element method is inherently parallel, you should not have to come up with any sort of “general framework”. How do you even deploy a “general framework” in a distributed-memory machine?