I have a multiphysics problem which results in a 9x9 block matrix system. Each of these blocks is itself composed of several small sub-blocks which correspond to different meshes and physics. Currently, I am using PETSc to solve this system based on a block preconditioner using the fieldsplit framework. In this approach, each of the 9 blocks along the diagonal is factorized one after the other using MUMPS. Note that each block is quite reasonably sized for direct methods and the overall system has ~5 million DoF.

I have confirmed that this approach works, but I would like to accelerate the preconditioner factorization step. It seems to me that a significant improvement could be attained by attacking the factorizations for the blocks along the diagonal in parallel. However, I am not sure how to tell PETSc to do this.

The relevant commands I am currently using are below. Is there a command I can add to my ``ssparams’’ variable to make PETSc push the factorization steps to different (groups of) processors? Thanks in advance!

real[int] ffields(N*NDOF); // create matrix numbering for fieldsplit
string[int] nnames(N); // N is number of blocks, NDOF is number of DOF per block
for (int n = 0; n < N; n++){
ffields(n*NDOF:(n+1)*NDOF - 1) = real(n+1);
nnames[n] = "block" + n;
}
string ssparams = "-ksp_type gmres -pc_type fieldsplit -pc_fieldsplit_type multiplicative ";
for (int n = 0; n <= N; n++){
ssparams = ssparams + "-fieldsplit_" + nnames[n] + "_pc_type lu ";
ssparams = ssparams + "-fieldsplit_" + nnames[n] + "_ksp_type preonly ";
ssparams = ssparams + "-fieldsplit_" + nnames[n] + "_pc_factor_mat_solver_type mumps ";
}
matrix A;
funcA(A); // build A matrix (consisting of all 9x9 blocks)
Mat dA = A; // convert to PETSc Mat
set(dA, sparams = ssparams, fields = ffields, names = nnames);
real[int] b(N*NDOF);
funcb(b); // build RHS vector
real[int] x = dA^-1*b // solve Ax=b

Hi Chris,
Good to see you are making steady progress with FreeFEM + PETSc
I’ll give a very general answer, because the link between the title of your post and the content of the post is not very clear to me.

FieldSplit is definitely the way to go, this is exactly what it was designed for in PETSc ;

instead of doing Mat dA = A; it would probably better to use a MatNest, it is again exactly what it was designed for: Mat dA = [[A11, A12, A13...], [A21, A22, A23...]...]; then the FieldSplit will automatically follow the distribution of your blocks, see for example this Stokes example with a 2x2 MatNest;

PETSc does not do shared-memory parallelism, you can try to use a multithread BLAS with MUMPS, or use MKL PARDISO instead of MUMPS (_pc_factor_mat_solver_type mkl_cpardiso);

if my math is correct, your diagonal blocks are of size 5M/9~555k. This is the perfect problem size for distributed-memory direct solvers, MUMPS should scale very well with a moderate number of processes, say 8 to 16;

you are using a multiplicative FieldSplit, which means that when you apply the preconditioner, you have to wait before applying the inverse of the next diagonal block that your residual is updated. That means that if you do spread the factorization of the diagonal blocks among processes, when applying, they will alternatively idle and then work, so it’s kind of wasteful. Does -pc_fieldsplit_type additive work on your problem?

So, to sum up: kudos for using FieldSplit, may want to switch to MatNest, and definitely need to aim for distributed-memory parallelism (via the buildDmesh(), createMat() macros and stuff).
Let me know what you think.

Thanks for the quick and helpful response as usual @prj!

I apologize for the misleading title. What I thought I was going to post turned out to be a little bit different from what I ended up writing, and I forgot to go back to edit the title. Also, I think the terminology I’ve been using in my own head hasn’t been quite right. Oops!

Anyways, in regards to your comments:

I like your suggestion to use MatNest. I didn’t quite understand how that was used before, but I think it makes sense to me now. I’ll give that a shot.

Your math is correct. All but one block are the same size, and I have been using MUMPS for factoring the blocks with 8 processes to good effect. I have more processors available, but I have found that it does not scale well beyond 8-16 as you suggested. My original hope was that it might be possible to somehow tell PETSc to attack all of the blocks simultaneously in parallel with a group of 2-4 processors each in hopes of even better scaling.

Good point on the block-G-S preconditioner. I have tested this, and block-Jacobi performs almost as well, only requiring a few more Arnoldi iterations. When/if I manage to parallelize this effectively, I’m sure that will be a small cost compared to the savings.

My concern with the distributed memory macros you suggested is that each of my main blocks is itself a block matrix composed of multiple variational forms on different meshes. For that reason, I am not sure I could use those macros easily. Am I mistaken? My assumption (and I don’t know if this is correct) is that those macros are probably just feeding PETSc some numbering which tells it which columns/rows belong to which processors. In my case, I think it is quite natural to split the problem up via the main blocks rather than to partition the meshes, so I had hoped to create my own numbering based on those blocks and feed that to PETSc. This idea is what inspired the original (confusing) title for my post.

And by the way, thanks for the kudos! I really appreciate all you do to help this community. Your examples and comments elsewhere on the forum have also been very instructive!

MatNest is just like a traditional MatAIJ (in PETSc lingo), it’s merely a different numbering. MatAIJ => numbering by processes, MatNest => numbering by blocks, e.g., physics. The strength of FreeFEM + MatNest is that the notation is rather simple, much like you would do block matrices in MATLAB

Oh, if you can use an additive FieldSplit, then that’s a completely different story. What I would suggest is write a func PetscScalar[int] prec(PetscScalar[int]& in) { return \sum_{i=1..9} A_ii^-1 * in_{field_i}; } that you use as a PCSHELL. In this functor, you can just apply all inverses simultaneously. The catch here is to use different communicators for each diagonal block. There is an optional variable communicator when building a Mat, so all you have to do is split you mpiCommWorld, and then assign each block to different groups of processes. To pass this to PETSc/SLEPc, do set(A, sparams = "-pc_type shell", precon = prec); Also, since you’ll be using different communicators, the factorizations will also be performed concurrently, so that’s what you wanted to do in the beginning I guess?

You may be mistaken. Not very sure. If you can share your code in private, I’d be able to give a more thorough answer

These examples use PCSHELL [1,2]. There are examples somewhere in the examples/mpi folder that do subcommunicator creation using MPI_Comm_split, see essai.edp.