# Supply PETSc numbering for shared-memory block matrix

Hello FreeFEM developers,

I have a multiphysics problem which results in a 9x9 block matrix system. Each of these blocks is itself composed of several small sub-blocks which correspond to different meshes and physics. Currently, I am using PETSc to solve this system based on a block preconditioner using the fieldsplit framework. In this approach, each of the 9 blocks along the diagonal is factorized one after the other using MUMPS. Note that each block is quite reasonably sized for direct methods and the overall system has ~5 million DoF.

I have confirmed that this approach works, but I would like to accelerate the preconditioner factorization step. It seems to me that a significant improvement could be attained by attacking the factorizations for the blocks along the diagonal in parallel. However, I am not sure how to tell PETSc to do this.

The relevant commands I am currently using are below. Is there a command I can add to my ssparams’’ variable to make PETSc push the factorization steps to different (groups of) processors? Thanks in advance!

real[int] ffields(N*NDOF); // create matrix numbering for fieldsplit
string[int] nnames(N);  // N is number of blocks, NDOF is number of DOF per block
for (int n = 0; n < N; n++){
ffields(n*NDOF:(n+1)*NDOF - 1) = real(n+1);
nnames[n] = "block" + n;
}
string ssparams = "-ksp_type gmres -pc_type fieldsplit -pc_fieldsplit_type multiplicative ";
for (int n = 0; n <= N; n++){
ssparams = ssparams + "-fieldsplit_" + nnames[n] + "_pc_type lu ";
ssparams = ssparams + "-fieldsplit_" + nnames[n] + "_ksp_type preonly ";
ssparams = ssparams + "-fieldsplit_" + nnames[n] + "_pc_factor_mat_solver_type mumps ";
}
matrix A;
funcA(A); // build A matrix (consisting of all 9x9 blocks)
Mat dA = A; // convert to PETSc Mat
set(dA, sparams = ssparams, fields = ffields, names = nnames);
real[int] b(N*NDOF);
funcb(b); // build RHS vector
real[int] x = dA^-1*b // solve Ax=b

Hi Chris,
Good to see you are making steady progress with FreeFEM + PETSc
I’ll give a very general answer, because the link between the title of your post and the content of the post is not very clear to me.

1. FieldSplit is definitely the way to go, this is exactly what it was designed for in PETSc ;
2. instead of doing Mat dA = A; it would probably better to use a MatNest, it is again exactly what it was designed for: Mat dA = [[A11, A12, A13...], [A21, A22, A23...]...]; then the FieldSplit will automatically follow the distribution of your blocks, see for example this Stokes example with a 2x2 MatNest;
3. PETSc does not do shared-memory parallelism, you can try to use a multithread BLAS with MUMPS, or use MKL PARDISO instead of MUMPS (_pc_factor_mat_solver_type mkl_cpardiso);
4. if my math is correct, your diagonal blocks are of size 5M/9~555k. This is the perfect problem size for distributed-memory direct solvers, MUMPS should scale very well with a moderate number of processes, say 8 to 16;
5. you are using a multiplicative FieldSplit, which means that when you apply the preconditioner, you have to wait before applying the inverse of the next diagonal block that your residual is updated. That means that if you do spread the factorization of the diagonal blocks among processes, when applying, they will alternatively idle and then work, so it’s kind of wasteful. Does -pc_fieldsplit_type additive work on your problem?

So, to sum up: kudos for using FieldSplit, may want to switch to MatNest, and definitely need to aim for distributed-memory parallelism (via the buildDmesh(), createMat() macros and stuff).
Let me know what you think.

1 Like

Thanks for the quick and helpful response as usual @prj!

I apologize for the misleading title. What I thought I was going to post turned out to be a little bit different from what I ended up writing, and I forgot to go back to edit the title. Also, I think the terminology I’ve been using in my own head hasn’t been quite right. Oops!

• I like your suggestion to use MatNest. I didn’t quite understand how that was used before, but I think it makes sense to me now. I’ll give that a shot.
• Your math is correct. All but one block are the same size, and I have been using MUMPS for factoring the blocks with 8 processes to good effect. I have more processors available, but I have found that it does not scale well beyond 8-16 as you suggested. My original hope was that it might be possible to somehow tell PETSc to attack all of the blocks simultaneously in parallel with a group of 2-4 processors each in hopes of even better scaling.
• Good point on the block-G-S preconditioner. I have tested this, and block-Jacobi performs almost as well, only requiring a few more Arnoldi iterations. When/if I manage to parallelize this effectively, I’m sure that will be a small cost compared to the savings.
• My concern with the distributed memory macros you suggested is that each of my main blocks is itself a block matrix composed of multiple variational forms on different meshes. For that reason, I am not sure I could use those macros easily. Am I mistaken? My assumption (and I don’t know if this is correct) is that those macros are probably just feeding PETSc some numbering which tells it which columns/rows belong to which processors. In my case, I think it is quite natural to split the problem up via the main blocks rather than to partition the meshes, so I had hoped to create my own numbering based on those blocks and feed that to PETSc. This idea is what inspired the original (confusing) title for my post.

And by the way, thanks for the kudos! I really appreciate all you do to help this community. Your examples and comments elsewhere on the forum have also been very instructive!

1. MatNest is just like a traditional MatAIJ (in PETSc lingo), it’s merely a different numbering. MatAIJ => numbering by processes, MatNest => numbering by blocks, e.g., physics. The strength of FreeFEM + MatNest is that the notation is rather simple, much like you would do block matrices in MATLAB
2. Oh, if you can use an additive FieldSplit, then that’s a completely different story. What I would suggest is write a func PetscScalar[int] prec(PetscScalar[int]& in) { return \sum_{i=1..9} A_ii^-1 * in_{field_i}; } that you use as a PCSHELL. In this functor, you can just apply all inverses simultaneously. The catch here is to use different communicators for each diagonal block. There is an optional variable communicator when building a Mat, so all you have to do is split you mpiCommWorld, and then assign each block to different groups of processes. To pass this to PETSc/SLEPc, do set(A, sparams = "-pc_type shell", precon = prec); Also, since you’ll be using different communicators, the factorizations will also be performed concurrently, so that’s what you wanted to do in the beginning I guess?
3. You may be mistaken. Not very sure. If you can share your code in private, I’d be able to give a more thorough answer

Thanks!

1 Like
1. Cool. I’ll definitely go the MatNest route then.
2. Are you aware of any complete examples related to your point 2.? That sounds like exactly what I had in mind.
3. Let me see if point 2 fixes my issue. If it doesn’t, I’d be glad to follow up via email.

Cheers!

These examples use PCSHELL [1,2]. There are examples somewhere in the examples/mpi folder that do subcommunicator creation using MPI_Comm_split, see essai.edp.

1 Like

Thank you! I’ll take a look.

Dear @prj ,

I am experimenting also with PCFIELDSPLIT in PETSc. In my first test case, the system is completely block-diagonal, therefore only a single application of the block-Jacobi preconditioner should solve the system. This works if I use PCSHELL; however, not when I use PCFIELDSPLIT. I also created an MWE that demonstrates this behavior. Could you please take a look on the code, what am I doing incorrectly?

PETSc is a C library, so integer indexing starts at 0. You define options for -fieldsplit_1_ and -fieldsplit_2_, which does not make sense. It should be -fieldsplit_0_ and -fieldsplit_1_. When debugging, always remember to use -ksp_view and -options_left. You would have gotten:

Option left: name:-fieldsplit_2_ksp_reuse_preconditioner value: false
Option left: name:-fieldsplit_2_ksp_type value: preonly
Option left: name:-fieldsplit_2_pc_type value: lu


which would have raised a red flag for you, I believe.
Furthermore, if you want to define the same KSP/PC for all blocks, you can simply set -fieldsplit_pc_type [...] (without repeating the string twice for both 0_ and 1_).

1 Like

Thanks you so much for the detailed reply!

The reason I am trying to use PCFIELDSPLIT is that I want to compute the preconditioner only once, and reuse the preconditiner to accelerate the solution. Unfortunately, I have another question: it seems that changing the -ksp_reuse_preconditioner true is detected, but the factorization is computed anyway, at least according to the -log_view option. The attached script demonstrates this behaviour. What am I missing?

load "PETSc"                        // PETSc plugin
macro dimension()2// EOM            // 2D or 3D
include "macro_ddm.idp"             // additional DDM functions

func Pk = P1;                       // finite element space

mesh Th = square(getARGV("-global", 40), getARGV("-global", 40)); // global mesh
Mat A;
createMat(Th, A, Pk)

fespace Wh(Th, Pk);                 // local finite element space
varf vPb(u, v) = int2d(Th)(grad(u)' * grad(v)) + int2d(Th)(v) + on(1, u = 0.0);
real[int] rhs = vPb(0, Wh);

A = vPb(Wh, Wh);
set(A, sparams = "-ksp_type preonly -pc_type lu ");

Mat Sblock = [[A, 0], [0, A]];
Mat S1, S2;

real[int] rhsPETScTotal, usolPETScTotal;
changeNumbering([A,A],[rhs,rhs],rhsPETScTotal);
Wh u1=1;
Wh u2=2;
real[int] fieldIndices;
changeNumbering([A,A],[u1[], u2[]],fieldIndices);
string PETScPar = " -ksp_type preonly -ksp_monitor_true_residual -ksp_initial_guess_nonzero false  -pc_type fieldsplit -pc_fieldsplit_type additive ";
PETScPar = PETScPar + "-fieldsplit_0_ksp_type preonly -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_reuse_preconditioner false  -fieldsplit_1_ksp_type preonly -fieldsplit_1_pc_type lu -fieldsplit_1_ksp_reuse_preconditioner false -options_left ";
/* solving with MatAIJ */
MatConvert(Sblock,S1);
set(S1, sparams=PETScPar, fields=fieldIndices);
KSPSolve(S1, rhsPETScTotal, usolPETScTotal);

/* Solving with MatNest */
set(Sblock, sparams=PETScPar, fields=fieldIndices);
KSPSolve(Sblock, rhsPETScTotal, usolPETScTotal);

/* ReUse LU with updated MatAIJ: factorization is recomputed */
PETScPar = " -ksp_type preonly -ksp_monitor_true_residual -ksp_initial_guess_nonzero false  -pc_type fieldsplit -pc_fieldsplit_type additive ";
PETScPar = PETScPar + "-fieldsplit_0_ksp_type preonly -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_reuse_preconditioner true  -fieldsplit_1_ksp_type preonly -fieldsplit_1_pc_type lu -fieldsplit_1_ksp_reuse_preconditioner true -ksp_view  -options_left";
MatConvert(Sblock,S1);
set(S1, sparams=PETScPar, fields=fieldIndices);
KSPSolve(S1, rhsPETScTotal, usolPETScTotal);

/* ReUse LU with updated MatNest: factorization is recomputed */
Sblock = [[A, 0], [0, A]];
set(Sblock, sparams=PETScPar, fields=fieldIndices);
KSPSolve(Sblock, rhsPETScTotal, usolPETScTotal);


First, since you use PCFIELDSPLIT, I don’t understand why you are using all those MatConvert. Second, When you do Sblock = [[A, 0], [0, A]], it deletes the previous Sblock and recreate a new one (there is no mechanism to detect that you are using the same structure, and users are free to do, e.g., Sblock = [[A], [A], [A]]. What you need instead is to update A using either ChangeOperator or the operator =. Third, when you MatConvert(A, B), nothing can be shared between the KSP used for A and B.

1. I only used MatConvert so that I can use PCLU for debugging, but I see now it is not needed for PCFIELDSPLIT.
2. I had a feeling that the complete object was overwritten, thanks for the clarification. Here my other mistake was that I tried to set(Sblock , ..., field=....) twice.
3. According to point 1, I will stick to MatNest.

However, I must admit I am still confused. In the following script, I think I am doing things appropriately; however, -log_view tells the following:

MatLUFactorSym 2 1.0 3.4547e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 8.0e+00 6 0 0 0 4 6 0 0 0 5 0
MatLUFactorNum 4 1.0 9.3005e-03 1.0 2.06e+05 1.0 0.0e+00 0.0e+00 0.0e+00 16 3 0 0 0 16 3 0 0 0 44

If I understand things correctly, MatLUFactorSymbolic is not called once more, i.e., the information regarding the sparsity pattern of the matrix is reused; however, the actual factorization is recomputed with MatLUFactorNumeric. The thing is, I want to reuse the complete factorization, which seems to work for a single matrix, not a block-matrix.

load "PETSc"                        // PETSc plugin
macro dimension()2// EOM            // 2D or 3D
include "macro_ddm.idp"             // additional DDM functions

func Pk = P1;                       // finite element space

mesh Th = square(getARGV("-global", 40), getARGV("-global", 40)); // global mesh
Mat A;
createMat(Th, A, Pk)

fespace Wh(Th, Pk);                 // local finite element space
varf vPb(u, v) = int2d(Th)(grad(u)' * grad(v)) + int2d(Th)(v) + on(1, u = 0.0);
varf vPb2(u, v) = int2d(Th)(grad(u)' * grad(v) + 1e-7*dx(u)*dx(v)) + on(1, u = 0.0);
real[int] rhs = vPb(0, Wh);

A = vPb(Wh, Wh);
set(A, sparams = "-ksp_type preonly -pc_type lu ");

Mat Sblock = [[A, 0], [0, A]];

real[int] rhsPETScTotal, usolPETScTotal;
changeNumbering([A,A],[rhs,rhs],rhsPETScTotal);
Wh u1=1;
Wh u2=2;
real[int] fieldIndices;
changeNumbering([A,A],[u1[], u2[]],fieldIndices);
string PETScPar = " -ksp_type preonly -ksp_monitor_true_residual -ksp_initial_guess_nonzero false  -pc_type fieldsplit -pc_fieldsplit_type additive ";
PETScPar = PETScPar + "-fieldsplit_0_ksp_type preonly -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_reuse_preconditioner false  -fieldsplit_1_ksp_type preonly -fieldsplit_1_pc_type lu -fieldsplit_1_ksp_reuse_preconditioner false -options_left ";
set(Sblock, sparams=PETScPar, fields=fieldIndices);
KSPSolve(Sblock, rhsPETScTotal, usolPETScTotal);

PETScPar = " -ksp_type preonly -ksp_monitor_true_residual -ksp_initial_guess_nonzero false  -pc_type fieldsplit -pc_fieldsplit_type additive ";
PETScPar = PETScPar + "-fieldsplit_0_ksp_type preonly -fieldsplit_0_pc_type lu -fieldsplit_0_ksp_reuse_preconditioner true  -fieldsplit_1_ksp_type preonly -fieldsplit_1_pc_type lu -fieldsplit_1_ksp_reuse_preconditioner true -options_left";
A = vPb2(Wh, Wh);
Mat Sblock2 = [[A, 0], [0, A]];
ChangeOperator(Sblock2,Sblock);
set(Sblock, sparams=PETScPar);
KSPSolve(Sblock, rhsPETScTotal, usolPETScTotal);



Edit: I just realized I can do -ksp_reuse_preconditioner true for the outerKSP which completely reuses the factorization . However, I find it strange that I cannot completely change the reuse of each block in the system.

Mat Sblock2 = [[A, 0], [0, A]];
ChangeOperator(Sblock2,Sblock);


This is not correct. You should just be able to do:

matrix A2 = vPb2(Wh, Wh);
ChangeOperator(A2,A);


And it will automatically update Sblock. If this does not work, let me know.

Thanks for the help, I start to understand how things supposed to work. Unfortunately, ChangeOperator returns the following PETSc error:

[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Object is in wrong state
[0]PETSC ERROR: Object C has refct 5 > 1, would leave hanging reference
[0]PETSC ERROR: See Frequently Asked Questions (FAQ) — PETSc 3.17.3 documentation for trouble shooting.
[0]PETSC ERROR: Petsc Development GIT revision: v3.17.3-735-gfc4dabdb27 GIT Date: 2022-07-11 19:58:07 +0000
[0]PETSC ERROR: /root/…/home/aszabo/FFinstall/FreeFem-sources/src/mpi/FreeFem+±mpi on a arch-FreeFem named aszabo-VirtualBox by root Wed Jul 13 11:09:31 2022
[0]PETSC ERROR: #1 MatHeaderReplace() at /home/aszabo/FFinstall/petsc/src/mat/utils/gcreate.c:432

I am also attach an MWE. I experience this error message using both real and complex variables.
ChangeOperator-MWE.edp (1.6 KB)

This is not the code I wrote above (but there was a mistake as well there, my bad). The proper code should have had A2 and A switched around, i.e.:

matrix A2 = vPb2(Wh, Wh);
ChangeOperator(A,A2);


Note that contrary to your code, you must assemble a matrix, not a Mat!

Thanks for the clarification, now the code indeed works correctly!

I am sorry, I have another question (then, I am hoping not to bother you again for a while ). What I would like to do is change a block of the operator a third time, and recompute the block-LU factorization for that block. According to  -options_left, it seems that the change in option is detected, the options at the first set(Sblock, sparams="..."); persist. I also tried to change the options of the independently (set(A11,...);. in Sblock = [[A00, 0], [0, A11]]). Could you please take a look, what am I missing?
ChangeOperator-MWE2.edp (2.3 KB)

I’ve reproduced this in a PETSc-only MWE. I’ve asked other PETSc developers for help because I believe something is off. You can follow the discussion here: [petsc-dev] PCFIELDSPLIT with inner -ksp_reuse_preconditioner.

Thanks you so much for the investigation. I will keep an eye on the discussion.