How to solve multiple problems in parallel (not domain decomposition)

I need to solve multiple instances of the Helmholtz equation with fixed matrix but different right-hand-sides. So I find this example. I have some experience with FEM in FreeFem and MATLAB, but zero experience with parallel computing. I have two questions regarding this example.

  1. Is this code ready for parallel computing? Running the code with time ff-mpirun -np 4 takes even longer than time ff-mpirun -np 1 (almost 4 times longer). Using clock() I see it’s solved 4 times, each time takes the same amount of the time with -np 1. Should I run the code using other commands?

  2. I suppose the code use some domain docomposition and parallel solver to speed up the computation. But in my case the bottleneck is on the function evaluation of the right-hand-side and the assembly of the FEM vector (line 31–35 in the example code). How do we speed up that part using multiple cores? I went through this post and the two YouTube videos but still have no clue (it seems the videos are about domain decomposition and parallel solver).

I also searched other posts but didn’t find a complete code. I hope somebody can provide a minimal but complete example for speeding up problems with multiple rhs (or multiple matrices).

  1. What solver are you using?
  2. Domain decomposition is the most efficient way to speed up matrix and right-hand side assemblies.

Let’s look at this code snippet in the example.

int nRhs = 1000; // in my case I have a lot more iterations
complex[int, int] rhs(Vh.ndof, nRhs), sol(Vh.ndof, nRhs);
for(int i = 0; i < nRhs; ++i) {
    func f = 100 * exp(-20 * ((x-i/3.0)^2 + (y-i/3.0)^2)); // my function takes long to evaluate
    varf vRhs(u, v) = int2d(Th)(f * v);
    rhs(:, i) = vRhs(0, Vh);
}

When I run ff-mpirun -np 8 example14.edp, will the program distribute the for loop in 8 cores or decompose the mesh Th into 8 parts?

It will decompose Th into 8 parts.

Thank a lot!

Still the example runs slower with -np 8 than with -np 1. I have 8 cores in my computer.

This problem is tiny and for educational purposes only. Increase the mesh size, do not use clock() but instead run the program with the command line -log_view, and at some point you’ll see that it gets faster.

In my case, each problem is tiny (40x40 2D mesh), but it’s to be solved with millions of different rhs. In this case I suppose domain decomposition will not help a lot but parallelizing the for loop is necessary? What is the API to do so in FreeFem?

See, e.g., Solving a lot of independent ODEs in parallel.

I have seen your answer before.

int ntask = 200;
for(int i = (mpirank * ntask) / mpisize; i < ((mpirank + 1) * ntask) / mpisize; ++i) {
  cout << "process #" << mpirank << " dealing with task #" << i << endl;
}
mpiAllreduce(...); // for synchronization

Sorry I am really new to these MPI stuff, what should we put inside mpiAllreduce? I looked over FreeFem doc but can’t find relevant parts (most part is about domain decomposition and parallel solver). Should I just read the MPI doc itself?

It will depends on what you want to do with each solution of the “millions” of linear systems.

I need ouput to a file all solution values interpolated on some discrete points on a larger circle enclosing the support of the source.

OK, then I guess you need to save to disk all your solutions? I feel obligated to warn you that what you want to do seems extremely inefficient.

Now I can get my code run, but the output data has more points than expected. Here is the last part of my code:

// solve the system
  complex[int, int] B(A.n, rhs.m), X(A.n, rhs.m);
  ChangeNumbering(A, rhs, B);
  set(A, sparams = "-ksp_type preonly -pc_type lu");
  KSPSolve(A, B, X);
  ChangeNumbering(A, sol, X, inverse = true, exchange = true);

  // write to disk the solution values on a circle
  Vh ur;
  real xt;
  real yt;
  for(int i = 0; i < rhs.m; ++i) {
    ur[] = sol(:, i).re;
    for (int n=0; n<nOut; ++n){
      xt = rOut * cos(2*pi*n/nOut);
      yt = rOut * sin(2*pi*n/nOut);
      urfile << ur(xt, yt) << endl;
    }
  }

My rhs.m=1000 and nOut=64. So the total number of values in urfile should be 64000. However, I get 64015 values if I use multiple processes in ff-mpirun!

I feel like domain decomposition will not do much good for my small sized problems, but I do want MPI for the loops in my code (which I know how to do now). Is there a way to disable domain decomposition but still use MPI in my code when I run the command ff-mpirun?

Yes, do not do domain decomposition…

Before using PETSc, I tried the simple way: sol = A^-1 * rhs, where sol and rhs are 2D arrays. Then I extract each solution as sol(:, i). However, the solution obtained this way are completely wrong unless rhs has only 1 column. For example, one solution looks like this:

A natural idea for small problems is to store the LU decomposition and reuse it. But I don’t know how is that done in FreeFem.

Use PETSc but no domain decomposition.

This is exactly where I don’t understand the code of the original example. I don’t recognize which line of code does domain decomposition explicitly. Thus I don’t know how not to do decomposition in the code except using -np 1 in the command line. But then everything is done sequentially.

Functions from the file macro_ddm.idp are used to do domain decomposition. So remove those.

Mat is not a function but a type, it comes from load "PETSc" not include "macro_ddm.idp". Here is what you want to do: FreeFem-sources/diffusion-cartesian-2d-PETSc.edp at master · FreeFem/FreeFem-sources · GitHub.