I need to solve multiple instances of the Helmholtz equation with fixed matrix but different right-hand-sides. So I find this example. I have some experience with FEM in FreeFem and MATLAB, but zero experience with parallel computing. I have two questions regarding this example.
Is this code ready for parallel computing? Running the code with time ff-mpirun -np 4 takes even longer than time ff-mpirun -np 1 (almost 4 times longer). Using clock() I see it’s solved 4 times, each time takes the same amount of the time with -np 1. Should I run the code using other commands?
I suppose the code use some domain docomposition and parallel solver to speed up the computation. But in my case the bottleneck is on the function evaluation of the right-hand-side and the assembly of the FEM vector (line 31–35 in the example code). How do we speed up that part using multiple cores? I went through this post and the two YouTube videos but still have no clue (it seems the videos are about domain decomposition and parallel solver).
I also searched other posts but didn’t find a complete code. I hope somebody can provide a minimal but complete example for speeding up problems with multiple rhs (or multiple matrices).
int nRhs = 1000; // in my case I have a lot more iterations
complex[int, int] rhs(Vh.ndof, nRhs), sol(Vh.ndof, nRhs);
for(int i = 0; i < nRhs; ++i) {
func f = 100 * exp(-20 * ((x-i/3.0)^2 + (y-i/3.0)^2)); // my function takes long to evaluate
varf vRhs(u, v) = int2d(Th)(f * v);
rhs(:, i) = vRhs(0, Vh);
}
When I run ff-mpirun -np 8 example14.edp, will the program distribute the for loop in 8 cores or decompose the mesh Th into 8 parts?
This problem is tiny and for educational purposes only. Increase the mesh size, do not use clock() but instead run the program with the command line -log_view, and at some point you’ll see that it gets faster.
In my case, each problem is tiny (40x40 2D mesh), but it’s to be solved with millions of different rhs. In this case I suppose domain decomposition will not help a lot but parallelizing the for loop is necessary? What is the API to do so in FreeFem?
int ntask = 200;
for(int i = (mpirank * ntask) / mpisize; i < ((mpirank + 1) * ntask) / mpisize; ++i) {
cout << "process #" << mpirank << " dealing with task #" << i << endl;
}
mpiAllreduce(...); // for synchronization
Sorry I am really new to these MPI stuff, what should we put inside mpiAllreduce? I looked over FreeFem doc but can’t find relevant parts (most part is about domain decomposition and parallel solver). Should I just read the MPI doc itself?
Now I can get my code run, but the output data has more points than expected. Here is the last part of my code:
// solve the system
complex[int, int] B(A.n, rhs.m), X(A.n, rhs.m);
ChangeNumbering(A, rhs, B);
set(A, sparams = "-ksp_type preonly -pc_type lu");
KSPSolve(A, B, X);
ChangeNumbering(A, sol, X, inverse = true, exchange = true);
// write to disk the solution values on a circle
Vh ur;
real xt;
real yt;
for(int i = 0; i < rhs.m; ++i) {
ur[] = sol(:, i).re;
for (int n=0; n<nOut; ++n){
xt = rOut * cos(2*pi*n/nOut);
yt = rOut * sin(2*pi*n/nOut);
urfile << ur(xt, yt) << endl;
}
}
My rhs.m=1000 and nOut=64. So the total number of values in urfile should be 64000. However, I get 64015 values if I use multiple processes in ff-mpirun!
I feel like domain decomposition will not do much good for my small sized problems, but I do want MPI for the loops in my code (which I know how to do now). Is there a way to disable domain decomposition but still use MPI in my code when I run the command ff-mpirun?
Before using PETSc, I tried the simple way: sol = A^-1 * rhs, where sol and rhs are 2D arrays. Then I extract each solution as sol(:, i). However, the solution obtained this way are completely wrong unless rhs has only 1 column. For example, one solution looks like this:
This is exactly where I don’t understand the code of the original example. I don’t recognize which line of code does domain decomposition explicitly. Thus I don’t know how not to do decomposition in the code except using -np 1 in the command line. But then everything is done sequentially.