Memory issue with Parallel computing

Dear FreeFem team, I have been using FreeFem PETSc framework to do my research. Currently, I noticed there was a memory accumulation problem when I increases the threads numbers. I think the problem is from below:

real[int] Rhorhs = HelmholtzRho(0, Gamma11);
	Gamma11<real> def2(rhoi);
	rhoi[] = Afilt1^-1 * Rhorhs; // LOCAL P1 OVERLAP SOLUTION

	desV1 = -rhoi;	desV2 = -rhoiy;
	desV1[] .*= A1.D;
	desV2[] .*= A1.D;

	rhoFilt1G[](loc2globP1) = desV1[]; // TRANSFER LOCAL P1 SOLUTION TO GLOBAL P1
	rhoFilt2G[](loc2globP1) = desV2[];
	rhoFilt1Gt = 0.;	rhoFilt2Gt = 0.;
	mpiAllReduce(rhoFilt1G[], rhoFilt1Gt[], mpiCommWorld, mpiSUM); // REDUCE GLOBAL P1 TO SINGLE UNIQUE GLOBAL P1
	mpiAllReduce(rhoFilt2G[], rhoFilt2Gt[], mpiCommWorld, mpiSUM);
	rho1G = rhoFilt1Gt; // INTERPOLATE FE SPACE FROM P1 TO P0
	rho2G = rhoFilt2Gt;

the global variable “rhoFilt2G” is distributed all over the threads; if there any way I can avoid this? or I have to send the local data to master and re-assemble it by MPI functions.
Thank you.

If you don’t want memory issues, don’t centralize any variable. This will scale badly for large meshes or large number of processes. Why do you need the variables rhoFilt1Gt and rhoFilt2Gt?

There are two parts, I need the global data: 1. when I do P0 to P1 interpolation for the entire domain; 2. when I sort my stress, for example, rhoFilt1Gt. I need this in decline order.

  1. Why not do the interpolation locally?
  2. Why not sort locally? Why do you need to sort the stress?

The sorted stress is used for my stress group for optimization; If I sort it locally, the local order is not right; For example, if the global array is [1,2,4,5,6]; but the local data is [1,4,6] & [2,5,4]; Then I will have [1,4,6] and [2,4,5]; It is not like I do it globally and cut it; The interpolation I probably can do it locally;

I don’t know about your stress group for optimization, so unless you provide more details, I can’t help you.

Sure should I give you the code?

No: why do you need to sort the stress for your optimizer?

Het Prj, since this code is not written by me; I will let author talk to you directly; She probably will give more details. Thank you

Hi prj, I work with zcwang6 and I am the author of the optimization script.
The stresses need to be sorted to use a PNorm approximation of the local stresses in the group. The stress groups represent stress constraints in the optimization problem, instead of having many individual stress constraints. Thus the local stress data is transferred and reduced to a global array for sorting.

As for the interpolation from P1 to P0, I am not aware of a way to accurately interpolate using only local data. That way (transferring and reducing to a global array and interpolating, then transferring back to local P0) is what I’ve found to be accurate.

I appreciate any advice you can give on how we can reduce the memory consumption from global variables across all processors.

As for the interpolation from P1 to P0, I am not aware of a way to accurately interpolate using only local data

What is wrong with interpolating local data?

Do you have a pseudo-code available somewhere of your optimization algorithm?

What solver do you use for computing Afilt1^-1 * Rhorhs ?

GMRES KSP solver with GAMG preconditioner

In my parallelization trials, I had noticed that the interpolated data at the boundary is not as accurate when I used the direct interpolation of local data ex. rho1 = rhofilt1 (P0 space = P1 space). Was this a correct way of doing it?

As for the stress, I’ve attached a snippet from a research paper that may help explain. The last paragraph and equation 13 speaks on the sorting for stress grouping to apply the p-norm approximation.

(ref: Le, C., Norato, J., Bruns, T. et al. Stress-based topology optimization for continua. Struct Multidisc Optim 41, 605–620 (2010). https://doi.org/10.1007/s00158-009-0440-y)

Was this a correct way of doing it?

It is to be expected that the interpolated data at the boundary is not accurate in parallel if you do not synchronize the values after interpolation, see a very similar question asked today. If it is still not working, please provide a minimal working example.

I see your optimization algorithm. There is no simple answer: if you want a global order, you have to pay the price… Still, I doubt this implies a large cost unless you are working with extremely large meshes, it’s just a vector after all. I guess there is something else in your implementation that is not scalable, but for that I’d need to see the code indeed.

I am familiar with the exchange function as I also use it to synchronize dx(u) after solving my elasticity equation. In the case of interpolation, do I use the exchange function to synchronize the P0 data after interpolation, or synchronize the P1 data before interpolation?

Regarding the other global data required, we do work with a large mesh (13 to 17 million elements) for optimization, so the global variables become very costly.

It depends what you do before and after the interpolation. There is no harm in calling the exchange routine more than needed, so if there is still a problem when calling it before and after, please send a MWE.

OK, well, that’s a good argument in disfavor of any method that uses a global metric, such as the one you are referring just above :slight_smile: Can’t you average the stress per subdomain, and then sort that globally? The vector to reduce would be much smaller of course (number of MPI processes, instead of number of elements).

I will change my method of interpolation accordingly and validate. As for the stress grouping and approximation, I had also thought of the way you proposed, but it doesn’t work very well.
However, we may just use a single stress group to avoid the need for sorting - this reduces the accuracy of the approximation of the stress concentration in the design, but the benefit of better memory usage outweighs that.

Thank you very much for your guidance.