Dear FreeFem team, I have been using FreeFem PETSc framework to do my research. Currently, I noticed there was a memory accumulation problem when I increases the threads numbers. I think the problem is from below:
real[int] Rhorhs = HelmholtzRho(0, Gamma11);
Gamma11<real> def2(rhoi);
rhoi[] = Afilt1^-1 * Rhorhs; // LOCAL P1 OVERLAP SOLUTION
desV1 = -rhoi; desV2 = -rhoiy;
desV1[] .*= A1.D;
desV2[] .*= A1.D;
rhoFilt1G[](loc2globP1) = desV1[]; // TRANSFER LOCAL P1 SOLUTION TO GLOBAL P1
rhoFilt2G[](loc2globP1) = desV2[];
rhoFilt1Gt = 0.; rhoFilt2Gt = 0.;
mpiAllReduce(rhoFilt1G[], rhoFilt1Gt[], mpiCommWorld, mpiSUM); // REDUCE GLOBAL P1 TO SINGLE UNIQUE GLOBAL P1
mpiAllReduce(rhoFilt2G[], rhoFilt2Gt[], mpiCommWorld, mpiSUM);
rho1G = rhoFilt1Gt; // INTERPOLATE FE SPACE FROM P1 TO P0
rho2G = rhoFilt2Gt;
the global variable “rhoFilt2G” is distributed all over the threads; if there any way I can avoid this? or I have to send the local data to master and re-assemble it by MPI functions.
Thank you.
If you don’t want memory issues, don’t centralize any variable. This will scale badly for large meshes or large number of processes. Why do you need the variables rhoFilt1Gt and rhoFilt2Gt?
There are two parts, I need the global data: 1. when I do P0 to P1 interpolation for the entire domain; 2. when I sort my stress, for example, rhoFilt1Gt. I need this in decline order.
The sorted stress is used for my stress group for optimization; If I sort it locally, the local order is not right; For example, if the global array is [1,2,4,5,6]; but the local data is [1,4,6] & [2,5,4]; Then I will have [1,4,6] and [2,4,5]; It is not like I do it globally and cut it; The interpolation I probably can do it locally;
Hi prj, I work with zcwang6 and I am the author of the optimization script.
The stresses need to be sorted to use a PNorm approximation of the local stresses in the group. The stress groups represent stress constraints in the optimization problem, instead of having many individual stress constraints. Thus the local stress data is transferred and reduced to a global array for sorting.
As for the interpolation from P1 to P0, I am not aware of a way to accurately interpolate using only local data. That way (transferring and reducing to a global array and interpolating, then transferring back to local P0) is what I’ve found to be accurate.
I appreciate any advice you can give on how we can reduce the memory consumption from global variables across all processors.
In my parallelization trials, I had noticed that the interpolated data at the boundary is not as accurate when I used the direct interpolation of local data ex. rho1 = rhofilt1 (P0 space = P1 space). Was this a correct way of doing it?
As for the stress, I’ve attached a snippet from a research paper that may help explain. The last paragraph and equation 13 speaks on the sorting for stress grouping to apply the p-norm approximation.
(ref: Le, C., Norato, J., Bruns, T. et al. Stress-based topology optimization for continua. Struct Multidisc Optim41, 605–620 (2010). https://doi.org/10.1007/s00158-009-0440-y)
It is to be expected that the interpolated data at the boundary is not accurate in parallel if you do not synchronize the values after interpolation, see a very similar question asked today. If it is still not working, please provide a minimal working example.
I see your optimization algorithm. There is no simple answer: if you want a global order, you have to pay the price… Still, I doubt this implies a large cost unless you are working with extremely large meshes, it’s just a vector after all. I guess there is something else in your implementation that is not scalable, but for that I’d need to see the code indeed.
I am familiar with the exchange function as I also use it to synchronize dx(u) after solving my elasticity equation. In the case of interpolation, do I use the exchange function to synchronize the P0 data after interpolation, or synchronize the P1 data before interpolation?
Regarding the other global data required, we do work with a large mesh (13 to 17 million elements) for optimization, so the global variables become very costly.
It depends what you do before and after the interpolation. There is no harm in calling the exchange routine more than needed, so if there is still a problem when calling it before and after, please send a MWE.
OK, well, that’s a good argument in disfavor of any method that uses a global metric, such as the one you are referring just above Can’t you average the stress per subdomain, and then sort that globally? The vector to reduce would be much smaller of course (number of MPI processes, instead of number of elements).
I will change my method of interpolation accordingly and validate. As for the stress grouping and approximation, I had also thought of the way you proposed, but it doesn’t work very well.
However, we may just use a single stress group to avoid the need for sorting - this reduces the accuracy of the approximation of the stress concentration in the design, but the benefit of better memory usage outweighs that.