Memory Leak Issue with ff-mpirun

As I was looking for a memory leak in my code using valgrind, I found an issue and unfortunately was not able to understand it. Here is this issue.

I take the simplest code one can write, it just prints “Hello world” :

cout << “Hello World” << endl;

and launch it in 3 different ways :

  1. FreeFem++
  2. ff-mpirun -np 1
  3. mpirun -np 1 FreeFem+±mpi

Using valgrind, I get the following reports : (1.2 KB) (1022 Bytes) (4.0 KB)

What strikes me is that I get memory leaks when i run the program with 2. and 3. and not with 1.

I would like to understand it in order to run my much larger program with method 2. or 3. Can someone explain to me what is happening ?

Thanks a lot,

FreeFEM is not Valgrind-clean, what is the exact issue you are facing?

My exact issue is that I want to define a P0 function on a fine 2D-mesh (h=0.001 for a domain of size 1 by 1). I assign a value to each elements through a specific algorithm which is computationnaly expensive. Therefore, i use parallelization and want to apply ff-mpirun (or mpirun FreeFem++ -mpi but i understand that it is more or less the same) on 4 cores. I estimate the time of computation to 6-hours with this parallelization.
I manage to run this algorithm and get my function once. But most of the time the programm quits without any explanation. Since i spot a kerneloops after my last tentativ, I was looking for a memory leak.

That’s highly unlikely to be the problem. Do you use a plugin? If the computation is expensive per element, you should write a plugin to do this in C++, that’s the most efficient way to achieve decent performance.

Also, to check for such a leak, since FreeFEM is not Valgrind-clean, it’s best to use Massif. If you share your code with a small mesh, I could probably give you a better insight.

1 Like

Well, i’m not sure that this would solve the issue. It is expensive mostly since it relies on the trunc function of FreeFem++ which is computationnaly expensive (when the truncated object contains lots of element). In short, for each element, I need to truncate the mesh + perform integrals over my new truncated mesh. To be more specific, here is a piece of code, that contains a simplified version of what I want to do.
program.edp (2.7 KB)

You can play with the parameter h, which defines the mesh refinement.

Unless I rewrite the trunc function (something that I would be highly unqualified to do), writing a plugin would presumably not be more efficient.

  1. I don’t see any code.
  2. Truncating a mesh over and over is counterproductive and you should better use regions.

Ok thanks i’ll try Massif !
That would be of great help, I shared a simplified version of my code bellow
program.edp (2.7 KB)
(same as above)

Indeed, I thought about using regions, but after looking at the documentation, it seemed to me that defining a region per element (as i need to do) was not a correct use of the concept of region. It could be that I’m wrong though, correct me if so.

The fact that the simplified program takes already a non-negligible amount of time should be a good indicator that it is not the way to proceed forward. What is it that you compute in the loop?
Furthermore, what’s the point of the chi variable? Why simply not do THK = trunc(Th, nuTriangle == T /* chiT>0.1 */, split=1);? Anyway, unless you tell me what you are precisely doing in that loop, I cannot really help you further.

It is simplified in the sense that it avoid to explain the mathematical technicalities (related to homogenization theory, inverse problem, and a wide branch of other analysis fields) that lead me to consider this algorithm in order to define my solution. Nevertheless this implementation contains every steps that i need to perform, and i would really like to make it work !

Yes, i was not aware of the existence of this parameter. I’ll correct it ! Thanks for the suggestion !

If you want some insight about what I am doing, you can check this article, who propose a similar ideas to mine : (see section 3.3)

I do not have the time to go through this paper and figure out how you potentially implemented the algorithm. Again, if you are willing to show what is costly inside the for loop, and that you compute element by element, I will be able to tell you whether this could be greatly simplified, or not.

In my case, the for loop is exactly the same, except that I use other functions for ueps and ubar. I did not insert the code defining these two functions since it would be irrelevant with respect to my issue. Now I don’t really need my code to be simplified, my issue is about the fact that the program somehow quits without giving me any explanation. Note also that from time to time, it manage to finish and write the solution in a file.

Therefore I would like to understand where my implementation is wrong and how could I correct it in order to make my program finish every time.

I hope I’m being clear, but if there is something unclear tell me, I would gladly reformulate.

I run the following command on my terminal
nohup ff-mpirun -np 4 program.edp > Results.txt &

I guess you could attach the debugger to each process or see if you can
catch any exceptions- it looks like yo uare using catch for div by zero?
If you just want to integrate over different regions it may be faster to just define
a gating factor ( 1 or 0 ) and multiply by that in your integration. You could probably
write simple c++ code to do this expecially if it is a fixed uniform mesh ( and even more
if you are happy with simpson’s rule ).

I tried that approach and it seems to be slower than trunc although if you want
to use it as a replacement it may be an alternative approach. If you are just picking
out small regions you may be able to do good optimizations in dedicated c++

You are trying to parallelize a slow and inefficient script. As I’ve said, this is not the proper way to move forward. Anyway, just to confirm my initial answer to your original question, there is no significant leak, here is the Massif output.

4.576^                                ::::@@::::@::::@::::@:::                
     |##::::::::::::::::::::::::::::::: ::@ ::::@::::@::::@:: :::@:::::::::@::
     |# :::: :::: :::::::::: ::::::: :: ::@ ::::@::::@::::@:: :::@:::::::::@::
     |# :::: :::: :::::::::: ::::::: :: ::@ ::::@::::@::::@:: :::@:::::::::@::
     |# :::: :::: :::::::::: ::::::: :: ::@ ::::@::::@::::@:: :::@:::::::::@::
     |# :::: :::: :::::::::: ::::::: :: ::@ ::::@::::@::::@:: :::@:::::::::@::
     |# :::: :::: :::::::::: ::::::: :: ::@ ::::@::::@::::@:: :::@:::::::::@::
     |# :::: :::: :::::::::: ::::::: :: ::@ ::::@::::@::::@:: :::@:::::::::@::
     |# :::: :::: :::::::::: ::::::: :: ::@ ::::@::::@::::@:: :::@:::::::::@::
     |# :::: :::: :::::::::: ::::::: :: ::@ ::::@::::@::::@:: :::@:::::::::@::
     |# :::: :::: :::::::::: ::::::: :: ::@ ::::@::::@::::@:: :::@:::::::::@::
     |# :::: :::: :::::::::: ::::::: :: ::@ ::::@::::@::::@:: :::@:::::::::@::
     |# :::: :::: :::::::::: ::::::: :: ::@ ::::@::::@::::@:: :::@:::::::::@::
     |# :::: :::: :::::::::: ::::::: :: ::@ ::::@::::@::::@:: :::@:::::::::@::
     |# :::: :::: :::::::::: ::::::: :: ::@ ::::@::::@::::@:: :::@:::::::::@::
     |# :::: :::: :::::::::: ::::::: :: ::@ ::::@::::@::::@:: :::@:::::::::@::
     |# :::: :::: :::::::::: ::::::: :: ::@ ::::@::::@::::@:: :::@:::::::::@::
     |# :::: :::: :::::::::: ::::::: :: ::@ ::::@::::@::::@:: :::@:::::::::@::
     |# :::: :::: :::::::::: ::::::: :: ::@ ::::@::::@::::@:: :::@:::::::::@::
     |# :::: :::: :::::::::: ::::::: :: ::@ ::::@::::@::::@:: :::@:::::::::@::
   0 +----------------------------------------------------------------------->GB
     0                                                                   6.874

Number of snapshots: 65
 Detailed snapshots: [1 (peak), 31, 36, 41, 46, 52, 62]

Why is it failing on your machine? Probably a transient fault.

Indeed the try-catch structure is there for div by zero error.
Using a gating factor was my first guess but as you said, it is slower than using a truncation !
Indeed simpson’s rule would indeed satisfy me, i’ll try it ! Thanks for the trick !

Thanks for your answer, you’re right, a debugger would give me more insights about what’s happening.

Thanks for having check that.
I hope it’s not… and if it is, i don’t understand how it was able to end (about 10 times on my >50 launchs) ? As marchywka said, using a debugguer will give me some insights.

Just looking at the code, in fact you only seem to be integrating over one element
at a time and I guess you are making some kernel or something. In any case,
you can get the element parameters like volume IIRC and just about do the integration
by writing it out. Regions as suggested before may work as well. With a uniform grid
and P0 elements its easy. I’m not even sure why you need the larger grid right now.

And I’m not sure why you catch div by zero instead of just checking. If you only have a
few than I guess they are exceptional cases but generally confusing the FPU, even with
denormals, can really slow things down. Although maybe things have changed in the
last decade since I looked :slight_smile:

After seeing some segfaults, and not understanding the FF memory usage, I became
a bit concerned and have more examples to post. But in at least one case what I though
was memory corruption turned out to be a sign mistake turning an imaginary eigenvalue
into a real one causing the solution to grow exponentially generating garbage output lol.

Yes I am only integrating over one element at a time. I’m not sure what you mean by “making some kernel”.
Yes, exactly, when I use P0 element it’s much easier, since the integration over one element is just the value on the triangle times the area of the triangles (so I know how to avoid making the truncation and so on).

I’m not sure either, i. guess both catching an error or checking is the same.

Finally I used Simpson approximation as you suggested and everything went well. + the computation time has largely reduced. Thanks for the suggestion !