Was this a correct way of doing it?
It is to be expected that the interpolated data at the boundary is not accurate in parallel if you do not synchronize the values after interpolation, see a very similar question asked today. If it is still not working, please provide a minimal working example.
I see your optimization algorithm. There is no simple answer: if you want a global order, you have to pay the price… Still, I doubt this implies a large cost unless you are working with extremely large meshes, it’s just a vector after all. I guess there is something else in your implementation that is not scalable, but for that I’d need to see the code indeed.