FreeFem++ Not Using GPU with PETSc

Description:

I am encountering issues with FreeFem++ not utilizing the GPU when solving problems with PETSc. Despite following all the necessary steps to enable GPU support, it seems that FreeFem++ does not leverage CUDA during execution. Below, I outline everything I have done so far.

  1. PETSc Configuration

I configured PETSc with CUDA support using the following command:

./configure --with-mpi=1
–with-cuda=1
–with-cudac=nvcc
–download-fblaslapack
–with-debugging=0
–prefix=/usr/local/petsc
–with-shared-libraries=1
–download-hpddm
–download-metis
–download-ptscotch
–download-parmetis
–download-superlu
–download-mmg
–download-parmmg
–download-scalapack
–download-mumps

I also set the necessary environment variables:

export PETSC_DIR=/usr/local/petsc
export PETSC_ARCH=“”
export PATH=$PETSC_DIR/lib/petsc/bin:$PATH
export LD_LIBRARY_PATH=$PETSC_DIR/lib:$LD_LIBRARY_PATH

PETSc appears to work correctly with CUDA, as verified by running examples such as ex19 and ex2. CUDA was successfully detected and utilized by PETSc during these tests.
2. FreeFem++ Installation

I installed FreeFem++ from the develop branch with support for PETSc and CUDA using the following commands:

./configure
CPPFLAGS=“-I$PETSC_DIR/include”
LDFLAGS=“-L$PETSC_DIR/lib”
–enable-download
–enable-optim
–with-mpi=/usr/local/openmpi
–with-petsc=$PETSC_DIR
–enable-summary
–without-mumps

Additional steps taken to address potential issues:

Used ./3rdparty/getall -a to download missing dependencies.
Manually edited the mpic.c file in the MUMPS directory to resolve compilation errors.
Built FreeFem++ with make -j$(nproc) and passed all tests using make check.

What preconditioner are you using? What MatType are you using?

Thank you for your response. Here are the details about the preconditioner and matrix type I am using in the example:

Preconditioner: Jacobi (-pc_type jacobi)
Matrix type: aijcusparse (GPU-optimized format)

The script I am using is as follows:

load “PETSc”

macro defKSP_OPTIONS() “-ksp_type cg -pc_type jacobi -vec_type cuda -mat_type aijcusparse -ksp_monitor -log_view” //

int nx = 5000, ny = 5000;
real lx = 1.0, ly = 1.0;
mesh Th = square(nx, ny, [x * lx, y * ly]);

fespace Vh(Th, P1);

Vh u, v;
func f = 1.0;

solve Poisson(u, v, solver = “petsc”) =
int2d(Th)(dx(u) * dx(v) + dy(u) * dy(v)) -
int2d(Th)(f * v);

cout << “System solved using GPU with PETSc.” << endl;

I am running this script with the following command:

mpiexec -n 1 FreeFem++ poisson_gpu.edp -log_view

The problem is solved without errors, but it seems that the GPU is not being utilized during the execution, even though PETSc was configured with CUDA support and examples such as ex19 and ex2 work correctly using the GPU.

Is there something I might be missing to ensure FreeFem++ leverages the GPU? Any additional guidance would be greatly appreciated.

Could you please send the output of -ksp_view?

jp@jp:~/Descargas$ mpiexec -n 1 FreeFem++ poisson_gpu.edp -log_view
– FreeFem++ v4.14 (mié 20 nov 2024 17:13:44 -05 - git v4.14-95-g6fcdacf31)
file : poisson_gpu.edp
Load: lg_fem lg_mesh lg_mesh3 init_mesh3_array
eigenvalue
1 : load “PETSc”
2 :
3 : macro defKSP_OPTIONS() “-ksp_type cg -pc_type jacobi -vec_type cuda -mat_type aijcusparse -ksp_monitor -log_view -ksp_view” ////
4 :
5 : int nx = 1000, ny = 500;
6 : real lx = 1.0, ly = 1.0;
7 : mesh Th = square(nx, ny, [x * lx, y * ly]);
8 :
9 : fespace Vh(Th, P1);
10 :
11 : Vh u, v;
12 : func f = 1.0;
13 :
14 : solve Poisson(u, v, solver = “petsc”) =
15 : int2d(Th)(dx(u) * dx(v) + dy(u) * dy(v)) -
16 : int2d(Th)(f * v);
17 :
18 : cout << “System solved using GPU with PETSc.” << endl;
19 :
20 : sizestack + 1024 =1768 ( 744 )

– Square mesh : nb vertices =501501 , nb triangles = 1000000 , nb boundary edges 3000 rmdup= 0
– Solve :
min -1.55651e+10 max -1.55651e+10
System solved using GPU with PETSc.
times: compile 0.074077s, execution 7.02759s, mpirank:0
######## unfreed pointers 23 Nb pointer, 0Bytes , mpirank 0, memory leak =17228288
CodeAlloc : nb ptr 4165, size :560176 mpirank: 0
Ok: Normal End


*** WIDEN YOUR WINDOW TO 160 CHARACTERS. Use ‘enscript -r -fCourier9’ to print this document ***


------------------------------------------------------------------ PETSc Performance Summary: ------------------------------------------------------------------

FreeFem++ on a named jp with 1 process, by jp on Fri Nov 22 11:43:32 2024
Using Petsc Development GIT revision: v3.22.1-208-g2ad7182b109 GIT Date: 2024-11-20 12:46:14 -0600

                     Max       Max/Min     Avg       Total

Time (sec): 7.083e+00 1.000 7.083e+00
Objects: 0.000e+00 0.000 0.000e+00
Flops: 0.000e+00 0.000 0.000e+00 0.000e+00
Flops/sec: 0.000e+00 0.000 0.000e+00 0.000e+00
MPI Msg Count: 0.000e+00 0.000 0.000e+00 0.000e+00
MPI Msg Len (bytes): 0.000e+00 0.000 0.000e+00 0.000e+00
MPI Reductions: 0.000e+00 0.000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N → 2N flops
and VecAXPY() for complex vectors of length N → 8N flops

Summary of Stages: ----- Time ------ ----- Flop ------ — Messages — – Message Lengths – – Reductions –
Avg %Total Avg %Total Count %Total Avg %Total Count %Total
0: Main Stage: 7.0831e+00 100.0% 0.0000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0%


See the ‘Profiling’ chapter of the users’ manual for details on interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flop: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
AvgLen: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
%T - percent time in this phase %F - percent flop in this phase
%M - percent messages in this phase %L - percent message lengths in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)
GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU time over all processors)
CpuToGpu Count: total number of CPU to GPU copies per processor
CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per processor)
GpuToCpu Count: total number of GPU to CPU copies per processor
GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per processor)
GPU %F: percent flops on GPU in this event

Event Count Time (sec) Flop — Global — — Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU
Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F

— Event Stage 0: Main Stage


Object Type Creations Destructions. Reports information only for process 0.

— Event Stage 0: Main Stage

========================================================================================================================
Average time to get PetscTime(): 2.37e-08
#PETSc Option Table entries:
-log_view # (source: command line)
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: --with-mpi=1 --with-cuda=1 --with-cudac=nvcc --download-fblaslapack --with-debugging=0 --prefix=/usr/local/petsc --with-shared-libraries=1 --download-hpddm --download-metis --download-ptscotch --download-parmetis --download-superlu --download-mmg --download-parmmg --download-scalapack --download-mumps

Libraries compiled on 2024-11-20 21:56:11 on jp
Machine characteristics: Linux-6.8.0-49-generic-x86_64-with-glibc2.35
Using PETSc directory: /usr/local/petsc
Using PETSc arch:

Using C compiler: mpicc -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow -fstack-protector -fvisibility=hidden -g -O
Using Fortran compiler: mpif90 -fPIC -Wall -ffree-line-length-none -ffree-line-length-0 -Wno-lto-type-mismatch -Wno-unused-dummy-argument -g -O

Using include paths: -I/usr/local/petsc/include -I/usr/local/cuda-12.2/include

Using C linker: mpicc
Using Fortran linker: mpif90
Using libraries: -Wl,-rpath,/usr/local/petsc/lib -L/usr/local/petsc/lib -lpetsc -Wl,-rpath,/usr/local/petsc/lib -L/usr/local/petsc/lib -Wl,-rpath,/usr/local/cuda-12.2/lib64 -L/usr/local/cuda-12.2/lib64 -L/usr/local/cuda-12.2/lib64/stubs -Wl,-rpath,/usr/local/openmpi/lib -L/usr/local/openmpi/lib -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/11 -L/usr/lib/gcc/x86_64-linux-gnu/11 -ldmumps -lmumps_common -lpord -lpthread -lscalapack -lsuperlu -lflapack -lfblas -lparmmg -lmmg -lmmg3d -lptesmumps -lptscotchparmetisv3 -lptscotch -lptscotcherr -lesmumps -lscotch -lscotcherr -lparmetis -lmetis -lm -lcudart -lnvToolsExt -lcufft -lcublas -lcusparse -lcusolver -lcurand -lcuda -lX11 -lmpi_usempif08 -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lstdc++ -lrt -lquadmath

You are not using PETSc for doing the linear solve, thus why the GPU is not being utilized. You need to use a varf, switch to a Mat, and so on. BTW, do you have a good preconditioner in mind for using GPU? Plain Jacobi is terrible, unless you are working on a simple problem.

Thanks for your observation. It is clear to me that my current implementation is not set up for PETSc to handle the linear solution directly, and I understand that is necessary to take advantage of the GPU. I am also considering your preconditioner suggestion and am willing to try something more efficient than Jacobi for more complex problems.

Could you provide me with a working example where the GPU is properly utilized with PETSc in FreeFem++? This would be very useful for me to compare and make sure my setup is working correctly.

I really appreciate your help. :blush:

I don’t think there is any example using GPU, yet. Why do you want to use GPU? Are you sure this is appropriate for your application?

Thanks for your response. I am working on a simulation of the electrical behavior of the atrium using finite elements and the Courtemanche model. My main goal is to speed up the solution of the finite element system using GPU and additionally optimize the voltages and currents calculations also on GPU.

Do you think this would be feasible using FreeFem++ with GPU, or would you recommend another tool or approach to achieve this acceleration? I really appreciate your advice on how to approach this kind of problem, especially if FreeFem++ is not the best option to take advantage of GPU in this context.

My main goal is to speed up the solution of the finite element system using GPU and additionally optimize the voltages and currents calculations also on GPU

Again, what is your motivation for using GPU? Why does plain distributed-memory parallelism not fit your needs?

My motivation for using GPU is to leverage its massive parallelism for node-specific computations like diffusion, voltages, and currents, which are highly parallelizable.

highly parallelizable

How? What is the algorithm? Have you implemented it in CUDA?

No, I have not implemented the algorithm in CUDA yet. My initial intention was to explore whether it is feasible to use FreeFEM with PETSc to leverage GPU acceleration for my simulations. I was curious to see if FreeFEM could natively utilize PETSc’s GPU capabilities. While I have not developed a CUDA implementation myself, I have observed that other tools leveraging GPU acceleration for similar finite element problems have successfully implemented such features. This prompted my interest in exploring the possibilities with FreeFEM.

I have observed that other tools leveraging GPU acceleration for similar finite element problems have successfully implemented such features

Using PETSc? Please share a reference, I will let you know how easy it would be to try in FreeFEM.

PETSc (Portable, Extensible Toolkit for Scientific Computation) offers support for finite element solvers that leverage GPU acceleration. The PetscFE class in PETSc encapsulates finite element discretizations, facilitating the implementation of finite element methods within the PETSc framework.

Petsc

To utilize GPU capabilities in PETSc, you can configure the library with GPU support and select appropriate backends such as CUDA, HIP, or Kokkos. This enables the execution of algebraic solvers on GPU systems from NVIDIA, AMD, and Intel.

Petsc

For practical examples, PETSc provides several tutorials and examples that demonstrate the use of finite element solvers with GPU acceleration. Notably, examples like ex12, ex17, and ex62 showcase the application of PetscFE in solving finite element problems.

Petsc

Additionally, the paper “Toward Performance-Portable PETSc for GPU-based Exascale Systems” discusses the integration of GPU support in PETSc, highlighting its performance and portability across different GPU architectures.

ArXiv

These resources provide comprehensive guidance on implementing finite element solvers with GPU acceleration using PETSc.

I am not looking for a ChatGPT answer. You said: I have observed that other tools leveraging GPU acceleration for similar finite element problems have successfully implemented such features. What other tools? And how do they relate to node-specific computations like diffusion, voltages, and currents?

GPUs are difficult to use. If you don’t know what you are looking for or what you need, chances are that you will get abysmal performance in the end. I would suggest that you try out first plain distributed-memory parallelism. GPUs can work seamlessly in PETSc (and thus in FreeFEM), but again, if you don’t know what you are doing, there is no real point in using them (because you’ll get bad performance).