Compilation with GPU accelerated nodes on the HPC Niflheim #750

JohanKoelsenDeWit · 2024-10-10T08:46:01Z

JohanKoelsenDeWit
Oct 10, 2024

This discussion is a continuation of a Q&A mistakenly opened under 'issues'. I will post my question here underneath:

Dear SMILEI team,

I'm having trouble compiling SMILEI with GPU accelerated nodes on the HPC, Niflheim’s, NVIDIA A100 nodes.
I'm not experienced with GPU accelerated nodes, and it is very likely me that is making the mistakes. However, if it is not too inconvenient, I was hoping you could help me compile SMILEI with GPU acceleration on Niflheim successfully?

I successfully compile with intel/2023a for CPU but encounter an issue when using CUDA for the A100 GPUs. After exporting the GPU compiler to nvcc and running make config=gpu_nvidia, I get the following error:
src/Params/Params.h:421:5: error: body of ‘constexpr’ function ‘static constexpr int Params::getGPUClusterWidth(int)’ not a return-statement 421 | } | ^ src/Params/Params.h: In static member function ‘static constexpr int Params::getGPUClusterGhostCellBorderWidth(int)’: src/Params/Params.h:469:5: error: body of ‘constexpr’ function ‘static constexpr int Params::getGPUClusterGhostCellBorderWidth(int)’ not a return-statement 469 | } | ^ src/Params/Params.h: In static member function ‘static constexpr int Params::getGPUInterpolationClusterCellVolume(int, int)’: src/Params/Params.h:494:5: error: body of ‘constexpr’ function ‘static constexpr int Params::getGPUInterpolationClusterCellVolume(int, int)’ not a return-statement 494 | } | ^

I see in your installation guide that several flags must be supplied in $CXXFLAGS and $GPU_COMPILER_FLAGS, and that your environment variables for jean_zay_gpu_A100 are set to:

export CXXFLAGS="-O3 -std=c++14 -fopenmp -D__GCC_ATOMIC_TEST_AND_SET_TRUEVAL=1"
export GPU_COMPILER_FLAGS="-O3 --std=c++14 -arch=sm_80 --expt-relaxed-constexpr"
export LDFLAGS="-lcudart -lcurand -lgomp"

Neither I or the admins for Niflheim are however sure what these flags must be set to in my case.

Could you please advice me to the correct CXXFLAGS and GPU_COMPILER_FLAGS for Niflheim’s A100 partition?. :-)

Thank you for your help, and kind regards, Johan

With the kind answer:

To your question:
Could you specify what environment you are using / what modules have you loaded.
You mention compiling for CPU with intel oneapi.
For GPU you should only use the compilers provided in an nvhpc package ( currently i recomment 24.5)
Our dependencies are mostly hdf5 and openmpi, these should be compiled with nvc++ after installing nvhpc
Hdf5 is simple enough and you could install it locally following our guide, for openmpi this should be handled by your support.

Regarding flags, there is lot to be said, you should inspire yourself from the jeanzay A100 example in scripts/compile_tools/machine/ but also https://smileipic.github.io/Smilei/Use/install_linux_GPU.html
FOr starters:
-fopenmp should NOT be here , neither -lgomp
you are missing the gpu specific flags for CXXFLAGS such as -acc=gpu -gpu=cc80

so it should look like:

export LDFLAGS="-acc=gpu -gpu=cc80 -cudalib=curand "
export CXXFLAGS="-acc=gpu -gpu=cc80,fastmath -std=c++14 -lcurand -Minfo=accel -w -D__GCC_ATOMIC_TEST_AND_SET_TRUEVAL=1 -I$NVDIR/Linux_x86_64/23.11/math_libs/include/"
export GPU_COMPILER_FLAGS="-O3 --std c++14 -arch=sm_80 --expt-relaxed-constexpr -I$NVDIR/hdfsrc/install/include/"

while specifying NVDIR (or you can remove those if you have a module that exports the include and lib folders)

JohanKoelsenDeWit · 2024-10-10T08:57:28Z

JohanKoelsenDeWit
Oct 10, 2024
Author

Thank you kindly.
Here is the list of loaded modules:

intel-compilers/2023.1.0
numactl/2.0.16-GCCcore-12.3.0
UCX/1.14.1-GCCcore-12.3.0
impi/2021.9.0-intel-compilers-2023.1.0
iimpi/2023a
Szip/2.1.1-GCCcore-12.3.0
HDF5/1.14.0-iimpi-2023a
GCCcore/11.3.0
zlib/1.2.12-GCCcore-11.3.0
binutils/2.38-GCCcore-11.3.0
bzip2/1.0.8-GCCcore-11.3.0
ncurses/6.3-GCCcore-11.3.0
libreadline/8.1.2-GCCcore-11.3.0
Tcl/8.6.12-GCCcore-11.3.0
SQLite/3.38.3-GCCcore-11.3.0
XZ/5.2.5-GCCcore-11.3.0
GMP/6.2.1-GCCcore-11.3.0
libffi/3.4.2-GCCcore-11.3.0
OpenSSL/1.1
Python/3.10.4-GCCcore-11.3.0
CUDA/12.1.1

And yes, I believe I need to install nvhpc. I will try to set the flags correctly as well.

3 replies

charlesprouveur Oct 10, 2024
Maintainer

So, in theory you can use nvcc (cuda) to compile CUDA code and use intel compilers for the c++ code that is run on a host (CPU).
The problem with smilei is we use openacc directives AND CUDA (for the projector) . This means we need nvc++ contained in the nvhpc package to compile our c++ code, which is not possible with intel compiler.

You should ask support to do in order:

install nvhpc with your gcc module (11.3 is fine) (edit: in recent versions i recommend 24.5, do NOT install 24.7, i have not tested 24.9 yet)
compile openmpi with nvc++
compile hdf5 with nvc++

with that you will use your python module and we can start worrying about compilation options :)

JohanKoelsenDeWit Oct 11, 2024
Author

The modules are built through the EasyBuild software on Niflheim. It has, however, proven quite difficult to compile openmpi with nvc++, as many other prerequisites have to be built as well.
Do you by any chance have experience with EasyBuild? :)
Thank you for your patience as well!

charlesprouveur Oct 11, 2024
Maintainer

I am afraid, i don't have any experience with EasyBuild.
What we can do as a first step, is try with the openmpi builtin nvhpc (although i expect issues at runtime) to complete the compilation process of smilei .
Once we have completed that and we see runtime issues, we can go into the details of a local openmpi install.
Were you able to create a module of hdf5 compiled with nvc++ ?
If not you can look at https://smileipic.github.io/Smilei/Use/install_linux_GPU.html for a local install of hdf5 with nvc++

JohanKoelsenDeWit · 2024-10-23T09:29:09Z

JohanKoelsenDeWit
Oct 23, 2024
Author

Hi again, Sorry for this late reply. I have been trying to locally compile OpenMP and HDF5 on Niflheim, but it has proven too difficult due to the setup on the HPC. However, we managed to locally compile SMILEI with GPU support on a Linux computer. The compilation was successful, but the simulation seems unable to complete a timestep when running with GPU acceleration. PACE_2D_1.0.txt is the namelist in txt format where the gpu_computing = True, is set for the GPU run. We also ran the simulation with CPUs only. Attached is the output from the CPU and the GPU run. They were run as: CPU version is run with mpirun -n 4 build_cpu/smilei PACE_2D_1.0_cpu.py &> out_cpu.txt GPU version is run with mpirun -n 1 build_nvidia/smilei PACE_2D_1.0.py &> out_gpu.txt We set the number of patches from [16,16] to [1, 1] when running with GPU acceleration as well. It can be seen that the simulation completes a few timesteps on the CPU run but doesn't complete any with the GPU acceleration. 7 GB is allocated to the GPU, so it seems to be initiating just fine. Finally, nvidia_envis the one from the guide used, but we had to replace one $NVDIR with gcc to recover from a bug. I was wondering if you by chance could see what our mistake is? 🙂 Many thanks again for your patience, and best wishes, Johan

…

________________________________ From: charlesprouveur ***@***.***> Sent: Friday, October 11, 2024 12:01 PM To: SmileiPIC/Smilei ***@***.***> Cc: Johan Kølsen de Wit ***@***.***>; Author ***@***.***> Subject: Re: [SmileiPIC/Smilei] Compilation with GPU accelerated nodes on the HPC Niflheim (Discussion #750) I am afraid, i don't have any experience with EasyBuild. What we can do as a first step, is try with the openmpi builtin nvhpc (although i expect issues at runtime) to complete the compilation process of smilei . Once we have completed that and we see runtime issues, we can go into the details of a local openmpi install. Were you able to create a module of hdf5 compiled with nvc++ ? If not you can look at https://smileipic.github.io/Smilei/Use/install_linux_GPU.html for a local install of hdf5 with nvc++ — Reply to this email directly, view it on GitHub<#750 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BB3MI25R5TIHSGHMCAOA2G3Z26OYVAVCNFSM6AAAAABPWIGLJWVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAOJRGM3DEOA>. You are receiving this because you authored the thread.Message ID: ***@***.***> _ _ ___ _ | | _ \ \ Version : 5.1-34-g60be16288-master / __| _ __ (_) | | ___ (_) | | \__ \ | ' \ _ | | / -_) _ | | |___/ |_|_|_| |_| |_| \___| |_| | | /_/ Reading the simulation parameters ------------------------------------------------------------------------------- HDF5 version 1.14.2 Python version 3.10.12 Parsing pyinit.py Parsing 5.1-34-g60be16288-master Parsing pyprofiles.py Parsing PACE_2D_1.0.py Parsing pycontrol.py Check for function preprocess() python preprocess function does not exist Calling python _smilei_check Calling python _prepare_checkpoint_dir Calling python _keep_python_running() : [1;36mCAREFUL: Patches distribution: hilbertian [0m Smilei will run on GPU devices [;33m[WARNING] src/Params/Params.cpp:1171 (compute) simulation_time has been redefined from 18.221237 to 18.219817 to match timestep.[0m Geometry: 2Dcartesian ------------------------------------------------------------------------------- Interpolation order : 2 Maxwell solver : Yee simulation duration = 18.219817, total number of iterations = 2856 timestep = 0.006379 = 0.950000 x CFL, time resolution = 156.752402 Grid length: 1.21559, 1.21559 Cell length: 0.0094968, 0.0094968, 0 Number of cells: 128, 128 Spatial resolution: 105.299, 105.299 Cell sorting: activated Electromagnetic boundary conditions ------------------------------------------------------------------------------- xmin silver-muller, absorbing vector [1, 0] xmax silver-muller, absorbing vector [-1, -0] ymin silver-muller, absorbing vector [0, 1] ymax silver-muller, absorbing vector [-0, -1] Vectorization: ------------------------------------------------------------------------------- Mode: adaptive Default mode: off Time selection: never Calling python writeInfo Initializing MPI ------------------------------------------------------------------------------- MPI_THREAD_MULTIPLE enabled Number of MPI processes: 1 OpenMP disabled OpenMP task parallelization not activated Number of patches: 1 x 1 Number of cells in one patch: 128 x 128 Dynamic load balancing: never Initializing the restart environment ------------------------------------------------------------------------------- Initializing species ------------------------------------------------------------------------------- Creating Species #0: electron Pusher: boris Boundary conditions: thermalize thermalize thermalize thermalize [1;36mCAREFUL: For species 'electron' Using thermal_boundary_temperature[0] in all directions [0m Density profile: 2D user-defined function (uses numpy) Creating Species #1: deuteron Pusher: boris Boundary conditions: thermalize thermalize thermalize thermalize [1;36mCAREFUL: For species 'deuteron' Using thermal_boundary_temperature[0] in all directions [0m Density profile: 2D user-defined function (uses numpy) Initializing External fields ------------------------------------------------------------------------------- External field Bz: 2D built-in profile `constant` (value: 0.275099) Binary processes #0 within species (0 1) 1. Collisions with Coulomb logarithm: auto Initializing Patches ------------------------------------------------------------------------------- First patch created All patches created Creating Diagnostics, antennas, and external fields ------------------------------------------------------------------------------- Diagnostic Fields #0 : Ex Ey Ez Rho_electron Rho_deuteron Created performances diagnostic Finalize MPI environment ------------------------------------------------------------------------------- Done creating diagnostics, antennas, and external fields Minimum memory consumption (does not include all temporary buffers) ------------------------------------------------------------------------------- Particles: Master 6400 MB; Max 6400 MB; Global 6.25 GB Fields: Master 2 MB; Max 2 MB; Global 0.00198 GB scalars.txt: Master 0 MB; Max 0 MB; Global 0 GB Fields0.h5: Master 0 MB; Max 0 MB; Global 0 GB Performances.h5: Master 0 MB; Max 0 MB; Global 0 GB Initial fields setup ------------------------------------------------------------------------------- Applying external fields at time t = 0 Applying prescribed fields at time t = 0 Applying antennas at time t = 0 GPU allocation and copy of the fields and particles ------------------------------------------------------------------------------- Open files & initialize diagnostics ------------------------------------------------------------------------------- Running diags at time t = 0 ------------------------------------------------------------------------------- Species creation summary ------------------------------------------------------------------------------- Species 0 (electron) created with 67108864 particles Species 1 (deuteron) created with 67108864 particles Expected disk usage (approximate) ------------------------------------------------------------------------------- WARNING: disk usage by non-uniform particles maybe strongly underestimated, especially when particles are created at runtime (ionization, pair generation, etc.) Expected disk usage for diagnostics: File Fields0.h5: 1.79 G File Performances.h5: 5.91 M File scalars.txt: 390.67 K Total disk usage for diagnostics: 1.80 G Keeping or closing the python runtime environment ------------------------------------------------------------------------------- Checking for cleanup() function: python cleanup function does not exist Closing Python Time-Loop started: number of time-steps n_time = 2856 ------------------------------------------------------------------------------- [1;36mCAREFUL: The following `push time` assumes a global number of 1 cores (hyperthreading is unknown) [0m timestep sim time cpu time [s] ( diff [s] ) push time [ns] _ _ ___ _ | | _ \ \ Version : 5.1-34-g60be16288-master / __| _ __ (_) | | ___ (_) | | \__ \ | ' \ _ | | / -_) _ | | |___/ |_|_|_| |_| |_| \___| |_| | | /_/ Reading the simulation parameters ------------------------------------------------------------------------------- HDF5 version 1.10.7 Python version 3.10.12 Parsing pyinit.py Parsing 5.1-34-g60be16288-master Parsing pyprofiles.py Parsing PACE_2D_1.0_cpu.py Parsing pycontrol.py Check for function preprocess() python preprocess function does not exist Calling python _smilei_check Calling python _prepare_checkpoint_dir Calling python _keep_python_running() : [;33m [WARNING](0) src/Params/Params.cpp:696 (Params) Resources allocated 48 underloaded regarding the total number of patches 4[0m [1;36mCAREFUL: Patches distribution: hilbertian [0m Smilei will run on CPU devices [;33m [WARNING](0) src/Params/Params.cpp:1170 (compute) simulation_time has been redefined from 18.221237 to 18.219817 to match timestep.[0m [;33m [WARNING](0) src/Params/Params.cpp:1262 (compute) Particles cluster width `cluster_width` set to : 32[0m [;33m [WARNING](0) src/Params/Params.cpp:1276 (compute) Particles cluster width set to: 64 for the adaptive vectorization mode[0m Geometry: 2Dcartesian ------------------------------------------------------------------------------- Interpolation order : 2 Maxwell solver : Yee simulation duration = 18.219817, total number of iterations = 2856 timestep = 0.006379 = 0.950000 x CFL, time resolution = 156.752402 Grid length: 1.21559, 1.21559 Cell length: 0.0094968, 0.0094968, 0 Number of cells: 128, 128 Spatial resolution: 105.299, 105.299 Cell sorting: activated Electromagnetic boundary conditions ------------------------------------------------------------------------------- xmin silver-muller, absorbing vector [1, 0] xmax silver-muller, absorbing vector [-1, -0] ymin silver-muller, absorbing vector [0, 1] ymax silver-muller, absorbing vector [-0, -1] Vectorization: ------------------------------------------------------------------------------- Mode: adaptive Default mode: off Time selection: never Calling python writeInfo Initializing MPI ------------------------------------------------------------------------------- MPI_THREAD_MULTIPLE enabled Number of MPI processes: 4 Number of threads per MPI process : 12 OpenMP task parallelization not activated Number of patches: 2 x 2 Number of cells in one patch: 64 x 64 Dynamic load balancing: never Initializing the restart environment ------------------------------------------------------------------------------- Initializing species ------------------------------------------------------------------------------- Creating Species #0: electron Pusher: boris Boundary conditions: thermalize thermalize thermalize thermalize [1;36mCAREFUL: For species 'electron' Using thermal_boundary_temperature[0] in all directions [0m Density profile: 2D user-defined function (uses numpy) Creating Species #1: deuteron Pusher: boris Boundary conditions: thermalize thermalize thermalize thermalize [1;36mCAREFUL: For species 'deuteron' Using thermal_boundary_temperature[0] in all directions [0m Density profile: 2D user-defined function (uses numpy) Initializing External fields ------------------------------------------------------------------------------- External field Bz: 2D built-in profile `constant` (value: 0.275099) Binary processes #0 within species (0 1) 1. Collisions with Coulomb logarithm: auto Initializing Patches ------------------------------------------------------------------------------- First patch created All patches created Creating Diagnostics, antennas, and external fields ------------------------------------------------------------------------------- Diagnostic Fields #0 : Ex Ey Ez Rho_electron Rho_deuteron Created performances diagnostic Finalize MPI environment ------------------------------------------------------------------------------- Done creating diagnostics, antennas, and external fields Minimum memory consumption (does not include all temporary buffers) ------------------------------------------------------------------------------- Particles: Master 3200 MB; Max 3200 MB; Global 12.5 GB Fields: Master 0 MB; Max 0 MB; Global 0.00213 GB scalars.txt: Master 0 MB; Max 0 MB; Global 0 GB Fields0.h5: Master 0 MB; Max 0 MB; Global 0 GB Performances.h5: Master 0 MB; Max 0 MB; Global 0 GB Initial fields setup ------------------------------------------------------------------------------- Solving Poisson at time t = 0 Initializing E field through Poisson solver ------------------------------------------------------------------------------- Poisson solver converged at iteration: 0, relative err is ctrl = 0.000000 x 1e-14 Poisson equation solved. Maximum err = 0.000000 at i= -1 Time in Poisson : 0.000249 Applying external fields at time t = 0 Applying prescribed fields at time t = 0 Applying antennas at time t = 0 Open files & initialize diagnostics ------------------------------------------------------------------------------- Running diags at time t = 0 ------------------------------------------------------------------------------- Species creation summary ------------------------------------------------------------------------------- Species 0 (electron) created with 67108864 particles Species 1 (deuteron) created with 67108864 particles Expected disk usage (approximate) ------------------------------------------------------------------------------- WARNING: disk usage by non-uniform particles maybe strongly underestimated, especially when particles are created at runtime (ionization, pair generation, etc.) Expected disk usage for diagnostics: File Fields0.h5: 1.79 G File Performances.h5: 7.28 M File scalars.txt: 390.67 K Total disk usage for diagnostics: 1.80 G Keeping or closing the python runtime environment ------------------------------------------------------------------------------- Checking for cleanup() function: python cleanup function does not exist Closing Python Time-Loop started: number of time-steps n_time = 2856 ------------------------------------------------------------------------------- [1;36mCAREFUL: The following `push time` assumes a global number of 48 cores (hyperthreading is unknown) [0m timestep sim time cpu time [s] ( diff [s] ) push time [ns] 1/2856 9.5692e-03 1.7488e+01 ( 1.7488e+01 ) 6254 2/2856 1.5949e-02 3.4516e+01 ( 1.7028e+01 ) 6089 3/2856 2.2328e-02 5.2292e+01 ( 1.7776e+01 ) 6357 import math import scipy.constants import numpy as np # Constants c = scipy.constants.speed_of_light q = scipy.constants.electron_volt m = scipy.constants.electron_mass eps0 = scipy.constants.epsilon_0 # EPOCH input values B_EPOCH = 0.057 n_EPOCH = 7.4971e16 T_EPOCH = 5 l_EPOCH = 0.01 # MW t_end_EPOCH = 5e-10# 3e-9 # N_cells = 128 # MW CR_EPOCH = l_EPOCH/(N_cells) * 1/c I0_EPOCH = 1e3 #W/m² omega_r = 5.8e9*2*math.pi B_r = m*omega_r/q n_r = eps0*m*omega_r**2/q**2 L_r = c/omega_r t_r = 1/omega_r # SMILEI parameters T_SMILEI = T_EPOCH/511e3 n_SMILEI = n_EPOCH/n_r B_SMILEI = B_EPOCH/B_r l_SMILEI = l_EPOCH / L_r x0_SMILEI = l_SMILEI/2 t_end_SMILEI = t_end_EPOCH/t_r l_cav_SMILEI = l_EPOCH/L_r a0_SMILEI = 0.86*c/(omega_r/(2*math.pi)) * 10**6 * math.sqrt(I0_EPOCH/1e18) dx_sim = l_EPOCH/N_cells dt_CR = 0.95*dx_sim/c/np.sqrt(2) dt_SIM = 2*np.pi/(20.3*omega_r) # originally 10.3 field_step = 1 # MW int(dt_SIM/dt_CR) # save fields every field_step def super_gaussian(x, y): return n_SMILEI * np.exp(-((np.sqrt((x-x0_SMILEI)**2 + (y-x0_SMILEI)**2))/(l_cav_SMILEI/3))**6) Main( geometry = "2Dcartesian", interpolation_order = 2, number_of_cells = [N_cells, N_cells], grid_length = [l_SMILEI, l_SMILEI], #number_of_patches = [ 16, 16 ], # MW number_of_patches = [ 1, 1 ], # MW gpu_computing = True, #MW timestep = CR_EPOCH * omega_r * 0.95/np.sqrt(2), simulation_time = t_end_SMILEI, EM_boundary_conditions = [ ['silver-muller'], ['silver-muller' ]], reference_angular_frequency_SI = omega_r, print_every = int(1) # ) Species( name = "electron", position_initialization = "regular", momentum_initialization = "maxwell-juettner", charge = -1.0, mass = 1.0, particles_per_cell = 4096, number_density = super_gaussian, temperature=[T_SMILEI], boundary_conditions = [["thermalize", "thermalize"], ["thermalize", "thermalize"]], thermal_boundary_temperature = [T_SMILEI], ) Species( name = "deuteron", position_initialization = "regular", momentum_initialization = "maxwell-juettner", charge = 1.0, mass = 1.0*1836.2, particles_per_cell = 4096, number_density = super_gaussian, temperature=[T_SMILEI], boundary_conditions = [["thermalize", "thermalize"], ["thermalize", "thermalize"]], thermal_boundary_temperature = [T_SMILEI], ) Collisions( species1 = ["electron", "deuteron"], species2 = ["electron", "deuteron"], ) ExternalField( field = "Bz", profile = constant(B_SMILEI) ) DiagPerformances( every = field_step, #flush_every = field_step, ) DiagFields( every = field_step, fields = ['Ex','Ey','Ez','Rho_electron', 'Rho_deuteron'] ) DiagScalar( every = field_step, vars = ["Utot", "Ukin", "Uelm", "Uelm_Ex", "Ukin_bnd", "Uelm_bnd"], precision = 10 ) """ Checkpoints( #restart_dir = '../PACE_2D_1.0/', dump_minutes = 1410, exit_after_dump = True, keep_n_dumps = 2, ) """ export BUILD_DIR=build_nvidia export NVDIR="/home/matthias/Projekte/Smilei/NVDIR" export PATH=$NVDIR/Linux_x86_64/23.11/compilers/bin:$PATH export PATH=$NVDIR/Linux_x86_64/23.11/comm_libs/mpi/bin:$PATH export HDF5_ROOT_DIR=$NVDIR/hdfsrc/install/ export LD_LIBRARY_PATH=$HDF5_ROOT_DIR/lib export LDFLAGS="-acc=gpu -gpu=ccnative -cudalib=curand " export CXXFLAGS="-acc=gpu -gpu=ccnative,fastmath -std=c++14 -lcurand -Minfo=accel -w -D__GCC_ATOMIC_TEST_AND_SET_TRUEVAL=1 -I$NVDIR/Linux_x86_64/23.11/math_libs/include/" export GPU_COMPILER_FLAGS="-O3 --std c++14 -arch=sm_86 --expt-relaxed-constexpr --compiler-bindir gcc -I$NVDIR/Linux_x86_64/23.11/comm_libs/12.3/openmpi4/openmpi-4.1.5/include/ -I$NVDIR/hdfsrc/install/include/" export SMILEICXX_DEPS=g++ export SLURM_LOCALID=0

0 replies

charlesprouveur · 2024-10-23T09:59:08Z

charlesprouveur
Oct 23, 2024
Maintainer

Hi Johan,

Sorry to hear about the compilation issues on your Niflheim

As for your execution, i see a couple issues. You are using currently not supported features:
momentum_initialization = "maxwell-juettner",
...
boundary_conditions = [["thermalize", "thermalize"], ["thermalize", "thermalize"]],

I have not tested maxwell-juettner so far so this could be an issue (or not, i will have a look if necessary EDIT: as long as you don't use injection or moving window it should not be a problem)
The BC thermalize is currently not supported on GPU (simply a question of time, i can add it if you only need that)

To check your install you can run this small test case that runs on my laptop GPU:
input.txt (just in case: you should rename this input.txt into input.py, this is because github does not like uploading python script)

Best,

Charles

PS: can you be more specific on this: "Finally, nvidia_envis the one from the guide used, but we had to replace one $NVDIR with gcc to recover from a bug."

0 replies

JohanKoelsenDeWit · 2024-10-23T11:41:02Z

JohanKoelsenDeWit
Oct 23, 2024
Author

Hi Charles, Thank you kindly for your reply. Apologies for the vagueness in the encountered bug with the compilation - I'm away from the local Linux computer, so it is a colleague that is helping me compiling SMILEI and running the comparison tests. I have asked him for a clarification on the bug. I only need the thermalized BC. If you could add that, it would be of great help! Do you have any recommendations for other momentum-initializations apart from maxwell-juettner that are GPU supported? Best wishes, Johan

…

________________________________ From: charlesprouveur ***@***.***> Sent: Wednesday, October 23, 2024 11:59 AM To: SmileiPIC/Smilei ***@***.***> Cc: Johan Kølsen de Wit ***@***.***>; Author ***@***.***> Subject: Re: [SmileiPIC/Smilei] Compilation with GPU accelerated nodes on the HPC Niflheim (Discussion #750) Hi Johan, Sorry to hear about the compilation issues on your Niflheim As for your execution, i see a couple issues. You are using currently not supported features: momentum_initialization = "maxwell-juettner", ... boundary_conditions = [["thermalize", "thermalize"], ["thermalize", "thermalize"]], I have not tested maxwell-juettner so far so this could be an issue (or not, i will have a look if necessary) The BC thermalize is currently not supported on GPU (simply a question of time, i can add it if you only need that) To check your install you can run this small test case that runs on my laptop GPU: input.txt<https://github.com/user-attachments/files/17488844/input.txt> Best, Charles PS: can you be more specific on this: "Finally, nvidia_envis the one from the guide used, but we had to replace one $NVDIR with gcc to recover from a bug." — Reply to this email directly, view it on GitHub<#750 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BB3MI264DNJKGNBA2BU7PVTZ45XQDAVCNFSM6AAAAABPWIGLJWVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCMBSG42DGMA>. You are receiving this because you authored the thread.Message ID: ***@***.***>

2 replies

charlesprouveur Oct 23, 2024
Maintainer

After talking with colleagues, you should be able to use maxwell-juettner, otherwise "cold" is another possible value.
I'll try to add this quickly. I'll update you once it's done and pushed to the github repo.

Charles

JohanKoelsenDeWit Oct 24, 2024
Author

Hi Charles,

Thank you very much.
The error encountered with the setup of nvidia_env.sh is the following:

The way NVDIR is setup and from the documentation of nvcc:

Specify the directory in which the default host compiler executable resides.
The host compiler executable name can be also specified to ensure that the correct host compiler is selected. In addition, driver prefix options (--input-drive-prefix, --dependency-drive-prefix, or --drive-prefix) may need to be specified, if nvcc is executed in a Cygwin shell or a MinGW shell on Windows

However, in NVDIR there is no host compiler. So the compiler return the error:

Compiling src/Particles/nvidiaParticles.cu
Compiling src/Projector/Projector1D2OrderGPUKernelCUDAHIP.cu

Smilei/NVDIR/gcc: No such file or directory
nvcc fatal : Failed to preprocess host compiler properties.
make: *** [makefile:374: build_nvidia/src/Particles/nvidiaParticles.o] Error 1
make: *** Waiting for unfinished jobs....
Smilei/NVDIR/gcc: No such file or directory
nvcc fatal : Failed to preprocess host compiler properties.
make: *** [makefile:374: build_nvidia/src/Projector/Projector1D2OrderGPUKernelCUDAHIP.o] Error 1

It was fixed by changing
GPU_COMPILER_FLAGS=”--compiler_bindir gcc"

Please let me know if we have misunderstood anything. 🙂

My colleague was able to run the test input, and the results from the two runs are attached here.

gpu-test.txt
cpu-test.txt

The GPU is a NVIDIA GeForce RTX 3090
The CPU is Intel(R) Xeon(R) W-2133 CPU @ 3.60GHz

We were however unable to see the GPU acceleration. Perhaps this is tied to the changes in the compiler flags?

charlesprouveur · 2024-10-24T10:44:01Z

charlesprouveur
Oct 24, 2024
Maintainer

Hi Johan,

NVDIR is an environment variable that we set ourselves at the root of the nvhpc folder we setup, ie:
export NVDIR="/local/home/myaccount/tools/nvhpc_24.5
in which there should be two folders: "Linux_x86_64" and "modulefiles"
g++ is only used for the SMILEICXX_DEPS environment variable set in the machine file
I can imagine that gcc is not visible on your machine but that is highly unusual.

In your case, since you are using an RTX 3090, the exact options in our example work (ie -arch=sm_86)

Looking at your output there is nothing unusual, we do see "smilei will run on GPU" so on that side i think you are good.

Now onto the "acceleration" aspect, there are two things here:

the 1D test case is tiny, you need millions if not tens of million particle for a GPU to be "worth it". Typically i use a 3D test case with > 30e6 particles to get an idea of our performance on different hardware. It does not mean using GPU is bad per se for smaller test case, just that it is inefficient
The RTX 3090 is a consummer card, meaning it is not made for double precision computing. As you can see in the following link
https://www.techpowerup.com/gpu-specs/geforce-rtx-3090.c3622

Theoretical Performance:

FP32 (float)
    35.58 TFLOPS 
FP64 (double)
    556.0 GFLOPS (1:64)

Your theroretical performance is slashed by a factor 70 between single and double precision.
At this point your CPU peak performance in double precision is comparable to your GPU.

I will add another point: at high performance writing outputs (the .h5 files) will be a huge bottleneck which is why their frequency should be reduced to the minimum. In your cases you can see that diagnostics took respectively 82% and 87% of the computing time. At this point any performance comparison between the two chips is pointless, you are only seeing the performance of your SSD.

If you want to look at pure performance you can run a much bigger test case with an output at the end for instance.

Best regards,

Charles

2 replies

JohanKoelsenDeWit Oct 25, 2024
Author

Dear Charles,
Thank you very much for your response - I apologize for the inconvenience of my novice questions regarding the GPU computation times, your points really clarifies the results.

Moving on, I will follow your advice and try to set over a larger simulation with only outputting at the end as well.

Best wishes, Johan

charlesprouveur Oct 25, 2024
Maintainer

Hi Johan,

No worries, that is perfectly normal. I am happy my explanations made it clearer for you.
On my side I am working on porting the thermal boundary conditions as there are other people interested internally :)
It's a bit tricky but it should not take long.

Best regards,
Charles

charlesprouveur · 2024-11-19T18:04:18Z

charlesprouveur
Nov 19, 2024
Maintainer

Hi Johan,
Thermal BC have been ported on GPU. A new version of smilei has been pushed on the github repo.
Best regards,
Charles

0 replies

JohanKoelsenDeWit · 2024-11-27T08:48:12Z

JohanKoelsenDeWit
Nov 27, 2024
Author

Dear Charles, Thank you very much for making this update. I am looking forward to using it! Best wishes, Johan

…

________________________________ From: charlesprouveur ***@***.***> Sent: Tuesday, November 19, 2024 7:04 PM To: SmileiPIC/Smilei ***@***.***> Cc: Johan Kølsen de Wit ***@***.***>; Author ***@***.***> Subject: Re: [SmileiPIC/Smilei] Compilation with GPU accelerated nodes on the HPC Niflheim (Discussion #750) Hi Johan, Thermal BC have been ported on GPU. A new version of smilei has been pushed on the github repo. Best regards, Charles — Reply to this email directly, view it on GitHub<#750 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BB3MI256S7HNCO5GY6EZBLT2BN4TPAVCNFSM6AAAAABPWIGLJWVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCMZQHE2TOMA>. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compilation with GPU accelerated nodes on the HPC Niflheim #750

{{title}}

Replies: 7 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Compilation with GPU accelerated nodes on the HPC Niflheim #750

JohanKoelsenDeWit Oct 10, 2024

Replies: 7 comments · 7 replies

JohanKoelsenDeWit Oct 10, 2024 Author

charlesprouveur Oct 10, 2024 Maintainer

JohanKoelsenDeWit Oct 11, 2024 Author

charlesprouveur Oct 11, 2024 Maintainer

JohanKoelsenDeWit Oct 23, 2024 Author

charlesprouveur Oct 23, 2024 Maintainer

JohanKoelsenDeWit Oct 23, 2024 Author

charlesprouveur Oct 23, 2024 Maintainer

JohanKoelsenDeWit Oct 24, 2024 Author

charlesprouveur Oct 24, 2024 Maintainer

JohanKoelsenDeWit Oct 25, 2024 Author

charlesprouveur Oct 25, 2024 Maintainer

charlesprouveur Nov 19, 2024 Maintainer

JohanKoelsenDeWit Nov 27, 2024 Author

JohanKoelsenDeWit
Oct 10, 2024

Replies: 7 comments 7 replies

JohanKoelsenDeWit
Oct 10, 2024
Author

charlesprouveur Oct 10, 2024
Maintainer

JohanKoelsenDeWit Oct 11, 2024
Author

charlesprouveur Oct 11, 2024
Maintainer

JohanKoelsenDeWit
Oct 23, 2024
Author

charlesprouveur
Oct 23, 2024
Maintainer

JohanKoelsenDeWit
Oct 23, 2024
Author

charlesprouveur Oct 23, 2024
Maintainer

JohanKoelsenDeWit Oct 24, 2024
Author

charlesprouveur
Oct 24, 2024
Maintainer

JohanKoelsenDeWit Oct 25, 2024
Author

charlesprouveur Oct 25, 2024
Maintainer

charlesprouveur
Nov 19, 2024
Maintainer

JohanKoelsenDeWit
Nov 27, 2024
Author