NVHPC Support #693

ptheywood · 2021-09-22T13:06:01Z

Closes #977.

ptheywood · 2021-09-22T16:32:48Z

Not actualyl blocked, by std::experimental::filesystem - it just uses gcc's stdlib so can pass the appropriate linker args.

Mostly seems to work, other than release RTC test suite failures. Debug is fine which makes tracing the fault more interesting. Release RTC examples also work, so it's test suite specific in some way?

Vis works, but the vis repo needs CMake changes to address warnings (the same as the main repo + some extras).

Segfault notes

The Segfaulr occurs when newRTCFunction is called, but duplicating that content to an example instead runs ok.
A new test file with just one of the offending tests in it runs OK.
test_cuda_simulation.cu in tests_dev is NOT ok...
Building it all, and filtering only the single test is NOT ok...
Commenting out most of test_cuda_simulation is OK
TestCUDASimulation.SetGetPopulationData being compilerd in appears to cause the issue?, regardless of whether or not it is filtered out
- GetAgent, Step, AgentDeath and AgentID_MultipleStatesUniqueIDs all also cause issues if they are compiled, even if they are not executed?
- commenting out a.newFunction("DeathFunc", DeathTestFunc) in AgentDeath is enough to remove the segfault...
The segfault appears to occur inside newRTCFunction if newFunction and newRTCFunction are used in the same compilation unit with nvhpc in release mode, regardless of which is first in the file? Possibly a compiler issue?

#include "flamegpu/flamegpu.h"
#include "gtest/gtest.h"

namespace flamegpu {
namespace tests {
namespace test_nvhpc {

FLAMEGPU_AGENT_FUNCTION(cudacxx_test_func, flamegpu::MessageNone, flamegpu::MessageNone) {
    return flamegpu::ALIVE;
}

const char* rtc_test_func = R"###(
FLAMEGPU_AGENT_FUNCTION(rtc_test_func, flamegpu::MessageNone, flamegpu::MessageNone) {
    return flamegpu::ALIVE;
}
)###";

TEST(testNVHPC, RTCElapsedTime) {
    ModelDescription m("m");
    AgentDescription &agent = m.newAgent("agent");

    // Using newRTCFunction and newFunction in the same compilation unit appears to cause the segfault within newRTCFunction.
    // Comment out either call to remove the segfault.
    agent.newFunction("cudacxx_test_func", cudacxx_test_func);
    AgentFunctionDescription &func = agent.newRTCFunction("rtc_test_func", rtc_test_func);
}

}  // namespace test_nvhpc
}  // namespace tests
}  // namespace flamegpu

After chucking a bunch of printf __LINE__ into AgentFunctionDescription::newRTCFunction the offending line appears to be agent->functions.emplace(function_name, rtn); Both funciton_name, rtn and agent->functions all appear to be valid however...

ptheywood · 2021-10-01T13:47:57Z

If using GCC 8's stdlib rather than GCC 9'this builds ok in Release mode (nvc++ 21.7-0, ubuntu 21.04).

ptheywood · 2021-10-01T16:27:34Z

Currently working on reproducing this with a simpler use case. Currently leaning towards or more of the following:

Compiler error (as this only appears to be effected by nvhpc + gcc 9)
UB related to the use of enable_shared_from_this which the docs for list several opportunities for UB.
Unintialised member variables which are potentially present for many of tha *Data classes.

Will conitnue to work on the MWE a little, but if it doesn't reproduce soon it'll just get dumped into a gist for future reference. Running gcc and nvhpc builds through valgrind (with an appropriate cuda suppressions list) would be good and generally worthwhile on the whole.

I tried enabling -Wuninitialized on gcc build but this appeared to produce no additional warnings on my office box when building the flamegpu target.

ptheywood · 2024-01-09T11:42:45Z

NVHPC repackages the location of curand compared to standalone nvcc. Prior to nvhpc 22.3 this is not correctly reflected by the include path during compilation via cmake when using nvhpc installed nvcc, but gcc as the host compiler.

We may be able to resoilve this by requiring curand as a dependency in cmake, otherwise we might need to expliclty add an edge case to cmake to ensure this include path is set.

ptheywood · 2024-01-09T13:14:53Z

after some horrible cmake additions to explicitly add the non-symlinkg math_libs include directory to include path(s) if required, curand is now found when using nvhpc installed nvcc, and nvhpc as the host compiler.

However, this then exposes an issue with include path ordering and the finding / use of cub and thrust.

The cub/thrust version mismatch check is identifying that they do not agree. locally using a cuda 11.8 nvhpc 22.11 which ships with cub/thrust 1.15, this is conflicting with the explicitly added cub/thrust 1.17 we fetch.

This will be due to include directories and precedent. It might not be the case for all cmake/nvhpc combos, so i will force CI to investigate for me (once the outage ends?)

some commits are wip, as cmake 3.18 needs to use a differnet method for symlink resolution compared to 3.19, which is not tested.

ptheywood · 2024-01-11T18:13:52Z

some NVHPC builds via containers which fail to configure CMake are erroring due to:

2024-01-11T18:00:25.4234501Z     #error -- unsupported pgc++ configuration! Only pgc++ 18, 19, 20 and 21 are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.

This is with a version of nvcc distributed with the version of nvhpc which is apparently incompatible.

It uses GCC's stdlib, so requires the same linker arguments to access std::experimental::filesystem

Swig 4.0.2 does not appear to build from source with NVHPC/Clang by default

… exposes an issue with thrust.

Includes CMP0152 which in CMake >= 3.28 changes symlink resolution behaviour, relevant to nvhpc workarounds.

cudaMemset takes an int not a uint64, so 0xfffffff was triggering an implicit cast sign change.

…mem issue with 23.11

… >= 22.9

…n be added to suppressions

nvcc believes it is incompatible with the versions of nvhpc it was distributed with...

ptheywood added blocked and removed blocked labels Sep 22, 2021

ptheywood force-pushed the nvhpc branch from 38489a8 to 93e414d Compare October 7, 2021 10:09

ptheywood force-pushed the nvhpc branch from 93e414d to 4117492 Compare April 4, 2022 14:57

ptheywood mentioned this pull request Nov 16, 2022

Support nvc++ as the host compiler #977

Open

ptheywood force-pushed the nvhpc branch from 0ed940a to 74debf7 Compare January 8, 2024 13:07

ptheywood force-pushed the nvhpc branch from 311a5c5 to 3ad813d Compare January 11, 2024 14:12

ptheywood added this to the 2.0.0-rc2 milestone Jan 12, 2024

ptheywood added 16 commits January 15, 2024 11:33

Patch RapidJson if using nvhpc

9bead12

Adjust warning levels / suppressions when NVHPC is the host C++ compiler

865ac99

Enable linking against stdc++fs when using nvhpc

9f55bee

It uses GCC's stdlib, so requires the same linker arguments to access std::experimental::filesystem

Possible bugfix: Use insert not emplace fixes nvhpc + gcc9

182cdb5

Only build Swig from source on linux if GNU compiler

16ffca3

Swig 4.0.2 does not appear to build from source with NVHPC/Clang by default

fixup stdfilesystem nhvpc

6ba94d0

Rough first pass at nvhpc ci

7acb020

warning nvhpc cmake fix

0dc74c1

temp disable python stuff in nvhpc ci

b412e7e

nvhpc ci expansionn

7c8ad99

nvhpc swig instrall tweaks

c80a411

fixup

9090a75

Install pcre2-dev for swig build from source in nvhpc container

8695236

bigger nvhpc matrix

d99b184

nvhpc ci fixup - don't use 2004 for now

3e49547

use deadsnakes to select python

0131c02

ptheywood added 28 commits January 15, 2024 11:33

More deadsnakes venv fix attempts

a249956

another deadsnakes venv fix attempt

325f00f

nvhpc fix

925dab8

ci debugging

2b55585

Try another thing

26adf7c

Remove ci debugging code

a821711

Adjust nvhpc ci matrix. more nvhpc, single python

6d2e823

widen nvhpc ci matrix to detect when curand started working

a117bfa

try nvhpc 22.1 and 22.2 to find when curand starts working

acb31bb

try to ensure cmake uses nvc++

850e09e

fixup

834a25a

WIP: Add curand's repackaged location when using nvhpc, but this then…

fa5ba59

… exposes an issue with thrust.

Adjust cmake_minimim_required upper limit to set all new policies to NEW

cc92e9c

Includes CMP0152 which in CMake >= 3.28 changes symlink resolution behaviour, relevant to nvhpc workarounds.

nvhpc suppress unused parameter warnings in JSONAdjacencyGraphSizeReader

6cc9bcd

NVHPC fix implicit conversion int sign change warnings

5c5d373

cudaMemset takes an int not a uint64, so 0xfffffff was triggering an implicit cast sign change.

try only building all remaingin targets with 1 process, potential ci …

8097d47

…mem issue with 23.11

Only set -Wno-unused-but-set-parameter where nvhpc supports it (maybe…

3fe66d3

… >= 22.9

fixup

d3061e8

add nvcc from nvhpc but use gcc to ci. This might not work at all yet

604928c

remove superfluous cmake message

02a0a46

attemtpt to add diagnostic error numbers to nvc++ warnings so they ca…

334afe2

…n be added to suppressions

Don't enable -Wsigncompare with older nvhpc

cb0b9d3

Supress parameter declared but never referenced when using older nvhpc

d6f17f8

nvhpc CI: if configure fails, cat the logs.

8c29297

cmake warnings fixup

05150cf

tweaks

a27d9ec

Attempt to fix some nvhpc ci via allow-unsupported-compilers

24226fd

nvcc believes it is incompatible with the versions of nvhpc it was distributed with...

try much newer CMake, to see if that resolves anything

e0b0f69

ptheywood force-pushed the nvhpc branch from a740004 to e0b0f69 Compare January 15, 2024 11:33

ptheywood mentioned this pull request Apr 30, 2024

Enhance GPGPU functionality using FLAME GPU Chaste/Chaste#265

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVHPC Support #693

NVHPC Support #693

ptheywood commented Sep 22, 2021 •

edited

Loading

ptheywood commented Sep 22, 2021 •

edited

Loading

ptheywood commented Oct 1, 2021

ptheywood commented Oct 1, 2021 •

edited

Loading

ptheywood commented Jan 9, 2024 •

edited

Loading

ptheywood commented Jan 9, 2024 •

edited

Loading

ptheywood commented Jan 11, 2024

NVHPC Support #693

Are you sure you want to change the base?

NVHPC Support #693

Conversation

ptheywood commented Sep 22, 2021 • edited Loading

ptheywood commented Sep 22, 2021 • edited Loading

Segfault notes

ptheywood commented Oct 1, 2021

ptheywood commented Oct 1, 2021 • edited Loading

ptheywood commented Jan 9, 2024 • edited Loading

ptheywood commented Jan 9, 2024 • edited Loading

ptheywood commented Jan 11, 2024

ptheywood commented Sep 22, 2021 •

edited

Loading

ptheywood commented Sep 22, 2021 •

edited

Loading

ptheywood commented Oct 1, 2021 •

edited

Loading

ptheywood commented Jan 9, 2024 •

edited

Loading

ptheywood commented Jan 9, 2024 •

edited

Loading