Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bot-specific SitePackage.lua that solves libfabric issues #531

Open
bedroge opened this issue Apr 4, 2024 · 4 comments
Open

Bot-specific SitePackage.lua that solves libfabric issues #531

bedroge opened this issue Apr 4, 2024 · 4 comments

Comments

@bedroge
Copy link
Collaborator

bedroge commented Apr 4, 2024

With help from @casparvl, I've added the following to /project/def-users/bot/shared/host-injections/2023.06/.lmod/SitePackage.lua on our AWS build cluster, which will be picked up by the bot for builds relying on libfabric:

require("strict")
local hook = require("Hook")

-- LmodMessage("Load bot-specific SitePackage.lua")

local function eessi_bot_libfabric_set_psm3_devices_hook(t)
    local simpleName = string.match(t.modFullName, "(.-)/")
    -- we may want to be more specific in the future, and only do this for specific versions of libfabric
    if simpleName == 'libfabric' then
        -- set environment variables PSM3_DEVICES as workaround for MPI applications hanging in libfabric's PSM3 provider
        -- crf. https://github.com/easybuilders/easybuild-easyconfigs/issues/18925
        setenv('PSM3_DEVICES', 'self,shm')
    end
end

-- combine all load hook functions into a single one
function site_specific_load_hook(t)
    eessi_bot_libfabric_set_psm3_devices_hook(t)
end

local function combined_load_hook(t)
    -- Assuming this was called from EESSI's SitePackage.lua, this should be defined and thus run
    if eessi_load_hook ~= nil then
        eessi_load_hook(t)
    end
    site_specific_load_hook(t)
end

hook.register("load", combined_load_hook)

This solves the Haswell OpenMPI issues that we observed in several PRs. I was going to make a PR for it, but I have some doubts on how this should be done:

  • does it have to be restricted to Haswell (we also saw some hangs with other architectures, but it's not entirely clear if they were caused by the same issue)?
  • does it have to be restricted to certain versions of libfabric?
  • do we also need this for the tests? Answer fron @casparvl: yes, might be needed.
  • which script should make sure that this SitePackage.lua is picked up / copied to the right location? bot/build.sh, EESSI-install-software.sh, eessi_container.sh, ...?
  • what if a PR wants to update SitePackage.lua, should it already pick up the new version? If so, we should probably prevent it from being copied to the shared directory already, otherwise other builds will also pick it up already before it's merged.
@boegel
Copy link
Contributor

boegel commented Apr 5, 2024

  • I wouldn't restrict it to only Haswell on our build cluster, since libfabric is essentially irrelevant there (at runtime).
  • We could restrict it to specific version of libfabric (since it seems to be a bug there?)
  • We may also need it for the test suite, yes, but then I would deal with that in the test suite repo?
  • I would only put the hook in place during the build phase, so bot/build.sh
  • If the SitePackage.lua is put in place via bot/build.sh, then changes to it should only get picked up by the PR, and should be isolated to that PR?

@boegel
Copy link
Contributor

boegel commented Apr 5, 2024

Same approach could be used for other problems that are triggered via libfabric, see easybuilders/easybuild-easyconfigs#20233

@ocaisa
Copy link
Member

ocaisa commented May 10, 2024

@TopRichard also found an issue with our CUDA hook when trying to use it on NESSI, it will currently forbid the loading of dependency modules that have GPU support even for building purposes. Disabling that hook as part of the bot-specific SitePackage.lua seems like a good idea.

@bedroge
Copy link
Collaborator Author

bedroge commented Nov 26, 2024

In order to fix similar kind of MPI issues on our zen4 cluster (see #815), I added the following file to the bot account:

$ cat /project/def-users/bot/shared/host-injections/2023.06/.lmod/SitePackage.lua
require("strict")
local hook = require("Hook")

-- LmodMessage("Load bot-specific SitePackage.lua")

local function eessi_bot_libfabric_set_psm3_devices_hook(t)
    local simpleName = string.match(t.modFullName, "(.-)/")
    -- we may want to be more specific in the future, and only do this for specific versions of libfabric
    if simpleName == 'libfabric' then
        -- set environment variable FI_PROVIDER as workaround for MPI applications hanging in libfabric's PSM3 provider
        -- crf. https://github.com/easybuilders/easybuild-easyconfigs/issues/18925
        setenv('FI_PROVIDER', '^psm3')
    end
end

-- combine all load hook functions into a single one
function site_specific_load_hook(t)
    eessi_bot_libfabric_set_psm3_devices_hook(t)
end

local function combined_load_hook(t)
    -- Assuming this was called from EESSI's SitePackage.lua, this should be defined and thus run
    if eessi_load_hook ~= nil then
        eessi_load_hook(t)
    end
    site_specific_load_hook(t)
end

hook.register("load", combined_load_hook)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants