Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do I need to disable the AMD plugin when using the cuda plugin? #2531

Closed
GYDmedwin opened this issue Nov 25, 2024 · 1 comment
Closed

Do I need to disable the AMD plugin when using the cuda plugin? #2531

GYDmedwin opened this issue Nov 25, 2024 · 1 comment

Comments

@GYDmedwin
Copy link

GYDmedwin commented Nov 25, 2024

Description
Hi,
I want to checkpoint a Resnet job trained on an A100 using CRIU and the cuda plugin,but AMD errors are always reported. Do I need to comment out AMD-related content during compilation?

CRIU version: 4.0 CUDA Driver : 560
No container used and cuda-checkpoint is normal.

THANKS!

CRIU logs and information:

(00.600145) plugin: `amdgpu_plugin' hook 2 -> 0x730131770335
(00.600161) Error (amdgpu_plugin.c:1203): amdgpu_plugin: fstat error for /dev/kfd: No such file or directory
(00.600170) ----------------------------------------
(00.600175) Error (criu/cr-dump.c:1681): Dump files (pid: 3515192) failed with -1
(00.600179) Waiting for 3515192 to trap
(00.600188) Daemon 3515192 exited trapping
(00.600192) Sent msg to daemon 3 0 0
pie: 3515192: __fetched msg: 3 0 0
pie: 3515192: 3515192: new_sp=0x7a4028df37c8 ip 0x7a42faa90fd7
(00.620620) 3515192 was trapped
(00.620643) 3515192 was trapped
(00.620645) 3515192 (native) is going to execute the syscall 15, required is 15
(00.620658) 3515192 was stopped
(00.620818) net: Unlock network
(00.620824) amdgpu_plugin: finished  amdgpu_plugin (AMDGPU/KFD)
(00.620894) cuda_plugin: finished cuda_plugin stage 0 err -1
(00.784858) cuda_plugin: resuming devices on pid 3515192
(00.784864) cuda_plugin: Restore thread pid 3515332 found for real pid 3515192
(00.784948) Unfreezing tasks into 1
(00.784950)     Unseizing 3515192 into 1
(00.785380) Error (criu/cr-dump.c:2111): Dumping FAILED.
CRIU full dump/restore logs:

(paste your output here)

Output of `criu --version`:

(paste your output here)

Output of `criu check --all`:

(paste your output here)

Additional environment details:

@rst0git
Copy link
Member

rst0git commented Nov 25, 2024

Do I need to comment out AMD-related content during compilation?

By default, the AMD GPU plugin for CRIU is installed in /usr/lib/criu/amdgpu_plugin.so. You can safely remove this file to avoid loading the plugin during checkpoint/restore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants