You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
Hi,
I want to checkpoint a Resnet job trained on an A100 using CRIU and the cuda plugin,but AMD errors are always reported. Do I need to comment out AMD-related content during compilation?
CRIU version: 4.0 CUDA Driver : 560
No container used and cuda-checkpoint is normal.
THANKS!
CRIU logs and information:
(00.600145) plugin: `amdgpu_plugin' hook 2 -> 0x730131770335
(00.600161) Error (amdgpu_plugin.c:1203): amdgpu_plugin: fstat error for /dev/kfd: No such file or directory
(00.600170) ----------------------------------------
(00.600175) Error (criu/cr-dump.c:1681): Dump files (pid: 3515192) failed with -1
(00.600179) Waiting for 3515192 to trap
(00.600188) Daemon 3515192 exited trapping
(00.600192) Sent msg to daemon 3 0 0
pie: 3515192: __fetched msg: 3 0 0
pie: 3515192: 3515192: new_sp=0x7a4028df37c8 ip 0x7a42faa90fd7
(00.620620) 3515192 was trapped
(00.620643) 3515192 was trapped
(00.620645) 3515192 (native) is going to execute the syscall 15, required is 15
(00.620658) 3515192 was stopped
(00.620818) net: Unlock network
(00.620824) amdgpu_plugin: finished amdgpu_plugin (AMDGPU/KFD)
(00.620894) cuda_plugin: finished cuda_plugin stage 0 err -1
(00.784858) cuda_plugin: resuming devices on pid 3515192
(00.784864) cuda_plugin: Restore thread pid 3515332 found for real pid 3515192
(00.784948) Unfreezing tasks into 1
(00.784950) Unseizing 3515192 into 1
(00.785380) Error (criu/cr-dump.c:2111): Dumping FAILED.
CRIU full dump/restore logs:
(paste your output here)
Output of `criu --version`:
(paste your output here)
Output of `criu check --all`:
(paste your output here)
Additional environment details:
The text was updated successfully, but these errors were encountered:
Do I need to comment out AMD-related content during compilation?
By default, the AMD GPU plugin for CRIU is installed in /usr/lib/criu/amdgpu_plugin.so. You can safely remove this file to avoid loading the plugin during checkpoint/restore.
Description
Hi,
I want to checkpoint a Resnet job trained on an A100 using CRIU and the cuda plugin,but AMD errors are always reported. Do I need to comment out AMD-related content during compilation?
CRIU version: 4.0 CUDA Driver : 560
No container used and cuda-checkpoint is normal.
THANKS!
CRIU logs and information:
CRIU full dump/restore logs:
Output of `criu --version`:
Output of `criu check --all`:
Additional environment details:
The text was updated successfully, but these errors were encountered: