-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMDGPU: add parallel restore of BO content to accelerate restore #2527
base: criu-dev
Are you sure you want to change the base?
Conversation
Currently, in the target process, device-related restore operations and other restore operations almost run sequentially. When the target process executes the corresponding CRIU hook functions, it can't perform other restore operations. However, for GPU applications, some device restore operations have no logical dependencies on other common restore operations and can be offloaded to the main CRIU process, allowing the target process to perform other restore operations in parallel. - RESTORE_ASYNCHRONOUS *RESTORE_ASYNCHRONOUS: Hook to enable the main CRIU process to perform some restore operations of plugins. This hook triggers immediately after the fork operation, allowing the relevant plugin restore to start promptly. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Similar to the previous commit, in the target process, some device-related restore operations have no logical dependencies on others and can be offloaded to the main CRIU process. This patch introduces a new option, `parallel`, which allows some device-related restore operations to be offloaded to the main CRIU process. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Currently, when CRIU calls `cr_plugin_init`, `fdstore` is not initialized. However, during the plugin restore procedure, there may be some common file operations used in multiple hooks. This patch moves `cr_plugin_init` after `fdstore_init`, allowing `cr_plugin_init` to use `fdstore` to place these file operations. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
When enabling `RESTORE_ASYNCHRONOUS`, the target process and the main CRIU process need an IPC interface to communicate and transfer file descriptors. This patch adds a Unix domain UDP socket and stores this socket in `fdstore`. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Currently the restore of buffer object comsumes a significant amount of time. However, this part has no logical dependencies with other restore operations. This patch introduce some structures and some helper functions for the target process to offload this task to the main CRIU process. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
This patch implements the entire logic to enable the offloading of buffer object content restoration. It has two parts: the first replaces the restoration of buffer objects in the target process by sending a parallel restore command to the main CRIU process; the second implements the `RESTORE_ASYNCHRONOUS` hook in the amdgpu plugin to enable buffer object content restoration in the main CRIU process. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
@@ -422,6 +422,8 @@ int main(int argc, char *argv[], char *envp[]) | |||
" can be 'nftables' or 'iptables' (default).\n" | |||
" --unprivileged accept limitations when running as non-root\n" | |||
" consult documentation for further details\n" | |||
" --parallel enable parallel restore of AMDGPU buffer object content\n" | |||
" with other restore operations\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wweewrwer Is there any reason why parallel restore of buffer objects should not be enabled by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To the best of my ability, I can't think of any scenarios where parallel restore wouldn't be applicable. I think the default method is simpler to implement, while the parallel method is more efficient. The state of processes restored through parallel restore has no differences from those restored using the default method. If most of the criu team prefer to enable the optimization by default, we can make the change :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this functionality with Podman, but it causes podman container restore
to hang:
(00.000986) Forking task with 1 pid (flags 0x6c028000)
(00.000987) Creating process using clone3()
(00.001105) PID: real 389586 virt 1
(00.001120) Run restore asynchronous hook from criu master for external devices
(00.001121) plugin: `amdgpu_plugin' hook 12 -> 0x7fcb95a16f83
(00.001122) amdgpu_plugin: Begin to recv parallel restore cmd
(00.001132) Parallel restore: begin to recv cmd_head
(00.001140) 1: Found id extRootNetNS (fd 14) in inherit fd list
(00.001219) 1: timens: monotonic -602 193779164
(00.001223) 1: timens: boottime -602 193774845
Is it intended to work with containers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be enabled by default.
* Most of the times, it'll be -ENOTSUP and in few cases, it | ||
* might actually be a true error code but that won't be | ||
* -ENOTSUP. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls don't add comments that don't add any new information.
@@ -2051,6 +2051,17 @@ static int restore_root_task(struct pstree_item *init) | |||
if (ret < 0) | |||
goto out; | |||
|
|||
pr_info("Run restore asynchronous hook from criu master for external devices\n"); | |||
ret = run_plugins(RESTORE_ASYNCHRONOUS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it isn't a good name for this hook. It can be a bit more descriptive about the exact moment when it is called.
if (ret) | ||
return ret; | ||
|
||
int *vis = (int *)malloc(restore_cmd.cmd_head.entry_num * sizeof(int)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls use xmalloc and check its return code.
return ret; | ||
|
||
int *vis = (int *)malloc(restore_cmd.cmd_head.entry_num * sizeof(int)); | ||
memset(vis, 0, restore_cmd.cmd_head.entry_num * sizeof(int)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xzalloc return zeroed memory region.
if (restore_cmd.entries[i].gpu_id == restore_cmd.entries[j].gpu_id) { | ||
vis[j] = 1; | ||
fseek(bo_contents_fp, restore_cmd.entries[j].read_offset, SEEK_SET); | ||
ret = sdma_copy_bo_helper(restore_cmd.entries[j].size, restore_cmd.fds_write[restore_cmd.entries[j].write_id], bo_contents_fp, buffer, buffer_size, h_dev, max_copy_size, SDMA_OP_VRAM_WRITE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line exceeds the recommended 80-character limit. While it's not a strict rule, keeping lines shorter generally improves readability.
TL;DR:
This pull request extends CRIU to support parallel restore of AMDGPU buffer object content alongside other restore operations to accelerate the restoration.
The target issue:
In the current restore procedure of AMDGPU applications, the content of the AMDGPU buffer object (BO) is restored synchronously in
CR_PLUGIN_HOOK__RESTORE_EXT_FILE
. This procedure usually takes a significant amount of time, and during this time the target process cannot perform any other restore operations. However, this restoration has no logical dependencies with other restore operations. Parallelizing this part with other restore operations can speed up the restoration.The parallel restore approach in this PR:
The core idea of these patch series is to offload the restore of the BO content from the target process to the main CRIU process (the main CRIU process refers to the parent process, and the target process refers to the child process created during the fork). To achieve this, we introduce a new hook,
CR_PLUGIN_HOOK__RESTORE_ASYNCHRONOUS
, in the main CRIU process. For the AMDGPU plugin, the target process will no longer restore BO contents inCR_PLUGIN_HOOK__RESTORE_EXT_FILE
and just send the relevant BOs to the main CRIU process. the main CRIU process will receive the corresponding BOs inCR_PLUGIN_HOOK__RESTORE_ASYNCHRONOUS
and begin the restoration. Meanwhile, the target process can continue with other parts of the restoration without being blocked by the BO content restoration. The full design of the idea can also be referred with the ACM SoCC'24 paper: On-demand and Parallel Checkpoint/Restore for GPU Applications.Tests:
We evaluated the performance according to the following settings. The results show that parallel restore can speed up by 34.3% when images cached in the page cache, and 7.6% when restoring from disk.
Results:
Settings:
CPU: Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz
Memory: DDR4, 2x8GB
GPU: AMD MI50
Disk: 512GB, Samsung SSD 860
Docker image: rocm/pytorch:rocm5.6_ubuntu20.04_py3.8_pytorch_1.12.1
Example program:
example.py
: a ResNet18 application. Enter 'y' to exit, or press any other key to perform inference.Steps:
Install CRIU
Follow the standard CRIU installation process. Ensure you set the environment variable
CRIU_LIBS_DIR
to theplugins/amdgpu
path.Dump checkpoint image
Restore from disk
Test for sequential restore:
Test for parallel restore:
Restore from page cache
Install vmtouch for caching images:
Test: