Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMDGPU: add parallel restore of BO content to accelerate restore #2527

Open
wants to merge 6 commits into
base: criu-dev
Choose a base branch
from

Conversation

wweewrwer
Copy link

TL;DR:

This pull request extends CRIU to support parallel restore of AMDGPU buffer object content alongside other restore operations to accelerate the restoration.

The target issue:

In the current restore procedure of AMDGPU applications, the content of the AMDGPU buffer object (BO) is restored synchronously in CR_PLUGIN_HOOK__RESTORE_EXT_FILE. This procedure usually takes a significant amount of time, and during this time the target process cannot perform any other restore operations. However, this restoration has no logical dependencies with other restore operations. Parallelizing this part with other restore operations can speed up the restoration.

The parallel restore approach in this PR:

The core idea of these patch series is to offload the restore of the BO content from the target process to the main CRIU process (the main CRIU process refers to the parent process, and the target process refers to the child process created during the fork). To achieve this, we introduce a new hook, CR_PLUGIN_HOOK__RESTORE_ASYNCHRONOUS, in the main CRIU process. For the AMDGPU plugin, the target process will no longer restore BO contents in CR_PLUGIN_HOOK__RESTORE_EXT_FILE and just send the relevant BOs to the main CRIU process. the main CRIU process will receive the corresponding BOs in CR_PLUGIN_HOOK__RESTORE_ASYNCHRONOUS and begin the restoration. Meanwhile, the target process can continue with other parts of the restoration without being blocked by the BO content restoration. The full design of the idea can also be referred with the ACM SoCC'24 paper: On-demand and Parallel Checkpoint/Restore for GPU Applications.

Tests:

We evaluated the performance according to the following settings. The results show that parallel restore can speed up by 34.3% when images cached in the page cache, and 7.6% when restoring from disk.

Results:

From disk From page cache
Sequential restore 1728ms 254ms
Parallel restore 1596ms 167ms
Speed up 7.6% 34.3%

Settings:

CPU: Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz

Memory: DDR4, 2x8GB

GPU: AMD MI50

Disk: 512GB, Samsung SSD 860

Docker image: rocm/pytorch:rocm5.6_ubuntu20.04_py3.8_pytorch_1.12.1

Example program:

example.py: a ResNet18 application. Enter 'y' to exit, or press any other key to perform inference.

import time
import os
import sys
import torch
import torchvision.models as models
import torchvision.transforms as transforms
torch.set_grad_enabled(False)

device = "cuda:0"

model = models.resnet18(weights='DEFAULT')
model = model.to(device)
model.eval()

batch_size = 1
channels = 3
height = 224
width = 224
input_tensor = torch.randn(batch_size, channels, height, width)
preprocess = transforms.Compose([
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
input_tensor = preprocess(input_tensor)

while input()!="y":
    st = time.time()
    input_tensor = input_tensor.to(device)
    output = model(input_tensor)
    output = output.to("cpu")
    _, predicted_idx = torch.max(output, 1)
    torch.cuda.synchronize()
    ed = time.time()
    print("test time:",ed-st)
    sys.stdout.flush()

Steps:

  1. Install CRIU

    Follow the standard CRIU installation process. Ensure you set the environment variable CRIU_LIBS_DIR to the plugins/amdgpu path.

  2. Dump checkpoint image

    #In one shell
    python3 example.py
    #In another shell
    mkdir -p /tmp/criu-dump
    criu dump -t $(pgrep python3) -D /tmp/criu-dump -j --file-locks
    
  3. Restore from disk

    Test for sequential restore:

    #Clear page cache
    sync; sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" 
    criu restore -D /tmp/criu-dump -j --file-locks
    cat stats-restore | crit decode --pretty | grep restore_time
    

    Test for parallel restore:

    sync; sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" 
    criu restore -D /tmp/criu-dump -j --file-locks --parallel
    cat stats-restore | crit decode --pretty | grep restore_time
    
  4. Restore from page cache

    Install vmtouch for caching images:

    sudo apt install vmtouch
    

    Test:

    sync; sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" 
    #Cache image in memory
    vmtouch -l criu-dump
    #Warm up environment 
    criu restore -D /tmp/criu-dump -j --file-locks
    #Begin to Test
    criu restore -D /tmp/criu-dump -j --file-locks
    cat stats-restore | crit decode --pretty | grep restore_time
    criu restore -D /tmp/criu-dump -j --file-locks --parallel
    cat stats-restore | crit decode --pretty | grep restore_time
    

Currently, in the target process, device-related restore operations and
other restore operations almost run sequentially. When the target
process executes the corresponding CRIU hook functions, it can't perform
other restore operations. However, for GPU applications, some device
restore operations have no logical dependencies on other common restore
operations and can be offloaded to the main CRIU process, allowing the
target process to perform other restore operations in parallel.

- RESTORE_ASYNCHRONOUS

*RESTORE_ASYNCHRONOUS: Hook to enable the main CRIU process to perform
some restore operations of plugins. This hook triggers immediately after
the fork operation, allowing the relevant plugin restore to start
promptly.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Similar to the previous commit, in the target process, some
device-related restore operations have no logical dependencies on others
and can be offloaded to the main CRIU process. This patch introduces a
new option, `parallel`, which allows some device-related restore
operations to be offloaded to the main CRIU process.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Currently, when CRIU calls `cr_plugin_init`, `fdstore` is not
initialized. However, during the plugin restore procedure, there may be
some common file operations used in multiple hooks. This patch moves
`cr_plugin_init` after `fdstore_init`, allowing `cr_plugin_init` to use
`fdstore` to place these file operations.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
When enabling `RESTORE_ASYNCHRONOUS`, the target process and the main
CRIU process need an IPC interface to communicate and transfer file
descriptors. This patch adds a Unix domain UDP socket and stores this
socket in `fdstore`.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Currently the restore of buffer object comsumes a significant amount of
time. However, this part has no logical dependencies with other restore
operations. This patch introduce some structures and some helper
functions for the target process to offload this task to the main CRIU
process.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
This patch implements the entire logic to enable the offloading of
buffer object content restoration. It has two parts: the first replaces
the restoration of buffer objects in the target process by sending a
parallel restore command to the main CRIU process; the second implements
the `RESTORE_ASYNCHRONOUS` hook in the amdgpu plugin to enable buffer
object content restoration in the main CRIU process.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
@@ -422,6 +422,8 @@ int main(int argc, char *argv[], char *envp[])
" can be 'nftables' or 'iptables' (default).\n"
" --unprivileged accept limitations when running as non-root\n"
" consult documentation for further details\n"
" --parallel enable parallel restore of AMDGPU buffer object content\n"
" with other restore operations\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wweewrwer Is there any reason why parallel restore of buffer objects should not be enabled by default?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the best of my ability, I can't think of any scenarios where parallel restore wouldn't be applicable. I think the default method is simpler to implement, while the parallel method is more efficient. The state of processes restored through parallel restore has no differences from those restored using the default method. If most of the criu team prefer to enable the optimization by default, we can make the change :)

Copy link
Member

@rst0git rst0git Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this functionality with Podman, but it causes podman container restore to hang:

(00.000986) Forking task with 1 pid (flags 0x6c028000)
(00.000987) Creating process using clone3()
(00.001105) PID: real 389586 virt 1
(00.001120) Run restore asynchronous hook from criu master for external devices
(00.001121) plugin: `amdgpu_plugin' hook 12 -> 0x7fcb95a16f83
(00.001122) amdgpu_plugin: Begin to recv parallel restore cmd
(00.001132) Parallel restore: begin to recv cmd_head
(00.001140)      1: Found id extRootNetNS (fd 14) in inherit fd list
(00.001219)      1: timens: monotonic -602 193779164
(00.001223)      1: timens: boottime -602 193774845

Is it intended to work with containers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be enabled by default.

* Most of the times, it'll be -ENOTSUP and in few cases, it
* might actually be a true error code but that won't be
* -ENOTSUP.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls don't add comments that don't add any new information.

@@ -2051,6 +2051,17 @@ static int restore_root_task(struct pstree_item *init)
if (ret < 0)
goto out;

pr_info("Run restore asynchronous hook from criu master for external devices\n");
ret = run_plugins(RESTORE_ASYNCHRONOUS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it isn't a good name for this hook. It can be a bit more descriptive about the exact moment when it is called.

if (ret)
return ret;

int *vis = (int *)malloc(restore_cmd.cmd_head.entry_num * sizeof(int));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls use xmalloc and check its return code.

return ret;

int *vis = (int *)malloc(restore_cmd.cmd_head.entry_num * sizeof(int));
memset(vis, 0, restore_cmd.cmd_head.entry_num * sizeof(int));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xzalloc return zeroed memory region.

if (restore_cmd.entries[i].gpu_id == restore_cmd.entries[j].gpu_id) {
vis[j] = 1;
fseek(bo_contents_fp, restore_cmd.entries[j].read_offset, SEEK_SET);
ret = sdma_copy_bo_helper(restore_cmd.entries[j].size, restore_cmd.fds_write[restore_cmd.entries[j].write_id], bo_contents_fp, buffer, buffer_size, h_dev, max_copy_size, SDMA_OP_VRAM_WRITE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line exceeds the recommended 80-character limit. While it's not a strict rule, keeping lines shorter generally improves readability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants