AMDGPU: add parallel restore of BO content to accelerate restore #2527

wweewrwer · 2024-11-22T10:30:08Z

TL;DR:

This pull request extends CRIU to support parallel restore of AMDGPU buffer object content alongside other restore operations to accelerate the restoration.

The target issue:

In the current restore procedure of AMDGPU applications, the content of the AMDGPU buffer object (BO) is restored synchronously in CR_PLUGIN_HOOK__RESTORE_EXT_FILE. This procedure usually takes a significant amount of time, and during this time the target process cannot perform any other restore operations. However, this restoration has no logical dependencies with other restore operations. Parallelizing this part with other restore operations can speed up the restoration.

The parallel restore approach in this PR:

The core idea of these patch series is to offload the restore of the BO content from the target process to the main CRIU process (the main CRIU process refers to the parent process, and the target process refers to the child process created during the fork). To achieve this, we introduce a new hook, CR_PLUGIN_HOOK__RESTORE_ASYNCHRONOUS, in the main CRIU process. For the AMDGPU plugin, the target process will no longer restore BO contents in CR_PLUGIN_HOOK__RESTORE_EXT_FILE and just send the relevant BOs to the main CRIU process. the main CRIU process will receive the corresponding BOs in CR_PLUGIN_HOOK__RESTORE_ASYNCHRONOUS and begin the restoration. Meanwhile, the target process can continue with other parts of the restoration without being blocked by the BO content restoration. The full design of the idea can also be referred with the ACM SoCC'24 paper: On-demand and Parallel Checkpoint/Restore for GPU Applications.

Tests:

We evaluated the performance according to the following settings. The results show that parallel restore can speed up by 34.3% when images cached in the page cache, and 7.6% when restoring from disk.

Results:

	From disk	From page cache
Sequential restore	1728ms	254ms
Parallel restore	1596ms	167ms
Speed up	7.6%	34.3%

Settings:

CPU: Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz

Memory: DDR4, 2x8GB

GPU: AMD MI50

Disk: 512GB, Samsung SSD 860

Docker image: rocm/pytorch:rocm5.6_ubuntu20.04_py3.8_pytorch_1.12.1

Example program：

example.py: a ResNet18 application. Enter 'y' to exit, or press any other key to perform inference.

import time
import os
import sys
import torch
import torchvision.models as models
import torchvision.transforms as transforms
torch.set_grad_enabled(False)

device = "cuda:0"

model = models.resnet18(weights='DEFAULT')
model = model.to(device)
model.eval()

batch_size = 1
channels = 3
height = 224
width = 224
input_tensor = torch.randn(batch_size, channels, height, width)
preprocess = transforms.Compose([
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
input_tensor = preprocess(input_tensor)

while input()!="y":
    st = time.time()
    input_tensor = input_tensor.to(device)
    output = model(input_tensor)
    output = output.to("cpu")
    _, predicted_idx = torch.max(output, 1)
    torch.cuda.synchronize()
    ed = time.time()
    print("test time:",ed-st)
    sys.stdout.flush()

Steps：

Install CRIU

Follow the standard CRIU installation process. Ensure you set the environment variable CRIU_LIBS_DIR to the plugins/amdgpu path.

Dump checkpoint image

#In one shell
python3 example.py
#In another shell
mkdir -p /tmp/criu-dump
criu dump -t $(pgrep python3) -D /tmp/criu-dump -j --file-locks

Restore from disk

Test for sequential restore:

#Clear page cache
sync; sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" 
criu restore -D /tmp/criu-dump -j --file-locks
cat stats-restore | crit decode --pretty | grep restore_time

Test for parallel restore:

sync; sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" 
criu restore -D /tmp/criu-dump -j --file-locks --parallel
cat stats-restore | crit decode --pretty | grep restore_time

Restore from page cache

Install vmtouch for caching images:

sudo apt install vmtouch

Test:

sync; sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" 
#Cache image in memory
vmtouch -l criu-dump
#Warm up environment 
criu restore -D /tmp/criu-dump -j --file-locks
#Begin to Test
criu restore -D /tmp/criu-dump -j --file-locks
cat stats-restore | crit decode --pretty | grep restore_time
criu restore -D /tmp/criu-dump -j --file-locks --parallel
cat stats-restore | crit decode --pretty | grep restore_time

Currently, in the target process, device-related restore operations and other restore operations almost run sequentially. When the target process executes the corresponding CRIU hook functions, it can't perform other restore operations. However, for GPU applications, some device restore operations have no logical dependencies on other common restore operations and can be offloaded to the main CRIU process, allowing the target process to perform other restore operations in parallel. - RESTORE_ASYNCHRONOUS *RESTORE_ASYNCHRONOUS: Hook to enable the main CRIU process to perform some restore operations of plugins. This hook triggers immediately after the fork operation, allowing the relevant plugin restore to start promptly. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>

Similar to the previous commit, in the target process, some device-related restore operations have no logical dependencies on others and can be offloaded to the main CRIU process. This patch introduces a new option, `parallel`, which allows some device-related restore operations to be offloaded to the main CRIU process. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>

Currently, when CRIU calls `cr_plugin_init`, `fdstore` is not initialized. However, during the plugin restore procedure, there may be some common file operations used in multiple hooks. This patch moves `cr_plugin_init` after `fdstore_init`, allowing `cr_plugin_init` to use `fdstore` to place these file operations. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>

When enabling `RESTORE_ASYNCHRONOUS`, the target process and the main CRIU process need an IPC interface to communicate and transfer file descriptors. This patch adds a Unix domain UDP socket and stores this socket in `fdstore`. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>

Currently the restore of buffer object comsumes a significant amount of time. However, this part has no logical dependencies with other restore operations. This patch introduce some structures and some helper functions for the target process to offload this task to the main CRIU process. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>

This patch implements the entire logic to enable the offloading of buffer object content restoration. It has two parts: the first replaces the restoration of buffer objects in the target process by sending a parallel restore command to the main CRIU process; the second implements the `RESTORE_ASYNCHRONOUS` hook in the amdgpu plugin to enable buffer object content restoration in the main CRIU process. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>

rst0git · 2024-11-22T12:43:25Z

criu/crtools.c

@@ -422,6 +422,8 @@ int main(int argc, char *argv[], char *envp[])
 	       "                        can be 'nftables' or 'iptables' (default).\n"
 	       "  --unprivileged        accept limitations when running as non-root\n"
 	       "                        consult documentation for further details\n"
+	       "  --parallel            enable parallel restore of AMDGPU buffer object content\n"
+	       "                        with other restore operations\n"


@wweewrwer Is there any reason why parallel restore of buffer objects should not be enabled by default?

To the best of my ability, I can't think of any scenarios where parallel restore wouldn't be applicable. I think the default method is simpler to implement, while the parallel method is more efficient. The state of processes restored through parallel restore has no differences from those restored using the default method. If most of the criu team prefer to enable the optimization by default, we can make the change :)

I tested this functionality with Podman, but it causes podman container restore to hang:

(00.000986) Forking task with 1 pid (flags 0x6c028000) (00.000987) Creating process using clone3() (00.001105) PID: real 389586 virt 1 (00.001120) Run restore asynchronous hook from criu master for external devices (00.001121) plugin: `amdgpu_plugin' hook 12 -> 0x7fcb95a16f83 (00.001122) amdgpu_plugin: Begin to recv parallel restore cmd (00.001132) Parallel restore: begin to recv cmd_head (00.001140) 1: Found id extRootNetNS (fd 14) in inherit fd list (00.001219) 1: timens: monotonic -602 193779164 (00.001223) 1: timens: boottime -602 193774845

Is it intended to work with containers?

It can be enabled by default.

avagin · 2024-11-22T23:56:04Z

criu/cr-restore.c

+	 * Most of the times, it'll be -ENOTSUP and in few cases, it
+	 * might actually be a true error code but that won't be 
+	 * -ENOTSUP.
+	 */


pls don't add comments that don't add any new information.

avagin · 2024-11-22T23:59:30Z

criu/cr-restore.c

@@ -2051,6 +2051,17 @@ static int restore_root_task(struct pstree_item *init)
 	if (ret < 0)
 		goto out;

+	pr_info("Run restore asynchronous hook from criu master for external devices\n");
+	ret = run_plugins(RESTORE_ASYNCHRONOUS);


I think it isn't a good name for this hook. It can be a bit more descriptive about the exact moment when it is called.

avagin · 2024-11-23T00:06:37Z

plugins/amdgpu/amdgpu_plugin.c

+	if (ret)
+		return ret;
+
+	int *vis = (int *)malloc(restore_cmd.cmd_head.entry_num * sizeof(int));


pls use xmalloc and check its return code.

avagin · 2024-11-23T00:07:54Z

plugins/amdgpu/amdgpu_plugin.c

+		return ret;
+
+	int *vis = (int *)malloc(restore_cmd.cmd_head.entry_num * sizeof(int));
+	memset(vis, 0, restore_cmd.cmd_head.entry_num * sizeof(int));


xzalloc return zeroed memory region.

avagin · 2024-11-23T00:10:44Z

plugins/amdgpu/amdgpu_plugin.c

+			if (restore_cmd.entries[i].gpu_id == restore_cmd.entries[j].gpu_id) {
+				vis[j] = 1;
+				fseek(bo_contents_fp, restore_cmd.entries[j].read_offset, SEEK_SET);
+				ret = sdma_copy_bo_helper(restore_cmd.entries[j].size, restore_cmd.fds_write[restore_cmd.entries[j].write_id], bo_contents_fp, buffer, buffer_size, h_dev, max_copy_size, SDMA_OP_VRAM_WRITE);


This line exceeds the recommended 80-character limit. While it's not a strict rule, keeping lines shorter generally improves readability.

wweewrwer added 6 commits November 22, 2024 15:48

rst0git requested review from rst0git and dayatsin-amd November 22, 2024 12:07

rst0git reviewed Nov 22, 2024

View reviewed changes

avagin reviewed Nov 22, 2024

View reviewed changes

avagin reviewed Nov 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMDGPU: add parallel restore of BO content to accelerate restore #2527

AMDGPU: add parallel restore of BO content to accelerate restore #2527

wweewrwer commented Nov 22, 2024

rst0git Nov 22, 2024

wweewrwer Nov 22, 2024

rst0git Nov 22, 2024 •

edited

Loading

avagin Nov 23, 2024

avagin Nov 22, 2024

avagin Nov 22, 2024

avagin Nov 23, 2024

avagin Nov 23, 2024

avagin Nov 23, 2024

AMDGPU: add parallel restore of BO content to accelerate restore #2527

Are you sure you want to change the base?

AMDGPU: add parallel restore of BO content to accelerate restore #2527

Conversation

wweewrwer commented Nov 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rst0git Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rst0git Nov 22, 2024 •

edited

Loading