The second (final) part of v4.0 #2481

avagin · 2024-09-16T12:22:25Z

No description provided.

To enable cross-compile we need to use the CC definition from criu/scripts/nmk/scripts/tools.mk: CC := $(CROSS_COMPILE)$(HOSTCC) Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

Skip cross-compilation on armv7 because, among many other errors, it fails with the following: In file included from ../../include/common/lock.h:9, from ../../criu/include/files.h:9, from amdgpu_plugin.c:30: ../../include/common/asm/atomic.h:60:2: error: #error ARM architecture version (CONFIG_ARMV*) not set or unsupported. 60 | #error ARM architecture version (CONFIG_ARMV*) not set or unsupported. | ^~~~~ ../../include/common/asm/atomic.h: In function 'atomic_add_return': ../../include/common/asm/atomic.h:81:9: error: implicit declaration of function 'smp_mb' [-Werror=implicit-function-declaration] 81 | smp_mb(); | ^~~~~~ Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

Co-developed-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

Errors on aarch64: In file included from amdgpu_plugin_drm.h:10, from amdgpu_plugin.c:33: amdgpu_plugin.c: In function 'amdgpu_plugin_dump_file': amdgpu_plugin_util.h:24:20: error: format '%lld' expects argument of type 'long long int', but argument 6 has type '__u64' {aka 'long unsigned int'} [-Werror=format=] 24 | #define LOG_PREFIX "amdgpu_plugin: " | ^~~~~~~~~~~~~~~~~ ../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX' 47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__) | ^~~~~~~~~~ amdgpu_plugin.c:1236:9: note: in expansion of macro 'pr_info' 1236 | pr_info("devices:%d bos:%d objects:%d priv_data:%lld\n", args.num_devices, args.num_bos, args.num_objects, | ^~~~~~~ cc1: all warnings being treated as errors Errors on ppc64: In file included from amdgpu_plugin_drm.h:10, from amdgpu_plugin.c:33: amdgpu_plugin.c: In function 'amdgpu_plugin_dump_file': amdgpu_plugin_util.h:24:20: error: format '%llu' expects argument of type 'long long unsigned int', but argument 6 has type '__u64' {aka 'long unsigned int'} [-Werror=format=] 24 | #define LOG_PREFIX "amdgpu_plugin: " | ^~~~~~~~~~~~~~~~~ ../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX' 47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__) | ^~~~~~~~~~ amdgpu_plugin.c:1236:9: note: in expansion of macro 'pr_info' 1236 | pr_info("devices:%u bos:%u objects:%u priv_data:%llu\n", | ^~~~~~~ cc1: all warnings being treated as errors In file included from amdgpu_plugin_util.c:38: amdgpu_plugin_util.c: In function 'print_kfd_bo_stat': amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=] 24 | #define LOG_PREFIX "amdgpu_plugin: " | ^~~~~~~~~~~~~~~~~ ../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX' 47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__) | ^~~~~~~~~~ amdgpu_plugin_util.c:196:17: note: in expansion of macro 'pr_info' 196 | pr_info("%s(), %d. KFD BO Addr: %llx \n", __func__, idx, bo->addr); | ^~~~~~~ amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=] 24 | #define LOG_PREFIX "amdgpu_plugin: " | ^~~~~~~~~~~~~~~~~ ../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX' 47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__) | ^~~~~~~~~~ amdgpu_plugin_util.c:197:17: note: in expansion of macro 'pr_info' 197 | pr_info("%s(), %d. KFD BO Size: %llx \n", __func__, idx, bo->size); | ^~~~~~~ amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=] 24 | #define LOG_PREFIX "amdgpu_plugin: " | ^~~~~~~~~~~~~~~~~ ../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX' 47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__) | ^~~~~~~~~~ amdgpu_plugin_util.c:198:17: note: in expansion of macro 'pr_info' 198 | pr_info("%s(), %d. KFD BO Offset: %llx \n", __func__, idx, bo->offset); | ^~~~~~~ amdgpu_plugin_util.h:24:20: error: format '%llx' expects argument of type 'long long unsigned int', but argument 5 has type '__u64' {aka 'long unsigned int'} [-Werror=format=] 24 | #define LOG_PREFIX "amdgpu_plugin: " | ^~~~~~~~~~~~~~~~~ ../../criu/include/log.h:47:52: note: in expansion of macro 'LOG_PREFIX' 47 | #define pr_info(fmt, ...) print_on_level(LOG_INFO, LOG_PREFIX fmt, ##__VA_ARGS__) | ^~~~~~~~~~ amdgpu_plugin_util.c:199:17: note: in expansion of macro 'pr_info' 199 | pr_info("%s(), %d. KFD BO Restored Offset: %llx \n", __func__, idx, bo->restored_offset); | ^~~~~~~ cc1: all warnings being treated as errors Co-developed-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

Running 'crit x ./ rss' on aarch64 crashes with: File "/home/criu/crit/crit/__main__.py", line 331, in explore_rss while vmas[vmi]['start'] < pme: ~~~~^^^^^ IndexError: list index out of range This adds an additional check to the while loop to do access indexes out of range. Signed-off-by: Adrian Reber <areber@redhat.com>

Previously the check was just if /sys/fs/selinux is mounted. This extends the check to see if all necessary tools are installed. Signed-off-by: Adrian Reber <areber@redhat.com>

Some test environments (Actuated runners for example) do not support maclvan devices. Skip tests depending on it automatically. Signed-off-by: Adrian Reber <areber@redhat.com>

Currently coredump only works on x86_64. Fail early on any other architecture. Signed-off-by: Adrian Reber <areber@redhat.com>

Signed-off-by: Adrian Reber <areber@redhat.com>

When attempting to checkpoint a container with CUDA processes, CRIU could fail with the following error: Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 1 Error (cuda_plugin.c:143): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call Error (cuda_plugin.c:384): cuda_plugin: PAUSE_DEVICES failed with In this situation, the target process is locked, but CRIU fails due to a timeout and exits with an error. We need to make sure that the target PID is unlocked in such case. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

The `uninstall_module.py` script is a wrapper for the `pip uninstall` command that enables support for specifying installation prefix (i.e., `--prefix`). When this functionality is used, we intentionally set `sys.path` to include only search paths for the specified prefix to avoid unintentional uninstallation of packages in system paths. Since `importlib_metadata` version 8.1.0, the `Distribution.from_name()` method has been modified [1] to perform additional pre-processing of Distribution objects [2] that requires loading distribution metadata and results in the following error: File "/usr/local/lib/python3.12/site-packages/importlib_metadata/__init__.py", line 422, in <lambda> buckets = bucket(dists, lambda dist: bool(dist.metadata)) ^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/importlib_metadata/__init__.py", line 454, in metadata from . import _adapters File "/usr/local/lib/python3.12/site-packages/importlib_metadata/_adapters.py", line 3, in <module> import email.message File "/usr/lib64/python3.12/email/message.py", line 11, in <module> import quopri ModuleNotFoundError: No module named 'quopri' This error occurs because we have excluded system paths from the list of search paths (`sys.path`). However, this pre-processing is not required for our use case, as we only use the discovery mechanism of importlib_metadata to resolve the metadata directory path of the module being uninstalled. To fix this problem, this patch updates `uninstall_module` to avoid the `from_name()` method and use `discover(name=package_name)` directly. [1] python/importlib_metadata@a65c29a [2] https://github.com/python/importlib_metadata/blob/a65c29ad/importlib_metadata/__init__.py#L391 Fixes: checkpoint-restore#2468 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

This patch fixes the following typos reported by codespell: ./test/others/bers/bers.c:394: dependin ==> depending, depend in ./criu/kerndat.c:837: hitted ==> hit Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

Some plugins (e.g., CUDA) may not function correctly when processes are frozen using cgroups. This change introduces a mechanism to disable the use of freeze cgroups during process seizing, even if explicitly requested via the --freeze-cgroup option. The CUDA plugin is updated to utilize this new mechanism to ensure compatibility. Signed-off-by: Andrei Vagin <avagin@google.com>

Adds a new "fault" to call dont_use_freeze_cgroup. Signed-off-by: Andrei Vagin <avagin@google.com>

The presence of /dev/nvidiactl indicates that the system has a compatible NVIDIA GPU driver installed and that the GPU is accessible to the operating system. Signed-off-by: Andrei Vagin <avagin@google.com>

This struct was being used un-initialized, meaning it was filled with random garbage. Mea culpa. Signed-off-by: David Francis <David.Francis@amd.com>

The topology parsing assumed that all parameter names were 30 characters or fewer, but recommended_sdma_engine_id_mask is 31 characters. Make the maximum length a macro, and set it to 64. Signed-off-by: David Francis <David.Francis@amd.com>

It should help to investigate errors of fsconfig, fsmount and etc. Signed-off-by: Andrei Vagin <avagin@google.com>

rst0git and others added 13 commits September 16, 2024 05:10

plugins/amdgpu: fix cross-compilation

1b82287

To enable cross-compile we need to use the CC definition from criu/scripts/nmk/scripts/tools.mk: CC := $(CROSS_COMPILE)$(HOSTCC) Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

plugins/amdgpu: use C99-standard types

d75d468

Co-developed-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

test: better test for SELinux tools

6122a74

Previously the check was just if /sys/fs/selinux is mounted. This extends the check to see if all necessary tools are installed. Signed-off-by: Adrian Reber <areber@redhat.com>

test: only run macvlan tests if macvlan devices can be created

54951b2

Some test environments (Actuated runners for example) do not support maclvan devices. Skip tests depending on it automatically. Signed-off-by: Adrian Reber <areber@redhat.com>

coredump: fail on unsupported architectures early

ab99a95

Currently coredump only works on x86_64. Fail early on any other architecture. Signed-off-by: Adrian Reber <areber@redhat.com>

ci: run aarch64 tests native via actuated

d7ba619

Signed-off-by: Adrian Reber <areber@redhat.com>

codespell: fix typos

3fb9d8f

This patch fixes the following typos reported by codespell: ./test/others/bers/bers.c:394: dependin ==> depending, depend in ./criu/kerndat.c:837: hitted ==> hit Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

rst0git self-requested a review September 16, 2024 17:23

avagin and others added 5 commits September 19, 2024 03:47

fault: allow to check dont_use_freeze_cgroup

1d78a58

Adds a new "fault" to call dont_use_freeze_cgroup. Signed-off-by: Andrei Vagin <avagin@google.com>

plugin/cuda: disable CUDA plugin if /dev/nvidiactl isn't present

54ebb7f

The presence of /dev/nvidiactl indicates that the system has a compatible NVIDIA GPU driver installed and that the GPU is accessible to the operating system. Signed-off-by: Andrei Vagin <avagin@google.com>

plugins/amdgpu: Zero ib_info on initialization

b31fec5

This struct was being used un-initialized, meaning it was filled with random garbage. Mea culpa. Signed-off-by: David Francis <David.Francis@amd.com>

util: dump fsfd log messages

6aafc9d

It should help to investigate errors of fsconfig, fsmount and etc. Signed-off-by: Andrei Vagin <avagin@google.com>

avagin force-pushed the second-part-of-4.0 branch from be1ad56 to 6aafc9d Compare September 19, 2024 10:47

avagin merged commit a8cbe76 into checkpoint-restore:master Sep 19, 2024
28 of 37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The second (final) part of v4.0 #2481

The second (final) part of v4.0 #2481

avagin commented Sep 16, 2024

The second (final) part of v4.0 #2481

The second (final) part of v4.0 #2481

Conversation

avagin commented Sep 16, 2024