Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compel: when pre-dump, handle ERESTART_RESTARTBLOCK as ERESTARTNOINTR #2512

Closed

Conversation

anatasluo
Copy link
Contributor

When pre-dump, if thread is in syscall like sleep or poll, it will return -ERESTART_RESTARTBLOCK. In that case, if we set ax register to -EINTR, it will break the original syscall. Instead, we can restore original syscall.

@anatasluo
Copy link
Contributor Author

anatasluo commented Nov 4, 2024

Test cases

I tested two cases for this patch:

  1. sleep syscall
#include <unistd.h>
#include <stdio.h>
#include <time.h>

const unsigned int interval = 60;

int main()
{
	time_t begin, end;
	printf("main thread works - %d for %u seconds \n", getpid(), interval);
	time(&begin);
	sleep(interval);
	time(&end);
	printf("it is ok - %f seconds \n", difftime(end, begin));
	return 0;
}
  1. poll syscall
#include <stdio.h>
#include <unistd.h>
#include <sys/poll.h>

#define TIMEOUT 25

int main (void)
{
	struct pollfd fds[2];
	int ret;

    printf("main thread works - %d \n", getpid());

	/* watch stdin for input */
	fds[0].fd = STDIN_FILENO;
	fds[0].events = POLLIN;

	/* watch stdout for ability to write */
	fds[1].fd = STDOUT_FILENO;
	fds[1].events = POLLOUT;

	ret = poll(fds, 1, TIMEOUT * 1000);

	if (ret == -1) {
		perror ("poll");
		return 1;
	}

	if (!ret) {
		printf ("%d seconds elapsed.\n", TIMEOUT);
		return 0;
	}

	if (fds[0].revents & POLLIN)
		printf ("stdin is readable\n");

	// if (fds[1].revents & POLLOUT)
	// 	printf ("stdout is writable\n");

	return 0;

}

Test command and result

sudo ./criu/criu pre-dump -vvv --shell-job -t

Without this patch, pre-dump of criu will break the sleep or poll syscall

When pre-dump, if thread is in syscall like sleep or poll, it will
return -ERESTART_RESTARTBLOCK. In that case, if we set ax register to -EINTR,
it will break the original syscall. Instead, we can restore original
syscall.

Signed-off-by: Longjun Luo <luolongjuna@gmail.com>
@@ -60,7 +60,7 @@ int sigreturn_prep_fpu_frame_plain(struct rt_sigframe *sigframe, struct rt_sigfr
}

int compel_get_task_regs(pid_t pid, user_regs_struct_t *regs, user_fpregs_struct_t *fpsimd, save_regs_t save,
void *arg, __maybe_unused unsigned long flags)
void *arg, __maybe_unused unsigned long flags, __maybe_unused bool pre_dump)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this argument is now used with both criu pre-dump and criu dump --leave-running, it might be better to use more generic name than pre_dump.

Copy link
Member

@rst0git rst0git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me.

@Snorch This change seems related to the discussion in #1589 (comment). Would you be able to review this pull request?

@Snorch
Copy link
Member

Snorch commented Nov 5, 2024

This does not sound correct.

We specifically set aside the -ERESTART_RESTARTBLOCK case, e.g. see commit message of dd71cca, it says:

Just change this code to enforce -EINTR after restore, this is what
we actually want until we teach criu to handle ERESTART_RESTARTBLOCK.

So we explicitly want this behavior of not restarting syscall in this case.

One of the reasons why we might want not to simply restart sleep syscall is that this approach does not consider the passed time before restart and may lead to wrong sleep times (possibly even endless sleep due to repeated dump).

0                                                    1                                                    2
-----------------------------------------------------
                                         -----------------------------------------------------
^                                        ^                                                   ^          
sleep(1)                                 interrupted + restarted sleep(1)                    sleeped more than was needed

In other words: Application which uses sleep (nanosleep) is in charge of restarting sleep with proper time adjustment, as sleep can be interrupted by signal even without CRIU involved.

@anatasluo
Copy link
Contributor Author

anatasluo commented Nov 5, 2024

@Snorch Yes, I agree with you. As you said, interrupted time will not be considered as elcasped time. So it is good to make ERESTART_RESTARTBLOCK case return -EINTR explicitly.
But for the case of pre-dump or --leave-running, there are some different things:

  1. interrupted time is really short, sleep itself is not that accurate. Since program could be affeched by things like schedule.
  2. for the poll syscall, interrupt directly may change the program behavior

Anyway, I think two strategies are all acceptable, just one advice.

@avagin
Copy link
Member

avagin commented Nov 5, 2024

interrupted time is really short, sleep itself is not that accurate. Since program could be affeched by things like schedule.

@anatasluo syscalls like poll and sleep have restart blocks in the kernel:
https://elixir.bootlin.com/linux/v6.11.6/source/include/linux/restart_block.h#L25

so they can be proper restarted only if the kernel doesn't drop their restart blocks. Otherwise, it can be the case, that a process called sleep(1h) and has been sleeping for 59 minutes, then we restart the syscall and it starts sleeping for another hour.

@anatasluo
Copy link
Contributor Author

anatasluo commented Nov 5, 2024

@avagin This patch is used for pre-dump or --leave-running. In this case, kernel should not drop anything. And also, check the kernel code, the remanning time is saved or updated in userspace:

int nanosleep_copyout(struct restart_block *restart, struct timespec64 *ts)
{
	switch(restart->nanosleep.type) {
#ifdef CONFIG_COMPAT_32BIT_TIME
	case TT_COMPAT:
		if (put_old_timespec32(ts, restart->nanosleep.compat_rmtp))
			return -EFAULT;
		break;
#endif
	case TT_NATIVE:
		if (put_timespec64(ts, restart->nanosleep.rmtp))
			return -EFAULT;
		break;
	default:
		BUG();
	}
	return -ERESTART_RESTARTBLOCK;
}

restart->nanosleep.rmtp is coming from userspace:

SYSCALL_DEFINE4(clock_nanosleep_time32, clockid_t, which_clock, int, flags,
		struct old_timespec32 __user *, rqtp,
		struct old_timespec32 __user *, rmtp)
{
	const struct k_clock *kc = clockid_to_kclock(which_clock);
	struct timespec64 t;

	if (!kc)
		return -EINVAL;
	if (!kc->nsleep)
		return -EOPNOTSUPP;

	if (get_old_timespec32(&t, rqtp))
		return -EFAULT;

	if (!timespec64_valid(&t))
		return -EINVAL;
	if (flags & TIMER_ABSTIME)
		rmtp = NULL;
	current->restart_block.fn = do_no_restart_syscall;
	current->restart_block.nanosleep.type = rmtp ? TT_COMPAT : TT_NONE;
	current->restart_block.nanosleep.compat_rmtp = rmtp;

	return kc->nsleep(which_clock, flags, &t);
}

@anatasluo
Copy link
Contributor Author

Test cases

I tested two cases for this patch:

  1. sleep syscall
#include <unistd.h>
#include <stdio.h>
#include <time.h>

const unsigned int interval = 60;

int main()
{
	time_t begin, end;
	printf("main thread works - %d for %u seconds \n", getpid(), interval);
	time(&begin);
	sleep(interval);
	time(&end);
	printf("it is ok - %f seconds \n", difftime(end, begin));
	return 0;
}
  1. poll syscall
#include <stdio.h>
#include <unistd.h>
#include <sys/poll.h>

#define TIMEOUT 25

int main (void)
{
	struct pollfd fds[2];
	int ret;

    printf("main thread works - %d \n", getpid());

	/* watch stdin for input */
	fds[0].fd = STDIN_FILENO;
	fds[0].events = POLLIN;

	/* watch stdout for ability to write */
	fds[1].fd = STDOUT_FILENO;
	fds[1].events = POLLOUT;

	ret = poll(fds, 1, TIMEOUT * 1000);

	if (ret == -1) {
		perror ("poll");
		return 1;
	}

	if (!ret) {
		printf ("%d seconds elapsed.\n", TIMEOUT);
		return 0;
	}

	if (fds[0].revents & POLLIN)
		printf ("stdin is readable\n");

	// if (fds[1].revents & POLLOUT)
	// 	printf ("stdout is writable\n");

	return 0;

}

Test command and result

sudo ./criu/criu pre-dump -vvv --shell-job -t

Without this patch, pre-dump of criu will break the sleep or poll syscall

@avagin Test cases have considered this situation.

@avagin
Copy link
Member

avagin commented Nov 5, 2024

compel uses rt_sigreturn to resume a process, but in the kernel, it cleans up the restart_block in the kernel:
https://elixir.bootlin.com/linux/v6.11.6/source/arch/x86/kernel/signal_64.c#L57

I use your test with minor modifications and I see that the poll syscall is restarted without the restart_block from the previous call:

avagin@avagin3:~/git/criu$ gcc test2.c 
avagin@avagin3:~/git/criu$ ./a.out & pid=$!
[1] 394390
avagin@avagin3:~/git/criu$ main thread works - 394390 

avagin@avagin3:~/git/criu$ sleep 5
avagin@avagin3:~/git/criu$ sudo ./criu/criu pre-dump -vvv -o pre-dump.log -t $pid -D img/
avagin@avagin3:~/git/criu$ echo $?
0
avagin@avagin3:~/git/criu$ wait
it is ok - 38.000000 seconds 
[1]+  Done                    ./a.out
avagin@avagin3:~/git/criu$ cat test2.c 
#include <stdio.h>
#include <unistd.h>
#include <sys/poll.h>
#include <time.h>

#define TIMEOUT 25

int main (void)
{
	struct pollfd fds[2];
	int ret;
	time_t begin, end;

    printf("main thread works - %d \n", getpid());

	time(&begin);
	ret = poll(fds, 0, TIMEOUT * 1000);
	time(&end);
	printf("it is ok - %f seconds \n", difftime(end, begin));

	if (ret == -1) {
		perror ("poll");
		return 1;
	}

	return 0;

}

@anatasluo
Copy link
Contributor Author

@avagin I found the problem here, restore_sigcontext would not affect this patch. Because this patch doesn't use restart_syscall. Instead, this patch restarts the original syscall. The problem here is that timeout for sleep syscall is read from memory while timeout for poll syscall is a value. So this patch works for sleep syscall, but not works for poll syscall.

SYSCALL_DEFINE3(poll, struct pollfd __user *, ufds, unsigned int, nfds,
		int, timeout_msecs)
{
	struct timespec64 end_time, *to = NULL;
	int ret;

	if (timeout_msecs >= 0) {
		to = &end_time;
		poll_select_set_timeout(to, timeout_msecs / MSEC_PER_SEC,
			NSEC_PER_MSEC * (timeout_msecs % MSEC_PER_SEC));
	}

	ret = do_sys_poll(ufds, nfds, to);

	if (ret == -ERESTARTNOHAND) {
		struct restart_block *restart_block;

		restart_block = &current->restart_block;
		restart_block->poll.ufds = ufds;
		restart_block->poll.nfds = nfds;

		if (timeout_msecs >= 0) {
			restart_block->poll.tv_sec = end_time.tv_sec;
			restart_block->poll.tv_nsec = end_time.tv_nsec;
			restart_block->poll.has_timeout = 1;
		} else
			restart_block->poll.has_timeout = 0;

		ret = set_restart_fn(restart_block, do_restart_poll);
	}
	return ret;
}

@anatasluo
Copy link
Contributor Author

Since this patch still has problems to fix, i will close it now.

@anatasluo anatasluo closed this Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants