Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce load on drain integration test to reduce race-condition failures #3640

Merged
merged 5 commits into from
Oct 19, 2024

Conversation

benclifford
Copy link
Collaborator

@benclifford benclifford commented Oct 18, 2024

This PR reduces the load places on the interchange and on the whole test environment caused by repeatedly querying the interchange for connected managers. It does this by increasing the period between such requests, from the default, every 20ms, to every 100ms.

In the last few days, test_drain.py began failing often. I have seen it occasionally fail before. This was initially a problem in PR #3639 which is unrelated, but I recreated the problem in CI against master as of #3627.

I investigated and found this behaviour causing the failure:

  • test_drain configures the drain period to be 1 second
  • startup of a worker pool was taking more than 1 second
  • the worker pool enters drain state, drains and terminates immediately on being fully started up.
  • test_py fails its assumption that there is a worker pool to inspect after waiting for there to be worker pool to inspect. This is the race condition: the assertion at line 57 is true but line 59 returns an empty managers list.
57    try_assert(lambda: len(htex.connected_managers()) == 1, check_period_ms=CONNECTED_MANAGERS_POLL_MS)
58
59    managers = htex.connected_managers()
60    assert managers[0]['active'], "The manager should be active"

Looking at the CI logs for a failing case, I saw direct evidence that the worker pool takes more than 1 second to start up in manager.log:

2024-10-18 10:31:16.007 parsl:914 29414 MainThread [INFO]  Python version: 3.12.7 (main, Oct  1 2024, 15:17:50) [GCC 9.4.0]
[...]
2024-10-18 10:31:16.008 parsl:151 29414 MainThread [INFO]  Manager initializing
[this is where the worker start time for drain purposes is measured]
[...]
2024-10-18 10:31:16.011 parsl:183 29414 MainThread [INFO]  Manager connected to interchange
2024-10-18 10:31:17.058 parsl:243 29414 MainThread [INFO]  Will request drain at 1729247477.0087547
[...]
2024-10-18 10:31:17.073 parsl:336 29414 Task-Puller [DEBUG]  ready workers: 0, pending tasks: 0

There's more than a second delay between "... connected to interchange" and the subsequent message "Will request drain". There's not a huge amount of stuff happening between these lines but there are things like multiprocessing initialization which starts a new process.

It looks like this bit of code is slow even in the successful case: rerunning until success, I see this timing in CI:

2024-10-18 12:11:17.475 parsl:183 23062 MainThread [INFO]  Manager connected to interchange
2024-10-18 12:11:18.181 parsl:243 23062 MainThread [INFO]  Will request drain at 1729253478.4731379

which is still a large fraction of a second (but sufficiently less than a second for the test to pass).

I haven't investigated what is taking that time. I haven't investigated if I also see that on my laptop.

I hypothesised that a lot of these test failures come from the test environment being quite loaded. I'm especially suspicious of using try_assert with its default timings which are very tight (20ms) - the connected managers RPC here would be expected to run much less frequently, more like every 5 seconds in regular Parsl use.

So I lengthed the period of the try_asserts in this test, to try to reduce load caused there.

That makes the test pass repeatedly again.

Things not investigated/debugged:

  • why this is taking >0.5 second even in the successful case - it's possible that this is a reasonable startup time and so the test might be lengthened by a few seconds
  • how to do a test without being timing reliant - draining is, by its very nature, reliant on the passage of "real time". For example, you might mock (at the libc level if not at the Python level) system time.
  • what other loads are present on the system - one of the points of slowly-ongoing PR [not for merge] tidying threads and file descriptors at shutdown #3397 shutdown tidyup is to reduce thread load left behind by earlier tests

@benclifford benclifford changed the title Investigation of drain test race conditions vs try-assert interchange load Reduce load on drain integration test to reduce race-condition failures Oct 19, 2024
@benclifford benclifford marked this pull request as ready for review October 19, 2024 12:19
Copy link
Member

@rjmello rjmello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.

Re: the race condition, consider caching the managers:

managers = None
def get_connected_managers():
    nonlocal managers
    managers = htex.connected_managers()
    return managers

try_assert(lambda: len(get_connected_managers()) == 1)

assert managers[0]['active'], "The manager should be active"
assert not managers[0]['draining'], "The manager should not be draining"

@benclifford
Copy link
Collaborator Author

This looks good.

Re: the race condition, consider caching the managers:

managers = None
def get_connected_managers():
    nonlocal managers
    managers = htex.connected_managers()
    return managers

try_assert(lambda: len(get_connected_managers()) == 1)

assert managers[0]['active'], "The manager should be active"
assert not managers[0]['draining'], "The manager should not be draining"

That will preserve an outdated view of the universe, sure. But then the next bit, where the test checks that running a task causes draining to not happen, won't happen, I think - if the manager has entered draining stage here, the (precondition for the?) test is broken.

@benclifford benclifford merged commit 973ee48 into master Oct 19, 2024
7 checks passed
@benclifford benclifford deleted the benc-drain-races branch October 19, 2024 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants