Reduce load on drain integration test to reduce race-condition failures #3640

benclifford · 2024-10-18T11:30:13Z

This PR reduces the load places on the interchange and on the whole test environment caused by repeatedly querying the interchange for connected managers. It does this by increasing the period between such requests, from the default, every 20ms, to every 100ms.

In the last few days, test_drain.py began failing often. I have seen it occasionally fail before. This was initially a problem in PR #3639 which is unrelated, but I recreated the problem in CI against master as of #3627.

I investigated and found this behaviour causing the failure:

test_drain configures the drain period to be 1 second
startup of a worker pool was taking more than 1 second
the worker pool enters drain state, drains and terminates immediately on being fully started up.
test_py fails its assumption that there is a worker pool to inspect after waiting for there to be worker pool to inspect. This is the race condition: the assertion at line 57 is true but line 59 returns an empty managers list.

57    try_assert(lambda: len(htex.connected_managers()) == 1, check_period_ms=CONNECTED_MANAGERS_POLL_MS)
58
59    managers = htex.connected_managers()
60    assert managers[0]['active'], "The manager should be active"

Looking at the CI logs for a failing case, I saw direct evidence that the worker pool takes more than 1 second to start up in manager.log:

2024-10-18 10:31:16.007 parsl:914 29414 MainThread [INFO]  Python version: 3.12.7 (main, Oct  1 2024, 15:17:50) [GCC 9.4.0]
[...]
2024-10-18 10:31:16.008 parsl:151 29414 MainThread [INFO]  Manager initializing
[this is where the worker start time for drain purposes is measured]
[...]
2024-10-18 10:31:16.011 parsl:183 29414 MainThread [INFO]  Manager connected to interchange
2024-10-18 10:31:17.058 parsl:243 29414 MainThread [INFO]  Will request drain at 1729247477.0087547
[...]
2024-10-18 10:31:17.073 parsl:336 29414 Task-Puller [DEBUG]  ready workers: 0, pending tasks: 0

There's more than a second delay between "... connected to interchange" and the subsequent message "Will request drain". There's not a huge amount of stuff happening between these lines but there are things like multiprocessing initialization which starts a new process.

It looks like this bit of code is slow even in the successful case: rerunning until success, I see this timing in CI:

2024-10-18 12:11:17.475 parsl:183 23062 MainThread [INFO]  Manager connected to interchange
2024-10-18 12:11:18.181 parsl:243 23062 MainThread [INFO]  Will request drain at 1729253478.4731379

which is still a large fraction of a second (but sufficiently less than a second for the test to pass).

I haven't investigated what is taking that time. I haven't investigated if I also see that on my laptop.

I hypothesised that a lot of these test failures come from the test environment being quite loaded. I'm especially suspicious of using try_assert with its default timings which are very tight (20ms) - the connected managers RPC here would be expected to run much less frequently, more like every 5 seconds in regular Parsl use.

So I lengthed the period of the try_asserts in this test, to try to reduce load caused there.

That makes the test pass repeatedly again.

Things not investigated/debugged:

why this is taking >0.5 second even in the successful case - it's possible that this is a reasonable startup time and so the test might be lengthened by a few seconds
how to do a test without being timing reliant - draining is, by its very nature, reliant on the passage of "real time". For example, you might mock (at the libc level if not at the Python level) system time.
what other loads are present on the system - one of the points of slowly-ongoing PR [not for merge] tidying threads and file descriptors at shutdown #3397 shutdown tidyup is to reduce thread load left behind by earlier tests

…efinition, so behaviour should not change

…viour should be unchanged

… by not saturating the cpu

rjmello

This looks good.

Re: the race condition, consider caching the managers:

managers = None
def get_connected_managers():
    nonlocal managers
    managers = htex.connected_managers()
    return managers

try_assert(lambda: len(get_connected_managers()) == 1)

assert managers[0]['active'], "The manager should be active"
assert not managers[0]['draining'], "The manager should not be draining"

benclifford · 2024-10-19T16:29:11Z

This looks good.

Re: the race condition, consider caching the managers:

managers = None
def get_connected_managers():
    nonlocal managers
    managers = htex.connected_managers()
    return managers

try_assert(lambda: len(get_connected_managers()) == 1)

assert managers[0]['active'], "The manager should be active"
assert not managers[0]['draining'], "The manager should not be draining"

That will preserve an outdated view of the universe, sure. But then the next bit, where the test checks that running a task causes draining to not happen, won't happen, I think - if the manager has entered draining stage here, the (precondition for the?) test is broken.

…lems

…lems (#3641)

benclifford added 3 commits October 18, 2024 11:28

State assert check periods explicitly - same value as in try_assert d…

ee3e780

…efinition, so behaviour should not change

whitespace

7e28e4e

put connected managers poll into a variable. leave unchanged, so beha…

a3c54c8

…viour should be unchanged

benclifford mentioned this pull request Oct 18, 2024

Ensure k8s pod names/labels are RFC 1123 compliant #3639

Merged

benclifford added 2 commits October 18, 2024 12:35

Reduce poll time a lot, to see if that helps managers start up faster…

1dc7bf6

… by not saturating the cpu

Merge remote-tracking branch 'origin/master' into benc-drain-races

676878a

benclifford changed the title ~~Investigation of drain test race conditions vs try-assert interchange load~~ Reduce load on drain integration test to reduce race-condition failures Oct 19, 2024

benclifford marked this pull request as ready for review October 19, 2024 12:19

benclifford requested review from rjmello, khk-globus and yadudoc October 19, 2024 12:59

rjmello approved these changes Oct 19, 2024

View reviewed changes

benclifford merged commit 973ee48 into master Oct 19, 2024
7 checks passed

benclifford deleted the benc-drain-races branch October 19, 2024 16:29

benclifford added a commit that referenced this pull request Oct 19, 2024

Increase time scale for drain test outside of the scale of #3640 prob…

4099625

…lems

benclifford added a commit that referenced this pull request Oct 19, 2024

Increase time scale for drain test outside of the scale of #3640 prob…

fec4e40

…lems (#3641)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce load on drain integration test to reduce race-condition failures #3640

Reduce load on drain integration test to reduce race-condition failures #3640

benclifford commented Oct 18, 2024 •

edited

Loading

rjmello left a comment

benclifford commented Oct 19, 2024

Reduce load on drain integration test to reduce race-condition failures #3640

Reduce load on drain integration test to reduce race-condition failures #3640

Conversation

benclifford commented Oct 18, 2024 • edited Loading

rjmello left a comment

Choose a reason for hiding this comment

benclifford commented Oct 19, 2024

benclifford commented Oct 18, 2024 •

edited

Loading