-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce load on drain integration test to reduce race-condition failures #3640
Conversation
…efinition, so behaviour should not change
…viour should be unchanged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good.
Re: the race condition, consider caching the managers:
managers = None
def get_connected_managers():
nonlocal managers
managers = htex.connected_managers()
return managers
try_assert(lambda: len(get_connected_managers()) == 1)
assert managers[0]['active'], "The manager should be active"
assert not managers[0]['draining'], "The manager should not be draining"
That will preserve an outdated view of the universe, sure. But then the next bit, where the test checks that running a task causes draining to not happen, won't happen, I think - if the manager has entered draining stage here, the (precondition for the?) test is broken. |
This PR reduces the load places on the interchange and on the whole test environment caused by repeatedly querying the interchange for connected managers. It does this by increasing the period between such requests, from the default, every 20ms, to every 100ms.
In the last few days, test_drain.py began failing often. I have seen it occasionally fail before. This was initially a problem in PR #3639 which is unrelated, but I recreated the problem in CI against master as of #3627.
I investigated and found this behaviour causing the failure:
Looking at the CI logs for a failing case, I saw direct evidence that the worker pool takes more than 1 second to start up in
manager.log
:There's more than a second delay between "... connected to interchange" and the subsequent message "Will request drain". There's not a huge amount of stuff happening between these lines but there are things like multiprocessing initialization which starts a new process.
It looks like this bit of code is slow even in the successful case: rerunning until success, I see this timing in CI:
which is still a large fraction of a second (but sufficiently less than a second for the test to pass).
I haven't investigated what is taking that time. I haven't investigated if I also see that on my laptop.
I hypothesised that a lot of these test failures come from the test environment being quite loaded. I'm especially suspicious of using
try_assert
with its default timings which are very tight (20ms) - the connected managers RPC here would be expected to run much less frequently, more like every 5 seconds in regular Parsl use.So I lengthed the period of the try_asserts in this test, to try to reduce load caused there.
That makes the test pass repeatedly again.
Things not investigated/debugged: