Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add TERM handling #375

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fatmcgav
Copy link
Contributor

This is useful to gracefully stop the Jenkins swarm process, for example
when running in Kubernetes in conjunction with a preStop lifecycle hook.

Signed-off-by: Gavin Williams gavin.williams@elastic.co

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

Example log output:

Oct 13, 2021 10:38:21 PM hudson.plugins.swarm.Client run
WARNING: Remaining retries: 5
Oct 13, 2021 10:38:21 PM hudson.plugins.swarm.Client run
INFO: Retrying in 10 seconds
Oct 13, 2021 10:38:30 PM hudson.plugins.swarm.Client$1 handle
INFO: Exiting...

Copy link
Member

@basil basil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for yet another contribution, Gavin! At a high level, this makes sense to me. At an implementation level, I wonder if this could be done without resorting to the private sun.misc.Signal API, which should generally be avoided for forward compatibility with future Java versions. I did a quick Google search and found that the official Runtime.getRuntime().addShutdownHook API should be able to handle SIGTERM. Can you look into using that API rather than the private sun.misc.Signal API to implement this RFE?

@basil basil added the rfe label Oct 14, 2021
@fatmcgav
Copy link
Contributor Author

@basil thanks for the feedback...

I did start out with the addShutdownHook api but couldn't find a good implementation.. I'll try again though.. 👍

@fatmcgav
Copy link
Contributor Author

@basil I've replaced the use of sun.misc.Signal with Runtime.getRuntime().addShutdownHook().

PTAL :)

This is useful to gracefully stop the Jenkins swarm process, for example
when running in Kubernetes in conjunction with a preStop lifecycle hook.

Signed-off-by: Gavin Williams <gavin.williams@elastic.co>
Comment on lines +201 to +202
logger.info("Interrupting '" + th.getClass() + "' termination");
th.interrupt();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you verified, when sending a TERM signal to a connected Swarm client, that this if statement is true and that the code inside the if statement is executed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh, so I've confirmed that doing a kill <PID> results in the log line being output:

java -jar client/target/swarm-client.jar -description "fatmcgav swarm test" -disableClientsUniqueId -executors 1 -fsroot jenkins -retry 6 -maxRetryInterval 160 -retryBackOffStrategy exponential -mode exclusive -username local-swarm -password **** -deleteExistingClients -name fatmcgav-swarm-test -master https://*** -username local-swarm -webSocket -labels swarm -labels x86_64
Oct 19, 2021 9:51:54 PM hudson.plugins.swarm.Client logArguments
INFO: Client invoked with: -deleteExistingClients true -description fatmcgav swarm test -disableClientsUniqueId true -executors 1 -fsroot jenkins -labels [swarm, x86_64] -maxRetryInterval 160 -mode exclusive -name fatmcgav-swarm-test -password ***** -retry 6 -retryBackOffStrategy EXPONENTIAL -url https://*** -username ***** -webSocket true
Oct 19, 2021 9:51:54 PM hudson.plugins.swarm.Client run
INFO: Connecting to Jenkins controller
Oct 19, 2021 9:51:54 PM hudson.plugins.swarm.Client run
INFO: Attempting to connect to https://***
...
Oct 19, 2021 9:52:05 PM hudson.plugins.swarm.Client run
WARNING: Remaining retries: 4
Oct 19, 2021 9:52:05 PM hudson.plugins.swarm.Client run
INFO: Retrying in 20 seconds
Oct 19, 2021 9:52:05 PM hudson.plugins.swarm.Client$1 run
INFO: Interrupting threads

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh, so I've confirmed that doing a kill <PID> results in the log line being output:

java -jar client/target/swarm-client.jar -description "fatmcgav swarm test" -disableClientsUniqueId -executors 1 -fsroot jenkins -retry 6 -maxRetryInterval 160 -retryBackOffStrategy exponential -mode exclusive -username local-swarm -password **** -deleteExistingClients -name fatmcgav-swarm-test -master https://*** -username local-swarm -webSocket -labels swarm -labels x86_64
Oct 19, 2021 9:51:54 PM hudson.plugins.swarm.Client logArguments
INFO: Client invoked with: -deleteExistingClients true -description fatmcgav swarm test -disableClientsUniqueId true -executors 1 -fsroot jenkins -labels [swarm, x86_64] -maxRetryInterval 160 -mode exclusive -name fatmcgav-swarm-test -password ***** -retry 6 -retryBackOffStrategy EXPONENTIAL -url https://*** -username ***** -webSocket true
Oct 19, 2021 9:51:54 PM hudson.plugins.swarm.Client run
INFO: Connecting to Jenkins controller
Oct 19, 2021 9:51:54 PM hudson.plugins.swarm.Client run
INFO: Attempting to connect to https://***
...
Oct 19, 2021 9:52:05 PM hudson.plugins.swarm.Client run
WARNING: Remaining retries: 4
Oct 19, 2021 9:52:05 PM hudson.plugins.swarm.Client run
INFO: Retrying in 20 seconds
Oct 19, 2021 9:52:05 PM hudson.plugins.swarm.Client$1 run
INFO: Interrupting threads

Do you see a line above that says "Interrupting '[…]' termination"? I can't see one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point... I'll do some debugging...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I ran some tests locally, and even with the swarm-client successfully connecting to a Jenkins controller, I don't see any threads being interrupted.

I've put a couple of thread dump here: https://gist.github.com/fatmcgav/f848c071bbf5f650408c3e7f0b45238a

The first one is just with the agent idle and connected to Jenkins, the 2nd with a simple sleep script running...

It looks like the runningThreads set never gets populated with anything based on adding some additional logging...

So is it safe just to remove the logic? Or should the threads be getting interrupted?

Copy link
Member

@basil basil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on https://github.com/jenkinsci/remoting/blob/f4b81bbb7b2df53bc14e1a14a36d4fd3ce570a31/src/main/java/hudson/remoting/jnlp/Main.java#L284-L287 it is sufficient to send an interrupt to the main thread, which will in turn interrupt the Engine thread. In Swarm's case, the main thread is the one running the while (true) loop in Client.java. (One way of reliably getting at this thread is to call Thread.currentThread() at the start of main() and hold on to the reference.) That thread should be sitting at line 235 of Client.java at idle, and when interrupted it would first hit the finally block of the snippet I linked to above (which calls engine.interrupt() to interrupt the Engine thread) then fall into the catch block at line 240 of Client.java, where we log a stack trace and eventually loop back around to reconnect. In your case you want to effect a clean shutdown, so you will want to change this to not log a stack trace and reconnect but rather to wait for the Engine thread to finish shutting down and then exit from the main thread cleanly without looping back around to reconnect. To distinguish this new case from the existing use cases you will likely need to add some sort of AtomicBoolean that is set by your new shutdown hook.

@fatmcgav
Copy link
Contributor Author

@basil thanks for the great reply and the pointer...

I'll rework...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants