On a very high level, the functionality can be summarized with the following bullet points:
rebootmgr
is a command-line tool, that can be used as a safer replacement for thereboot
command- It will only reboot when safe and/or necessary
- You can run it as a systemd timer, or manually
- It relies on Consul service discovery, for a cluster overview
- It can execute tasks before and after rebooting
You can configure rebootmgr in your cluster using the consul key/value (kv) store.
If this exists in the consul key/value store, rebootmgr won't do anything unless you specify the option --ignore-global-stop-flag
.
Content of the key does not matter. You can use the content to explain the reason for stopping rebootmgr.
Example for enabling the stop flag:
$ consul kv put service/rebootmgr/stop "Stop rebootmgr for some reason"
If there are failed checks on certain hosts, that you want to be ignored by reboot manager, you can configure a list of hostnames, whose failed checks should be ignored.
Example for ignoring a host:
$ consul kv put service/rebootmgr/ignore_failed_checks '["some_hostname"]'
You can enable or disable Reboot Manager on individual hosts using this key.
Reboot Manager will allow reboots only if this configuration is present and well-formed, so by default, Reboot Manager will be disabled1.
To ensure that the configuration is present:
some_hostname$ rebootmgr --ensure-config
If the configuration is absent or invalid, it will create it, allowing reboots.
Example for disabling reboot manager on some_hostname
:
$ consul kv put service/rebootmgr/nodes/some_hostname/config '{"disabled": true}'
For an overview of how to register services and checks in consul, please refer to the consul documentation.
- If service X is registered to the agent that is being rebooted, health of all instances of service X in the whole cluster (also on other nodes) are taken into consideration.
- Only services with the tags
rebootmgr
,rebootmgr_postboot
andrebootmgr_preboot
will be taken into consideration. - Services tagged with
rebootmgr
are considered before and after the reboot,rebootmgr_preboot
only before andrebootmgr_postboot
only after a reboot. - Rebootmgr will consider services with consul maintenance mode enabled as broken, unless the service is tagged with
ignore_maintenance
- Rebootmgr assumes that
check_interval + check_timeout < 2 minutes
Example service definition:
{
"ID": "nova-compute",
"Name": "nova-compute",
"Tags": [
"openstack",
"rebootmgr" # leaving the service untagged, it would be ignored by rebootmgr, but still could be queried by a MAT
],
"Check": {
"Script": "check_nova_compute.sh",
"Interval": "30s"
}
}
Before and after rebooting, rebootmgr can run tasks.
Tasks are simple executable files (usually shell or python scripts).
Rebootmgr is looking for them in /etc/rebootmgr/pre_boot_tasks/
(for tasks that should be executed before rebooting) and /etc/rebootmgr/post_boot_tasks/
(for tasks that should run after rebooting).
Tasks will run in alphabetical order by filename.
If a task runtime exceeds two hours, reboot manager will fail and disable itself on that node.
If a task exits with any other code than 0
, reboot manager will fail and not reboot.
Reboot manager will reboot when one of the following is true:
- It has been invoked without the option
--check-triggers
- The consul key
service/rebootmgr/nodes/{hostname}/reboot_required
is set - The file
/var/run/reboot-required
exists
If the option --check-holidays
is specified, reboot manager will refuse to reboot on german holidays.
Footnotes
-
This is the reverse of earlier versions. We decided for safety reasons to allow reboots only when the configuration is properly present. ↩