-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handler-slack-multichannel.rb: several improvements #82
base: master
Are you sure you want to change the base?
handler-slack-multichannel.rb: several improvements #82
Conversation
- Add support for single channel defined as a string rather than an array (removes the need of having 2 different handlers) - Implement a retry strategy for handling slack api transitient errors (like timeouts). Retries hardcoded to 5 with 10 second timeout, but we should make this a setting - Exits with non zero status on api call errors, to allow sensu-server to catch the output and log it - Deals with non-nagios standard exit codes for checks (0-3), also known as "out of bounds" errors - Sets the color the same as UNKNOWN - Displays the exit code as it was returned by the check, for detailed information
Thanks for your contribution to Sensu plugins! Without people like you submitting PRs we couldn't run the project. I will review it shortly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this looks great, let me know if you have any questions about my suggestions.
req.body = payload(text, channel).to_json | ||
# Implement a retry strategy, with 5 tries max and a timeout of 10 seconds each | ||
tries ||= 5 | ||
Timeout.timeout(10) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are gonna use timeout we should expose options to specify http level timeouts otherwise they will be obscured by a generic timeout message. See sensu-plugins/sensu-plugins-http#123 for additional context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for configurable options, we probably would like to have not only the timeout but also the number of retries, and the sleeping time in between retries, as configurable settings. Something like I proposed on https://github.com/sensu-plugins/sensu-plugins-slack/pull/83/files
On the other hand, I am not sure we do need to go as deep as specifying dns
, open
and read
timeouts. Could we not just rescue these errors from Net::
?
bin/handler-slack-multichannel.rb
Outdated
@@ -321,7 +341,8 @@ def color | |||
2 => '#FF0000', | |||
3 => '#6600CC' | |||
} | |||
color.fetch(check_status.to_i) | |||
# When the sensu check status is out of bounds, use color for UNK | |||
color.fetch(check_status.to_i, '#6600CC') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think given the overlap we should break this out into a library method that can be shared to avoid duplication: https://github.com/sensu-plugins/sensu-plugins-slack/blob/4.0.0/bin/handler-slack.rb#L206-L228
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See following comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be resolved by fac988a, please review
bin/handler-slack-multichannel.rb
Outdated
@@ -335,6 +356,7 @@ def translate_status | |||
2 => :CRITICAL, | |||
3 => :UNKNOWN | |||
} | |||
status[check_status.to_i] | |||
# When the sensu check status is out of bounds, don't hide the status but show it for detailed info | |||
status.fetch(check_status.to_i, "OUT OF BOUNDS: #{check_status.to_i}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that this is the right approach, I think we should just fall back to an unknown
. See previous comment about merging the functionality via a library method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not entirely agree with having OUT OF BOUNDS
messages declared as UNKNOWN
. An UNK in nagios-style means the check code has pruposely set this exit status and reflects the purpose of the programmer to explicitly tell the operator there is no way to know the state of the service checked.
An OOB from nagios standards means that the check script has exited somewhere in a nasty way. Typical cases are uncaught bad curl exits, missing files, missing permissions (on check script or files to read) and a myriad more.
The very reason to have OOB is to let the operator know that the check itself is failing, rather than working in an expected manner that is not able to monitor a service or resource.
I can see how it would make sense to avoid duplicity here, although you could argue likewise about the whole having a def translate_status
and a def color
which are based on the same input. We could return both in a hash and access the key we want specifically in the code.
In any case I strongly believe having OOB explicitly defined and the exit code returned to the operator for visibility is crucial to understand what is going on with the service monitored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your point about translate_status
and color
duplication, I did not initially write that bit. I added in the error handling to return a status (and corresponding color) which is understood in the sensu ecosystem in #62. More often than not changes to existing plugins are made within the confines of how it already works. I did not use that slack handler at the time so I did not really have anywhere to test it. I made the smallest and safest change that solved the problem someone was facing.
For the moment I think we should stick to the status[1] of unknown
for non [0-2]
codes because there is lack of precedence in the ecosystem. In ~5 years of consuming, contributing, and maintaining sensu I have yet to see anyone follow this convention you are proposing. While we mirror certain aspects of nagios I don't want to hold ourselves to all of their conventions as it ties our hands. I am not opposed to it in general but I think we should put out an RFC and get community feedback before declaring a new convention to be followed. I am fine with adding some additional output to tell the operator more in the meantime.
[1]: leaving the status code intact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right that it probably make sense to open this discussion to the whole community. I still advocate for my proposed convention but I will move the conversation to there for openness and greater feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, added here: sensu-plugins/community#127
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the de-duplication of translate_status
and color
, I will check if this would mean lots of further changes. If it doesn't I will go ahead with it, otherwise I will try to keep it simple for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be resolved bu fac988a, please review
Ok I will add a commit with the suggested changes here, and mark conversations as resolved accordingly. |
Keep client and check names in the title Co-Authored-By: kali-brandwatch <kali@brandwatch.com>
Change to string interpolation Co-Authored-By: kali-brandwatch <kali@brandwatch.com>
Change to string interpolation Co-Authored-By: kali-brandwatch <kali@brandwatch.com>
Change to string interpolation Co-Authored-By: kali-brandwatch <kali@brandwatch.com>
- Remove unnecessary puts for the whole request. - Fix syntax error on string interpolation. - Shorten error messages to keep clearer error messages and make rubocop happy about line length.
…dard exit codes As discussed on sensu-plugins#82 (comment) and sensu-plugins#82 (comment) Add a configurable setting for which exit status code will be used when the check returns a non standard status (non [0..3]), defaulting to 3 (UNK) Set rescues on `color` and `translate_status` to use this setting. Should resolve both comments above
@kali-brandwatch this PR should be updated to reflect that #83 was merged. A rebase is probably in order. I think it would be best to split out the OOB to another PR so we can get the rest of this merged as I don't think we have come to a consensus on how best to proceed on that front. |
Pull Request Checklist
Is this in reference to an existing issue?
General
Update Changelog following the conventions laid out here
Update README with any necessary configuration snippets
Binstubs are created if needed
RuboCop passes
Existing tests pass
New Plugins
Tests
Add the plugin to the README
Does it have a complete header as outlined here
Purpose
Known Compatibility Issues