Alerting Setup for Operate First Monitoring

Status: accepted
Deciders: hemajv, 4n4nd, anishasthana, HumairAK, tumido, mhild

Technical Story: issue-1, issue-2, issue-3

Context and Problem Statement

As we have multiple services/applications deployed and monitored in the Operate First environment (ex. Jupyterhub, Argo, Superset, Observatorium, Project Thoth, AICoE CI pipelines etc), we need to implement an incident reporting setup for handling outages/incidents related to these services.

All the services are being monitored by Prometheus. Prometheus scrapes and stores time series data identified by metric key/value pairs for each of the available services. These metrics can be used for measuring the service performance and alerting on any possible service degradation such as basic availability, latency, durability and any other applicable SLI/SLOs. These SLI/SLOs for the various services are defined and documented in the SRE repository.

Alerting with Prometheus is separated into two parts. Alerting rules in Prometheus servers send alerts to an Alertmanager. The Alertmanager then manages those alerts, including silencing, inhibition, aggregation and sending out notifications via methods such as email, on-call notification systems, and chat platforms.

Whether its a major bug, capacity issues, or an outage, users depending on the services expect an immediate response. Having an efficient incident management process is critical in ensuring that the incidents are always communicated to the users (via on-call notification systems) and handled by the team immediately. An on-call notification system is a system/software that provides an automated means of contacting users and communicating pertinent information during an incident. It also has additional on-call scheduling features that can be used to ensure that the right people on the team are available to address a problem during an incident. There are multiple on-call notification systems such as PagerDuty, JIRA etc that can be used for incident reporting, but which of these tools are best suitable for reporting the outages/incidents to our users?

Decision Drivers

Visibility of incident reporting
- How can the incidents be tracked and reported in an open and transparent manner for our users?
Compatibility with Prometheus
- Is the incident reporting tool compatible with Prometheus i.e. can it handle/receive Prometheus alerts?
Complexity and cost of incident reporting tool
- How easy/hard is it to manage and operate the incident reporting tool?
- Is it a free or paid tool?
- Is it open source?

Considered Options

Option 1:
- Use PagerDuty which is a popular paid on-call management and incident response platform
Option 2:
- Use open source tools like:
  - Cabot - Python/Djano based monitoring platform
  - OpenDuty - Incident escalation tool similar to PagerDuty
  - Dispatch - Incident management tool by Netflix
  - Response - Django based incident management tool
which are free, self-hosted infrastructure that provides some of the best features of PagerDuty, Pingdom etc without their cost and complexity
Option 3:
- Use GitHub Alertmanager receiver which is a Prometheus Alertmanager webhook receiver that creates GitHub issues from alerts

Decision Outcome

Chosen option: Option 3, because:

The GitHub alertmanager receiver can easily be configured and operated to function with Prometheus alerts. It automatically creates issues in GitHub repositories for any active alerts being fired, making it visible for any user to track
All communication/updates/concerns related to the incident can be easily handled by adding comments in the issues created by the GitHub receiver
Unlike Option 1, there is no additional cost involved
There is no requirement for using JIRA/Slack for incident tracking, which are the only supported options in some of the tools listed in Option 2 (such as Dispatch and Response) In any case that such a requirement surfaces, we can use GitHub bots for different platforms such as GitHub for Slack and Google Chat to notify us of the issues immediately
It is actively being maintained and supported compared to some of the tools in Option 1 (such as Cabot and OpenDuty) which lack community support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0007-alerting-setup.md

0007-alerting-setup.md

Alerting Setup for Operate First Monitoring

Context and Problem Statement

Decision Drivers

Considered Options

Decision Outcome

Files

0007-alerting-setup.md

Latest commit

History

0007-alerting-setup.md

File metadata and controls

Alerting Setup for Operate First Monitoring

Context and Problem Statement

Decision Drivers

Considered Options

Decision Outcome