-
Notifications
You must be signed in to change notification settings - Fork 25
Service Level AOI
http://eapd.cms.gov/ is available 99.9% of the year ( all but ~8 hours) a. 2 hours for major infrastructure upgrades that are cost-prohibitive to blue green b. 6 hours per year for AWS outages and other management reserve c. Site fully loads in < 5 seconds
The primary goal here is to define indicators (metrics) that are useful and actionable. MACPro eAPD’s mission is to support the states creation, submission, and response process for APDs and contracts.
In order to support the states, the system needs to be available and responding in a reasonable amount of time. As such, the most important primary indicators will be the ratio of successful responses and the latency in those responses. All other indicators can help indicate or predict other deeper issues, but do not directly indicate the ability to serve information.
As such, monitoring efforts should focus on:
- Availability Monitoring
- Rate of HTTP 400/500 responses from the ELB
- Duration of HTTP 200 responses ELB Targets
Indicators that are generally NOT useful for availability monitoring, but may help with troubleshooting, proper provisioning and configuration changes:
- CPU usage
- Memory Usage
(High levels of CPU and memory usages generally indicate proper provisioning of resources and therefore should be carefully considering when investigating a reported issue)
Endpoint | Service | Status | Alert |
---|---|---|---|
https://eapd.cms.gov/ | Primary | Not Available | Alert |
Service | Metric | Stat | Period | Threshold | # of Datapoints | Actions | Notes |
---|---|---|---|---|---|---|---|
Application ELB | HTTPCode_Target_5XX_Count | Average | 5 minutes | > 200 | 2 | Alert | |
HTTPCode_Target_5XX_Count | Average | 5 minutes | > 50 | 5 | Warn | ||
TargetResponseTime | p95 | 5 minutes | > 2.5 seconds | 2 | Alert | ||
TargetResponseTime | p95 | 5 minutes | > 0.5 seconds | 5 | Warn | ||
Instances | HealthyHostCount | Average | 5 | 1 | 1 | Alert | |
CPUUtilization | n/a | n/a | < 20% | n/a | Alert | ||
RDS | CPUUtilization | Average | 5 | > 50% | 2 | Warn | |
Lambdas | Errors | n/a | n/a | > 0 | n/a | Alert |
- Alerts would trigger an alerting mechanism for investigation by on-call
- Warning would generate a report that should be investigated
- Team Working Agreement
- Team composition
- Workflows and processes
- Testing and bug filing
- Accessing eAPD
- Active Documentation:
- Sandbox Environment
- Glossary of acronyms
- APDs 101
- Design iterations archive
- MMIS Budget calculations
- HITECH Budget calculations
- Beyond the APD: From Paper to Pixels
- UX principles
- User research process
- Visual styling
- Content guide
- User research findings
- eAPD pilot findings
- User needs
- Developer info
- Development environment
- Coding Standards
- Development deployment
- Infrastructure Architecture
- Code Architecture
- Tech 101
- Authentication
- APD Auto Saving Process
- Resetting an Environment
- Hardware Software List
- Deploying Staging Production Instances Using Scripts
- Terraform 101 for eAPD
- Provisioning Infrastructure with Terraform
- WebSocket basics
- Operations-and-Support-Index
- Single Branch Deployment Strategy
- Ops and Support Overview
- Service Level AOI
- Incident Response Plan
- On-Call Policy
- Infrastructure Contingency Plan
- Updating CloudFront Security Headers
- Requesting and Installing TLS Certificates