Overview
This monitoring tool continuously checks the health of multiple services currently under development. It measures availability (is the service reachable and responding correctly?) and error rate (are requests failing, and how often?), and then notifies developers when behavior indicates an outage or serious degradation.
The goal is to detect issues quickly, reduce time-to-awareness, and provide a consistent, shared standard for what “healthy” means across services.
What the tool does
The tool provides three core capabilities:
- Service health monitoring
- Runs recurring health checks against each registered service (e.g., HTTP endpoint checks).
- Confirms both reachability and correctness (not just “port open,” but “expected response”).
- Tracks availability over time.
- Error-rate monitoring
- Collects request/response outcomes (success vs failure).
- Calculates error rate over rolling windows (for example, 1-minute, 5-minute, and 15-minute windows).
- Separates transient blips from sustained degradation using thresholds and minimum-sample rules.
- Alerting and developer notification
- Sends alerts when a service is likely down (availability failure) or when error rate crosses defined thresholds.
- Routes alerts to the owning developers/team (e.g., messaging channel, email, pager/on-call integration—depending on how we wire it up).
- Includes actionable context in each alert: affected service, time detected, current symptoms, recent trend, and links to logs/dashboards (where available).
Why do we need it
Services under active development tend to change frequently, which increases the likelihood of regressions, dependency issues, and configuration drift. This tool exists to:
- Reduce mean time to detect (MTTD): developers learn about failures within minutes, not hours.
- Improve uptime and confidence: availability is consistently measured and reported transparently.
- Create a shared operational baseline: teams agree on thresholds, severity, and ownership.
- Support better release habits: monitoring and alerting make it safer to ship more often, as failures surface quickly and are easier to correlate with recent changes.
Key concepts (definitions)
- Availability: The percentage of time a service is responding successfully to health checks (and/or serving successful requests) within a defined window.
- Error rate: The percentage of requests resulting in errors (e.g., 5xx responses, timeouts, failed dependency calls), calculated over a rolling window.
- Incident: A sustained condition where a service is down or severely degraded and requires developer action.
- Alert: A notification that an incident may be occurring (triggered by availability failures or error-rate thresholds).
Alert conditions (high-level)