Health Monitoring & On-Call Management
| Field | Value |
|---|---|
| Status | Draft |
| Owner | TBD |
| Contributors | SRE, Platform, Compliance |
| Proposed under | PRD-0002 May 2026 Unification, Simplification, Stabilization |
| Date | 2026-04-23 |
1. Executive summary
Problem. We have no canonical on-call stack. PagerDuty was used in the past, then Jira Service Management (JSM), and we have drifted between both without a single source of truth for who is on call, what the runbook is, how an incident is declared, and how it is reviewed afterwards. For a company operating under Part 135 and providing UTM services under ASTM F3548-21, that gap is a regulatory exposure as well as an engineering one.
Proposal. Adopt a lean, low-cost, preferably self-hosted stack for health monitoring and on-call management, with the following non-negotiables:
- One source of truth for health signals (service up/down, SLO burn, job failures, safety-relevant alerts from UTM and onboard).
- One source of truth for who is on call, with schedules, escalations, and heartbeats.
- AI-extensible — webhook-in and webhook-out so we can plug Gemini / Claude / local models into summarization, triage, and first-response drafting without vendor lock-in.
- Baked-in procedures for incident declaration, response, and post-incident review, so that Part 135 and UTM reporting obligations are satisfied by the normal flow rather than by a separate human-chased checklist.
Out of scope. Rebuilding the underlying observability backend (covered by the Argus OTel proposal). Replacing Jira as an engineering work tracker.
2. Why this belongs under the May 2026 umbrella
- Unification (U). Collapses PagerDuty-era runbooks + JSM-era scheduling + ad-hoc Slack incident channels into one declared incident-response surface.
- Simplification (S). Retires at least one paid tool (PagerDuty or JSM, depending on the option chosen) and one overlapping incident-tracking flow.
- Stabilization (R). Without declared on-call and a working alerting pipeline, “stabilization” is a claim we cannot evidence. This is the operational half of the stabilization outcome.
- Cost (C). PagerDuty + JSM licensing is recurring spend that a lean self-hosted option can largely eliminate. Exact line items quantified during discovery.
- Modernization (M). AI-assisted incident summarization, triage, and first-response drafting is one of the highest-value applications of AI we can deploy this year. The stack must be built to allow it, not retrofitted.
- Demonstrability (D). The golden-path demo fails cleanly when a shared service breaks. “Fails cleanly” means the alert arrives at a named on-call person through a real pipeline, not in a Slack DM to whoever happens to be online.
Primary contribution: Simplification + Modernization, with direct Cost and Stabilization gains.
3. Drivers and requirements
3.1 Regulatory drivers
- Part 135. As a Part-135 operator, we have incident-reporting, maintenance, and operational-control obligations that require timestamped, auditable records of who was notified, when, and what was done. The on-call system is the system of record for the notified when / by whom half of that.
- UTM / USS (ASTM F3548-21). USS operational obligations include responding to service degradation in bounded time. We need a pipeline where UTM health alerts reach an on-call person with a defined response window, and where the response is recorded in a form compatible with post-incident reporting to regulators or peer USSs.
- Audit trail. Every alert, acknowledgement, escalation, and resolution is durably logged. The log is export-capable for regulator requests.
3.2 Functional requirements
- Schedules and escalations — named rotations, timezone-aware, holiday-aware, override-able; escalate on non-ack within N minutes; multi-level fallback.
- Channels — at minimum phone + SMS + push + email + Slack. Regulatory carve-out: phone and SMS are non-optional for primary escalation.
- Heartbeats — dead-man switches for critical services; if a heartbeat stops, an alert fires automatically.
- Incident declaration — one-click (or one-command) from an alert to a declared incident, with an incident channel, a named incident commander, a running timeline, and a draft post-mortem.
- Webhooks in and out — alerts can be ingested from anywhere that can POST JSON; alerts can be fanned out to anywhere that can receive JSON. This is the AI-integration surface.
- Runbook links — every alert carries a link to the service runbook; runbooks are version-controlled and rendered alongside the alert.
- Post-incident review — the stack produces a structured post-mortem artifact at incident close, linkable to the regulatory record.
- SLO / SLI — we can declare SLOs per service and alert on burn rate, not just on a metric crossing a threshold.
3.3 Non-functional requirements
- Lean operational footprint. Must run on our existing GKE or Cloud Run without demanding a new team to operate it.
- Low recurring cost. Preferred: self-hosted with near-zero per-seat cost. Acceptable: a paid tool under ~$500/month total for the foreseeable team size.
- AI-extensible. Webhook-in + webhook-out at the alert level and at the incident level. No vendor-specific AI lock-in; we will route to Vertex AI (Gemini) today and possibly elsewhere tomorrow.
- Data residency on GCP. Incident data, alert history, schedules — stored in GCP-region-bound storage where practical.
- Mobile usability. On-call staff will primarily interact from a phone at 3 AM; the mobile experience cannot be an afterthought.
4. Candidate options
Option A — Grafana OnCall + Grafana LGTM (self-hosted on GKE)
The straightforward open-source answer. Grafana OnCall handles scheduling/escalation/heartbeats; the existing Grafana Loki/Mimir/Tempo stack handles signals. Alerts flow through Alertmanager or directly into OnCall.
- Pros. Best-known OSS on-call; largest community; integrates cleanly with Prometheus/Alertmanager and with the Grafana UI we already use. Webhook integrations are well-documented. Native SLO support via Grafana SLO.
- Cons. Two things to operate (OnCall + the LGTM backend); Grafana’s OnCall product direction has been debated in the community — we should confirm active OSS stewardship at selection time. Mobile app is good but not PagerDuty-class.
- AI extension. Alert webhook → Cloud Function → Vertex AI (Gemini) → post summary back to the incident channel. Straightforward.
- Cost. Compute + storage on GKE; no per-seat license.
Option B — SigNoz (unified OTel-native, self-hosted)
OTel-native single-UI observability + incident management on ClickHouse. A natural fit if we also move toward OTel onboard (see Argus proposal).
- Pros. One UI and one backend instead of a fragmented stack; ClickHouse is cost-efficient on GCS-backed persistent disks; OTel-native, which pairs well with the Argus onboard direction; incident-management features growing but less mature than Grafana OnCall.
- Cons. On-call scheduling/escalation is thinner than OnCall; may need to pair with GoAlert or similar for real rotation management. Newer, smaller community.
- AI extension. OTel-native; collector sidecars make it easy to route telemetry to an LLM for anomaly detection or RCA drafting.
- Cost. Compute + ClickHouse storage on GCS; no per-seat license.
Option C — GoAlert + Existing Backend
GoAlert is a focused, lightweight on-call scheduler — small blast radius, easy to operate. Pair with whatever we already have for signals.
- Pros. Tiny operational footprint (runs on Cloud Run); Netflix-pedigree; focused on scheduling and escalation, not trying to be everything.
- Cons. Not a signals backend — needs a separate system for metrics/logs/alerting rules. No incident-management UI; we’d pair with Dispatch or similar.
- AI extension. Webhook in and out; easy to hook.
- Cost. Cloud-Run-level cost; no per-seat license.
Option D — Netflix Dispatch (incident framework) + Scheduler of Choice
Dispatch is opinionated about the process of incident response: Slack channels, ticket creation, timeline, post-mortem. Pair with any scheduler (GoAlert, OnCall, or a paid tool for the alerting side).
- Pros. Encodes the process — exactly what we need to bake in for Part 135 and UTM. Scriptable, python-native.
- Cons. Not a scheduler; must be paired with one. Smaller user base than Grafana OnCall.
- AI extension. Excellent — designed to be scripted. An “AI incident commander” that drafts the timeline and first responder actions is plausible.
- Cost. Compute only.
Option E — Better Stack (paid, hosted)
Hosted, low-ops, fair pricing, good AI features (noise reduction, auto-grouping).
- Pros. Zero ops; good UX; modern AI features built in.
- Cons. Per-seat cost recurs forever; data lives off-GCP; lock-in risk on the AI features.
- AI extension. Built-in summarization/noise-reduction; webhook-out for custom extension possible but less open than the self-hosted options.
- Cost. Starts free / ~$29 per user per month at the tier we’d use.
Option F — What we currently have (baseline)
Keep PagerDuty + JSM as-is. Costs the most, unifies the least, delivers regulatory coverage only because we know the product. Included for comparison.
Selection matrix (scored at spike end, not now)
| Criterion | A (OnCall) | B (SigNoz) | C (GoAlert) | D (Dispatch) | E (Better Stack) | F (PD+JSM) |
|---|---|---|---|---|---|---|
| Lean ops | ||||||
| Low recurring cost | ||||||
| AI-extensible | ||||||
| Schedules / escalations quality | ||||||
| Incident-process support | ||||||
| Regulatory audit trail | ||||||
| Mobile UX for on-call | ||||||
| Time to stand up on staging |
The May-2026 spike (see §6) fills this matrix. The recommendation then lands on the PRD.
5. Regulatory & compliance
- Part 135. All alerts, acknowledgements, escalations, and resolutions are durably logged with timestamp, actor, and content. Logs are retained to the Part-135 record-keeping requirement and exportable on request.
- ASTM F3548-21 / UTM. USS health alerts route to on-call within the bounded response time required by the applicable service-level obligations. The exact threshold is captured in the SLO definition for each UTM service.
- Data residency. Incident records and alert history stored in GCP-region-bound storage where the chosen option permits; where it does not (Option E), a residency evaluation is documented and accepted or rejected explicitly.
- DCM change control. Any change to production alert routing for safety-relevant services routes through the DCM Change Request process.
6. May 2026 scope
6.1 In scope
- Spike — a 1-week bounded evaluation of options A, B, C, and optionally D, scored against the §4 matrix. Output: a written recommendation.
- Pick one — CEO sign-off on the recommendation at the end of the spike.
- Stand up the chosen option on staging — deployed to GKE or Cloud Run; admins named; schedules for the SRE and Platform rotations imported; at least three services wired to it end-to-end (one UTM, one Uncrew cloud, the golden-path demo runner).
- Baked-in incident procedure — incident declaration one-click from an alert; Slack channel auto-created; incident commander named; timeline captured; post-mortem artifact produced at close.
- One AI integration — alert → webhook → Vertex AI (Gemini) → post a 3-sentence summary + a ranked top-3 likely causes back to the incident channel. Human-gated; no automated action.
- Documentation — runbook-link convention, alert-author guide, on-call-rotation doc, and a “how to declare an incident” page. All in this docs repo.
- Retire the PagerDuty-or-JSM overlap on staging. Old surface kept readable but not accepting new alerts.
6.2 Out of scope
- Production cutover — staging-only in May; production routes through DCM Change Request.
- Replacing Loki/Mimir as the signals backend — covered by Argus; independent.
- Replacing Jira as an engineering tracker — not the same thing as JSM-as-on-call.
- Automated incident response actions — AI summarizes; AI does not act. A later proposal can widen the aperture once we have evidence.
- Full SLO coverage for every service — SLOs declared for the three in-scope services; the rest is follow-up.
7. Risks
- “Run another thing” drag. Self-hosted means we operate another stack. Mitigation: preference for options with the smallest operational footprint (GoAlert, Dispatch, or OnCall on GKE Autopilot).
- Regulatory gap during cutover. If staging is the new source of truth but production still runs on PagerDuty/JSM, a split-brain period is unavoidable. Mitigation: staging-only in May; production cutover is its own DCM CR with a written cutover plan.
- Vendor abandonment of OSS on-call. If the chosen project stops being maintained, we inherit a fork. Mitigation: pick at selection time based on active stewardship, not hype; design the alert-ingest layer (webhook-based) so the scheduler is swappable.
- AI extension noise. LLM summaries that hallucinate causes are worse than no summary. Mitigation: summaries are advisory, clearly labeled AI-generated, never block acknowledgement, and are measured against on-call feedback for the first 30 days.
8. Phased rollout
- Week 1 — spike. Stand up the candidate options in a throwaway staging namespace. Score the §4 matrix. Written recommendation + CEO sign-off.
- Week 2 — deploy chosen option. Import the SRE and Platform rotations. Wire up the first three services.
- Week 3 — bake in the incident procedure. One-click declaration; Slack + timeline + post-mortem. Dry-run incident on the golden-path demo.
- Week 4 — AI integration + docs. Gemini summarization webhook live; runbook convention and on-call docs published. PagerDuty/JSM overlap retired on staging.
9. Estimation input
| Story | Optimistic (d) | Likely (d) | Pessimistic (d) | Domain |
|---|---|---|---|---|
| Option-selection spike (A, B, C, optionally D) | 3 | 5 | 7 | [sre][platform] |
| Deploy chosen option to staging (GKE or Cloud Run) | 2 | 4 | 7 | [infra][sre] |
| Schedule import + first three service integrations | 2 | 4 | 6 | [sre] |
| Baked-in incident declaration + post-mortem artifact | 3 | 5 | 8 | [sre][platform] |
| Vertex-AI summary webhook integration | 1 | 2 | 4 | [platform][ai] |
| Runbook convention, on-call docs, declare-an-incident doc | 2 | 3 | 5 | [docs][sre] |
| Retire PagerDuty/JSM overlap on staging | 1 | 2 | 4 | [sre] |
10. Open questions
- Do we have a hard regulatory retention period for alert/incident history under Part 135, and does the chosen option meet it without a separate archive?
- Phone + SMS escalation — do we use the chosen option’s built-in provider, or route through Twilio so we can keep the carrier layer separate?
- Does UTM have a defined response-time SLO today that we can encode as an alert threshold, or do we need to define one in this PRD?
- Does the CEO have a strong preference between “self-hosted lean” and “paid low-ops”? That choice changes the spike shape.
- Who owns on-call rotation governance — SRE, Platform, or a rotating duty?
11. References
- Gemini research note (internal, April 2026) — candidate-option background
- PRD-0002 May 2026 Unification, Simplification, Stabilization
- Argus — OTel-Native Onboard Observability (paired signals stack)
- Grafana OnCall
- SigNoz
- GoAlert
- Netflix Dispatch
- Better Stack