Skip to content
Argus — OTel-Native Onboard Observability

Argus — OTel-Native Onboard Observability

Andi Lamprecht Andi Lamprecht ·· 11 min read· Draft
FieldValue
StatusDraft
OwnerTBD
ContributorsOnboard, Observability, Infra
Proposed underPRD-0002 May 2026 Unification, Simplification, Stabilization
Date2026-04-23

1. Executive summary

Problem. Argus — our onboard observability — is a loose bundle of a Grafana agent container, a Python script that scrapes the HALO cellular modem, and a second Python process that reads a narrow subset of MAVLink messages. Everything is shipped as log lines to stdout, picked up by the Grafana agent, and forwarded to the cloud. The stack is not integrated with uncrew-mavlink-shim, does not use any of the OTel signal types beyond logs, covers only a fraction of the MAVLink surface, and has no story for host-system telemetry or for running consistently across different onboard compute platforms.

Proposal. Replace it with an OTel-native onboard observability agent, tightly integrated with the onboard software lifecycle (installed and managed by the Inventory Service, discoverable as a first-class onboard component). The agent emits logs, metrics, and traces using OTel semantics; captures the full MAVLink message stream from the shim; captures host-system telemetry (journal, CPU temp, memory, IO, dmesg); scrapes the cellular modem as a first-class integrated source; and runs across multiple onboard compute targets behind a hardware abstraction layer.

Out of scope. The cloud-side ingest stack. Mimir and Loki stay as-is for this PRD. A later migration to GCP-native (Cloud Logging, Google-Managed Prometheus) or the previously-planned ClickHouse / ClickStack is explicitly deferred.

2. Why this belongs under the May 2026 umbrella

  • Unification (U). Collapses three independently-managed onboard processes (Grafana agent + HALO-scraper + MAVLink-reader) into one integrated agent. Unifies signal types (logs, metrics, traces) under OTel instead of “everything is a log line.”
  • Simplification (S). Retires two Python sidecars and one bespoke Docker configuration. Replaces ad-hoc stdout log-line parsing with typed OTel signals.
  • Stabilization (R). A full MAVLink capture is the single most valuable source for post-incident investigation we currently don’t have. It moves us from “reproduce the failure in the lab” to “replay the actual message stream from the failed flight.”
  • Modernization (M). Onboard telemetry becomes structured, queryable, and AI-legible — a precondition for the AI-assisted incident-analysis workflows tracked under Primary Outcome 6.
  • Cost (C). Metrics compress much better than log lines over cellular. Moving host telemetry and MAVLink to metrics where appropriate reduces cellular data volume and downstream ingest cost. Exact savings TBD during discovery.

Primary contribution: Unification and Simplification, with a structural gain in Stabilization via full MAVLink capture.

3. Problem detail

3.1 What Argus is today

  • Grafana agent runs in a separate Docker container on the aircraft. Forwards stdout log lines from other containers to the cloud Grafana stack (Loki for logs, Mimir for metrics).
  • HALO modem Python script scrapes the HALO cellular modem for signal strength, connection state, and data-usage counters. Emits stdout log lines.
  • MAVLink reader Python script subscribes to a small, hand-picked subset of MAVLink messages. Emits stdout log lines.
  • uncrew-mavlink-shim emits its own stdout log lines as a side effect of normal operation. The Grafana agent picks those up too.

3.2 What is wrong with it

  1. Everything is a log line. OTel’s metric and trace signal types are unused. Metrics compress much better than logs, which matters a great deal on a metered cellular link. Traces would give us causality across the shim, the MAVLink bus, and onboard services — we currently have to correlate timestamps across unrelated log streams.
  2. MAVLink coverage is a narrow hand-picked subset. When an incident happens, the messages we actually need are usually not captured. We need a full MAVLink log per flight.
  3. The agents are not integrated. The Grafana agent, the HALO scraper, and the MAVLink reader are three independent processes with independent lifecycles, independent failure modes, and no awareness of the onboard software install/update flow. They require manual wiring for every new onboard deployment.
  4. No host-system telemetry. We do not capture the journal, CPU temperature, memory pressure, IO pressure, or dmesg. When a flight fails for a host-level reason (thermal throttle, OOM, disk full, kernel driver error), we have no signal.
  5. No hardware abstraction. The current stack assumes a single target. We already have to run on Ubuntu hosts, Raspberry Pi variants, and vendor-specific compute units (Auterion Skynode today; Blueflite and other PX4/ArduPilot-class platforms coming). The same agent has to run across all of them.

4. Target state

4.1 Integrated OTel-native agent

One onboard observability agent, installed and lifecycle-managed by the Inventory Service (see Onboard Software Deployment via Inventory Service). The agent:

  • Emits OTel logs, metrics, and traces over OTLP.
  • Is discoverable as a first-class onboard component through the same registration/approval path as the rest of the onboard software.
  • Runs as the appropriate shape for the target platform — sidecar container, systemd-managed process, or journal integration — chosen by the HAL, not by the ops person deploying it.
  • Has no manual wiring step between itself and the MAVLink shim, the host, or the modem. Discovery is automatic.

4.2 Full MAVLink capture, through the shim

uncrew-mavlink-shim is the one process that already sees every MAVLink message on the vehicle. The observability agent consumes the shim’s message stream through a defined contract — not by re-subscribing to the MAVLink bus independently — and emits:

  • A per-flight MAVLink trace — every MAVLink message captured, timestamped, available for cloud-side replay.
  • Derived metrics for high-cardinality MAVLink signals where a metric is more efficient than a message dump (attitude rates, battery voltage, GPS fix quality, link-layer error counters). These reduce cellular volume without losing the ability to reconstruct the detailed record when needed.

The contract between shim and agent is versioned and stable. The shim is the single source of MAVLink truth; the agent does not bypass it.

4.3 Host-system telemetry

The agent captures, as OTel metrics where possible and logs where necessary:

  • systemd-journald output from relevant services.
  • CPU temperature, per-core utilization, throttle events.
  • Memory pressure (PSI where available, free/available breakdown elsewhere).
  • IO pressure, disk-usage, IOPS.
  • dmesg / kernel log (driver errors, USB resets, thermal events).
  • Filesystem fill, inode pressure, specifically on the partitions used for MAVLink logs and flight artifacts.

4.4 Cellular-modem telemetry, first-class

The HALO modem scraper is folded into the agent. Cellular signal strength, connection state, handover events, data-usage counters, and modem health become OTel metrics — not log lines parsed from a Python script’s stdout.

The abstraction must accommodate modems beyond HALO; the scraper becomes a pluggable source behind the HAL.

4.5 Hardware abstraction layer

A HAL inside the agent isolates platform-specific concerns:

  • Host-telemetry collection (journal path, sysfs thermal zones, pressure files).
  • Modem integration (HALO today; others tomorrow).
  • Process shape (systemd unit vs. Docker sidecar vs. journal service).
  • Artifact paths (where MAVLink captures and local buffers live).

Supported targets at v1:

  • Ubuntu (x86_64 and aarch64) — the most common dev and field platform.
  • Raspberry Pi (aarch64).
  • Auterion Skynode.

Adding a target is a HAL plugin + a conformance test run; it is not a fork.

4.6 Tight integration with onboard deployment

The agent is not a thing the ops team remembers to also install. It is installed, updated, and removed by the same Inventory-Service-driven flow that manages the onboard software itself (see the Onboard Software Deployment PRD). Agent version is part of the approved UAV configuration.

5. Regulatory & compliance

  • Cellular PII is not captured; IMSI/ICCID values, if emitted by the modem, are hashed or redacted before leaving the vehicle.
  • No change to Part 135 certificate configuration control. Production onboard deployment of the new agent routes through DCM Change Request after this PRD is accepted.
  • Flight-data handling continues to respect DroneUp data-retention policy; raw MAVLink captures beyond the cloud retention window are purged onboard and in cloud.

6. May 2026 scope

6.1 In scope

  1. Agent MVP on Ubuntu — OTel logs + metrics + traces over OTLP, emitting to the existing Grafana cloud stack (Loki/Mimir) through their OTLP-compatible endpoints.
  2. Shim contractuncrew-mavlink-shim exposes a stable interface (unix socket or named pipe, TBD in design) for the agent to consume the full MAVLink message stream. Shim documentation and compatibility tested.
  3. Full MAVLink capture — every MAVLink message captured per flight, uploaded or buffered per cellular-link availability.
  4. Host-telemetry module — journal, CPU temp, memory, IO, dmesg via OTel metrics/logs on Ubuntu.
  5. Modem-telemetry module — HALO integration folded into the agent; emitted as OTel metrics.
  6. HAL v0 — enough abstraction to support Ubuntu and Raspberry Pi at v1; Skynode support stubbed with a conformance test list but not delivered this month.
  7. Inventory-Service integration — agent installed and version-pinned via the same flow as the onboard runtime. “Unmanaged sidecar” deployments are removed from the dev fleet by end of May.
  8. Backward compatibility — Grafana cloud stack (Loki + Mimir) untouched. Agent emits to their OTLP endpoints; no cloud migration.

6.2 Out of scope

  • Cloud-side ingest changes. Mimir and Loki stay. No move to Cloud Logging, Google-Managed Prometheus, or ClickHouse/ClickStack. Explicitly deferred.
  • Skynode HAL plugin delivery. Conformance tests are written; the plugin itself is a follow-up.
  • Additional modems beyond HALO. Pluggable interface designed; additional sources are follow-ups.
  • Production fleet cutover. Dev fleet only in May; production cutover routes through DCM Change Request afterwards.
  • Replay tooling for cloud-side MAVLink timelines (UI/CLI). Tracked separately.

7. Technical specifications

  • Language. Go (aligns with the platform-wide target state; lightweight, statically-linked binaries suit the onboard constraint).
  • Protocol. OTLP (HTTP/protobuf) to the cloud. Internal shim contract TBD in design — unix socket with length-prefixed framing is the starting assumption.
  • Signal types.
    • Logs: shim output, host journal, dmesg, agent-internal events.
    • Metrics: CPU/mem/IO/thermal, modem signal/state/usage, MAVLink-derived rates and counters.
    • Traces: spanning shim → agent → OTLP export, and spanning MAVLink-command → vehicle-response where the shim can infer causality.
  • Buffering. Onboard disk-backed buffer for all signals; cellular link is treated as intermittent. Buffer-size caps and retention policies per signal type.
  • Compression. OTLP protobuf + gzip at the transport layer. Metrics over logs where equivalent information is available, specifically for high-frequency MAVLink signals.
  • Lifecycle. Started, stopped, updated, and version-pinned by the Inventory Service’s systemd-unit / container lifecycle. No manual docker run invocations.
  • HAL. Target-agnostic core with a target-specific plugin selected at install time based on Inventory Service metadata for the UAV.
  • Security. OTLP endpoint reached with a per-UAV mTLS identity issued by the Inventory Service. No long-lived shared bearer tokens.

8. Risks & phased rollout

Risks

  • Shim contract stability. The agent reads every MAVLink message from the shim through a defined interface. If the contract breaks, observability breaks. Mitigation: versioned interface, CI contract tests between shim and agent repos.
  • Cellular data volume. “Every MAVLink message” is more data than “a hand-picked subset.” Mitigation: derived metrics replace message dumps for high-frequency signals; raw capture is buffered onboard and uploaded opportunistically; compression + sampling policies per signal class.
  • HAL surface growth. Every new onboard target is a new HAL plugin. Mitigation: conformance test suite is the contract; a plugin that does not pass conformance does not ship. Plugins are additive; they do not branch the core.
  • Grafana cloud stack receiver compatibility. Loki and Mimir speak OTLP with some caveats. Mitigation: spike at start of May to confirm OTLP → Loki/Mimir works for our signal volumes; fall back to intermediate collector if it does not.
  • Regression risk for the dev fleet. Replacing the entire onboard observability stack in one go is high blast radius. Mitigation: dual-run (old stack + new agent) on dev UAVs for the first two weeks of deployment; old stack retired only after signal parity is confirmed.

Phased rollout

  1. Week 1 — spike. OTLP → Loki/Mimir receiver path confirmed. Shim-agent interface prototype on a desk rig.
  2. Week 2 — agent core on Ubuntu. Logs + metrics + traces over OTLP. Host telemetry module. HALO module folded in. Installed by Inventory Service on one dev UAV.
  3. Week 3 — full MAVLink capture. Shim contract finalized; agent consumes full stream; derived metrics for high-rate signals. Dual-run with the old stack on the dev fleet.
  4. Week 4 — HAL v0 + Raspberry Pi target. Second target brought up through the HAL. Skynode conformance tests written.
  5. End of May. Old stack retired from dev fleet; production cutover DCM Change Request drafted.

9. Estimation input

StoryOptimistic (d)Likely (d)Pessimistic (d)Domain
OTLP receiver compatibility spike (Loki/Mimir)124[observability]
Go agent core (OTel SDK wiring, OTLP export, disk buffer)359[onboard][observability]
Shim–agent interface contract + shim implementation358[onboard]
Full MAVLink capture + derived metrics3610[onboard][observability]
Host telemetry module (journal, CPU, mem, IO, dmesg)247[onboard]
HALO modem module (folded into agent)235[onboard]
HAL v0 + Ubuntu + Raspberry Pi plugins358[onboard]
Inventory-Service lifecycle integration247[onboard][platform]
Dual-run deployment + dev-fleet cutover247[infra][onboard]
Skynode conformance test suite (no plugin delivery)124[onboard]

10. Open questions

  • Does Loki accept OTLP logs at our signal volume without an intermediate collector, or do we need an OTel Collector in front?
  • Where does the per-flight MAVLink capture live on disk, and what is the retention policy before upload succeeds?
  • Does the shim expose a message-stream interface today, or is that new work in the shim repo?
  • What is the agent’s behavior when the Inventory Service is unreachable at startup — fail, run with cached config, or degrade to local-only capture?
  • Do we want per-UAV mTLS identity issued by the Inventory Service on day one, or is a shared dev credential acceptable for May?

11. References

Last updated on