Onboard Observability

Andi Lamprecht ·2026-05-11· 11 min read· Accepted

ADR-0094 · Author: Sybil Melton · Date: 2025-02-07 · Products: uncrew
Originally ADR-0099-OnBoard-Observability (v15) · Source on Confluence ↗

Observability of the (UAV) On-Board Services

Problem

In distributed systems, observability is the ability to collect data about programs’ execution, modules’ internal states, and the communication among components.[1][2] To improve observability, software engineers use a wide range of logging and tracing techniques to gather telemetry information, and tools to analyze and use it. Observability is foundational to site reliability engineering, as it is the first step in triaging a service outage.
(Wikipedia)

At DroneUp we have effectively invested in OpenTelemetry (aka OTEL) as the code instrumentation and Honeycomb as the SaaS presentation layer to implement observability in our distributed cloud applications. The chief net result is the trace:

Though there are also more traditional metrics and dashboards that Honeycomb offers (via integration with Kubernetes) that aren’t trace/span-centric.

11e97f62e201a82cef9a210fa91567fb-honeycomb-monitoring.png

This ADR is debating how to extend observability to our on-board / UAV-embedded services. We can’t debate it forever, but we do have reasons to debate it chiefly because of the cloud bias of both OTEL and Honeycomb.

This is best illustrated by the screenshot of the “leading observability platform for robotics” Foxglove:

Unlike cloud services, robots find themselves in an unbounded environment that they perceive and behave fed by this perception. To observe a robot is to see through its eyes. ROS is to robotics what Kubernetes is to cloud computing. ROS nodes publish their observations to pubsub topics and those get written to ROS-bags. The observations can be anything between a single-number measurement (e.g.: battery temperature) and raw, multi-megabyte camera frames.

Debugging issues in a robotics application require a different type of observability. Text logs and metrics are a good start. But context also comes from data types with a visual or geometric interpretation: maps, point clouds, geolocation, poses, video, odometry. (after)

Bias-2 Infinite Bandwidth

Compounded by the visual datapoints a robot simply cannot indiscriminately stream its observables to a sink somewhere in the cloud (like Honeycomb). It can write it to a file that’s streamed after the flight is completed and/or produce a stream curated for the available bandwidth and enqueued behind higher QoS streams like C2 - this itself scales to Uncrew and has been further debated here. If Uncrew uses OTEL to produce its observables and since OTEL doesn’t accept them being binary, it will be a partial stream curated not for the available bandwidth, but for the OTEL expression. Uncrew will then be successful at detaining this stream until the drone lands or until bandwidth permits.

Bias-3 Trace-Centrism

OTEL and Honeycomb are span-/trace- centric. Microservices lend themselves to this model well - each service, by definition, exposes a request/response interface and talks to downstream services to fulfil the requests enabling the tree-like waterfall representation of the trace/request.

A robot can be seen as a service and its mission a request. It will fly the mission and then tell you when it’s done flying it. However, the things really worth observing happen before this mission root span is finished. Meanwhile OTEL will only publish finished spans and so, it seems a mission isn’t a good candidate for a span.

Let’s take a look at how robots do using the example of obstacle detection:

What comes out of that is:

Despite no actor here cares for the fate of the events it emits, the steps 1..4 do form a tree, but it it’s not exactly the kind of tree OTEL and Honeycomb expect. The parent span (if any) will end before the children spans. Honeycomb seems to be able to deal with that though. Additionally, both OTEL and Honeycomb offer another (on top parent-child) inter-span relationship. Spans can be associated with links, which, in the event-based architecture, requires that events need to carry the span and trace-ids. Yet, it seems obvious that everyone, Uncrew engineers and users alike, would want to see that the mission has been interrupted by that specific alert, point cloud and camera frames.
Every individual 1..4 step can be thought of as a span with attributes, relationships and duration worth capturing.
When the Uncrew Flight Controller responds to the obstacle detection alert, it is not done flying the mission or its more complex items (like a survey). It may even be not done responding to the previous obstacle detection alert when another one comes. It seems then than the traces themselves nest forming parent-child relationships. OTEL Spans allow one to many relationships, but they are not of the parent-child variety.

Bias-4 Engineering-Audience

Honeycomb (competing with Datadog, NewRelic or Splunk in the same segment) is written with cloud engineers in mind. It is not written thinking of pilots who need to observe a flight in real-time. It is not written thinking of computer vision specialists needing to see what the robot saw. It is not written thinking about robotics developers who may want to find every occurrence of a 30..40m turn radius and need to write software that derive 2nd order observations from raw log data, e.g.: is it twilight? has it been raining? What other flights do the radar says were in the proximity? Only generic Data Analytics giants like AWS Athena or Google Big Query can measure up to this task.

This audience problem seems to be the greatest contention for Uncrew, who:

As a product - has the requirement to offer observability to is users.
As an Engineering team and matter of good engineering practice, it has to offer observability to itself.

The difference between these two is very blurry. There will clearly be instrumentations/observations useful to both audiences and ones the Uncrew engineers will know they just add for themselves.

But there is more! An on-board component emitting an event that indicates battery overheating is producing an observable of interest to both the Uncrew engineer, pilot and.. to the drone itself! The autonomous system itself is the third member observability audience.

Flight Log

The Flight Log is an established facility in some ways defined in contrast to the application log. The main difference (again on audience) is that the Flight Log has to be kept on premises for audit and compliance purposes.

ROS

Uncrew on-board architecture is based on ROS2, which itself has its observability practice:

logs have the standard INFO/DEBUG/WARN/ERR levels and the go to console (if available), file system (if available) and the/rosout topic and a bunch of other sinks the developer can configure.
metrics as typical observations go to topics, which are drained and directed somewhere like Grafana for visualization.

It’s evident that a ROS topic is a central instrument for depositing all kinds of observations that happen on a robot: telemetry, position, HW component status, alerts, precision landing progresses and impediments. A topic is also the central instrument in the Uncrew on-board architecture postulating the Monitoring Service that saps telemetry from all topics it cares for, curates it for the off-board consumptions turning some of it into alerts. The C2 ingress component, trusting correct curation, puts this stream on the wire towards the Avatar and any audiences downstream from it. It is clear that most of this stream is of interest to the Uncrew engineering team and thus constitutes observability.

Auterion

Not long ago Auterion Suite helped diagnosing a problem with the mavlink-router crashing. It put the FMU into a C2-loss condition, which resulted in an 18mins loiter and a belated decision to Return Home. The drone’s battery ran out leading to an emergency “land now” over a piece of shrubbery, crash and damaged drone.

In order to correlate between events happening in the PX4 and Uncrew App, Uncrew has requested to be given access to the `ulog`` file so it can write its logs there as well.

Since the Uncrew realtime Flight Log is partially redundant with the ulog’s content and because we expect ulog to grow, possibly morph to ROS2 bags that contain raster sensor output, we are debating with Auterion whether the Auterion OS should stop streaming the ulogs.

Alternatives Considered

The Uncrew Team (certainly its cloud part) is well invested into Honeycomb as the presentation layer. Despite tracing doesn’t seem useful for on-board observability, Uncrew, as as an engineering team, needs a generic tool to browse the logs, metrics, alerts and to host dashboards. Honeycomb is good at that and we will strive to direct at least some of the on-board observations to it.

We will also assert that the institution of the Flight Log as the drone’s “black box” remains firmly in the Uncrew architecture.

Separate the Audiences

One obvious approach is to address each audience separately. Use on-board OTEL whenever Uncrew needs to make something observable for itself. Use the Monitoring Service for whenever Uncrew needs to make something observable for anybody else.

Pros:

Uniformity of the engineering practice @DroneUp.
Correlated with the Uncrew backend observables.

Cons:

Blurriness of the division. What tools should an engineer use to discern what audience is this observable in front useful to? A rule-of-thumb could be postulated that if the observable exposes the software design (names a function or variable) then it’s likely not informative to an external stakeholder. Is it not though? Imagine NTSB investigating an incident and trawling to the Flight Log. Any piece of the puzzle (relatable to the software design or not) is - by definition - useful to the investigators. Conversely, every observable useful to an external stakeholder is - by definition - useful to an Uncrew engineer.
Duplicated instrumentation and data - at this point we have determined that the overlap (between the engineering and non-engineering audience) may be close to 100%.
Detached from the PX4/FMU observables. Should Uncrew chose to write its observables to ulog then they will be present in the same medium as the PX4 observables and correlated to/interleaved with them. If Uncrew writes them via OTEL to Honeycomb, then they will be present in the same medium as Uncrew’s backend components, but missing the ulog part. Surely Uncrew will wish to see all of it together and this is what the Flight Log attempts to solve by reconciling both on their resting point in GCS.

Example observability instrumentation in the Obstacle Detection node:

on_new_point_cloud(const point_cloud_frame& frame) {
    auto span = _tracer->StartSpan(
        "on_new_point_cloud",
        otel_options{}.with_parent(frame.trace_id, frame.span_id));
    obstacles = detect_obstacles(span, frame)
    if (!obstacles.empty()) {
        obstacle_alert alert{obstacles}.with_parent(frame.trace_id, span.id)
        topic("alerts").put(alert)
    }
}

Since we agree that tracing is valuable for both audiences it’s difficult not to conflate one with the other. To enable tracing, each ROS event has to carry the trace id. If this is the OTEL variety of tracing, the trace id is compound: trace + span id. Normally OTEL transmits these as HTTP headers - as kind of metadata. ROS events have no metadata so the trace ids need to be explicit and in-band. It is also quite evident that the suspected redundancy isn’t all that pronounced.

Pick most demanding Audience

An approach that solves the redundancy problem would remove OTEL from the onboard code entirely. If Flight Log and OTEL are truly redundant then OTEL can be reconstructed/manufactured from Flight Log somewhere in the cloud. Short of the Avatar doing this directly, it could be fulfilled by a Honeycomb client that’s also a Flight Log subscriber.

bae753c19b966469869e1fdc3ff12ad5-obervability.drawio.png

Example observability instrumentation in the Obstacle Detection node:

on_new_point_cloud(const point_cloud_frame& frame) {
    obstacles = detect_obstacles(span, frame)
    if (!obstacles.empty()) {
        obstacle_alert alert{obstacles}.with_parent(frame.trace_id)
        topic("alerts").put(alert)
    }
}

In this case the trace id is formed on the issuance of each raw camera frame (or each root event) by any means assuring uniqueness (e.g.: UUID) and then just passed along the event chain.

A subscriber to Flight Log can consume the same events and since they carry trace ids, it can create OTEL spans and traces for them at the point of translation towards Honeycomb. The spans won’t contain durations or attributes that illustrate/compliment the event processing, unless they belong to the event definition. If an engineer wishes to deposit an ad-hoc observable, they will have to define a ROS event dedicated to it and then worry about maintaining it.

Decision

The code example presented against the separate audiences section doesn’t actually demonstrate the redundancy it warns about. The events deposited to the ROS topics describe “what” has happen. The events deposited to the OTEL instrumentation decorate the “what” with the “how”. The two approaches appear to compliment each other.

Uncrew shall stay course and build the in-band event infrastructure to offer observability to its users. On top of that, it will instrument its code with OTEL to offer observability to its engineers.

What exactly happens to an event and the additional infrastructure it requires is subject to this accompanying “sub-ADR”.

There is a number of assumptions and unanswered questions made within this ADR, to name the few:

Don’t (long running) spans indeed not make it to Honeycomb before they are finished?
Do we stream OTEL or write it out to a file and upload when the flight’s finished?
- If the latter, how does Honeycomb deal with spans captured a long while ago?
- How does it deal with child spans arriving way later than its parents (should we choose to use the same trace id in the cloud and on the UAV)?
  And so a PoC is the next natural step to validate this decision.

Cited by queries

Observability and metrics across DroneUp — 2026-04-24

Last updated on May 11, 2026

Onboard Observability Eventjourney Apollo Avatar Winch Control