Skip to content
Simulator Failure Injection

Simulator Failure Injection

Andi Lamprecht Andi Lamprecht ·· 9 min read· Accepted
ADR-0271 · Author: Remek Zajac · Date: 2026-02-26 · Products: uncrew
Originally ADR--0141-(Simulator) Failure Injection (v15) · Source on Confluence ↗

Title

Traceability Links
Jama RequirementsUERQ-FRQ-486
Jira TasksCORE-2639

Context

DroneUp’s simulator needs to support injecting realistic failures for RPIC training and system validation (UERQ-FRQ-486). The architectural question is: where should failure injection logic live (onboard, cloud, or hybrid), how should it be exposed to users, and how do we maintain strict production/simulation isolation?

We need to allow users to inject the following failures into simulated UAs:

  • Propulsion System Failure (Motor Failure)

  • Accelerometer Failure

  • Lost Link (LTE) - Aircraft losing LTE connection

    • When the Onboard Uncrew loses connection to its Avatar, the Onboard Uncrew should engage RTL as required (UERQ-HLR-825, UERQ-FRQ-49). The pilot should see the connection dot in the Mission Console turn red.
  • Lost Link (Uncrew) - Aircraft keeps LTE connection, but the Pilot loses internet connection

    • The response to pilot disappearing should be identical to lost link - both are technically a C2 loss. Although the detection and response are not implemented, the injection is trivial - it’s sufficient to close the assigned RPIC browser window.
  • GPS / Compass Failure

    • Onboard Uncrew monitors for presence/absence and/or integrity of the global position telemetry and will raise respective alerts: PositionNoData or PositionInvalidData, which will be presented to the RPIC as an alert. Insofar as PositionInvalidData can be simulated by emitting corrupt GLOBAL_POSITION_INT, its absence cannot be w/o highjacking the telemetry stream.
    • Compass failure is assumed to manifest with the magnetometer_calibrated showing false. MAVSDK sets magnetometer to false if CAL_MAG0_ID is zero. MAVSDK registers for PX4 parameter changes to recalculate magnetometer health in real time. Similarly to the accelerometer health, nothing reacts to the magnetometer going bad.
  • Geocage Violation

    • Geocage violation is a well tested and easily provoked failure. It suffices to take QGC and send the UA outside of its geocage.
  • In-Flight Remote ID Failure

    • This too is a well tested and easily provoked failure. It suffices to stop the virtual-remote-id container in the Virtual Skynode’s web frontend. Granted, that that reaching the frontend (learning its IP address) requires access to GCP.
  • Winch Failure - Package Not Releasing

    • We currently deploy virtual-reel-winch along with each Virtual Skynode. Insofar that virtual-reel-winch lacks failure injection, one can be easily written.
  • Low Energy Failsafe

Isolating Production from Simulation

It is imperative that the production software treats simulated flights exactly the same way as it treats real ones and so we have been careful not to communicate the nature of the UA (simulated or otherwise) to the production software, except in one isolated case.

It is further undesirable to mix interfaces and functionality related solely to simulation into otherwise a production software. This helps to uphold the discipline of hiding the simulated nature of a UA from production software. Additionally:

  • More software in components implies more bugs and risk - risk that can affect real flights.
  • Simulation related code deployed to oversee real flights would constitute dead code, which goes against the principles of DO-178C.

Interestingly, PX4 doesn’t seem to bother with this accepting purely simulation related PX4 parameters (e.g.: SIH_LOC_LAT0) in the same bucket as every other PX4 parameter.

Meanwhile the aforementioned exception is borne out of the realization that even though Uncrew as a product is chiefly to fly real drones, it must also allow to test and train and allow its users to knowingly to work with simulators. This creates an inherent tension: we offer simulators to users while simultaneously hiding their simulated nature from the production software stack. We should probably divide things into the flight safety sensitive software and the rest.

Simulating Link Loss

A link loss simulation has been successfully demonstrated with the c2-killer auterionos application installable on a Virtual Skynode. It works by manipulating iptables and installing a rule that DROPs traffic towards the avatar on port 443. It does this on the local network.

docker exec 4175888e86be iptables-legacy -A OUTPUT -d 34.74.238.230 -p tcp --dport 443 -j DROP

The avatar’s host is extracted from the private certificate stored on virtual skynode for mavlink-shim to work. Its IP address is resolved normally. Since mavlink-shim runs on the host network (probably because so is virtual-px4), so must c2-killer. However containers running in their own networks (not on host) will have their own iptables and maintain the ability to talk to the avatar.

Highjacking MAVLINK Traffic

Insofar that winch or remote_id failures as well as PositionInvalidData can be simulated off band by various Uncrew containers being deployed alongside PX4, the accelerometer failure cannot be. Neither can magnetometer or absence of position. This has to be done by tampering with the telemetry flowing from PX4 to mavlink-shim. This is successfully demonstrated with this gomavlink based autopilot-proxy.

Exposing Failure injection to Web Frontends

Onboard Uncrew components, including mavlink-shim and whatever failure injection actors we decide to put onboard are behind NAT and must dial out in order to be reached. Mavlink-shim does this towards its avatar acting as a grpc client of the service the Avatar offers to it. Things are on its head a bit as the UA is conceptually a server, but must act as a client. The Geodata service does similar as mavlink-shim so that it can fetch the tiles it needs to cache. Avatar exposes a proxy for the Geodata service to reach and does this necessarily on the same port (443) as facilitated by cmux. Cmux allows to multiplexing multiple kinds of service on the same port - as long of course as they run on the same service.

If the failure injection components were to use this existing facility then we will require that Avatar starts serving requests purely related to simulation handling and that would be an arguably undesirable precedent.

Decision Drivers

  1. Production/simulation isolation: no simulation code in production flight paths,
  2. DO-178C compliance: avoid dead code in certified components, (
  3. Minimal Avatar changes: keep safety-critical Avatar lean, (
  4. User accessibility: failure injection must be reachable from web frontends without VPN,
  5. Implementation feasibility: leverage existing infrastructure (AuterionOS apps, port 443 tunnel)

Alternatives Considered

Proxy-Based Link Loss

It is concluded that since mavlink-shim has to authenticate its avatar with mTLS, we cannot position a proxy between it and the avatar.

Plan gRPC Server on a VM

Instead of passing Avatar to expose failure injection to users we could take advantage of the fact that VMs, where Virtual Skynodes run are reachable if the user is on VPN. That last condition was deemed disqualifying.

Decision

Chosen Option: Deploy autopilot-proxy as a separate AuterionOS application on simulated UAs only, positioned between mavlink-shim and PX4. For user exposure, Avatar will act as a gRPC proxy to a separate cloud-based FailureInjection service (conservative approach). The tunnel approach is deferred to a separate ADR.

image-20260225-205410.png

Onboard deployment

We will package the entire failure injection interface and majority of the function into the autopilot-proxy, which we will deploy between mavlink-shim and px4 solely on simulated UAs. We will ensure this is the case by packaging the proxy as a separate auterionos application and wiring it into the simulator boostrap. This separation will pose a risk that mavlink-shim certificates will not be reachable once Auterion implements data-caging between their onboard apps. Something they expressed the intention to do. When this happens, we can deploy the same factory certificates along the proxy and have it exchange it for the personal ones just as mavlink-shim does it.

The autopilot-proxy will implement the injection of C2 loss directly (iptables manipulation) and of failures that require tampering with PX4 telemetry. Failures that don’t need that, namely: remote_id and winch failures, the proxy will implement by relaying to the remote_id and winch simulators respectively.

Exposure to Users

Because autopilot-proxy, like mavlink-shim, sits behind NAT, it cannot be a plain gRPC server. Conservatively then we will have to mimic what Avatar does in order to map PilotCommandService to PilotCommandService.

We should not put failure injection directly into the Avatar code, but instead have it act as a plain grpc proxy to an external cloud-based FailureInjection service that users can talk to. The service would be structurally similar to Avatar owing to the aforementioned mapping.

There is a promising grpc-tunnel-proxy PoC that suggests we can do a LOT better than that and:

  • Have autopilot-proxy establish an Hashicorp/yamux mTLS tunnel with Avatar’s port 443 and use its end of the tunnel as the listener of the FailureInjection gRPC service (instead of acting as a client)
  • Have Avatar forward to various tunnels (and various gRPC services at their ends) using service-name-based routing.
  • Have Avatar continue to offer the Avatar gRPC service towards mavlink-shim, but allow swapping roles and Avatar becoming a client instead and a simple proxy
  • Have Avatar continue to act as an http proxy for the onboard gRPC Geodata Service.
  • All through the keyhole of the same, mTLS authenticated keyhole of port 443.

The PoC however must be its own ADR in order to proceed with the idea as it is an architecturally significant change.

Consequences

What becomes easier or more difficult to do because of this change?

  • easier to conduct RPIC training exercises,
  • easier to validate system responses to failures,
  • introduces a new onboard component to maintain per simulator version,
  • has dependency on Auterion’s app packaging model,
  • risk of data-caging breaking certificate access,
  • need for a follow-on ADR for the tunnel approach.

Formal Impact

(1) Does this change safety-critical data flows? Yes — the proxy intercepts PX4↔mavlink-shim telemetry.

(2) Does this affect failure detection? Yes — it deliberately injects failures.

(3) Attack surface changes? Possibly — new component on the data path.

(4) Derived requirements? Likely — the proxy must be transparent in non-injection mode.

(5) DAL implications? No, only tool qualification. Proxy is simulation-only and will not be deployed to production UAs.

Last updated on