Skip to content

Apollo Mtls

Andi Lamprecht Andi Lamprecht ·· 5 min read· Accepted
ADR-0112 · Author: Sybil Melton · Date: 2025-02-07 · Products: uncrew
Originally ADR-0090-Apollo-mTLS (v3) · Source on Confluence ↗

UAV->Cloud with mTLS

Context

Drones fly in noisy, unreliable cellular/radio networks, which manifests with:

  • high (IP) packet loss rates - this is where the IP router drops packets because:

    • the packet lost its integrity on the radio link and its checksum doesn’t hold;
    • the network experiences congestion;
  • frequent link loss events;

Link loss is just no network and that’s very difficult to correct. Packet loss at the TCP transport causes retransmissions. If, within the initial Retransmission Timeout (RTO), the TCP sender doesn’t receive an ack for a previously emitted TCP segment, the sender shall retransmit that segment (now assumed lost) and double the RTO. The RTO should be double of the observed Round Trip Time (RTT), but (RFC6298)

Whenever RTO is computed, if it is less than 1 second, then the RTO SHOULD be rounded up to 1 second.

This simply means that packet loss causes delays. In a 50% packet loss network a byte put on the TCP socket will arrive with delays reaching:

  • RTT/2 at 50% chance
  • 1s at 25% chance
  • 2s at 12.5% chance
  • 4s at 6.25% chance

Multi-second delays on the Command & Control (C2) channel between the operator and drone may be catastrophic so the natural temptation is to put C2 away from TCP. For instance; propose a protocol that does speculative retransmissions over UDP. C2 is sporadic, so it doesn’t need bandwidth nor flow control. One could then retransmit the same packet multiple times all within one RTO and increase its chances of delivery.

However. Even though replacing TCP, a protocol that took decades to evolve, sounds like a red flag already, that just a tip of the iceberg. The world is web-orientated. If a network exchange isn’t https then:

  • Middleboxes might drop it;
  • The standard http path- or domain-based routing the cloud ingresses/load balancers offer become useless and one needs to defer to exotic UDP port-forwarding load balancers.
  • There is no TLS to offer trust or encryption and one has to defer to exotic DTLS or VPN.

Meanwhile:

  • We don’t even know how much time do our drones spend in the packet-loss window where speculative retransmissions might make a difference. What if it’s just 1%?
  • When one hears “a 50% packet loss network”, one needs to ask: what’s the observation time window? Given one minute, of which the first 30s drops all the packets and the other 30s drops none; is it a 50% packet loss network or now two networks 100 and 0 percent respectivelly? What matters here is that no amount of retransmissions is going to help in the first 30s.

Decision

We shall endavour to measure the actual network conditions (packet loss and round-trip delays) our drones experience. All across the fleet and across operations to increase the sample size.

Meanwhile, as a true proof of concept, we will ask our drones to speak to our standard, https backend infrastructure using mTLS.

Consequences

We understand that using mTLS (and thus TCP) may expose our C2 to unnaceptable delays, but have to observe them being unacceptable and correctable before we try shriking them.

We understand that in presence of our aspirations for:

We may need to terminate TLS ourselves at each Avatar.

Alternatives Considered

mTLS is quite standard, but not all that common (way less common than OAuth2/JWT), especially if we wish to establish trust between a UAV and its specific Avatar.

Can drones be considered OAuth2 applications?
It could be argued that drones reach out to the backend seeking access to pilots,seeking to be controlled. They are autonmous agents, but not quite as autonmous as to fly themselves. However:

  • There is no fine grained permissions that the drone needs to extend to its Avatar. By design, the Avatar (aka Digital Twin) represents its drone in the drone’s full agency. By design, between the drone and its Avatar, there is no authorization, there’s only (mutual) authentication.
  • If the drones weren’t behind NAT, like MAVLINK, we would be compelled to treat them as servers and (in the OAuth2 vocabulary) resources. Well which model is correct? Are drones resources or clients? Both models can be made to work, but only one of them should be correct. It could be that drones resemble resources until their increasing autonomy makes them resemble clients - the correctness is transient. But once again, this is why the drone’s full authority extends to its Avatar. If drones ever need pilots, Avatars will do the seeking.

A JWT access token has no standard provision to be revoked. OAuth2 doesn’t offer any means for the resource server to determine that a previously issued token should no longer be trusted. Instead, JWT tokens expire and the OAuth2 authorization server can be told not to issue a new one. If tokens expire, then they could expire in-flight when the drone finds itself on a packet radio network only connected to its Avatar and the pilot, but cut off from the public internet. We could have the JWT tokens expire well beyond the flight time, drone’s battery life or at the end of the shift, but that exposes the very vulnerability we want to prevent: The drone falls down from the sky, an attacker finds its SD card, uses the access token on it to impersonate the drone.

Last updated on