Skip to content
Frontend Performance Observability

Frontend Performance Observability

Andi Lamprecht Andi Lamprecht ·· 9 min read· Accepted
ADR-0139 · Author: Sybil Melton · Date: 2025-02-07 · Products: shared
Originally ADR-0110-Frontend-Performance-Observability (v3) · Source on Confluence ↗

Frontend Performance Observability

Jira ticket - UNCREW-2984

Context

Frontend web applications performance is critical for ensuring optimal user experience and system reliability. Monitoring and testing the performance of Apollo Frontend can highlight issues like slow load times, large payload sizes, or JavaScript execution bottlenecks, which directly impact our users and the reliability of the system.
In the past we have received user reports about degradation of Frontend performance after releases. The goal of this ADR is to present a way to detect degradation after changes reach trunk branch and to know when any of those problems occur on client side.
This is a challenge, as each report may have different cause. Some may be related to clients network problem, others to the response time of Uncrew Backend services, another might be related to some change on the Frontend application side.
This is why it is important to collect as much data as we can, and compare it with data we already collected from different pieces of our system (like argus dashboard for telemetry or logs in HoneyComb).
Another important part is that the proposed solution should cover testing in the ‘Lab’ and in the ‘Field’, where ‘Lab’ is an isolated machine (CircleCI, without any other processes running in the background), that will be used as a benchmark of the application performance in between releases, and ‘Field’ is an actual deployed application ran by clients and developers in dev, staging or production.

Decision

To be able to observe performance of the Uncrew Apollo Frontend we have to implement new tools and extend the ones that are currently used.
Let’s divide tools for testing in ‘Lab’ and in the ‘Field’.

For ‘Lab’ testing, a testing of the performance in an isolated machine (CircleCi), we will use Puppeteer test runner.

  • Puppeteer will run in intervals, every hour in developers working hours. That will give us constant feedback of the application performance state.

  • Each test will generate Lighthouse report, and each report will cover some functionality (for example Signing In, displaying Mission in Mission Manager etc.)

  • Since tests will take quite long time to finish (and this time will be extended with each test), they will not block merges to trunk (additionally because the code for tests will be implemented in different repo)

  • Constant tests will allow to observe any potential degradation of the performance between merges to trunk.

  • Generated reports will be published to Slack channel.

    • Reports will inform about exceeding agreed norm of web vitals (as specified in Web Vitals), as important fact is that vitals may differ in each consecutive run without any changes in the code (explained in the Background of Web Vitals).
  • Puppeteer tests can be added with each new functionality, and they can be written by the developers.

For ‘Field’ testing, a performance testing of the application used by actual users, we will use two tools.

  • First one is Grafana Faro, which utilized will display Product-Understandable dashboard with all the information we can gather about performance of the app that comes from the browser itself.

    • Web Vitals are currently being sent to HoneyComb by our automated WebVitalsInstrumentation, but lack of easily readable dashboards is problem that Grafana Faro can solve.
    • Additionally, Faro allows to trace a user session and action each user has taken. By default, it includes much useful information like each page user has visited, but to display information like user ID, it requires building custom dashboards.
    • Implementation of Grafana Faro outside of PoC (for which Free account on Grafana Cloud has been used) would need to include installation of Alloy to collect and transport traces published by Grafana Faro Web SDK.
  • Second tool is Argus Agent for collecting resource consumption on clients machines, for a better understanding on any potential latency, CPU or memory issues and how they correlate with web vitals published in HoneyComb

    • argus-agent have to be installed on laptops and tablets used in hubs for running Uncrew Apollo Frontend
    • argus-dashboard will be used to collect and display metrics for each machine.
    • Have to be installed by IT on clients machines.

Background

There are multiple tools that can help understand how the application behaves, and which were considered in decision.
Some of these tools serves similar purpose, some totally different. All of them are meant to provide better understanding on whether there are any problems with the application performance.

Summary of capabilities that the tool should have to be considered for implementation are presented below. Note: it is not required for each tool to contain all the capabilities.

  • Generates some form of reports about performance when actions are taken in the app
  • Gives clear understanding on web vitals during specific moments of the application lifecycle
  • Provides information about clients machine resource consumption
  • Can differentiate users
  • Can differentiate environments

Lighthouse

Automated tool used to generate reports about load speed times, SEO friendliness, industry best practices and accessibility.
Reports generated by Lighthouse can be parsed and used as a metrics. It cannot show resource consumption on client machine.
Lighthouse can be manually ran in the browser, or can be implemented in a test runner. It cannot detect specific user if that information was not provided directly, nor it cannot detect environment.

Grafana Faro (K6)

Tool that generates web vitals reports similar to how Lighthouse does it, with the difference that it generates them for each client. It cannot show resource consumption on client machine.
It can however detect separate users, environments. In terms of observability and capabilities it is very similar to a tool already implemented in Uncrew Apollo Frontend - Honeycomb. The only difference is a Grafana Dashboard that can be prepared to display precisely what developer wants to be informed about.

Honeycomb Web Vitals Instrumentation

Tool that is currently implemented in Uncrew Apollo Frontend, but still decided to present its capabilities. Honeycomb Web Vitals Instrumentation provides web vitals to the Honeycomb Dashboard.
It provides information about specific user, environment, but cannot provide information about clients machine resource consumption.
It does not generate any form of reports, besides the query that developer can prepare (example: Staging logs)
While this tool is implemented, logging and displaying the data have to be improved.

Argus Agent

Installed on the clients machines tool, that can collect data about actual resource consumption (CPU, RAM, Disk usage, Bandwidth, Latency).
One of the downsides is that it cannot differentiate tabs in the browser and because of that it cannot provide information about environment that the app is running on. It can provide information about user.
Collected data can be presented in Grafana Dashboard in a readable form.
One thing that is important in terms of installing Argus Agent on clients machines is that it requires a change in the company’s standard operating procedures in configuring and maintaining said machines.

Puppeteer

Tool used in industry for running E2E tests and automated tasks. It can run in CI with a specific predefined tests during which it can generate
reports about various web vitals, Lighthouse reports and even take and save screenshots of the currently visible state of the application.
Tool itself cannot provide information about current user and can be run on one environment at a time or more if required.

Web vitals

Because of how browser ecosystem is built, it is not possible to get from within the codebase the actual usage of CPU and RAM for an individual application.
Tools like Lighthouse however can give us some understanding on web vitals, such us:

  • Largest Contentful Paint (LCP) - Measures the time it takes for the largest content element on the page (such as images, videos, or large blocks of text) to become visible on the screen.
  • First Input Delay (FID) - Measures the time it takes for a page to respond to the first user interaction (such as clicking a button or tapping on a link).
  • Cumulative Layout Shift (CLS) - Measures the visual stability of a page by quantifying unexpected layout shifts during the loading phase.
  • First Contentful Paint (FCP) - Measures the time it takes for the first piece of content to be rendered (text, image, or canvas).
  • Time to Interactive (TTI) - Measures the time it takes for the page to become fully interactive, meaning that users can interact with it without any delays.
  • Total Blocking Time (TBT) - Measures the amount of time that the browser is blocked from responding to user inputs during the page load process. It correlates with FID.

Important information about web vitals is that they can be different in each consecutive run of the same test. That’s why it is important to not compare them directly, but to compare ranges in which their values are considered an acceptable norm.
While there is a guideline on how the ’norm’ looks like, each application differs, and the system architecture can influence some amends to the established norm (for example if application is ‘heavy’ because of the calculations, it is obvious some values will always be increased in some specific scenarios).

Web Vital value may be increased based on numerous reasons, for example:

  • Not optimized images and other media (increase load time, especially LCP)
  • Unoptimized CSS or JS code (increase to most of the Web Vitals)
  • Excessive third-party scripts (if they are loaded synchronously can lead to increased values of Vitals)
  • Blocking the main thread by waiting for HTTP response
  • Lack of caching which may increase wait times for responses
  • Increased traffic which may increase response time and because of that LCP, CLS or TBT.
  • User specific conditions - latency issues, memory problems, overall client device performance issues.

While web vitals are very useful piece of data we can get from the browser, and are used as a industry standard to determine performance of the application, it’s not enough to observe how the application impacts on the clients machine.
For this purpose we must have a way to collect and display CPU and RAM usage of the clients machines.

Grafana Alloy

Alloy is an Open Telemetry collector that publishes traces to Grafana. Since DroneUp is using hosted Grafana Enterprise in GCP, Alloy needs to be installed to publish traces.

Status

Cited by queries

Last updated on