Categories: Data Engineering

The Dashboard Problem and Data Shapes

The data teams at Mozilla have put a great deal of effort into building a robust data ingestion pipeline and reliable data warehouse that can serve a wide variety of needs. Yet, we keep coming back to conversations about the dashboard problem or about how we’re missing last mile tooling that makes data accessible for use in data products that we can release to different customers within Mozilla.

We’re in the process of evaluating several general-purpose business intelligence (BI) tools to allow rapid self-service that we expect will meet a large class of needs, but we also know that there are certain use cases important enough to justify purpose-built dashboards. While bespoke dashboards require a higher level of maintenance and should be built sparingly, they can deliver significant value when closely matched to the needs of a well-understood target audience or when they solve a class of problem that would otherwise require a whole series of simpler dashboards. Where custom data products are justified, we want to have repeatable patterns available for deployment.

The question is: how alike are these different custom dashboards? What are the factors that motivate different dashboard designs on the frontend and backend?

I posit that the first and most important step for determining how a dashboard works is to understand the size of the dataset needed to power the dashboard and determining what shape the data should take to serve the target need.

Context

At Mozilla, Google BigQuery is our data warehouse technology, and I’m assuming that all dashboard products are fed based on data available in our BigQuery projects. Usually, we will be working from derived tables populated via periodic scheduled queries. with some processing (ETL) applied by means of periodic scheduled queries. In practice, most of our dashboards are powered from tables updated once per day.

In some of the cases we describe below, we are serving results based on querying BigQuery directly, but we also have PostgreSQL available in our infrastructure as a traditional transactional database.

From user interviews about data products and analysis needs, we keep hearing the theme that responsiveness matters. Users are used to interacting with web applications that are highly optimized for providing low-latency results; page loads that take longer than 1 second hurt users’ confidence in a product and limit their productivity. For a data product to make a significant organizational impact, it needs to be something that users enjoy interacting with, so we see latency and page load times as a high-priority user experience concern. This article explores options for providing dashboards with good performance on top of different data shapes and the trade-offs necessary as the complexity and size of the data model grows.

Data shapes

Static data

For dashboard cases where we have a finite set of visualizations to display (perhaps up to a few dozen plots), it is possible to bake the data directly into the rendered product such that no backend is needed and the dashboard can be a simple static site.

We can schedule builds of the dashboard to happen immediately after the relevant nightly queries finish (in Mozilla’s case as part of an Airflow DAG) and serve the results as a set of static files in an object store like GCS. This leads to a simple architecture and snappy responses for clients.

Examples of this technique in use at Mozilla are KPI dashboards and “Moment in time” dashboards accompanying major product launches. These generally consist of a curated set of visualizations aimed at helping a wide audience have a shared language of what metrics we care about and how those metrics are performing. These use cases are more about informing audiences rather than enabling exploration of the data.

Tidy data

The next level of complexity is a dashboard that has one or two dimensions of configurability. Usually, there is a defined set of metrics or visualizations, but the user can select between different target cohorts for the visualization. In most cases, the amount of data involved is not particularly large, but exceeds an amount (perhaps 10 MB) that we would want to serve to a browser all at once like with a static page.

The data in this case should probably live in a relational database and be accessible via an HTTP endpoint. A model we have found works well is to follow a tidy data methodology where each row in the table represents a single day for a single cohort, and then we add a column for each metric that we care about. This ends up looking something like the following:

date cohort active_users returning_users
2020-02-11 Overall 32711 12421
2020-02-12 Overall 31622 10629

If the user chooses the “Overall” cohort in the dashboard, we need to fetch perhaps a few hundred rows from the database to fill in whatever time range we care about. The dashboard in this case would likely have plots for “Active Users” and “Returning Users”, or perhaps we show those two time series on a single plot.

From a performance perspective, the important characteristic here is that rows in the table correspond directly to time series on the dashboard. We aren’t doing any aggregation, but simply fetching precalculated results from a database. This can be well served by PostgreSQL if we index the table by cohort, and we should expect to be able to serve pages in well under one second even if the number of cohorts or number of concurrent users scales into the thousands.

We could also serve results from a tidy table in BigQuery directly, but the mean latency for queries even on small tables will likely be in the 1 to 2 second range, with a long tail meaning that 99th percentile response times could be closer to 10 seconds. PostgreSQL is able to provide much more predictable performance.

Large-scale projects at Mozilla that roughly follow this model are Mission Control, the Firefox Measurement Dashboard and its coming replacement, GLAM. Those dashboards provide a few different views to look at aggregates of telemetry probes, but generally they are just serving a single plot at a time. This seems simple, but the user can choose from thousands of probes, dozens of product versions, and dozens of OS versions. The backing table in PostgreSQL ends up having millions of rows, each with precalculated aggregates for a particular combination of dimension values, but response times are low because a single page view generally requires only pulling a single row out of PostgreSQL.

Data cubes

If we want a dashboard to be more exploratory where a user can drill down into the data based on a variety of dimensions, the tidy data model can break down. Particularly, the space of potential dimensions combinations can be so large that it’s not feasible to fully precalculate all the possible values.

In this case, we need to resort to a data model where we precalculate subsets of data, but expect that we’ll need to perform final aggregations as part of the user request. The technique of precalculating aggregates for each unique set of dimensions is sometimes called a data cube or OLAP cube, but be aware that those terms have a rich history of their own, so the terminology used here may be missing nuance compared to what’s available in tools specifically built to support OLAP cubes.

An example of where we use this technique is the Growth and Usage Dashboard (GUD) which allows users to view high-level usage metrics like monthly active users, dynamically drilling down into a set of supported dimensions (product, OS, language, country, and release channel). In its initial implementation, this led to a very large table (~2 TB) which we left in BigQuery since BigQuery is very good at scanning and aggregating large amounts of data quickly where PostgreSQL likely would not be able to serve this need. This setup technically functioned, but an average page load for a new combination of dimensions was on order 30 seconds. We relied heavily on caching results for pages that had already been requested, but the response for new pages was simply too slow to have this be a reasonable tool for ad hoc data exploration.

Our first major step to improving performance was a server-side in-memory cache of the data for each page viewed. That simple measure brought the load time for the front page and other commonly viewed dimension combinations into the ideal sub-second range, but quickly breaks down for any exploratory use case where users want to drill down to more specific slices of the data, requiring another 30 second wait for each slice.

Our second improvement focused on bringing down load times for novel sets of dimensions outside of the cache. We achieved an order of magnitude reduction by aggressively limiting the space of potential dimension values. For example, there are more than 200 possible values for the Country dimension, but we had already been limiting the available country selections on the frontend to a dozen of the largest countries in order to keep the UI manageable. By exactly matching the contents of the data cube to the available options in the UI, we were able to create a “compressed” cube that is ~25 GB in size. This leads to average page load times of 5 seconds, certainly a huge improvement but far from a victory in terms of making this a data product that users truly love. The size of the cube may be small enough now that it would be worth investigating the performance of this workload if we exported the data to PostgreSQL.