Abstract. Respecting our user’s privacy choices is at the top of our priorities and it also involves the deletion of their data from our Data Warehouse (DHW) when they request us to do so. For Analytics Engineering, this deletion presents the challenge to maintain business metrics reliable and stable along with the evolution of business analyses. This blog describes our approach to break through this challenge. Reading time: ~5 minutes.
Mozilla has a strong commitment to protecting user privacy and giving each user control over the information that they share with us. When the user’s choice is to opt-out of sending telemetry data, the browser sends a request that results in the deletion of the user’s records from our Data Warehouse. We call this process Shredder. The impact of Shredder is problematic when the reported key performance indicators (KPIs) and Forecasts change after a reprocess or “backfill” of data. This is a limitation to our analytics capabilities and the evolution of our products. Yet, running a backfill is a common process that remains essential to expand our business understanding, so the question becomes: how do we rise to this challenge? Shredder Mitigation is a strategy that breaks through this problem and resolves the impact in business metrics. Let’s see how it works with a simplified example. A table “installs” in the DWH contains telemetry data including the install id, browser and channel utilized on given dates.
installs
date | install_id | browser | channel |
2021-01-01 | install-1 | Firefox | Release |
2021-01-01 | install-2 | Fenix | Release |
2021-01-01 | install-3 | Focus | Release |
2021-01-01 | install-4 | Firefox | Beta |
2021-01-01 | install-5 | Fenix | Release |
Derived from this installs table, there is an aggregate that stores the metric “kpi_installs”, which allows us to understand the usage per browser over time and improve accordingly, and that doesn’t contain any ID or channel information.
installs_aggregates_v1
date | browser | kpi_installs |
2021-01-01 | Firefox | 2 |
2021-01-01 | Fenix | 2 |
2021-01-01 | Focus | 1 |
Total | 5 |
What happens when install-3 and install-5 opt-out of sending telemetry data and we need to backfill? This event results in the browser sending a deletion request, which Mozilla’s Shredder process addresses by deleting existing records of these installs along the DWH. After this deletion, the business asks us if it’s possible to calculate kpi_installs split by channel, to evaluate beta, nightly and release separately. This means that the channel needs to be added to the aggregate and the data be backfilled to recalculate the KPI. With install-3 and install-5 deleted, the backfill will report a reduced -thus, unstable- value for kpi_installs due to Shredder’s impact.
installs_aggregates (without shredder mitigation)
date | browser | channel | kpi_installs |
2021-01-01 | Firefox | Release | 2 |
2021-01-01 | Fenix | Release | 1 |
Total | 3 |
How do we solve this problem? The Shredder Mitigation process safely executes the backfill of the aggregate by recalculating the KPI using only the combination of previous and new aggregates data and queries, identifying the difference in metrics due to Shredder’s deletions and storing this difference as NULL. The process runs efficiently for terabytes of data, ensuring a 100% stability in reported metrics and avoiding unnecessary costs by running automated data checks for each subset backfilled. Every version of our aggregates that use Shredder Mitigation is reviewed to not contain any dimensions that could be used to identify previously deleted records. The result of a backfill with shredder mitigation in our example, is a new version of the aggregate that incorporates the requested dimension “channel” and matches the reported version of the KPI:
installs_aggregates_v2
browser | channel | kpi_installs |
Firefox | Release | 1 |
Firefox | Beta | 1 |
Fenix | Release | 1 |
Fenix | NULL | 1 |
Focus | NULL | 1 |
Total | 5 |
With the reported metrics stable and consistent, the shredder mitigation process enables the business to safely evolve, generating knowledge in alignment with our data protection policies and safeguarding our users’ privacy choice. Want to learn more? Head over to the shredder process technical documentation for a detailed implementation guide and hands-on insights.