This Week in Glean: Backfilling rejected GPUActive Telemetry data

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla uses to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as Glean inspires it. You can find an index of all TWiG posts online.)

Data ingestion is a process that involves decompressing, validating, and transforming millions of documents every hour. The schemas of data coming into our systems are ever-evolving, sometimes causing partial outages of data availability when the conditions are ripe. Once the outage has been resolved, we run a backfill to fill in the gaps for all the missing data. In this post, I’ll discuss the error discovery and recovery processes through a recent bug.

Catching and fixing the error

Every Monday, a group of data engineers pours over a set of dashboards and plots indicating data ingestion health. On 2020-08-04, we filed a bug where we observed an elevated rate of schema validation errors coming from environment/system/gfx/adapters/N/GPUActive. For mistakes like these that are small fractions of our overall volume, partial outages are typically not urgent (as in not “we need to drop everything right now and resolve this stat!” critical). We called the subject experts and found out that the code responsible for reporting multiple GPUs in the environment had changed.

An intern reached out to me about a DNS study running a few weeks after filing the bug about GPUActive. I helped figure out that his external monitor setup with his Macbook was causing rejections like the ones that we had seen weeks before. One PR and one deploy later, I watched the error rates for the GPUActive field abruptly drop to zero.

Figure: Error counts for environment/system/gfx/adapters/N/GPUActive

The schema’s misspecification resulted in 4.1 million documents between 2020-07-04 and 2020-08-20 to be sent to our error stream, awaiting reprocessing.

Running a backfill

In January of 2021, we ran the backfill of the GPUActive rejects. First, we determined the backfill range by querying the relevant error table:

SELECT
DATE(submission_timestamp) AS dt,
COUNT(*)
FROM
`moz-fx-data-shared-prod.payload_bytes_error.telemetry`
WHERE
submission_timestamp < '2020-08-21'
AND submission_timestamp > '2020-07-03'
AND exception_class = 'org.everit.json.schema.ValidationException'
AND error_message LIKE '%GPUActive%'
GROUP BY
1
ORDER BY
1

The query helped verify the date range of the errors and their counts: 2020-07-04 through 2020-08-20. The following tables were affected:

crash
dnssec-study-v1
event
first-shutdown
heartbeat
main
modules
new-profile
update
voice

We isolated the error documents into a backfill project named moz-fx-data-backfill-7 and mirrored our production BigQuery datasets and tables into it.

SELECT
*
FROM
`moz-fx-data-shared-prod.payload_bytes_error.telemetry`
WHERE
DATE(submission_timestamp) BETWEEN "2020-07-04"
AND "2020-08-20"
AND exception_class = 'org.everit.json.schema.ValidationException'
AND error_message LIKE '%GPUActive%'

Then we ran a suitable Dataflow job to populate our tables using the same ingestion code as the production jobs. It took about 31 minutes to run to completion. We copied and deduplicated the data into a dataset that mirrored our production environment.

gcloud config set project moz-fx-data-backfill-7
dates=$(python3 -c 'from datetime import datetime as dt, timedelta;
start=dt.fromisoformat("2020-07-04");
end=dt.fromisoformat("2020-08-21");
days=(end-start).days;
print(" ".join([(start + timedelta(i)).isoformat()[:10] for i in range(days)]))')
./script/copy_deduplicate --project-id moz-fx-data-backfill-7 --dates $(echo $dates)

This query took hours because it iterated over all tables for ~50 days, regardless of whether it contained data. Future backfills should probably remove empty tables before kicking off this script.

Now that tables were populated, we handled data deletion requests since the time of the initial error. A module named Shredder serves the self-service deletion requests in BigQuery ETL. We ran Shredder from the bigquery-etl root.

script/shredder_delete
--billing-projects moz-fx-data-backfill-7
--source-project moz-fx-data-shared-prod
--target-project moz-fx-data-backfill-7
--start_date 2020-06-01
--only 'telemetry_stable.*'
--dry_run

This removed relevant rows from our final tables.

INFO:root:Scanned 515495784 bytes and deleted 1280 rows from moz-fx-data-backfill-7.telemetry_stable.crash_v4
INFO:root:Scanned 35301644397 bytes and deleted 45159 rows from moz-fx-data-backfill-7.telemetry_stable.event_v4
INFO:root:Scanned 1059770786 bytes and deleted 169 rows from moz-fx-data-backfill-7.telemetry_stable.first_shutdown_v4
INFO:root:Scanned 286322673 bytes and deleted 2 rows from moz-fx-data-backfill-7.telemetry_stable.heartbeat_v4
INFO:root:Scanned 134028021311 bytes and deleted 13872 rows from moz-fx-data-backfill-7.telemetry_stable.main_v4
INFO:root:Scanned 2795691020 bytes and deleted 1071 rows from moz-fx-data-backfill-7.telemetry_stable.modules_v4
INFO:root:Scanned 302643221 bytes and deleted 163 rows from moz-fx-data-backfill-7.telemetry_stable.new_profile_v4
INFO:root:Scanned 1245911143 bytes and deleted 6477 rows from moz-fx-data-backfill-7.telemetry_stable.update_v4
INFO:root:Scanned 286924248 bytes and deleted 10 rows from moz-fx-data-backfill-7.telemetry_stable.voice_v4
INFO:root:Scanned 175822424583 and deleted 68203 rows in total

After this is all done, we append each of these tables to the production tables. Appends requires superuser permissions, so it was handed off to another engineer to finalize the deed. Afterward, we deleted the rows in the error table corresponding to the backfilled pings from the backfill-7 project.

DELETE
FROM
`moz-fx-data-shared-prod.payload_bytes_error.telemetry`
WHERE
DATE(submission_timestamp) BETWEEN "2020-07-04"
AND "2020-08-20"
AND exception_class = 'org.everit.json.schema.ValidationException'
AND error_message LIKE '%GPUActive%'

Finally, we updated the production errors with new errors generated from the backfill process.

bq cp --append_table   moz-fx-data-backfill-7:payload_bytes_error.telemetry   moz-fx-data-shared-prod:payload_bytes_error.telemetry

Now those rejected pings are available for analysis down the line. For the unadulterated backfill logs, see this PR to bigquery-backfill.

Conclusions

No system is perfect, but the processes we have in place allow us to systematically understand the surface area of issues and systematically address failures. Our health check meeting improves our situational awareness of changes upstream in applications like Firefox, while our backfill logs in bigquery-backfill allow us to practice dealing with the complexities of recovering from partial outages. These underlying processes and systems are the same ones that facilitate the broader Glean ecosystem at Mozilla and will continue to exist as long as the data flows.

Data@Mozilla

This Week in Glean: Backfilling rejected GPUActive Telemetry data

Catching and fixing the error

Running a backfill

Conclusions

This Week in Glean: Page Load Data, Three Ways (Or, How Expensive Are Events?)

This Week in Glean: Your personal Glean data pipeline

This Week in Glean: What If I Want To Collect All The Data?

This Week in Glean: Migrating Legacy Telemetry Collections to Glean

This Week in Glean: How Long Must I Wait Before I Can See My Data?

Never Look at the Data: Why did we start getting so many pings from Korea?

This Week in Data: Python Environment Freshness

This Week in Glean: Reviewing a Book – Rust in Action

Crash Reporting Data Sprint

This Week in Glean: What Flips Your Bit?

Never Look at the Data: Why did we start getting so many pings from Korea?

This Week in Data: Python Environment Freshness

This Week in Glean: What Flips Your Bit?

This Week in Glean: Designing a telemetry collection with Glean

My first time experience at the SciPy conference

Documenting outages to seek transparency and accountability

Data and Firefox Suggest

Announcing Mozilla Rally

Data Publishing @ Mozilla

Understanding default browser trends

This Week in Glean: Data Reviews are Important, Glean Parser makes them Easy

This Week in Glean: What Flips Your Bit?

Detecting Internet Outages with Mozilla Telemetry Data

Making your Data Work for you with Mozilla Rally

This Week in Glean: Fantastic Facts and where to find them

Welcome (back) to Data@Mozilla

This Week in Data: Reading “The Manager’s Path” by Camille Fournier