The Glean logo

This Week in Glean: What Flips Your Bit?

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.)

The idea of “soft-errors”, particularly “single-event upsets” often comes up when we have strange errors in telemetry. Single-event upsets are defined as: “a change of state caused by one single ionizing particle (ions, electrons, photons…) striking a sensitive node in a micro-electronic device, such as in a microprocessor, semiconductor memory, or power transistors. The state change is a result of the free charge created by ionization in or close to an important node of a logic element (e.g. memory “bit”)”. And what exactly causes these single-event upsets? Well, from the same Wikipedia article: “Terrestrial SEU arise due to cosmic particles colliding with atoms in the atmosphere, creating cascades or showers of neutrons and protons, which in turn may interact with electronic circuits”. In other words, energy from space can affect your computer and turn a 1 into a 0 or vice versa.

There are examples in our data collected by Glean from Mozilla projects like Firefox, that appear to be malformed by a single bit from the value we would expect. In almost every case we cannot find any plausible explanation or bug in any of the infrastructure from client to analysis, so we often shrug and say “oh well, it must be cosmic rays”. A totally fantastical explanation for an empirical measurement of some anomaly that we cannot explain.

What if it wasn’t just some fantastical explanation? What if there was some grain of truth in there and somehow we could detect cosmic rays with browser telemetry data? I was personally struck with these questions, recently, as I became aware of a recent bug that was filed that described just these sorts of errors in their data. These errors were showing up as strings with a single character different in the data (well, a single bit actually). At about the same time, I read an article about a geomagnetic storm that hit at the end of March. Something clicked and I started to really wonder if we could possibly have detected a cosmic event through these single-event upsets in our telemetry data.

I did a little research to see if there was any data on the frequency of these events and found a handful of articles (for instance) that kept referring to a study done by IBM in the 1990’s that referenced 1 cosmic ray bit flip per 256MB of memory per month. After a little digging, I was able to come up with two papers by J.F. Ziegler, an IBM researcher. The first paper, from 1979, on “The Effects of Cosmic Rays on Computer Memories”, goes into the mechanisms by which cosmic rays can affect bits in computer memory, and makes some rough estimates on the frequency of such events, as well as the effect of elevation on the frequency. The later article from the 1990’s, “Accelerated Testing For Cosmic Soft-Error Rate”, went more in detail in measuring the soft-error rates of different chips by different manufacturers. While I never found the exact source of the “1 bit-flip per 256MB per month” quote in either of these papers, the figure could possibly be generalized from the soft-error rate data in the papers. So, while I’m not entirely sure that that number for the rate is accurate, it’s probably close enough for us to do some simple calculations.

So, now that I had checked out the facts behind cosmic ray induced errors, it was time to see if there was any evidence of this in our data. First of all, where could I find these errors, and where would I most likely find these sorts of errors? I thought about the types of data that we collect and decided that a numeric field would be nearly impossible to detect a bit-flip within, unless it was a field with a very limited expected range. String fields seemed to be a little easier candidate to search for, since single bit flips tend to make strings a little weird due to a single unexpected character. There are also some good places to go looking for bit flips in our error streams too, such as when a column or table name is affected. Secondly, I had to make a few hand-wavy assumptions in order to crunch some numbers. The main assumption is that every bit in our data has the same chance of being flipped as any other bit in any other memory. The secondary assumption is that the bits are getting flipped at the client side of the connection, and not while on our servers.

We have a lot of users, and the little bit of data we collect from each client really adds up. Let’s convert that error rate to some more compatible units. Using the 1/256MB/month figure from the article, that’s 4096 cosmic soft-errors per terabyte per month. According to my colleague, chutten, we receive about 100 terabytes of data per day, or 2800 TB in a 4 week period. If we multiply that out, it looks like we have the potential to find 11,468,800 bit flips in a given 4 week period of our data. WHAT! That seemed like an awful lot of possibilities, even if I suspect a good portion of them to be undetectable just due to not being an “obvious” bit flip.

Looking at the Bugzilla issue that had originally sparked my interest in this, it contained some evidence of labels embedded in the data being affected by bit-flips. This was pretty easy to spot because we knew what labels we were expecting and the handful of anomalies stood out. Not only that, the effect seemed to be somewhat localized to a geographical area. Maybe this wasn’t such a bad place to try and correlate this information with space-weather forecasts. Back to the internet and I find an interesting space-weather article that seems to line up with the dates from the bug. I finally hit a bit of a wall in this fantastical investigation when I found it difficult to get data on solar radiation by day and geographical location. There is a rather nifty site, which has quite a bit of interesting data on solar radiation, but I was starting to hit the limits of my current knowledge and the limits on time that I had set out for myself to write this blog post.

So, rather reluctantly, I had to set aside any deeper investigations into this for another day. I do leave the search here feeling that not only is it possible that our data contains signals for cosmic activity, but that it is very likely that it could be used to correlate or even measure the impact of cosmic ray induced single-event upsets. I hope that sometime in the future I can come back to this and dig a little deeper. Perhaps someone reading this will also be inspired to poke around at this possibility and would be interested in collaborating on it, and if you are, you can reach me via the Glean Channel on Matrix as @travis. For now, I’ve turned something that seemed like a crazy possibility in my mind into something that seems a lot more likely than I ever expected. Not a bad investigation at all.