Understanding DNT Adoption within Firefox

UPDATED 2011-09-08 11:55am PST: changed the description of how we store and retain IP address to be more accurate

On March 23rd, Mozilla launched its newest and most awesome browser: Firefox 4. Along with a plethora of features, including faster performance, better security and the whole nine yards, Firefox 4 included a cutting edge privacy feature called Do No Track (DNT). For the uninitiated, DNT simply tells sites “I don’t want to be tracked” via a HTTP header visible to all advertisers and publishers.
Mozilla’s new Privacy Blog  has several posts on the feature, including a new one today releasing a Do Not Track Field Guide for developers. Based on our current numbers, we’ve been seeing for several weeks now just under 5% of our users with DNT turned on within Firefox.
The Mozilla team is all about experimenting. We love innovating new technologies that do good and benefit the community as a whole. As Firefox 4 kept breaking records, the peak was 5,500 downloads/minute, (source: http://blog.mozilla.org/blog/2011/03/25/the-first-48-hours-of-mozilla-firefox-4/) we felt that it would be important to understand whether people were enabling DNT. Every Metrics guy lives and dies by the data. In late 2010, the metrics team gave a small talk on how we collect log data (click here for the video ppt). While that project has gone multiple iterations over time, the basic premise is still the same:

  • Grab logs from multiple data-centers.
  • Split out anonymized and non-anonymized data into two separate files
  • Store both sets of files in HDFS
  • Create relevant partitions inside HIVE
  • Query the data.
  • Drool over the stats.

(Non-anonymized data such as IP address has a 6-month retention policy and is deleted on expiration)

We decided to follow the same approach for calculating DNT stats. Once every day, each Firefox instance pings the AUS servers with respect to its DNT status. The ping request looks something like this:
“DNT:-” User has NOT set DNT
“DNT:1” User HAS set DNT and does *not* wish to be tracked.

Armed with the following data points, a simple HIVE query gives us DNT stats for a given day:

   SELECT ds, dnt_type,  count(distinct ip_address)  FROM web_logs WHERE (request_url LIKE ‘%Firefox/4.0%’ OR request_url LIKE ‘%Firefox/5.0%’ OR request_url LIKE ‘%Firefox/6.0%’) AND dnt_type != ‘DNT:1, 1’ AND ds = ‘$dateTime’ GROUP BY ds, dnt_type ORDER BY ds desc;

The above script is run on a nightly basis and the result is then plotted over a time graph, as included with this post.

One BIG caveat:

The DNT numbers are being undercounted, primarily because we use hashed IP address as proxy for counting a unique user. This means, while there can be multiple users behind a given NAT with DNT set, the counter is incremented only once. This may account for why our numbers are a bit lower than those being reported by other groups, including the recent study of 100 million Firefox users conducted by Krux Digital

Possible Fix, NOT:

While it is possible to uniquely identify each instance of browser, doing so will require that we start tracking users, thereby defeating the exact purpose for why DNT was created in the first place.

 

Feel free to leave us a comment or email: (aphadke at_the_rate mozilla dot com – Anurag Phadke) for more information.

13 responses

  1. Jim wrote on :

    What’s the point of using hashed IPs to track users? It adds nothing to privacy; with only 2^32 possible inputs, it would be trivial to figure out which IP corresponds to a given hash. Might as well just use IP addresses directly.

  2. deinspanjer wrote on :

    The original idea was to hash all the IPs seen on a given day with a strong randomly generated salt and then discard that salt at the end of the day. That way, for any given day we’d be able to look at logs from multiple log sources and say whether we saw records that came from the same IP address multiple times, but we would not be able to go back and retrieve specific records for a given IP address we were interested in.

    While I still think there is some merit to that idea, it turns out that the complexity of making sure that we managed the salts properly and the risk of someone finding a creative way to crack it made us decide it would be easier to skip that approach for now.

    What we do instead is keep the IP address for the log records in a separate file which we can join based on the file name and row number. Then, we simply delete the non-anonymized data from our warehouse according to our data retention policy.

    Here is a public presentation we recently did on that topic:

    http://bit.ly/mozilla-metrics-mango-brownbag

  3. Ferdinand wrote on :

    @Jim maybe you should read up on how hashes work. [kjfihsfihsfiuhsf&&%&^%SDFSDFI*8] Now which IP is that?

  4. Mook wrote on :

    I find it ironic that you’re tracking users with DNT set 😉 With that flag set, I would expect compliance to mean you do not track _anything_ about me – not even the fact that I made an update ping. Not hashed IP addresses, not non-identifying information, _nothing_. I would be fine with “somebody made an update ping”, of course; however, I am against anything that can distinguish between me making a second request and somebody at the other end of the Earth making an independent request.

    Sad; I had thought Mozilla would be better than this. If even Mozilla is tracking despite the flag, I have no trust in anybody else actually obeying the flag either.

    Yes, actually obeying the spirit of the flag would cripple metrics a lot. I had thought user privacy was one of the core values of Mozilla; I guess not, or at least we have vastly different definitions.

  5. Dis wrote on :

    @Ferdinand I think Jim is referring to the fact that with only 2^32 possible inputs (less actually), it’s fairly easy to calculate the hashes for all of them in a few seconds. See: https://en.bitcoin.it/wiki/Mining_hardware_comparison#Single_Card_Setups

  6. Wladimir Palant wrote on :

    @Ferdinand: Maybe you should better read what Jim said because he is right. Creating a rainbow table for all possible IP addresses to revert the hashing isn’t unthinkable. Also, if somebody wants to check the activity of a particular IP address then hashing won’t help: hash the IP address, then look up all the log entries for this hash. So while hashing definitely helps it isn’t the ideal solution.

  7. Danny Moules wrote on :

    Indeed… if the IP were salted with some secret data that would be more helpful, since you would need to know the salt first. Still, ‘security by obscurity’ and all that.

    Is ‘anonymised’ data also deleted after 6 months?

    Anyway it’s interesting to see the ~linear increase in adoption after the pref was exposed in the menus. I feel like we should do that more often. There’s so many handy features of Fx that are relegated to the ‘about:config’ tab which I’m sure people would use if they knew the features existed.

  8. Jason wrote on :

    There’s something ironic about tracking the adoption of Do Not Track but I can’t quite put my finger on it…

  9. deinspanjer wrote on :

    The rainbow table is a valid concern in this day of clusters large enough to theoretically pull it off. That said, if the salt is large and randomly generated every day, it would not be a very feasible thing to do.

    There are two possible threats that we were looking to solve and one non-threat:
    1. Data theft — If someone managed a break in and they could download the TBs of data, we want to make it as difficult as is reasonably possible to prevent them from seeing IPs.
    2. Data acquisition — If Mozilla were subpoenaed or otherwise required to hand over any available data about a particular IP address, we would like to have as little capability and liability to do that as is reasonable.
    x. Data misuse — Any reasonably implemented strategy should prevent misuse by Mozilla itself, including the potential for a change of policy that would open up old data to be used in ways that were not originally planned.

    The hashing strategy is likely to have handled all three of these concerns, but as I mentioned above, it turned out to be safer and simpler to avoid using hashing at all.

    @Wladimir — Setting aside rainbow tables, the hashing strategy that was originally described was specifically designed to prevent the ability to look up a particular IP address. If you salt the hash with a good random salt that is generated every day and is *not* kept around after that day is passed, then you would not be able to do a lookup the way you describe.

  10. deinspanjer wrote on :

    Oh, I see that my original comment was stuck in the moderation queue. That sucks. :/

  11. Ferdinand wrote on :

    Glad to be corrected and I hope my stupid comment prevents other people from looking stupid 😉

  12. deinspanjer wrote on :

    We are *not* tracking anything about the users themselves. We are only monitoring how many requests are coming in to our site with the DNT feature enabled.
    DNT is a complex thing to define, and it is still a source of considerable discussion. I’d suggest downloading the DNT field guide and reading through it. The background and definitions are very useful.

    As far as the data in the logs that were used to generate this analysis, it is important to consider that, other than the IP address, there is no potentially identifying information (PII), and we are not building a user profile or compiling usage on a user basis to track the activity or choices of a user over time. I believe that, since we are doing none of those user tracking things, the analysis that we are doing with this data falls well outside what most people expect to be protected from when they turn DNT on.

  13. Mook wrote on :

    I’m sorry, I didn’t realize the header was missing half of the phrase “do not track me in a personally identifiable way”, instead of what it says on the tin, “do not track”.

    I don’t care if that information is not potentially identifying information; all I care about is if I’m being tracked in any way.

    Again, my feelings on this is particular to Mozilla – with Google or Omniture, I can expect to be tracked and respond accordingly (by blocking access or other local means). My problem was merely that Mozilla is doing things that are against its expressed value system.