Grab logs from multiple data-centers.
- Split out anonymized and non-anonymized data into two separate files
- Store both sets of files in HDFS
- Create relevant partitions inside HIVE
- Query the data.
- Drool over the stats.
(Non-anonymized data such as IP address has a 6-month retention policy and is deleted on expiration)
“DNT:-” User has NOT set DNT
“DNT:1” User HAS set DNT and does *not* wish to be tracked.
Armed with the following data points, a simple HIVE query gives us DNT stats for a given day:
SELECT ds, dnt_type, count(distinct ip_address) FROM web_logs WHERE (request_url LIKE ‘%Firefox/4.0%’ OR request_url LIKE ‘%Firefox/5.0%’ OR request_url LIKE ‘%Firefox/6.0%’) AND dnt_type != ‘DNT:1, 1’ AND ds = ‘$dateTime’ GROUP BY ds, dnt_type ORDER BY ds desc;
The above script is run on a nightly basis and the result is then plotted over a time graph, as included with this post.
One BIG caveat:
The DNT numbers are being undercounted, primarily because we use hashed IP address as proxy for counting a unique user. This means, while there can be multiple users behind a given NAT with DNT set, the counter is incremented only once. This may account for why our numbers are a bit lower than those being reported by other groups, including the recent study of 100 million Firefox users conducted by Krux Digital
Possible Fix, NOT:
While it is possible to uniquely identify each instance of browser, doing so will require that we start tracking users, thereby defeating the exact purpose for why DNT was created in the first place.