Feed on
Posts
Comments
hadoop logofirefox logo

I was presented with the challenge of answering the question – how many Firefox users have one add-on or more installed on their Firefox. Currently, addons.mozilla.org (AMO) has statistics on the download counts of add-ons but the actual usage of add-ons has been unanswered.

The Add-ons manager inside of Firefox will check each add-on for update at AMO. This happens once every 24 hour period when Firefox is ran by the user. Updates are handled over HTTP at either addons.mozilla.org for Firefox1.0/1.5/2.0 and versioncheck.addons.mozilla.org for Firefox3.0/3.5. The add-ons manager will ping the servers with information about each add-on and if an update exists the server will respond with one. Since the update ping is handled in HTTP the update ping is recorded in a log file. If you have never seen a web server’s HTTP log file they simply are flat text files where each line contains information about the requests made to the server. Below is example line of the AMO log file and an explanation of the fields.

 
IP                  HOSTNAME                             TIMESTAMP          REQUEST                  
 255.255.255.255 versioncheck.addons.mozilla.org - [22/Jun/2009:02:00:00 -0700] "GET 
/update/VersionCheck.php?reqVersion=1&id={B13721C7-F507-4982-B2E5-502A71474FED}&
version=2.2.0.102&maxAppVersion=3.*&status=userEnabled&
appID={ec8030f7-c20a-464f-9b0e-13a3a9e97384}&appVersion=3.0.11&appOS=WINNT&
appABI=x86-msvc&locale=en-US HTTP/1.1" 200 520 "-" 
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.11) 
Gecko/2009060215 Firefox/3.0.11(.NET CLR 3.5.30729)"

We choose the log files on 2009/06/22 because Firefox will ping AMO multiple times after a Firefox version update and this date was 11 days after the Firefox 3.0.11 release and 3 days after Firefox 3.5 RC2 release so most users should have already of been up-to-date by this time. The whole day’s worth of log data for both hostnames total out to be around 28GB compressed. The log files were large because they also contained requests for AMO’s website.

There is no unique identifier to determine which update pings came from which user and we had to rely on identifying pings from a single user by the IP address, and Timestamp in the update ping. The IP address is the single most unique identifier due to the nature of IP addresses but because of routers (NAT) and proxies many computers can sit behind one IP address. To add another degree of separation we decided to group the update pings by the timestamp of the ping. Update pings will happen within a few seconds from each other so pings in a certain time window would be considered as one user, and other pings from an IP address out side of this time window would be considered as a different user. For example, say that there are two Firefox users behind a router. User 1 might open up his browser in the morning at 10AM and ping AMO and User 2 might open up his browser in the afternoon and ping AMO at 2PM. Even though the pings are from the same IP address the pings at 10AM would all be grouped together and counted separately from the pings that happened at 2PM which would also be grouped together.

Setup/Config

Upon hearing about the description of the problem I thought that this would be an ideal candidate for a MapReduce job. I would be able to Map the IP address as the Key and all the other data in the log file entry as a HashMap for the value. I talked with my manager and found out that this was a technology that the metrics team was interested in exploring and was given 4 mac minis to test my implementation out on. I quickly began setting up my Hadoop Cluster running on ubuntu 9.04 desktop which soon became ubuntu 9.04 server for memory conservation and unnecessary gui (Ubuntu 9.04 Desktop takes up about 256MB of RAM on a clean install while Server takes up about 90MB of RAM on a clean install). I have been a user of hadoop in the past but I had never setup my own hadoop cluster before so I turned to the hadoop website and this blog post to help me out. These tutorials at the time did not exactly cover the hadoop version I was using, 0.20. (I believe they are up-to-date now). But was able to accommodate for this fact. The main two things different in hadoop version 0.19 than in version 0.20 is the configuration file and parts of the hadoop java API. In version 0.20, /conf/hadoop-site.xml was split in to three parts in /conf/core-site.xml, /conf/hdfs-site.xml, and /conf/mapred-site.xml and the API was re-factored slightly and some classes were depreciated.

My Mac minis were given hostnames of hadoop-node1-4. Hadoop-node1 was my master node that ran the NameNode and the JobTracker while hadoop-node2-4 were my slaves that ran the TaskTracker and the DataNode.

mac mini hadoop cluster
 
 # /conf/masters
 hadoop-node1.mv.mozilla.com
 
 # /conf/slaves
 hadoop-node2.mv.mozilla.com
 hadoop-node3.mv.mozilla.com
 hadoop-node4.mv.mozilla.com
 
 # The java processes
 hadoop@hadoop-node1:/usr/local/hadoop/conf$ jps
 15778 Jps
 30059 NameNode
 30187 SecondaryNameNode
 30291 JobTracker
 
 hadoop@hadoop-node2:/usr/local/hadoop$ jps
 16950 TaskTracker
 16838 DataNode
 20186 Jps

After getting the cluster setup I tested it out with the hadoop wordcount example and validated the results.

MapReduce

I then began writing a MapReduce Job with the Hadoop Java API. My first thought was to write my own RecordReader which is responsible for reading from an input split that would give key, value pairs to the mapper but decided to go with the default LineRecordReader which puts the file offset as the key and the line as the value because it seemed easier and more natural to have the log file line dissected inside of the Mapper’s map function.

Map

In the map function each line went through a regexp that broke each piece out of the log file line. If the line contained Firefox’s appid, and VersionCheck.php I would map the IP as the Key and construct an AddonsWritable (which is a child of MapWritable with an overridden toString() for output purposes) that contained the epoch time (converted from the date timestamp because it would be much easier to compare with), a MapWritable of add-on guids, and a count of the number of add-ons.

 
public static class IPAddressMapper extends Mapper{
     /* member vars for mapper which include vars for regexp and storing data */
     private AddonsWritable logInfo = new AddonsWritable();
 
     public void map(LongWritable key, Text logLine, Context context) throws IOException, InterruptedException {
         if(logLine.matchesRegexp() && isFirefox() && hasVersionCheckphp()) {
             logInfo.put(EPOCH, epoch); // store the epoch_ts
             logInfo.put(GUID, guid); // store the guid 
             logInfo.put(TOTAL, ONE); // store the count
             context.write(ipAddress, logInfo); // map out the ipAddress as Key and logInfo as value
         } 
     }
 }

Reduce

After the map the Hadoop Framework hashes keys and gives them to the reduce function. Inside the Reducer’s reduce function you are given the Key which is the IP address and an Iterable of the AddonsWritables that were from the same key/IP. I needed to group the values with update pings in a certain time window together and unfortunately the Iterable does not guarantee order. So I put the MapWritables in a PriorityQueue with a custom comparator that ordered values by the timestamp field in the AddonsWritable. Some IP’s had thousands of pings, so if I counted an IP with more than 2,000 pings I threw it out. Once all the values were placed in the PriorityQueue, I iterated over the priorityQueue popping off each value and comparing to the previous seen timestamp. If the abs(current timestamp – prev timestamp) <= 10 secs I considered them to be from the same user, and added them to a MapWritable of guids inside of the MapWritable that contained all the other information. Once I saw a current timestamp where the abs(current timestamp – prev timestamp) > 10secs I wrote/collected the previous values and started a new MapWritable for the next new window. Until there were no more values in the priorityQueue and I would write/collect the final current values.

public static class IPAddressEpochTimeReducer extends Reducer {
     private PriorityQueue pq = new PriorityQueue();
     public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
         for(AddonsWritable val: values) {
             pq.add(new AddonsWritable(val));
             if(pq.size() > 2000) {
                 pq.clear();
                 return ;
             }
         }
 
         while(!pq.isEmpty()) {
             AddonsWritable val = pq.remove();
             if(lastEpoch != -1 && Math.abs(lastEpoch - currentEpoch) > SIXTY_SECONDS) {
                 writeOut();  // Write out all the information for the current collection of versioncheck pings
                 resetVars();  // Reset all the currently used vars for the next collection of versioncheck pings
             }
             addGuid(output, val.get(GUID));
             sum += val.get(TOTAL);
             lastEpoch = currentEpoch;
         }
         /* There is one more remaining.  Write it out */
         writeOut();
     }
 }

Runtime Stats

Hadoop provides a web interface that will output runtime statistics for a job, below are the stats for the job described above.

  • Submitted At: 17-Jul-2009 09:55:12
  • Launched At: 17-Jul-2009 09:55:12 (0sec)
  • Finished At: 17-Jul-2009 14:05:52 (4hrs, 10mins, 39sec)
  • Average time taken by Map tasks: 2mins, 45sec
  • Average time taken by Shuffle: 2hrs, 33mins, 9sec
  • Average time taken by Reduce tasks: 1hrs, 4mins, 10sec
Kind Total Tasks(successful+failed+killed) Successful tasks Failed tasks Killed tasks Start Time Finish Time
Setup 1 1 0 0 17-Jul-2009 09:55:22 17-Jul-2009 09:55:24 (1sec)
Map 364 362 0 2 17-Jul-2009 09:55:25 17-Jul-2009 12:45:56 (2hrs, 50mins, 31sec)
Reduce 5 5 0 0 17-Jul-2009 10:13:40 17-Jul-2009 14:05:55 (3hrs, 52mins, 15sec)
Cleanup 1 1 0 0 17-Jul-2009 14:05:57 17-Jul-2009 14:06:03 (5sec)

Output/Results

The outputted files ended up having lines looking like the one below.

 IP              EPOCH_TS   ADDON_COUNT                           LIST_OF_GUIDS
 255.255.255.255 1245665519000	2 {CAFEEFAC-0016-0000-0013-ABCDEFFEDCBA} {CAFEEFAC-0016-0000-0000-ABCDEFFEDCBA}

I created a python script to gather statistics out of the output. With the 10 second window I ended up finding out that there was a total of

  • 244,727,644 add-on update pings
  • 117,557,228 users
  • average of 2.14 add-ons per user
  • variance of 5.68

Since our Active Daily User (ADU) count for that day was 98,000,000 users this data didn’t make much sense. And we decided to repeat the processes with a 60 second window instead of the previous 10 second window.

With the 60 second window I ended up finding that there was

  • 94,656,833 users
  • average of 2.63 add-ons per user
  • variance of 12.34

These numbers still seemed fairly large so I decided to reduce on IP address which would allow us a base where there is at least 1 user behind an ip. To reduce by IP I changed the reducer to not account for the timestamp and simply reduce all values sharing the same IP.

The output of this was

  • 32,848,771 IP/USERS
  • average of 5.04 add-ons per user
  • variance of 779.91

To compare this data with Firefox’s ADU I ran a similar mapreduce job on our Firefox ADU data that counted

  • 61,460,501 IPs
  • average of 1.60 Users per IP
  • variance of 86.57

8 Responses to “Tracking down the number of Firefox Addon users with hadoop”

  1. on 10 Aug 2009 at 4:02 pm Wladimir Palant

    Interesting results. This seems to suggest that roughly 50% of all Firefox users have add-ons installed. If we then take 36 million users pinging AMO on a day (at least that’s how I interpret your numbers) and 8730182 pings for Adblock Plus on that day – we get that 12% of all Firefox users are using Adblock Plus. Finally a way to put ADU numbers in relation, the result is significantly higher than my previous estimates however. One possible explanation is that users who have add-ons installed typically use their browser more frequently and send out more pings than those who don’t. This would make it difficult to compare ADU numbers between Firefox and add-ons, add-on users are probably significantly fewer than 50%.

  2. on 12 Aug 2009 at 2:57 am Ingo Lütkebohle

    Thanks for writing up your modeling approach, that is very good example material for using hadoop!

    I have a question about your ip-to-user estimate: At least in Germany, many users do not keep an IP address for more than 24 hours. This is because almost all consumer broadband/DSL providers disconnect them about once a day. Routers immediately reconnect, but get assigned a different IP then. Do you think this is of relevance for the number of users per IP and if yes, have you accounted for it?

  3. on 12 Aug 2009 at 6:43 am deinspanjer

    It is important to note that our analysis was not focused on tracking users over time through their IP address. It was only focused on bucketing the multiple requests that occurred in one 24 hour period into the number of requests per “user session”. Since the user session takes a short period of time to complete, there is little chance that the user’s IP address could change in the middle of that transaction.

  4. on 12 Aug 2009 at 11:21 am skrueger

    Ingo: I see how frequently changing IPs is an issue, and I did not think of this problem while developing. But I don’t think this is a large problem because I only think that this could make our distinct IP smaller than it actually is since the FIrefox Add-on Manager will ping once in a 24hour time window. The problem I am thinking of could happen when user A from IP x.x.x.x pings for updates and then loses/changes his IP. Then user B picks up IP x.x.x.x and pings for updates. In this case 2 distinct users (not behind the same router) pinged for updates but only one IP was counted. However, I am not certain on how often this case happens.

  5. on 16 Aug 2009 at 10:09 pm city_zen

    Thanks for publishing this.
    I’m surprised that the number of users with add-ons installed is not HIGHER. I mean, about 50% of Firefox users don’t have EVEN ONE add-on installed? Hard to believe. I think I would use another browser if Firefox didn’t have add-ons to customize it. Don’t take me wrong, Firefox is a great browser, but for me it’s the add-ons that push it over any other browser out there.
    Would it be possible to publish some more information about the usage patterns that you found? Starting with 28 GB of data, there must be more than 3 or 4 bullet points worth of output info, right? ;)
    A few examples that maybe are available to you after all that processing (great job, btw):
    – Top ten most installed add-ons
    – Percentage of users with more than 10 add-ons installed
    – Maximum number of add-ons installed (don’t look at me, I ONLY have a few dozens installed :D )
    – Distribution by country
    – OS used
    etc.

    Thank you

  6. on 19 Aug 2009 at 6:37 am SteveL

    We in the Hadoop project are always pleased to see other OSS projects doing interesting stuff with our code, thank you for the writeup. And no, that hadoop on ubuntu wiki page is still lagging -we should really do our own .deb files and so have much simpler setup instructions: “select Apache Hadoop on Synaptic”.

    A big issue for the stats is “how often are updates checked for”; without data on that, you can’t be sure what the numbers really mean -interesting as they are.

  7. on 19 Aug 2009 at 9:03 am deinspanjer

    These “versioncheck” pings normally happen on a 24 hour timer. There is a slight complication though in that when a user is updating Firefox itself, there are a couple of additional add-on versioncheck pings to determine compatibility with the Firefox version being installed, and if add-ons are updated, then the new versions of them ping again.

  8. on 28 Aug 2009 at 11:54 am peter

    A prime example for the usage of hadoop. Great!
    But I don’t think, that the conclusions you gained are reliable. Maybe FF should introduce an anonymous ID for each program, so the issues with multiple users behind an IP or changing ones are solved.
    But even then, we can’t be sure, about the spread of particular addons, since update pings maybe skipped a day (People should take a walk on weekends and not browse the internet… ;o) So there wouldn’t be an update check right?)

Trackback URI | Comments RSS

Leave a Reply

You must be logged in to post a comment.