Feed on

I recently picked up a copy of Tableau Desktop Professional.  The primary reason for wanting it was to be able to do some advanced visualizations like the ones described here.  Mozilla has a few different datacenters in different parts of the world, and we just brought a new one online in Phoenix.  We wanted to take a look at the geographic traffic breakdown between the Phoenix and the San Jose datacenters.

The visualization on the right was not painful to create.  It took *way* too long to render, but I don’t think Tableau developers ever really optimized for having over a million data points on a map.  It isn’t really that appealing though.  The visualization that I really wanted was an animation of each hour.  You can do this inside Tableau by putting the timestamp on the page shelf and then hitting the play button.  With a million data points, it doesn’t render very quickly though.  I ended up going page by page and exporting an image for each one. I then used a Firefox extension, APNG Editor to splice all the images together into a PNG animation.  It weighs in at 5 MB, but it looks really nice.  Maybe if it impresses some Tableau developers, they’ll consider making it possible to export this type of animation directly from Tableau.  I would sure like saving about three hundred mouse clicks next time I want to do this. :)

Here is the animation version: phx_to_sjc_map_2010-03-03
Here is the dataset: phx_to_sjc_2010-03-03,04

Update: Ken Kovash studied the Firefox 3.6 downloads and found a wonderful reason for them! See here for more details.

I was asked this evening if the nightly report was correct in showing that we had a 128% increase in Firefox downloads today.  The answer is a resounding yes that figure is correct, but I figured it wouldn’t hurt to put a bit more detail on it.

Histograms demonstrating the sharp flood of upgrade patch downloads on release day

Histograms demonstrating the sharp flood of upgrade patch downloads on release day

Wednesday, February 17th, 2010 was a release day for both Firefox 3.0.18 and 3.5.8.  Starting around noon Pacific time yesterday, hundreds of millions of Firefox users would eventually see a prompt notifying them that there was an upgrade available to install.

If the user was running the latest security release for their version of Firefox, the upgrade would consist of a small patch file that would quickly bring them up to the new security release.

If the user was not on the latest security release, they would instead download a special file that was the full size of the Firefox application.  The upgrade process would then automatically install the upgrade.

The question at hand is, “why do we see an increase in people visiting the Mozilla website and downloading a new installer?”  In the past, we’ve had concerns about whether people were having trouble with the automatic update system and were being forced to download the application and install it manually.  From the data I reviewed today, I think it is safe to say that is not a common scenario.  Instead, I’m happy to report that what I see is a lot of people who get a reminder that they should upgrade Firefox and they decide that it is about time for them to go ahead and download the latest and greatest Firefox 3.6 instead.

The chart above (generated courtesy of Tableau Software) shows that manual downloads of the 3.0 and 3.5 versions remained relatively flat, but manual downloads of Firefox 3.6 climbed by almost 3x over the previous day’s peak traffic time.
I would also like to point out that an infrastructure that can handle in increase of over 5 million requests per hour in a three hour window isn’t too shabby.

While my Pentaho Data Integration ETL processes do slow down considerably when processing this huge influx of data, they keep up, managing to process these 10 or so million requests per hour in 30 to 40 minutes.  It could actually be much quicker if I moved this processing to a separate server, but the primary reason it is slow is because once it falls outside its normal 5 minute run time, it has to compete with other ETL processes that are scheduled to run later in the hour.

I received a notification from the Linux Counter project that it was time to review my membership. The project is a very simple little thing. You register the fact that you are a Linux user and you register simple information regarding any number of Linux machines that you use or control. You can also run an update script on those machines that pings the project with kernel version and uptime. The project then compiles some simple statistics about Linux usage worldwide.

After updating my profile, I poked around on the site for a few minutes and discovered this page listing the nine countries in the world for which they have no members or machines: Countries that do not use Linux.

I figured there might be something I could do about that. I dug through our data warehouse and pulled out the table below showing that we have seen requests from five of these countries on a regular basis this year:

Average requests from Firefox on Linux in countries not represented in the Linux Counter project

Average requests from Firefox on Linux in countries not represented in the Linux Counter project

I think it would be fun if some of these Firefox on Linux users saw this post and registered with the Linux Counter to get their non-member country list down to four (or even zero)!

So, if you or someone you know uses Linux in one of the following countries and are proud of it, consider visiting the Linux Counter project and letting them know.

French Southern Territories, Solomon Islands, Pitcairn Islands, Norfolk Island, Saint Kitts and Nevis, Equatorial Guinea, Gambia, Christmas Island, Central African Republic

Update: Bugzilla SQR

I have had the chance to improve the bugzilla SQR in many ways. I have improved the overall run time inside of the ETL (both in kettle and in a python script), fixed a few bugs (A major one that was causing problem with the Open Bug Count), added new dimensions, and constructed a few dashboards. All my changes will be able to be found at sourceforge.

I added a bug severity dimension, added a component level onto the product dimension, added a team level onto the person dimension, and a days dimension to track bugs over a distribution.

I have made some mock up dashboards here and posted bellow are a few snap shots of the charts in them. Tell me what you think and what you would find useful!

hadoop logofirefox logo

I was presented with the challenge of answering the question – how many Firefox users have one add-on or more installed on their Firefox. Currently, addons.mozilla.org (AMO) has statistics on the download counts of add-ons but the actual usage of add-ons has been unanswered.

The Add-ons manager inside of Firefox will check each add-on for update at AMO. This happens once every 24 hour period when Firefox is ran by the user. Updates are handled over HTTP at either addons.mozilla.org for Firefox1.0/1.5/2.0 and versioncheck.addons.mozilla.org for Firefox3.0/3.5. The add-ons manager will ping the servers with information about each add-on and if an update exists the server will respond with one. Since the update ping is handled in HTTP the update ping is recorded in a log file. If you have never seen a web server’s HTTP log file they simply are flat text files where each line contains information about the requests made to the server. Below is example line of the AMO log file and an explanation of the fields.

IP                  HOSTNAME                             TIMESTAMP          REQUEST           versioncheck.addons.mozilla.org - [22/Jun/2009:02:00:00 -0700] "GET 
appABI=x86-msvc&locale=en-US HTTP/1.1" 200 520 "-" 
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: 
Gecko/2009060215 Firefox/3.0.11(.NET CLR 3.5.30729)"

We choose the log files on 2009/06/22 because Firefox will ping AMO multiple times after a Firefox version update and this date was 11 days after the Firefox 3.0.11 release and 3 days after Firefox 3.5 RC2 release so most users should have already of been up-to-date by this time. The whole day’s worth of log data for both hostnames total out to be around 28GB compressed. The log files were large because they also contained requests for AMO’s website.

There is no unique identifier to determine which update pings came from which user and we had to rely on identifying pings from a single user by the IP address, and Timestamp in the update ping. The IP address is the single most unique identifier due to the nature of IP addresses but because of routers (NAT) and proxies many computers can sit behind one IP address. To add another degree of separation we decided to group the update pings by the timestamp of the ping. Update pings will happen within a few seconds from each other so pings in a certain time window would be considered as one user, and other pings from an IP address out side of this time window would be considered as a different user. For example, say that there are two Firefox users behind a router. User 1 might open up his browser in the morning at 10AM and ping AMO and User 2 might open up his browser in the afternoon and ping AMO at 2PM. Even though the pings are from the same IP address the pings at 10AM would all be grouped together and counted separately from the pings that happened at 2PM which would also be grouped together.


Upon hearing about the description of the problem I thought that this would be an ideal candidate for a MapReduce job. I would be able to Map the IP address as the Key and all the other data in the log file entry as a HashMap for the value. I talked with my manager and found out that this was a technology that the metrics team was interested in exploring and was given 4 mac minis to test my implementation out on. I quickly began setting up my Hadoop Cluster running on ubuntu 9.04 desktop which soon became ubuntu 9.04 server for memory conservation and unnecessary gui (Ubuntu 9.04 Desktop takes up about 256MB of RAM on a clean install while Server takes up about 90MB of RAM on a clean install). I have been a user of hadoop in the past but I had never setup my own hadoop cluster before so I turned to the hadoop website and this blog post to help me out. These tutorials at the time did not exactly cover the hadoop version I was using, 0.20. (I believe they are up-to-date now). But was able to accommodate for this fact. The main two things different in hadoop version 0.19 than in version 0.20 is the configuration file and parts of the hadoop java API. In version 0.20, /conf/hadoop-site.xml was split in to three parts in /conf/core-site.xml, /conf/hdfs-site.xml, and /conf/mapred-site.xml and the API was re-factored slightly and some classes were depreciated.

My Mac minis were given hostnames of hadoop-node1-4. Hadoop-node1 was my master node that ran the NameNode and the JobTracker while hadoop-node2-4 were my slaves that ran the TaskTracker and the DataNode.

mac mini hadoop cluster
 # /conf/masters
 # /conf/slaves
 # The java processes
 hadoop@hadoop-node1:/usr/local/hadoop/conf$ jps
 15778 Jps
 30059 NameNode
 30187 SecondaryNameNode
 30291 JobTracker
 hadoop@hadoop-node2:/usr/local/hadoop$ jps
 16950 TaskTracker
 16838 DataNode
 20186 Jps

After getting the cluster setup I tested it out with the hadoop wordcount example and validated the results.


I then began writing a MapReduce Job with the Hadoop Java API. My first thought was to write my own RecordReader which is responsible for reading from an input split that would give key, value pairs to the mapper but decided to go with the default LineRecordReader which puts the file offset as the key and the line as the value because it seemed easier and more natural to have the log file line dissected inside of the Mapper’s map function.


In the map function each line went through a regexp that broke each piece out of the log file line. If the line contained Firefox’s appid, and VersionCheck.php I would map the IP as the Key and construct an AddonsWritable (which is a child of MapWritable with an overridden toString() for output purposes) that contained the epoch time (converted from the date timestamp because it would be much easier to compare with), a MapWritable of add-on guids, and a count of the number of add-ons.

public static class IPAddressMapper extends Mapper{
     /* member vars for mapper which include vars for regexp and storing data */
     private AddonsWritable logInfo = new AddonsWritable();
     public void map(LongWritable key, Text logLine, Context context) throws IOException, InterruptedException {
         if(logLine.matchesRegexp() && isFirefox() && hasVersionCheckphp()) {
             logInfo.put(EPOCH, epoch); // store the epoch_ts
             logInfo.put(GUID, guid); // store the guid 
             logInfo.put(TOTAL, ONE); // store the count
             context.write(ipAddress, logInfo); // map out the ipAddress as Key and logInfo as value


After the map the Hadoop Framework hashes keys and gives them to the reduce function. Inside the Reducer’s reduce function you are given the Key which is the IP address and an Iterable of the AddonsWritables that were from the same key/IP. I needed to group the values with update pings in a certain time window together and unfortunately the Iterable does not guarantee order. So I put the MapWritables in a PriorityQueue with a custom comparator that ordered values by the timestamp field in the AddonsWritable. Some IP’s had thousands of pings, so if I counted an IP with more than 2,000 pings I threw it out. Once all the values were placed in the PriorityQueue, I iterated over the priorityQueue popping off each value and comparing to the previous seen timestamp. If the abs(current timestamp – prev timestamp) <= 10 secs I considered them to be from the same user, and added them to a MapWritable of guids inside of the MapWritable that contained all the other information. Once I saw a current timestamp where the abs(current timestamp – prev timestamp) > 10secs I wrote/collected the previous values and started a new MapWritable for the next new window. Until there were no more values in the priorityQueue and I would write/collect the final current values.

public static class IPAddressEpochTimeReducer extends Reducer {
     private PriorityQueue pq = new PriorityQueue();
     public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
         for(AddonsWritable val: values) {
             pq.add(new AddonsWritable(val));
             if(pq.size() > 2000) {
                 return ;
         while(!pq.isEmpty()) {
             AddonsWritable val = pq.remove();
             if(lastEpoch != -1 && Math.abs(lastEpoch - currentEpoch) > SIXTY_SECONDS) {
                 writeOut();  // Write out all the information for the current collection of versioncheck pings
                 resetVars();  // Reset all the currently used vars for the next collection of versioncheck pings
             addGuid(output, val.get(GUID));
             sum += val.get(TOTAL);
             lastEpoch = currentEpoch;
         /* There is one more remaining.  Write it out */

Runtime Stats

Hadoop provides a web interface that will output runtime statistics for a job, below are the stats for the job described above.

  • Submitted At: 17-Jul-2009 09:55:12
  • Launched At: 17-Jul-2009 09:55:12 (0sec)
  • Finished At: 17-Jul-2009 14:05:52 (4hrs, 10mins, 39sec)
  • Average time taken by Map tasks: 2mins, 45sec
  • Average time taken by Shuffle: 2hrs, 33mins, 9sec
  • Average time taken by Reduce tasks: 1hrs, 4mins, 10sec
Kind Total Tasks(successful+failed+killed) Successful tasks Failed tasks Killed tasks Start Time Finish Time
Setup 1 1 0 0 17-Jul-2009 09:55:22 17-Jul-2009 09:55:24 (1sec)
Map 364 362 0 2 17-Jul-2009 09:55:25 17-Jul-2009 12:45:56 (2hrs, 50mins, 31sec)
Reduce 5 5 0 0 17-Jul-2009 10:13:40 17-Jul-2009 14:05:55 (3hrs, 52mins, 15sec)
Cleanup 1 1 0 0 17-Jul-2009 14:05:57 17-Jul-2009 14:06:03 (5sec)


The outputted files ended up having lines looking like the one below.

 IP              EPOCH_TS   ADDON_COUNT                           LIST_OF_GUIDS 1245665519000	2 {CAFEEFAC-0016-0000-0013-ABCDEFFEDCBA} {CAFEEFAC-0016-0000-0000-ABCDEFFEDCBA}

I created a python script to gather statistics out of the output. With the 10 second window I ended up finding out that there was a total of

  • 244,727,644 add-on update pings
  • 117,557,228 users
  • average of 2.14 add-ons per user
  • variance of 5.68

Since our Active Daily User (ADU) count for that day was 98,000,000 users this data didn’t make much sense. And we decided to repeat the processes with a 60 second window instead of the previous 10 second window.

With the 60 second window I ended up finding that there was

  • 94,656,833 users
  • average of 2.63 add-ons per user
  • variance of 12.34

These numbers still seemed fairly large so I decided to reduce on IP address which would allow us a base where there is at least 1 user behind an ip. To reduce by IP I changed the reducer to not account for the timestamp and simply reduce all values sharing the same IP.

The output of this was

  • 32,848,771 IP/USERS
  • average of 5.04 add-ons per user
  • variance of 779.91

To compare this data with Firefox’s ADU I ran a similar mapreduce job on our Firefox ADU data that counted

  • 61,460,501 IPs
  • average of 1.60 Users per IP
  • variance of 86.57

Shell script analytics

Recently, I was asked if I could provide a breakdown of Firefox users on the Macintosh platform by whether they were using the Intel or PPC chipset.  For anyone who only cares about seeing that data and not the “how” behind it, look no further than this link:
Firefox on Macintosh processor breakdown trends in Many Eyes

For anyone else, follow along.. Continue Reading »

There is a lot to be said on the above topic, but for the moment, I just wanted to drop a quick note about some ad-hoc work I did today:

I ran an analysis on a year and a half of FTP log files, filtering for some specific requests, and filtering out but summarizing uninteresting traffic from our heartbeat monitors and such.

I put together this simple Kettle transformation, and ran it with a cluster consisting of 32 fairly low powered slaves. The results were pleasing, especially considering I didn’t even go through a tuning process to determine the optimal number of step copies or row set sizes.

8782 files containing 432 million records. The processing was completed in 47 minutes giving a throughput of about 156 thousand rows per second.

There are a couple of screenshots after the cut.

Continue Reading »

The metrics team wants to reach out to the community of Bugzilla users to learn what would benefit them by asking: In Bugzilla, what kind of analytic questions should be answered? We have some analytic questions that we can answer, and we have gathered some questions that people would like to have answered. This blog post will outline these questions so you can contribute new questions to help us analyze the correct places. What would you like to see analyzed in Bugzilla?

What we currently have

The current Bugzilla SQR(Software Quality Reports) is placed in a star schema. A star schema is based on measures and dimensions. Measures are the main component being analyzed and can be aggregated. A good example of a measure would be sales dollar from a order. Dimensions are the “by” conditions. A good example of a dimension would be by time, or by product. Putting measures and dimensions together we can create sales dollars by day, sales dollars by week, and sales dollars by month or sales dollars by product. We can further group measures and dimensions in to cubes. Cubes are the collection of specific measures and dimensions. Right now the SQR has two cubes, Bugs, and Bugs Changes. With the SQR cubes we can ask things like: Bug Burndown By Product, Current Issues by Status and Product, Average Days to Resolution by product and priority, Overall open vs. closed trend for bugs by product. Below are all the details of each cube.

Bugs Cube

  • Bugs
  • Days Since Creation


  • Time Created
  • Bug Type
  • Person Assignee
  • Person Reporter
  • Priority
  • Product
  • Status
  • Time Release
  • Version

Bug Changes Cube

  • Changes
  • Actual Issues
  • Opened
  • Resolved
  • Net Open
  • Starting Net Open
  • Ending Net Open
  • Days Since Creation
  • Elapsed Days


  • Time Created
  • Issue Type
  • Person Assignee
  • Person Reporter
  • Priority
  • Product
  • Status
  • Time Released
  • Version

Current Analytic Questions Raised

We have talked to a few members of MoCo already and have also seen a few projects as well. Justin Scott has came up with an interesting project to analyze Bugzilla. Damon Sicore has a project that looks at Blockers By Priority, Blockers Closed By Day, Non-Blockers Closed by Day, Blockers added by Day, and a Break down of bugs being closed by person in each team. Murali Nandigama is a member of the QA team at Mozilla corporation and has raised a few Bugzilla metric questions of his own already. Murali is interested in: What are the top components where there are a lot of incoming critical and blocker bugs and how fast are we able to triage them to confirm or not?, After the bug is confirmed, how long are we taking to fix it?, After the bug is fixed how long are we taking to verify?, What is the mean time between confirmed to verify of bugs by components?, What is the re-fix rate and which component have significant re-fix rates?, and What is the duplicate bugs rate and which components have significant duplicate rates?

Now the metrics team wants your input. What would you like to see?

Mozilla’s bugzilla database contains approx. 480,000 bugs and approx. 5,000,000 entries in bugs_activity table and is too large for the initial development that I am doing. I want to construct a smaller sample Bugzilla data base that I can use to develop and run tests with for my project in a more efficient manner. To construct this new sample database I first want to prune the database by only housing the tables that are required. To figure out which tables are necessary I went inside of the SQR ETL job in spoon and wrote down the tables that were being used. The 8 necessary tables I found were:

  • verions
  • products
  • resolution
  • bug_status
  • priority
  • bugs
  • bugs_activity
  • profiles

These tables below could just be transferred in whole because of there small size and the information contained in them was just meta/attribute data of the bugs and besides that they didn’t have any related information to the bugs.bug_id:

  • verions
  • products
  • resolution
  • bug_status
  • priority

For the bugs I had to construct a statistical sampling and based on the bugs statistical sampling gather corresponding entries in the bugs_activity and profiles tables. I selected data from the bugs table and passed it through a reservoir sampling step that would pull a random amount of rows from the returned result set. I decided to pull a random sample of 1,000 bugs. After that I had a new job that matches the bug_id’s pulled from the reservoir sample with the bugs_activity table, and I then had to match the assigned_to and reporter fields from the random sample with the userid in the profiles table.

The creating of the Bugzilla Sample Database inside of spoon consists of 3 main jobs:

  1. Resetting the Sample Bugzilla Database
  2. Transferring the Non-bugs tables
  3. Transferring the Bugs tables

The first step will reset the sample Bugzilla database where we reset or clear any information stored in the database. The step follows the pseudo code below.

For Table t in the set of Tables T {
   If t exists {
      Drop t
   Create t

The second step will transfer the trivial tables mentioned above that do not involve the bugs table.

The third step will collect a sampling from the bugs table, output the entries into the new bugs table, select the bugs_activity entries based on bug_id, and profiles entries based on assigned_to and reporters, and output these into the new bugs_activity and profiles entries.

To begin you will want to have is access to a MySQL database where you can create, drop, and insert. The name of the database doesn’t matter but mine was called bugs_sample. After you have access to a MySQL database open up spoon, create a job, and setup the database connections as described below.

  1. Create a new job by going to File > New > Job .
  2. Right click in the Staging area and select Job settings from the menu that appears. This will open up the Job properties dialog window.
  3. Change the Job name: field to something like “Create Bugzilla Sample Database”.
  4. In the top left of spoon click on the Magnifying Glass with the label “View” under it.
  5. Right-Click on the label Database connections. From the drop down menu select New. This will open up the Database connection dialog box.
  6. The first database connection to create will be for the real Bugzilla database. Fill in the credentials to log into the Bugzilla database. An example is below.
  7. Click Test and you should get a message like the one below but with your credentials.
    If you didn’t receive this message check your MySQL connections and credentials.
  8. Repeat previous two steps above to create a database connection for the new Bugzilla Sample database.
  9. After the two database connections are set up it is now time to create a new job for the first step in the process. This will be done by creating a new job, and inside the job checking to see if each table exists. If it does drop it. Create the table. As described earlier. Follow this video to create the first part of the job.
    The schema for each table can be found here.
  10. Repeat the last part of the previous step for each table mentioned above. The end result should look something like the image below.
  11. Now we will create a new job for the second part of the process where we transfer all tables that don’t directly relate to the bugs table over to the new Sample Database. Create a new job entitled Transfer Non-bugs. This job will be composed of multiple transformations that will transfer tables from the Bugzilla database to our new Bugzilla Sample database. To construct the transformation to transfer a table from the Bugzilla database to the new Bugzilla Sample database follow this video. Create a new transformation for each non related to bugs table that are listed above.
  12. Once a transformation for each non related bug table has been completed and saved. Create a job to connect them together like the image below.
  13. Now it is time to create the third and last part of the transformation where we collect a sampling from the bugs table and based of this sampling select corresponding bugs_activity, and profiles entries. This is composed of two parts. The first part we will gather bugs and then do a reservoir sampling on this gathering. Then we will store the data to the new table and finally we will save some values to variables for use in a later query. This video describes how to construct the first part.

    NOTE: It was necessary to assign these values into a variable. I tried feeding the output fields from the group by CSV directly into a table input step query but the JDBC driver will only accept the first entry in the CSV when using the a WHERE IN (?) statement. By using the variable in the WHERE IN (${VARIABLE}) query eliminated this problem.

  14. The second part is two take the variables that were just assigned and used them in a WHERE of a query to select data that we need from the Bugzilla database and then dump the data into our new Bugzilla Sample database. This video describes this second part of the transformation.
  15. Finally we want to construct jobs that tie everything all together. This video shows the final job with everything connected.

If you would like to change your sample size change it in the reservoir sampling step. I would even recommend having a variable in this location to easily change the sample size when running the job. Also change the seed value inside of the reservoir sampling step each time to guarantee a new sample.

I have attached and included the source files in a compressed file for anyone interested. Read the README after unzipping.

I am working on a project this summer that analyzes Bugzilla. The basis of this project has been started by Nick Goodman and he entitled it Software Quality Reports (SQR). Software Quality Reports gives product managers, project managers, development managers, and software engineers more information on things like bug burn down rate by product, issues by status and product, average days to resolution by priority and product, open vs close trend by product, etc. I am going to take Nick Goodman’s SQR and improve it by making it more scalable and adding addition features that don’t currently exist.

A large part of this project involves doing an ETL(Extract Transform Load) on the bugzilla database into a star schema in side of a data warehouse. To design and run the ETL process I am using a program from the open source community project Pentaho BI (Business Intelligence) Suite called Pentaho Data Integration (PDI, and formerly known as kettle) and spoon — the graphical tool which is used to design and test every PDI process. Once the data is loaded in to the star schema, I will then use the Pentaho BI server to create graphs and charts to visualize and and drill down the data.

I will be reporting my work through out this blog and I hope that it will allow any one interested to participate, learn, or contribute. To get started right now you can read up about Pentaho, Pentaho Data Integration (formerly known as kettle), and Nick Goodman’s Software Quality Reports can be found on sourceforge.

« Prev