Feed on

Shell script analytics

Recently, I was asked if I could provide a breakdown of Firefox users on the Macintosh platform by whether they were using the Intel or PPC chipset.  For anyone who only cares about seeing that data and not the “how” behind it, look no further than this link:
Firefox on Macintosh processor breakdown trends in Many Eyes

For anyone else, follow along..
While my standard processing libraries parse out a lot of information from our daily traffic logs, the user agent string is not something that I parse out of the main log data because the current processing engine I have for user agent strings is a little slow and it takes a lot of CPU.

Sometimes, especially when dealing with log files as your source data, it just takes too much time to build up a big system to do one-off processing.  Way back when, Unix was founded on the idea of being able to chain together small specific tools to be able to do larger jobs efficiently.  That is something that I find very relevant in my day to day work.

So, in order to provide the numbers requested above, I turned to my tool chain of command line utilities.  One of my favorite tools in that tool chain is AWK.  It may not be quite as powerful as Perl or Python, but it is quick and simple and it is quite good at getting the job done.

Now, our automatic update web log traffic is a medium sized source of data I have to churn through.  Here are a few data points:

  • We generate about four to six gigabytes of log data per day for download.mozilla.org.  It spikes to as high as thirty gigabytes on release days and on release weeks, the median is much higher, somewhere between ten to fifteen gigabytes.
  • We generate about thirty gigabytes of log data per day for aus.mozilla.org during the week.  It dips as low as twenty-two gigabytes on Saturdays.
  • We generate more than two hundred and sixty gigabytes of log data per day for addons.mozilla.org and versioncheck.addons.mozilla.org during the week.  It dips to just over two hundred on the weekends.

Update:  I don’t know if anyone noticed, but I had a typo in the file glob for the versioncheck histogram. That is why there was a spike in the middle instead of the larger volume of traffic for the whole time period.

How did I get those numbers you ask? Why with a shell script:

[me@etl01 download.mozilla.org]$ gzip -l access_2009-07*.gz | awk -f ~/bin/histogram_sum.awk scale=500000000 column=8 width=10
2009-07-08 -   6.44GB: **************
2009-07-09 -   6.19GB: **************
2009-07-10 -   5.68GB: *************
2009-07-11 -   4.59GB: **********
2009-07-12 -   4.90GB: ***********
2009-07-13 -   6.02GB: *************
2009-07-14 -   5.85GB: *************
2009-07-15 -   5.81GB: *************
2009-07-16 -   6.13GB: **************
2009-07-17 -   9.37GB: *********************
2009-07-18 -   6.45GB: **************
2009-07-19 -   6.00GB: *************
2009-07-20 -   7.12GB: ****************
2009-07-21 -  12.72GB: ****************************
2009-07-22 -  33.78GB: *************************************************************************
2009-07-23 -  23.09GB: **************************************************
2009-07-24 -  15.41GB: **********************************
2009-07-25 -  10.95GB: ************************
2009-07-26 -  10.38GB: ***********************
2009-07-27 -  11.75GB: **************************
2009-07-28 -  10.39GB: ***********************
2009-07-29 -   7.32GB: ****************
Total counts = 216.93GB

[me@etl01 aus.mozilla.org]$ gzip -l access_2009-{06-30,07}*.gz | awk -f ~/bin/histogram_sum.awk scale=500000000 column=8 width=10
2009-06-30 -  30.55GB: ******************************************************************
2009-07-01 -  29.88GB: *****************************************************************
2009-07-02 -  29.21GB: ***************************************************************
2009-07-03 -  26.12GB: *********************************************************
2009-07-04 -  21.97GB: ************************************************
2009-07-05 -  25.63GB: ********************************************************
2009-07-06 -  30.20GB: *****************************************************************
2009-07-07 -  29.88GB: *****************************************************************
2009-07-08 -  29.78GB: ****************************************************************
2009-07-09 -  29.22GB: ***************************************************************
2009-07-10 -  27.06GB: ***********************************************************
2009-07-11 -  22.74GB: *************************************************
2009-07-12 -  25.69GB: ********************************************************
2009-07-13 -  29.81GB: *****************************************************************
2009-07-14 -  29.49GB: ****************************************************************
2009-07-15 -  29.34GB: ****************************************************************
2009-07-16 -  28.79GB: **************************************************************
2009-07-17 -  26.98GB: **********************************************************
2009-07-18 -  22.44GB: *************************************************
2009-07-19 -  25.11GB: ******************************************************
2009-07-20 -  29.50GB: ****************************************************************
2009-07-21 -  29.32GB: ***************************************************************
2009-07-22 -  28.15GB: *************************************************************
2009-07-23 -  25.53GB: *******************************************************
2009-07-24 -  24.34GB: *****************************************************
2009-07-25 -  20.89GB: *********************************************
2009-07-26 -  23.84GB: ****************************************************
2009-07-27 -  28.23GB: *************************************************************
2009-07-28 -  28.34GB: *************************************************************
2009-07-29 -  22.12GB: ************************************************
Total counts = 810.15GB

[me@etl01 logs]$ cat
 <(cd addons.mozilla.org; gzip -l access_2009-{06-30,07}*.gz | tail -n +2)
 <(cd versioncheck.addons.mozilla.org; gzip -l access_2009-{06-30,07}*.gz | tail -n +2)
 | sort -k 4 | awk -f ~/bin/histogram_sum.awk scale=5000000000 column=8 width=10
2009-06-30 - 284.10GB: **************************************************************
2009-07-01 - 285.41GB: **************************************************************
2009-07-02 - 259.99GB: ********************************************************
2009-07-03 - 227.03GB: *************************************************
2009-07-04 - 192.92GB: ******************************************
2009-07-05 - 216.33GB: ***********************************************
2009-07-06 - 247.53GB: ******************************************************
2009-07-07 - 244.36GB: *****************************************************
2009-07-08 - 243.49GB: *****************************************************
2009-07-09 - 235.46GB: ***************************************************
2009-07-10 - 218.33GB: ***********************************************
2009-07-11 - 190.16GB: *****************************************
2009-07-12 - 207.19GB: *********************************************
2009-07-13 - 231.92GB: **************************************************
2009-07-14 - 232.91GB: ***************************************************
2009-07-15 - 230.37GB: **************************************************
2009-07-16 - 228.75GB: **************************************************
2009-07-17 - 236.00GB: ***************************************************
2009-07-18 - 203.75GB: ********************************************
2009-07-19 - 214.33GB: ***********************************************
2009-07-20 - 238.31GB: ****************************************************
2009-07-21 - 242.12GB: ****************************************************
2009-07-22 - 277.91GB: ************************************************************
2009-07-23 - 272.68GB: ***********************************************************
2009-07-24 - 235.63GB: ***************************************************
2009-07-25 - 202.74GB: ********************************************
2009-07-26 - 216.85GB: ***********************************************
2009-07-27 - 241.81GB: ****************************************************
2009-07-28 - 238.79GB: ****************************************************
2009-07-29 - 232.97GB: ***************************************************
Total counts = 7099.05GB

The command I ran to generate the stats driving the Many Eyes site above was this:

pv access_2009-{06-30,07}*.gz |
 gzip -cd | grep -F "Macintosh" |
 sed -rne '/"GET \/update\/[1-3]\/Firefox\/.*?Macintosh/s/^.*?\[([0-9]{2}\/[a-zA-Z]{3}\/[0-9]{4}).*?(Intel|PPC) Mac OS X.*/\1\t\2/p' |
 awk -f ~/bin/tabulate_fields.awk kf=1 cf=2

The interesting parts of this command are:

pv access_2009-{06-30,07}*.gzPipe Viewer a utility for monitoring the progress of data through a pipeline.  When the argument to pv is a file, it stats file file to determine how large it is so that it can calculate an ETA to finish processing it.  This is why I do the otherwise foolish trick of cat’ing to gzip -cd instead of passing the file list direclty to gzip or zgrep.  Here, I’m passing in a fileglob of all the download access logs from June 30 to July 28th (the day I ran it).

grep -F “Macintosh” — I go ahead and pay the penalty for an extra process here because grep -F is much more efficient at throwing away the 98% of the log data that I’m not interested in than the much more complicated sed regex that follows.  The best part of pv is that it lets you actually test optimizations like this.  When I ran without the grep, I was processing about 1.5 MB/s.  When I put in the simple grep filter, that number jumped up to 5.5 MB/s.

sed -rne … — I won’t go into the gritty details about the regex for now.  Suffice to say that this regex looks for Firefox update checks that have the string Macintosh later in the request (hopefully in the user agent data).  For those matches, it then pulls out the request date and a single word, “Intel” or “PPC” if it is immediately followed by “Mac OS X”.

awk -f ~/bin/tabulate_fields.awk kf=1 cf=2 — This is my pièce de résistance.  Given many different types of tabular data, it can count occurrences of values for a given key value, and it can also sum a numeric column for the counted occurrences. This is a fairly simple use case.  It just keys off of the date in the first column of the output of sed and counts the occurrences of “Intel” or “PPC” for each day.

Hope you enjoyed this geek session.  If you would like to take the time to show me how I can do this things much more elegantly or quickly using Perl or Python, I’d actually love to hear it!  I really should learn to stop avoiding these wonderful languages just because I’m more familiar with an older tool. ;)


4 Responses to “Shell script analytics”

  1. on 29 Jul 2009 at 9:43 pm Jesse Ruderman

    Manyeyes requires Java :( What’s the answer?

  2. on 29 Jul 2009 at 9:51 pm deinspanjer

    Yeah, it requires Java. Apparently, according to Shaver, it won’t work with Minefield at the moment because we are waiting on Apple to deliver a new browser plug-in. I hope that you’ll agree that it is still a little better than requiring some other plug-in though because at least there are reasonable open source alternatives for this one.

  3. on 29 Jul 2009 at 11:16 pm Julian Hyde

    Yay, let’s hear it for awk. I’m a big fan too.

    It’s my language of choice when building moderately complex scripts. Perl and python give you the illusion of power, and tempt you to write whole applications in those languages, whereas awk tends to run out of road. This is good, because it forces you to use a ‘real’ language like Java for the big apps.

    I also use awk on the command line for really simple stuff, because I never can remember the syntax for ‘cut’.

    Finally, I always type ‘gawk’ these days. In some distros awk is aliased to mawk, and mawk has let me down one time too many with some subtle bugs.

    (Wow, that turned into a confession.)


  4. on 30 Jul 2009 at 12:49 am Arthur

    @Jesse: ~75% Intel vs. ~25% PPC

Trackback URI | Comments RSS

Leave a Reply

You must be logged in to post a comment.