Feed on

Mozilla’s bugzilla database contains approx. 480,000 bugs and approx. 5,000,000 entries in bugs_activity table and is too large for the initial development that I am doing. I want to construct a smaller sample Bugzilla data base that I can use to develop and run tests with for my project in a more efficient manner. To construct this new sample database I first want to prune the database by only housing the tables that are required. To figure out which tables are necessary I went inside of the SQR ETL job in spoon and wrote down the tables that were being used. The 8 necessary tables I found were:

  • verions
  • products
  • resolution
  • bug_status
  • priority
  • bugs
  • bugs_activity
  • profiles

These tables below could just be transferred in whole because of there small size and the information contained in them was just meta/attribute data of the bugs and besides that they didn’t have any related information to the bugs.bug_id:

  • verions
  • products
  • resolution
  • bug_status
  • priority

For the bugs I had to construct a statistical sampling and based on the bugs statistical sampling gather corresponding entries in the bugs_activity and profiles tables. I selected data from the bugs table and passed it through a reservoir sampling step that would pull a random amount of rows from the returned result set. I decided to pull a random sample of 1,000 bugs. After that I had a new job that matches the bug_id’s pulled from the reservoir sample with the bugs_activity table, and I then had to match the assigned_to and reporter fields from the random sample with the userid in the profiles table.

The creating of the Bugzilla Sample Database inside of spoon consists of 3 main jobs:

  1. Resetting the Sample Bugzilla Database
  2. Transferring the Non-bugs tables
  3. Transferring the Bugs tables

The first step will reset the sample Bugzilla database where we reset or clear any information stored in the database. The step follows the pseudo code below.

For Table t in the set of Tables T {
   If t exists {
      Drop t
   Create t

The second step will transfer the trivial tables mentioned above that do not involve the bugs table.

The third step will collect a sampling from the bugs table, output the entries into the new bugs table, select the bugs_activity entries based on bug_id, and profiles entries based on assigned_to and reporters, and output these into the new bugs_activity and profiles entries.

To begin you will want to have is access to a MySQL database where you can create, drop, and insert. The name of the database doesn’t matter but mine was called bugs_sample. After you have access to a MySQL database open up spoon, create a job, and setup the database connections as described below.

  1. Create a new job by going to File > New > Job .
  2. Right click in the Staging area and select Job settings from the menu that appears. This will open up the Job properties dialog window.
  3. Change the Job name: field to something like “Create Bugzilla Sample Database”.
  4. In the top left of spoon click on the Magnifying Glass with the label “View” under it.
  5. Right-Click on the label Database connections. From the drop down menu select New. This will open up the Database connection dialog box.
  6. The first database connection to create will be for the real Bugzilla database. Fill in the credentials to log into the Bugzilla database. An example is below.
  7. Click Test and you should get a message like the one below but with your credentials.
    If you didn’t receive this message check your MySQL connections and credentials.
  8. Repeat previous two steps above to create a database connection for the new Bugzilla Sample database.
  9. After the two database connections are set up it is now time to create a new job for the first step in the process. This will be done by creating a new job, and inside the job checking to see if each table exists. If it does drop it. Create the table. As described earlier. Follow this video to create the first part of the job.
    The schema for each table can be found here.
  10. Repeat the last part of the previous step for each table mentioned above. The end result should look something like the image below.
  11. Now we will create a new job for the second part of the process where we transfer all tables that don’t directly relate to the bugs table over to the new Sample Database. Create a new job entitled Transfer Non-bugs. This job will be composed of multiple transformations that will transfer tables from the Bugzilla database to our new Bugzilla Sample database. To construct the transformation to transfer a table from the Bugzilla database to the new Bugzilla Sample database follow this video. Create a new transformation for each non related to bugs table that are listed above.
  12. Once a transformation for each non related bug table has been completed and saved. Create a job to connect them together like the image below.
  13. Now it is time to create the third and last part of the transformation where we collect a sampling from the bugs table and based of this sampling select corresponding bugs_activity, and profiles entries. This is composed of two parts. The first part we will gather bugs and then do a reservoir sampling on this gathering. Then we will store the data to the new table and finally we will save some values to variables for use in a later query. This video describes how to construct the first part.

    NOTE: It was necessary to assign these values into a variable. I tried feeding the output fields from the group by CSV directly into a table input step query but the JDBC driver will only accept the first entry in the CSV when using the a WHERE IN (?) statement. By using the variable in the WHERE IN (${VARIABLE}) query eliminated this problem.

  14. The second part is two take the variables that were just assigned and used them in a WHERE of a query to select data that we need from the Bugzilla database and then dump the data into our new Bugzilla Sample database. This video describes this second part of the transformation.
  15. Finally we want to construct jobs that tie everything all together. This video shows the final job with everything connected.

If you would like to change your sample size change it in the reservoir sampling step. I would even recommend having a variable in this location to easily change the sample size when running the job. Also change the seed value inside of the reservoir sampling step each time to guarantee a new sample.

I have attached and included the source files in a compressed file for anyone interested. Read the README after unzipping.

3 Responses to “Creating a sample bugzilla database using kettle”

  1. on 07 Jun 2009 at 9:42 pm Max Kanat-Alexander

    This is really incredible and involved, but for the copying part, why didn’t you just copy everything except for longdescs and attach_data? Those are the only two tables that are of any size. All of the other tables are pretty tiny, relatively.


  2. on 08 Jun 2009 at 5:28 am deinspanjer

    The bmo bugs_activity table has several million records in it. The size of this table was a limitation due to the way the ETL is processing history currently. The only tables we did special processing on were bugs, bugs_activity, and users.

    The other tables that we didn’t copy were left out simply to make it easier to develop, test, and debug out ETL processes.

  3. on 11 Jun 2009 at 8:36 am Nicholas Goodman

    Daniel is spot on…. The current SQR does a horribily inefficient processing on the bug changes. Making it smaller to do the development (to simply change the ETL to use Analytic Query step) makes perfect sense. Adding back in the 5 million records should certainly be feasible, once the ETL is refactored.

Trackback URI | Comments RSS

Leave a Reply

You must be logged in to post a comment.