An optimization story with Django – one thousand times faster!

I’ve been working on a Django project that is going to be squeezed in as an external application under Pootle which is our tech of choice for the Mozilla Verbatim site which is where all awesome translations happen for Mozilla web projects. (translations for Firefox, Thunderbird, etc. is done by another project which I’ll blog about more in the next couple of weeks).

In Pootle, there are “Profiles” and that’s where the name and quick stats on contributions is stored. However, what I needed to do was to actually go through every “Suggestion”, “Submission” and “Review” and figure out who has contributed what on what project and language. Every such model has an overridden object manager that always joins on the “Users” model thus causing a ‘INNER JOIN’ on every simple data query. Thus this would build up an accurate page of who has contributed to what on the fly.

There are 14 projects, 99 languages, 847 users, 37,681 suggestions (and reviews) and lastly 127,226 submissions. The task is to group these first by language and then by project and lastly for each project, list the contributors names in alphabetical order. After some getting used to the data model and where things are located this wasn’t difficult. I built a prototype that visually represents this and showed it to the stakeholder who wanted this built.
The only problem is that due to the way the Django ORM and Pootle does this is that this takes on average: 11 seconds and involves 2,544 SQL queries. Yikes!

After some optimization (helped by my colleague Rob Hudson’s wonderful django-debug-toolbar) I was able to turn 11 seconds into 10 milliseconds (ie. one thousand times faster) and reduce the number of queries from 2,544 to 7! All it takes is just having to lessen some of the convenience of Django’s otherwise super practical ORM and get closer to the “SQL of things”. The code doesn’t express anything in raw SQL and everything is in Python with some very specific loops.

Why 7 queries and not 1, you might wonder. The reason is that the data I need to get to is spread across 7 different tables. Because some of them are large and joins are difficult what I do instead is pick only exactly what I want and need and nothing more. The logic kinda goes like this:

    • Fetch all IDs and their Name (e.g. 52 and “Arabic”)
    • For relevant tables with foreign keys, only fetch its ID, the foreign key ID and keep this in a list or dictionary whose structure implies information
    • Sort and re-organize these lists and dictionaries in Python
    • Summarize and replace the IDs with the Names for the final template rendering

I think the lesson here is that if your structures are big but not so big that you need a map/reduce job then getting closer to the raw SQL queries and doing the structure juggling in memory, you don’t need to replace your relational databases with NoSQL or typed raw SQL. Django’s ORM is fantastic. It’s slow because it chooses to be convenient but if it needs to be fast it’s just a few slight API calls away. If you’re curious, check out the code on Github.

2 responses

  1. Simeon wrote on :

    Can’t say for sure w/o specific code samples but possibly select_related() would have helped to solve this particular performance problem. The huge number of sql queries makes it sound as though your are forcing new sql queries in the template layer by following relationships on ORM objects. Frequently you can use select_related to force Django to go ahead and load the related objects that you will be evaluating in the template so that you get one query with a join instead of one query per object to be loaded… But I’ve totally done the sort of thing you’re describing and sometimes its the necessary solution – it always feels good to get back an order of magnitude speedup!

  2. Peter Bengtsson wrote on :

    @Simeon, thanks for the response! Yeah, the select_related() method sure is nifty. However, Like the layers of an onion, I think basic ORM usage is the first layer, select_related() the second layer and my trick mentioned above the third layer.

    Watch out though, as SQL joins can sometimes make queries slower and if the conditions are right, it’s sometimes easier to avoid it and walk around the problem altogether which I think is what I have done.