I’ve been working on a Django project that is going to be squeezed in as an external application under Pootle which is our tech of choice for the Mozilla Verbatim site which is where all awesome translations happen for Mozilla web projects. (translations for Firefox, Thunderbird, etc. is done by another project which I’ll blog about more in the next couple of weeks).
In Pootle, there are “Profiles” and that’s where the name and quick stats on contributions is stored. However, what I needed to do was to actually go through every “Suggestion”, “Submission” and “Review” and figure out who has contributed what on what project and language. Every such model has an overridden object manager that always joins on the “Users” model thus causing a ‘INNER JOIN’ on every simple data query. Thus this would build up an accurate page of who has contributed to what on the fly.
There are 14 projects, 99 languages, 847 users, 37,681 suggestions (and reviews) and lastly 127,226 submissions. The task is to group these first by language and then by project and lastly for each project, list the contributors names in alphabetical order. After some getting used to the data model and where things are located this wasn’t difficult. I built a prototype that visually represents this and showed it to the stakeholder who wanted this built.
The only problem is that due to the way the Django ORM and Pootle does this is that this takes on average: 11 seconds and involves 2,544 SQL queries. Yikes!
After some optimization (helped by my colleague Rob Hudson’s wonderful django-debug-toolbar) I was able to turn 11 seconds into 10 milliseconds (ie. one thousand times faster) and reduce the number of queries from 2,544 to 7! All it takes is just having to lessen some of the convenience of Django’s otherwise super practical ORM and get closer to the “SQL of things”. The code doesn’t express anything in raw SQL and everything is in Python with some very specific loops.
Why 7 queries and not 1, you might wonder. The reason is that the data I need to get to is spread across 7 different tables. Because some of them are large and joins are difficult what I do instead is pick only exactly what I want and need and nothing more. The logic kinda goes like this:
- Fetch all IDs and their Name (e.g. 52 and “Arabic”)
- For relevant tables with foreign keys, only fetch its ID, the foreign key ID and keep this in a list or dictionary whose structure implies information
- Sort and re-organize these lists and dictionaries in Python
- Summarize and replace the IDs with the Names for the final template rendering
I think the lesson here is that if your structures are big but not so big that you need a map/reduce job then getting closer to the raw SQL queries and doing the structure juggling in memory, you don’t need to replace your relational databases with NoSQL or typed raw SQL. Django’s ORM is fantastic. It’s slow because it chooses to be convenient but if it needs to be fast it’s just a few slight API calls away. If you’re curious, check out the code on Github.
Simeon wrote on :
Peter Bengtsson wrote on :