Knowing what you want can make a huge difference (…with pymongo)

So I’ve got a piece of MongoKit code that simply wraps a pymongo query and it looks like this

        times = defaultdict(list)
        for pq in (db.PlayedQuestion.collection
                   .find({'play.$id': {'$in': play_ids}})):
            if pq['time'] is not None:
                if pq['right']:
                    times['right'].append(pq['time'])
                elif pq['answer']:
                    times['wrong'].append(pq['time'])

What I noticed was that is was quite slow. For example, looping over 12,200 documents to get this data out took a big fat 600 milliseconds on average! Note that it’s not dressing things up as Python model instances as the .collection.find() just returns a dict straight from pymongo.

After realising that I actually only need timeright and answer I just set the fields parameter like this:

        times = defaultdict(list)
        for pq in (db.PlayedQuestion.collection
                   .find({'play.$id': {'$in': play_ids}},
                         fields=('time','right','answer')  # the difference
                         )):
            if pq['time'] is not None:
                if pq['right']:
                    times['right'].append(pq['time'])
                elif pq['answer']:
                    times['wrong'].append(pq['time'])

Now, looping over 12,200 documents only takes 200 milliseconds on average. Now we’re talking!

But wait a minute! If I only care about those with a time that isn’t None (null), can’t I include that in the query? Well, the time field is not indexed but it’s usually a good idea to push work onto the database. Let’s try:

        times = defaultdict(list)
        for pq in (db.PlayedQuestion.collection
                   .find({'play.$id': {'$in': play_ids},
                          'time': {'$ne': None}},
                         fields=('time','right','answer')
                         )):
            if pq['right']:
                times['right'].append(pq['time'])
            elif pq['answer']:
                times['wrong'].append(pq['time'])

As it turns out, only 1,575 of my 12,200 documents have a time attribute that isn’t null. So doing it this way takes only about 39 milliseconds on average.

Let’s repeat the same logic and push even more work into the database query instead. Here’s two queries instead of one but this time there’s less python control flow and all done in two queries:

        for pq in (db.PlayedQuestion.collection
                   .find({'play.$id': {'$in': play_ids},
                          'time': {'$ne': None},
                          'right': True},
                         fields=('time',)
                         )):
            times['right'].append(pq['time'])

        for pq in (db.PlayedQuestion.collection
                   .find({'play.$id': {'$in': play_ids},
                          'time': {'$ne': None},
                          'right': False,
                          'answer': {'$ne': u''}
                          },
                         fields=('time',)
                         )):
            times['wrong'].append(pq['time'])

You’d think this would be a lot faster now since the only field it returns is time which is a float point number and takes up significantly less bytes compared to answer which is a unicode string. However this approach takes 63 milliseconds on average which is worse than before. Having said that, now things are getting a bit weird because I only have 1,575 documents to loop over so a large part of that 63 milliseconds is probably spent just setting up and tearing down the query. Next: map/reduce!

As you might have noticed, this query is now ripe for an aggregate instead of this silly for-loop. So, let’s try a map reduce job on it:

from bson.code import Code
map_ = Code("""
 function() {
   if (this.right)
     emit('right', {total:this.time, count:1});
   if (!this.right && this.answer)
     emit('wrong', {total:this.time, count:1});
 }
""")

reduce_ = Code("""
 function r( who , values ){
   var n = { total : 0 , count : 0 };
   for ( var i=0; i<values.length; i++ ){
     n.total += values[i].total;
     n.count += values[i].count;
   }
   return n;
 }
""")

finalize = Code("""
 function f( who , res ){
     res.avg = res.total / res.count;
     return res;
  }
""")
 ...
        result = db.PlayedQuestion.collection.map_reduce(
          map_, reduce_, 'myoutput',
          finalize=finalize,
          query={'play.$id': {'$in': play_ids},
                 'time': {'$ne': None},
                 }
        )
        times = {}
        for reduction in result.find():
            if reduction['_id'] == 'right':
                times['right'] = reduction['value']['avg']
            elif reduction['_id'] == 'wrong':
                times['wrong'] = reduction['value']['avg']

Unfortunately all that juggling and javascript malarkey isn’t actually that performant. This takes 130 milliseconds on average. Bear in mind that the query reduces the collection down to a small number (1,575) which might mean that a map/reduce job simply isn’t the right tool.

In conclusion

I think doing this optimization dance was a bit premature. The collection size is simply too little. Looping over 1.5k lightweight dicts in python is a breeze. Also, knowing what data you need from a query makes a huge difference (200% in my case).

Also, by the time my database is big enough such that silly loops in the drivers don’t work, MongoDB might have the new promised aggregate functions (estimated for version 2.2 (two releases away)) which will do basic operations such as average() natively without the javascript parsing.

2 responses

  1. Anderson Cardoso wrote on :

    Nice post, thanks =]

  2. Dave Brondsema wrote on :

    Map/reduce (or any other javascript) in mongo queries will probably grab a lock. See http://www.mongodb.org/display/DOCS/How+does+concurrency+work#Howdoesconcurrencywork-OnJavascript So performance testing should be on a loaded active server, not with single standalone queries. And I’d recommend against it, it often is problematic due to the locks.