Should You Use GROUP BY….ORDER BY NULL By Default?

Sheeri

1

Edited to add: Thanks to Roland Bouman for pointing out Bug 30477 created Aug 2007 that addresses this issue. I am glad I am not the only one who thinks implicit overhead is bad!

At Northeast PHP a few weeks ago, an audience member came up to me after my talk about indexing and asked about ORDER BY NULL for optimal queries. I have to say, I was surprised, as I had not heard about using ORDER BY NULL. In a nutshell, apparently when MySQL does a GROUP BY, there is an implicit ORDER BY the same fields, which adds extra overhead for the mere purpose of returning the values in order of the GROUP BY fields.

I knew about the implicit ORDER BY, but I thought that was required for the GROUP BY, and made the GROUP BY faster. After all, it’s easier to group like items together if they are already sorted, right?

However, every single source I have researched seems to imply that, no, it is just overhead and completely unnecessary unless you really do want the results returned in the same order as the GROUP BY. For example, if you query with GROUP BY last_name and do not care about having the rows returned in lexical order of last names, you would use GROUP BY last_name ORDER BY NULL.

This leads me to believe that by default, whenever doing a GROUP BY, it is a good idea to use ORDER BY NULL. I have not seen that as a piece of advice that is generally given out in talks, either. So maybe there is something I am not understanding properly? I would love to know what everyone thinks.

Here is the result of my research:

The manual at http://dev.mysql.com/doc/refman/5.5/en/order-by-optimization.html says:

By default, MySQL sorts all GROUP BY col1, col2, … queries as if you specified ORDER BY col1, col2, … in the query as well. If you include an ORDER BY clause explicitly that contains the same column list, MySQL optimizes it away without any speed penalty, although the sorting still occurs. If a query includes GROUP BY but you want to avoid the overhead of sorting the result, you can suppress sorting by specifying ORDER BY NULL.

High Performance MySQL, 3rd Edition says the following on page 246, in the section on “Optimizing GROUP BY and DISTINCT”:

MySQL automatically orders grouped queries by the columns in the GROUP BY clause, unless you specify an ORDER BY clause explicitly. If you don’t care about the order and you see this causing a filesort, you can use ORDER BY NULL to skip the automatic sort. You can also add an optional DESC or ASC keyword right after the GROUP BY clause to order the results in the desired direction by the clause’s columns.

A random blog post extols the joy of “no more filesorts” in his EXPLAINs at http://www.subelsky.com/2008/05/order-by-null-kills-mysql-filesorts.html.

My question is – why wait until you see a “Using filesort” in your EXPLAIN plan? If the overhead is only used to order the results, and that is not desired, why not just use ORDER BY NULL by default, whenever using a GROUP BY query?

Or is it possible that yes, whenever developers write GROUP BY, they probably want the results returned in the order of the ORDER BY?

One response

  1. And wrote on :

    Sound like a mysql quirk. I would guess that mysql initially implemented “group by” using a sort, so results came back sorted, then people started assuming that this would be the case and build applications that depended on this, and then mysql couldn’t change this behavior without breaking the applications, and now they have given a explicit way of hinting that ordering is not needed. Something similar seems to have been the case with the ordering of for…in in javascript (i.e. that ordering is undefined, but the implementation sorts and therefor users start depend on it: http://stackoverflow.com/q/280713).

    I guess the problem is that you could sometimes get better grouping performance by putting the elements in a hash table of lists. The runtime of that would be O(n), whereas the runtime of a sort would be O(N*log(N)). But if the results need to be ordered by the key, you could still use the hashing-method and just order the keys of the hash-table, so if there are few groups, there will be few elements to sort and the overhead will be negligible. But in practice either method might be faster and it is up to the optimizer to guess what to do. (in some other database the “order by null”, might be interpreted as “order by null, ” so you get a unneeded ordering if you move your app)