You are here

Using GROUP BY WITH ROLLUP for Reporting Performance Optimization

MySQL Performance Blog - Mon, 17/09/2007 - 4:35pm

Quite typical query for reporting applications is to find top X values. If you analyze Web Site logs you would look at most popular web pages or search engine keywords which bring you most of the traffic. If you're looking at ecommerce reporting you may be interested in best selling product or top sales people. This information may often need simple select query, however what if you would like to show percents not just absolute value ?

For illustration purposes I've created a syntetic table filled with some 30mil rows evenly spread in 10.000 groups.

PLAIN TEXT SQL:
  1. CREATE TABLE `dt` (
  2.   `grp` int(10) UNSIGNED NOT NULL,
  3.   `slack` varchar(50) DEFAULT NULL
  4. ) ENGINE=MyISAM DEFAULT CHARSET=latin1

And I'm using some silly like query to illustrate some search is required, so count(*) can't be optimized away for MyISAM Tables.

To show percents for the values we need to know total number of rows which matches our where clause.
Most Simple Way Number One is to simply run two queries:

PLAIN TEXT SQL:
  1. mysql> SELECT grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp ORDER BY cnt DESC LIMIT 10;
  2. +------+-----+
  3. | grp  | cnt |
  4. +------+-----+
  5. | 9879 | 300 |
  6. | 3888 | 298 |
  7. | 1793 | 297 |
  8. | 2082 | 294 |
  9. | 9729 | 293 |
  10. | 3760 | 292 |
  11. | 2588 | 290 |
  12. | 7918 | 290 |
  13. | 4055 | 290 |
  14. | 6950 | 290 |
  15. +------+-----+
  16. 10 rows IN SET (21.12 sec)
  17.  
  18. mysql> SELECT count(*) cnt FROM dt WHERE slack LIKE "a%";
  19. +---------+
  20. | cnt     |
  21. +---------+
  22. | 2352996 |
  23. +---------+
  24. 1 row IN SET (18.71 sec)

This allows us to get results in about 40 seconds and as we can see is pretty inefficient - almost half of the time is spent counting the rows which are already traversed and counted for group by operation.

The obvious optimization is to get rid of LIMIT 10 and just fetch all groups and sum values on the application side. Which is the second way.

PLAIN TEXT SQL:
  1. mysql> SELECT grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp ORDER BY cnt DESC;
  2. +-------+-----+
  3. | grp   | cnt |
  4. +-------+-----+
  5. |  9879 | 300 |
  6. |  3888 | 298 |
  7. |  1793 | 297 |
  8. ...
  9. |  7195 | 180 |
  10. |  6703 | 178 |
  11. | 10000 | 107 |
  12. |     0 | 105 |
  13. +-------+-----+
  14. 10001 rows IN SET (21.30 sec)

We even can got read of sorting and add ORDER BY NULL so MySQL does not bother to sort results, though this has a little difference in this case as number of groups is relatively small:

PLAIN TEXT SQL:
  1. mysql> SELECT grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp ORDER BY NULL;   
  2. +-------+-----+
  3. | grp   | cnt |
  4. +-------+-----+
  5. |  2257 | 251 |
  6. |  2418 | 266 |
  7. |  5842 | 258 |
  8. |    90 | 276 |
  9. ...
  10. |  2595 | 243 |
  11. |  4446 | 239 |
  12. |  9802 | 233 |
  13. |  1861 | 227 |
  14. +-------+-----+
  15. 10001 rows IN SET (21.39 sec)

This method indeed works great if you have relatively small number of groups which you can fetch on the application side (you can store result in temporary table and run sum() and sort query on that table instead if amount of groups is much larger).

The other idea came to my mind is using GROUP BY WITH ROLLUP so I can get grouped result set together with total value for
the groups:

PLAIN TEXT SQL:
  1. mysql> SELECT grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp WITH rollup ORDER BY cnt DESC LIMIT 11;
  2. ERROR 1221 (HY000): Incorrect usage of CUBE/ROLLUP AND ORDER BY

Oops. Bad luck - for some reason you can't use order by together with ROLLUP. This does not make much sense to me and I find it quite inconvenient whenever it is MySQL limitation or SQL Standard. This would be extremely useful feature and I do not see good technical reaso not allowing it either.

OK Lets see how fast it is without order by (and limit):

PLAIN TEXT SQL:
  1. mysql> SELECT grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp WITH rollup;                 
  2. +-------+---------+
  3. | grp   | cnt     |
  4. +-------+---------+
  5. |     0 |     105 |
  6. |     1 |     259 |
  7. |     2 |     200 |
  8. ...
  9. |  9999 |     235 |
  10. | 10000 |     107 |
  11. |       | 2352996 |
  12. +-------+---------+
  13. 10002 rows IN SET (28.79 sec)

As you can see there is considerable penalty associalted with GROUP BY WITH ROLLUP in MySQL - it is significantly slower than standard group by even though the only thing it needs to do is maintain an extra count.

So if MySQL does not allow us to use ORDER BY together with GROUP BY WITH ROLLUP can we still do something to get the result set we want from single query ? Sure. Here is Way Number Three:

PLAIN TEXT SQL:
  1. mysql> SELECT * FROM (SELECT grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp WITH rollup) t ORDER BY cnt DESC LIMIT 11;
  2. +------+---------+
  3. | grp  | cnt     |
  4. +------+---------+
  5. | NULL | 2352996 |
  6. | 9879 |     300 |
  7. | 3888 |     298 |
  8. | 1793 |     297 |
  9. | 2082 |     294 |
  10. | 9729 |     293 |
  11. | 3760 |     292 |
  12. | 7918 |     290 |
  13. | 4055 |     290 |
  14. | 3749 |     290 |
  15. | 6950 |     290 |
  16. +------+---------+
  17. 11 rows IN SET (29.68 sec)

Use of extra temporary table for buffering helps us to get result set we're looking for and workaround this limitation, though it of course comes with performance penalty.

This approach is however still faster than running 2 queries completing in 30 seconds rather than 40. Though fetching all result set in this case is still significantly faster. So GROUP BY WITH Rollup is not the killer solution for this problem, while it well could be if well implemented inside MySQL.

Why am I looking on reporting performance optimization ? It is for ClickAider project which is growing rapidly and use of MySQL for reporting quite expectedly starts to become the bottleneck. We use a lot of other black magic to keep things as optimal as possible though performance is still far from our goal of real time reporting on 10.000.000+ recorded events.

UPDATE:
Looking a bit further into it I found why GROUP BY WITH ROLLUP is slower and why ORDER BY does not work with it. The thing is - it is using filesort as group by execution method, not temporary table as ordinary GROUP BY:

PLAIN TEXT SQL:
  1. mysql> EXPLAIN SELECT  grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp  \G
  2. *************************** 1. row ***************************
  3.            id: 1
  4.   select_type: SIMPLE
  5.         TABLE: dt
  6.          type: ALL
  7. possible_keys: NULL
  8.           KEY: NULL
  9.       key_len: NULL
  10.           ref: NULL
  11.          rows: 37748736
  12.         Extra: USING WHERE; USING TEMPORARY; USING filesort
  13. 1 row IN SET (0.01 sec)
  14.  
  15. mysql> EXPLAIN SELECT  grp, count(*) cnt FROM dt WHERE slack LIKE "a%" GROUP BY grp WITH rollup  \G
  16. *************************** 1. row ***************************
  17.            id: 1
  18.   select_type: SIMPLE
  19.         TABLE: dt
  20.          type: ALL
  21. possible_keys: NULL
  22.           KEY: NULL
  23.       key_len: NULL
  24.           ref: NULL
  25.          rows: 37748736
  26.         Extra: USING WHERE; USING filesort
  27. 1 row IN SET (0.00 sec)

As I forced FileSort execution method for GROUP BY by using SQL_BIG_RESULT hint I can see GROUP BY executing about same time as GROUP BY WITH ROLLUP.