You are here

Evaluating IO subsystem performance for MySQL Needs

MySQL Performance Blog - Wed, 05/03/2008 - 7:55am

I'm often asked how one can evaluate IO subsystem (Hard drive RAID or SAN) performance for MySQL needs so I've decided to write some simple steps you can take to get a good feeling about it, it is not perfect but usually can tell you quite a lot of what you should expect from the system.

What I usually look for MySQL is performance in random reads and random writes. Sequential reads and writes are rarely the problem for OLTP workloads, so we will not look at them.

I also prefer to look at performance with O_DIRECT flag set to bypass OS cache. This may execute separate code path in kernel and so has a bit different performance pattern compared to buffered IO (even followed by fsync regularly) , but it allows to easily bypass OS cache both for reads and for writes and so does not require creating large working sets for boxes with significant amounts of memory (or reducing amount of usable memory).

The system I'm testing this on has 256MB BBU (Battery Backed up Cache) on RAID controller so we will test two workloads size - first one is small which fits in the cache and the second one will be large enough so it does not. This allows us to see both by RAID cache and uncached IO performance.

We're interested in cached reads because they may show latency to the RAID cache if RAID read cache is enabled. If it is disabled you may be reading from Drives cache or even drives themselves which will affect performance significantly. In any case especially playing with data size a bit you will well learn how your cache behaves.

Were even more so interested about writes which show us how many cached writes per second we can do - this is important for log writes, which are synchronous IO operation database performance which typically have rather close data locality.

We're also interested in uncached reads and writes because this correspond to general database workload.

It is worth to run the test with 1 thread and with some higher number (say 64) to see how things scale.

The tool I'm using for this is SysBench which was designed by my team when I still worked for MySQL and we specially implemented bunch of tests to ease hardware evaluation for things which are important for MySQL.

To prepare small 128MB single file working set we can use the following command:

PLAIN TEXT CODE:
  1. [root@DB10 sysbench]# ./sysbench --test=fileio --max-time=60 --max-requests=1000000  --file-num=1 --file-extra-flags=direct --file-fsync-freq=0  --file-total-size=128M prepare

And when we can run the test:

PLAIN TEXT CODE:
  1. ./sysbench --test=fileio --max-time=60 --max-requests=1000000  --file-num=1 --file-extra-flags=direct --file-fsync-freq=0  --file-total-size=128M --file-test-mode=rndrd run
  2. sysbench v0.4.8:  multi-threaded system evaluation benchmark
  3.  
  4. Running the test with following options:
  5. Number of threads: 1
  6.  
  7. Extra file open flags: 16384
  8. 1 files, 128Mb each
  9. 128Mb total file size
  10. Block size 16Kb
  11. Number of random requests for random IO: 1000000
  12. Read/Write ratio for combined random IO test: 1.50
  13. Calling fsync() at the end of test, Enabled.
  14. Using synchronous I/O mode
  15. Doing random read test
  16. Threads started!
  17. Time limit exceeded, exiting...
  18. Done.
  19.  
  20. Operations performed:  773835 Read, 0 Write, 0 Other = 773835 Total
  21. Read 11.808Gb  Written 0b  Total transferred 11.808Gb  (201.52Mb/sec)
  22. 12897.22 Requests/sec executed
  23.  
  24. Test execution summary:
  25.     total time:                          60.0001s
  26.     total number of events:              773835
  27.     total time taken by event execution: 59.0285
  28.     per-request statistics:
  29.          min:                            0.0001s
  30.          avg:                            0.0001s
  31.          max:                            0.0086s
  32.          approx.  95 percentile:         0.0001s
  33.  
  34. Threads fairness:
  35.     events (avg/stddev):           773835.0000/0.00
  36.     execution time (avg/stddev):   59.0285/0.00

The output is pretty verbose so it is quite scary for many people, however there is basically one number important here 12897.22 req/sec - so the RAID cache on this drive can do about 13.000 16K reads/sec from the cache, not bad at all.

With 64 threads I get 23727.46 req/sec which shows even in such cached mode you can get better performance by having many outstanding requests.

Running with more than one threads you may also take a look at another table SysBench displays:

PLAIN TEXT CODE:
  1. per-request statistics:
  2.          min:                            0.0008s
  3.          avg:                            0.0027s
  4.          max:                            0.0112s
  5.          approx.  95 percentile:         0.0030s

So we get about 3ms 95 percentile request time - not bad - all request were executed with pretty uniform performance.

Let us now do the test with single thread and write-through RAID cache (as if there would not be any BBU)

PLAIN TEXT CODE:
  1. [root@DB10 sysbench]# ./sysbench --test=fileio --max-time=60 --max-requests=1000000  --file-num=1 --file-extra-flags=direct --file-fsync-freq=0  --file-total-size=128M --num-threads=1 --file-test-mode=rndwr run
  2. sysbench v0.4.8:  multi-threaded system evaluation benchmark

Auch... We get 142.62 Requests/sec which is extremely poor considering this is 8 drives array. You would likely get a bit better with sequential log writes but it is not going to be much higher.

Lets see what we get with 64 threads... hm again 142.83 Requests/sec and looking at VMSTAT I can see only 1 blocked process all the time which does not sounds right.

Honestly I did not expect this result while running benchmarks for this articles (and this is always very exciting to find something unexpected while doing Benchmarks). Fortunately I've already seen this some time ago though I hoped this issue is long fixed since that time... Though this box is running CentOS4 which is not the newest OS out where.

The problem is basically in case of O_DIRECT writes there is only one IO can be happening per file at any given time. To show this is the problem let us create working set consisting out of number of files and repeat the run:

PLAIN TEXT CODE:
  1. [root@DB10 sysbench]# ./sysbench --test=fileio --max-time=60 --max-requests=1000000  --file-num=128 --file-extra-flags=direct --file-fsync-freq=0  --file-total-size=128M --num-threads=64 --file-test-mode=rndwr cleanup
  2. sysbench v0.4.8:  multi-threaded system evaluation benchmark
  3.  
  4. Removing test files...
  5. [root@DB10 sysbench]# ./sysbench --test=fileio --max-time=60 --max-requests=1000000  --file-num=128 --file-extra-flags=direct --file-fsync-freq=0  --file-total-size=128M --num-threads=64 --file-test-mode=rndwr prepare
  6. sysbench v0.4.8:  multi-threaded system evaluation benchmark
  7.  
  8. 128 files, 1024Kb each, 128Mb total
  9. Creating files for the test...
  10. [root@DB10 sysbench]# ./sysbench --test=fileio --max-time=60 --max-requests=1000000  --file-num=128 --file-extra-flags=direct --file-fsync-freq=0  --file-total-size=128M --num-threads=64 --file-test-mode=rndwr run

Now we get 827.94 Requests/sec requests per second, which is not bad considering this is RAID5 volume.

Let us now repeat the tests for WriteBack cache configuration:

We get 4735.27 Requests/sec from 1 thread and single file; 4740.19 Requests/sec from 64 threads and 1 file and 5228.68 Requests/sec with 64 threads and 128 files.

As you can see numbers are a lot better especially when there is single thread doing IO (just as with database log). We can see serialization with O_DIRECT however the impact is much less in this case compared to Write Through mode because we're mostly limited by RAID controller capacity.

Let us now do test with larger test set. We'll use 2GB in 128 files (because we already discovered synchronization issue with O_DIRECT). This sounds small but as controller cache is just 256M it should be good enough.

PLAIN TEXT CODE:
  1. [root@DB10 sysbench]# ./sysbench --test=fileio --max-time=60 --max-requests=1000000  --file-num=1 --file-extra-flags=direct --file-fsync-freq=0  --file-total-size=128M --num-threads=64 --file-test-mode=rndwr cleanup
  2. sysbench v0.4.8:  multi-threaded system evaluation benchmark
  3.  
  4. Removing test files...
  5. [root@DB10 sysbench]# ./sysbench --test=fileio --max-time=60 --max-requests=1000000  --file-num=128 --file-extra-flags=direct --file-fsync-freq=0  --file-total-size=2G --num-threads=1 --file-test-mode=rndwr prepare
  6. sysbench v0.4.8:  multi-threaded system evaluation benchmark

There is also one trick running such larger scale benchmarks. SysBench by default has repeatable sequence of requests, because it does not initialize random generator. If you're doing multiple runs make sure to add --init-rng=1 option otherwise results can be skewed.

So we do run as:

PLAIN TEXT CODE:
  1. [root@DB10 sysbench]# ./sysbench --init-rng=1 --test=fileio --max-time=60 --max-requests=1000000  --file-num=128 --file-extra-flags=direct --file-fsync-freq=0  --file-total-size=2G --num-threads=1 --file-test-mode=rndrd run

For single thread we get 259.28 Requests/sec which approximately matches what single drive can do. This is important lesson - even if you have many hard drives you can't really get advantage of them unless you can get multiple outstanding requests.

Performance with 64 threads was 2334.63 Requests/sec which is perfect scalability to 6 hard drive this array has. We have even a bit more than 8x because 64 outstanding requests allow RAID controller to optimize seeks.

Lets us see how writes are doing:

For writes we're getting 583.43 Requests/sec for 1 thread and 605.88 Requests/sec for 64 threads.

As you can see we got more writes for single thread than we had reads and also using multiple threads for writes did not improve performance as dramatically as for reads. This is because as we have BBU writes are just stored in controller cache and flushed to the disk in background - in this case there is little difference if we use one thread or multiple threads.

As you also can see BBU is helpful even for random writes if they are performed from single thread - so it does not only help database log writes but has much larger positive impact on workload.

I should also note it is very good idea to watch VMSTAT while doing the test. Here is the snipped from good VMSTAT from the read test:

PLAIN TEXT CODE:
  1. 0 64  90560 3196492  86832 12631072    0    0 37760    23 3402  6932  0 12 64 24
  2.  0 64  90560 3196500  86836 12631068    0    0 37222    26 3328  7694  0  1 75 24
  3.  0 64  90560 3196436  86836 12631068    0    0 36998     6 3340  7716  0  1 74 24
  4.  0 64  90560 3196428  86840 12631064    0    0 37395    12 3320  7774  0  1 74 24
  5.  0 64  90560 3196436  86840 12631064    0    0 37498     6 3365  7758  0  1 74 24
  6.  0 64  90560 3196436  86840 12631064    0    0 37702     6 3370  7680  0  2 74 24

As you can see all 64 threads are waiting for IO. Now lest see our write test which seems to have some serialization issues

PLAIN TEXT CODE:
  1. 2 39  90560 5305556  86920 10534204    0    0     0  9345 1625  4587  0  0 70 30
  2.  1  1  90560 5305548  86920 10534204    0    0     0  9017 1661  4934  0  0 74 26
  3.  1  1  90560 5305556  86924 10534200    0    0     0  9223 1638  4881  0  0 73 27
  4.  1  3  90560 5305556  86932 10534192    0    0     0  9634 1715  5033  0  0 74 26
  5.  1  1  90560 5305556  86932 10534192    0    0     0  9235 1643  5035  0  0 73 26

You can get similar information by analyzing iostat numbers and that well may be more accurate. Though looking at VMSTAT is often good enough to see the problem.

P.S If you're wondering which hardware I used for benchmarks - it is Dell PowerEdge 2950 with PERC5 and 6 10K RPM hard drives in RAID5.

Entry posted by peter | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks