Today we announce release 1.0.2-2 of our XtraDB storage engine.
Here is a list of enhancements:
The patch splits global InnoDB buffer_pool mutex into several and eliminates waitings on flush IO and mutex when there is no enough free buffers. It helps if you have performance drops when data does not fit in memory.
InnoDB has a concurrent transaction limit of 1024 because in the standard InnoDB the number of undo slots is fixed value. This patch expands the maximum number of undo slots to 4072 and allows better utilizing modern hardware. (Thank SmugMug for this feature!)
ATTENTION: If the option was enabled and the expanded slots are used, you cannot use the datafile with system tablespace for the software which assume 1024 slots (e.g. the other mysqld or innodb hotbackup). See documentation for details.
Percona XtraDB 1.0.2-2 available :
XtraDB is compatible with existing InnoDB tables (Warning: unless you used innodb_extra_undoslots ) and we are going to keep compatibility in further releases. We are open for features requests for new engine and ready to accept community patches. You can monitor Percona’s current tasks and further plans on the Percona XtraDB Launchpad project. You can also request features and report bugs there. Also we have setup two maillists for General discussions and for Development related questions.
Entry posted by EvgenyS | One comment
One of our customers gave me a chance to run some benchmarks on 24-core (intel cpu based) server, and I could not miss it and ran few CPU-bound tasks there.
The goal of benchmarks was investigation of InnoDB-plugin and XtraDB scalability in CPU-bound load.
CPU specification:
PLAIN TEXT CODE:I tested MySQL-5.1.30 with InnoDB plugin, Xtradb-1.0.2-1, and XtraDB-1.0.2-2 (rel2). XtraDB-rel2 has not been released yet, we still are doing internal QA for, but it will be ready soon. Main difference XtraDB-rel2 it contains additional scalability fixes for buffer_pool (split_buffer_pool_mutex patch).
my.cnf is
PLAIN TEXT CODE:At first I tried sysbench oltp read-only with 10mil rows (the datasize is about 2.5GB), uniform distribution.
The results you can see there:

As you can see all engines scales pretty well with scaling factor 1.5 - 1.9 (by scale factor I mean relation TPS_at_N_threads / TPS_at(N-1)threads) , and we may stop and be happy regarding InnoDB scalability, but in the next run I tested sysbench oltp read-write load, and the results are:
Here starting 16 threads the result is dropping down with the same speed as it grew, and with 128 connections we have the same TPS as with 1 connection (and it is on 24-cores box!). XtraDB is slightly better there than InnoDB, but nothing special to be proud of. We definitely we will look how to fix it as next step, read detailed investigation what is the reason of performance drop in next post.
And last one workload I tried is TPCC-like benchmark (you can get it on https://launchpad.net/perconatools), with 100 Warehouses (about 9.5GB datasize).
The result is:
Here the result grows up to 16 connections, but after that InnoDB-plugin is dropping down. XtraDB and XtraDB-rel2 seem quite better, I guess this is mostly because fixes to rw_locks and to buffer_pool mutex (in rel2).
Conclusion: As read-only workload seems fine, read-write cases is something to worry about in 16+ cores boxes.
Intel Based 24 Core Servers are not mainstream these days but as number of cores is increasing now at the same pace as CPU frequency before we believe they are very soon to come. Also in real production there may be not a lot database fits "in-memory" cases - but on other hand 64-128GB RAM per box is not something rare already and recommendation to fit at least active dataset in memory is one we use for our customers.
Entry posted by Vadim | No comment
As you may know Sun / MySQL made release of community 5.0.75 only as source code release. We made binaries with our patchset.
Patches are mostly equal to build 10
http://www.mysqlperformanceblog.com/2008/12/11/mysql-binaries-percona-build10/, only change:
it includes bugfix for https://bugs.launchpad.net/percona-patches/+bug/308849
You can download binaries (RPMS x86_64) and sources with patches here
http://www.percona.com/mysql/5.0.75-b11/
Entry posted by Vadim | No comment
We have a lot of customers who do click analysis, site analytics, search engine marketing, online advertising, user behavior analysis, and many similar types of work. The first thing these have in common is that they're generally some kind of loggable event.
The next characteristic of a lot of these systems (real or planned) is the desire for "real-time" analysis. Our customers often want their systems to provide the freshest data to their own clients, with no delays.
Finally, the analysis is usually multi-dimensional. The typical user wants to be able to generate summaries and reports in many different ways on demand, often to support the functionality of the application as well as to provide reports to their clients. Clicks by day, by customer, top ads by clicks, top ads by click-through ratio, and so on for dozens of different types of slicing and dicing.
And as a result, one of the most common questions we hear is how to build high-performance systems to do this work. Let's see some ways you can build the functionality you need and get the performance you need. Because I've built two such systems to manage online ads through Google Adwords, Yahoo, MSN and others, it's easy and familiar for me to use the example of search engine marketing. I'll do that throughout this article.
Requirements
The words "need" and "want" are different. Do you really need atomic-level data? Do you really need real-time reporting? If you do, the problem is much more expensive to solve.
Start with the granularity of your data. What data do you need to make your business run? If you can't get access to the time of day of every click on every ad, will it hamper your ability to measure the ad's value? Is it enough to know how many times the ad was clicked each day? If so, you can roll all those events up into a per-day table.
Next, let's look at "real-time." None of the big three (Google, Yahoo, MSN) provides real-time reporting last time I was involved with them (and I suspect this is still true). It's too expensive. Consider your user expectations. For most applications I've been involved with, having day-old data is adequate, and users don't expect realtime. The trick here is that when you start out, realtime is possible because your data is small. "Hey, we do realtime reporting. Google doesn't even do that! We're better!" Then you get popular :) And if you've promoted your better-ness in the meantime, you might have to do some awkward backpedaling with customers, who now expect realtime data. The database giveth, and the database taketh away.
Finally, you should think a lot about how you need to query the data. It is a hard question to answer, and sometimes I've seen it evolve over time, especially as the growing data size forces it to. This goes back to what data you really need to make your business run. Anything else is gravy. If there are nice-to-haves, consider not building them in. Listen to some talks by 37Signals if you need inspiration to toss things out. Define the types of queries you absolutely have to have, if possible, and note the ways and types of aggregation (by-ad by-day, for example).
Sometimes I ask a customer "what kinds of queries do you have to run?" and they say "we can't decide, so we want to just store everything." If you can't decide yet, then don't store everything in the database. Instead, store the source data in some fashion that you can reload later, such as flat files, and build support in the database for one or two capabilities you absolutely need now; then add the rest later, reloading the data if needed.
Aggregate
Aggregation is absolutely key for most people. There are special cases, and there are ways to do general-purpose work without aggregating (see the section below on technologies), but if you're doing this with vanilla MySQL, you will need to aggregate your data.
What you want to do is aggregate in ways that optimize the most expensive things you'll do. And then, you might super-aggregate too. For example, if you aggregate by day and then you do a lot of queries over 365-day ranges for year-over-year analysis, aggregate again by month. Then write your queries to use the most aggregated data possible to save work.
Avoid operations that update huge chunks of aggregated data at once. Among other things, you'll make replication lag badly. More about this later.
Another way to say "aggregate" is to say "pre-compute." If you have time-critical queries for your app to do its work, can you do the work ahead of time so it's ready to get when needed? This might or might not be aggregation.
Denormalize
Pre-computing and careful denormalization need to go together. Figure out what other types of data you'll need in those aggregate tables, and include columns to support these queries. But beware of denormalizing with character data; try to make your rows fixed-length.
One reason denormalization is important is that nested-loop joins on large data sets are very expensive. If MySQL supported sort-merge or hash joins, you'd have other possibilities, but it doesn't, so you want to build your aggregate tables to avoid joins.
Watch Data Types
Does your ad ID look like "8a4dabde-1c82-102c-ab13-0019b984eacd" and is it stored in a VARCHAR(36)? When tables get big, every byte matters a lot. Use the smallest data types you can, the simplest character sets you can, and watch out for NULLable columns. Use smallint unsigned or tinyint unsigned if you can. You can save very large amounts of space. Choose primary keys very carefully, especially with InnoDB tables -- don't use GUIDs. Which brings me to my next point:
Use InnoDB
Assuming that you will use the stock MySQL server, InnoDB is usually your best bet. (Actually, XtraDB might be very interesting for you, but I digress). Due to the cost of repairing huge MyISAM tables and taking downtime, I would not use MyISAM for anything but read-only tables when things get big. And even if it's read-only, there's still another reason to use InnoDB/XtraDB tables...
Optimize For I/O
It is pretty much inevitable: if you do this kind of data processing in MySQL, you're going to end up heavily I/O bound. Listen to any of the talks at past MySQL conferences from people who have built systems like yours, and there's a fair chance they will talk about how hard they have to work on I/O capacity.
What does this have to do with InnoDB? Data clustering. InnoDB's primary keys define the physical order rows are stored in. That lets you choose which rows are stored close to each other, which is very beneficial in many cases. Especially on huge tables, it lets you scan portions of a table instead of the whole table if you a) choose your aggregation to match the order of your common queries and b) choose your primary key correctly.
Let's go back to the ad-by-day table. If you query date ranges most of the time, you should define the primary key as (day, ad). Don't use an auto-increment primary key, and don't put ad first. If you put ad first, then you're going to scan the whole table to query for information about yesterday. If you put day first, then yesterday will all be stored physically together (within the page -- the pages themselves may be widely separated, but that's another matter).
Don't Store Non-Aggregated Data
I've been talking a lot about aggregated data. What do you do with the non-aggregated data? My answer is usually simple: just don't store it in the database. Instead, pre-aggregate. Suppose your data is coming from some Apache log or similar source. Write a script to rip through the file and parse it 10k lines at a time, aggregating as it goes. When each chunk is done, make it write out a CSV file and import that with LOAD DATA INFILE. Keep those big fat log files out of the database. The database is usually the most expensive and hardest-to-scale component in your system -- don't waste resources.
Another benefit of this is the chance to parallelize. As you know, MySQL doesn't do intra-query parallelization, so ETL jobs written to rely on SQL tend to get really bogged down. In contrast, moving the processing outside the database lets you parallelize trivially.
If you need to analyze the non-aggregated data, you can store it on the filesystem and write custom scripts to do special-purpose tasks on it. Storing a little meta-data about each file can help a lot. Store the ranges of values for various attributes, for example; or the presence or absence of values. You can put these into the database in a little meta-table. Then your script can figure out which files it can ignore. What we're doing here starts to look like a hillbilly version of Infobright, which I'll talk about later.
Alternately, you can store the atomic data as CSV files and use the CSV engine so you have an SQL interface to it (the meta-tables are still a valid approach here!). This is an easy way to bypass the hard-to-scale database server for the initial insertion, because you can write CSV files with any programming language. Naturally, CSV files don't store as compactly on disk as [Compressed] MyISAM or Archive.
These are just some ideas I'm throwing around -- the point is to think outside the box, even to think of things that seem "less advanced" than using a database.
Sharding and Partitioning
Sharding is inevitable if your write workload exceeds the capacity of a single server (or if you're using replication, the capacity of a single slave). Sharding can also help you avoid massive tables that are too big to maintain. If you know you'll get there, it can change the lifecycle of your application in advance.
What about partitioning in MySQL 5.1? I know there are some cases when it can help a lot, and we've proven that with our customers. But you still have to think about how to avoid enormous tables that are hard to maintain, back up, and restore. And the partitioning functionality is not done yet and not fully integrated into the server, so I expect to find a lot more bugs and annoyances. There are already inconvenient limitations on some key parts of partitioning, such as maintenance and repair commands, that essentially negate the benefits of partitioning for those operations. An finally, it doesn't save you from the downtime caused by ALTER TABLE -- a typical reason to think about master-master with failover and failback for maintenance. As with anything, it's a cost-benefit equation. What are your priorities? Choose the solution that meets them.
Be Careful With Data Integrity
When you're storing several levels of aggregation, and there's denormalization, you need to be scrupulous about data cleanliness, because it's really hard to fix things up later. If your data is coming from a partner site, and you upload bad data there, you'll be getting bad data back for a long time. And every time you have some incremental job to update the aggregates, you're exposed to that bad data again.
Any inconsistencies in the atomic data tend to get magnified as it gets aggregated, because you suddenly have a single row created from many rows, and if the many rows don't match completely, the single one doesn't know what data should live in it. And this only gets harder to resolve as you get more levels of aggregations.
Watch Out For The Long Tail
People talk about the long tail and how you can focus on optimizing the short head. It's the classic 80-20 rule. Maybe 80% of your ad impressions are on 20% of your ads! Hooray! But don't forget that if you're aggregating per-day, an ad that gets a million impressions takes one row, and an ad that gets one impression takes exactly the same: one row. An impression per day becomes a fixed overhead of storage size. So, you actually have as many rows as you have unique ads per day. Viewed this way, suddenly you start to hate the ads that occasionally get an impression. They're so wasteful!
It's easy to flip back and forth between viewpoints on this and get distracted into making a mistake. Watch out when you do your capacity planning. Don't get fooled into calculating the wrong thing.
Be Creative With Table Structures
Suppose you have some yes/no fact about an ad impression, such as whether it was a blue ad (whatever that means.) You start out with this:
PLAIN TEXT SQL:What can we improve here? Especially assuming that there are indexes other than the primary key, we can shrink the primary key's width:
PLAIN TEXT SQL:There are a couple of ways to handle this now. You can have the clicks column record the total, and the blue_clicks column record only blue clicks; to find out non-blue clicks you subtract one from the other. Or you can have the blue clicks and non-blue clicks stored, and to get the totals you add them.
Did this gain us anything? We dropped one column, and we just moved those other values around to store them "next, to in the same row" instead of "below, in the next row." So we're storing all the same data, right?
Logically, yes; physically, no. Those values that we pivoted up beside their neighbors will share a set of primary key columns. And not only will every index be a little narrower, the table will now contain only half as many rows. That will make the indexes less than half the size. In real life this technique often makes the table+index much less than half the size. You have to write a little more complex queries, but that's often justified by a large reduction in table size.
I sort of stumbled upon this idea one day. I have no idea what this technique might be called, so I call it dog-earing the table (somehow the image of putting columns next to each other makes me think of putting cards next to each other and shoving).
Archive
If you don't need data anymore, move it away or get rid of it. I wrote a three-part article on data archiving on my own blog a while back. The benefits of purging and archiving data can be dramatic.
Take It Easy On Replication
Building aggregated tables is hard work for the database server. If you do it on the master with INSERT..SELECT queries, it will propagate to the slaves and it'll be hard work there too, assuming you use statement-based replication.
You can save that work by either using MySQL 5.1's row-based replication, or in MySQL 5.0 and earlier, doing the work on a slave, then piping the results back up to the master with LOAD DATA INFILE, which kind of emulates row-based replication in a way.
When you're updating big aggregate tables, don't work with giant chunks of them at once. If there's any possible way, do it in manageable bits. A day at a time, for example.
There are a lot of other ways you can make replication faster. I wrote a lot about this in our book, which is linked from the sidebar above.
Don't Assume Traditional Methods Will Save You
What you're really doing here is building a data warehouse. So you may think you should use traditional DW methods, like star schemas. The problem is that MySQL doesn't tend to perform well on a data warehousing workload. The nested-loop joins are not all that fast on big joins; the query optimizer can sometimes pick bad plans when you have a lot of joins between fact and dimension tables, and so on. With careful tweaking, many of these things can be overcome, but how much time do you have? And the gains are simply limited by some of MySQL's weaknesses in some cases.
Not only that, but star schemas are not intended to be fast. The star schema is essentially "I admit defeat and accept table scans as a fact of life." Table scans can be better than the alternative, if the alternatives are limited, but they're still not what you need unless you're okay with long queries that read a lot of rows -- MySQL can't handle too many of those at once.
Aside from star schemas, another tactic I see people try a lot is to build "flexible schemas" with tables that contain name-value pairs or something similar. The thought is that you can make the application believe it has a custom table, which is really constructed behind the scenes from the name-value tables in a complex query with many joins. I have never seen this approach scale well.
Use The Best Technologies You Can
MySQL is not the end-all and be-all. If you're familiar with it and it can serve you reasonably well, it's fine to use it for things that it's not 100% optimal for. But if the costs of doing that are going to outweigh the costs of using another solution, then look at other solutions.
One that holds promise is Infobright. While I have not evaluated their technology in depth, I think it merits a good look. I had the chance at OpenSQL Camp to talk to Alex Esterkin and see him present on it, and based on that exposure, I think they are doing a lot of things right. When I know enough to have a real opinion (or when other Percona people get to it before I do!) you'll see results on this blog.
Another is Kickfire -- also something I have not had a chance to properly evaluate. And there are others, and there will continue to be more. Finally, PostgreSQL is clearly better for some workloads out-of-the-box than MySQL is, especially for more complex queries. Percona is not tied to MySQL, although we're most famous for our knowledge about it. When another tool is the right one, we use it.
Have you thought about using something besides a database? You have your choice of buzzwords these days. Hadoop is a big one. But beware of falling into the trap of brute-forcing a solution that really needs to be solved with intelligent engineering, instead of massive resources.
Conclusion
This article has been an overview of some of the tactics I've used to successfully scale large click-processing and other types of event-analysis databases. In some cases I've been able to avoid sharding for a long time and run on many fewer disk drives with much less memory, or even with 10-15x fewer servers. Clever application design, and a holistic approach, are absolutely necessary. You can't look to the database to solve everything -- you have to give it all the help you can. Hopefully it's useful to you, too!
Entry posted by Baron Schwartz | No comment
When your goal is to optimize application performance it is very important to understand what goal do you really have. If you do not have a good understanding of the goal your performance optimization effort may well still bring its results but you may waste a lot of time before you reach same results as you would reach much sooner with focused approach.
The time is critical for many performance optimization tasks not only because of labor associated expenses but also because of the suffering - slow web site means your marketing budget is wasted, customer not completing purchases, users are leaving to competitors, all of this making the time truly critical matter.
So what can be the goal ? Generally I see there are 2 types of goals seen in practice. One is capacity goal this is when the system is generally overloaded so everything is slow, when you're just looking to see how you can get most out of your existing system, looking for consolidation or saving on infrastructure cost. If this is the goal you can perform general system performance evaluation and just fix the stuff which causes the most load on the system. MySQL Log analyzes with Mk-Log-Parser is a very good start for a ways to generally optimize MySQL load on the system.
Latency Goal is another breed. The system may not look loaded but some pages still may want to be loading much slower than you like. These goals are not system wise but they are much more specific to the different user interactions or even types of users. For example you may define goal also "Search pages have to have response time below 1 second in 95% cases and below 3 seconds in 99% cases". Note We're specific to the user interaction - people are used to Search taking longer time than other interactions for many applications, and also we speak about percentile response time rather than "all queries". It is surely good all search queries complete in one seconds but it is too not practical. The goal description may be more specific too - for example you may have different response time guidelines for pages which are requested for real humans vs search engine bots (which are often quite different in their access pattern) or you may define "large users" as users having more than 100.000 images uploaded and measure the response time for them specifically because this group has its own performance challenges.
Looking at Latency it is also much more practical to look from the top of the stack. If you look at MySQL log you may find some queries which are slow but it is hard to go back from them to what is really important for the user and so the business - the page response times. Furthermore. It is not enough in many cases to focus only on Server Side optimization - the Client Side Optimization is also quite important in particular for aggressive performance goals and fast back-end. This is why we added this service to Percona offerings.
If Server side or Client Side performance optimization is going to be more important for your application depends on the application performance a lot. The better your application is the more Client Side optimization you will need. For example if it takes you 30 seconds to generate the search results and 3 more seconds to load all style sheets images and render the page server side optimization is more important. If you have optimized things and now HTML takes 0.5 seconds to generates an extra 3 seconds become the main response time contributer which has the highest performance optimization potential.
But let us get back to the Server Side Optimization. Lets assume our performance goal applies to the HTML generation rather than full page load on the client. So meet our goal we should look at the pages which do not meet our goal, which is pages which take more than 1 second to generate in given example.
For goal driven performance optimization it is important there is enough instrumentation and production performance logging in place so you really can focus on hard data in your work. For small and medium size applications you can log all requests to MySQL table for larger ones you can log only small portion of them. I usually keep one table per day so it is easy to copy the data to a different box for data crunching and remove the old ones.
The log table should contain URL, IP and all the data you need to be able to repeat request if you need to. It may include cookie data, post data, logged in user information etc. But the real thing is number of times which are stored for request. wall clock time - is the real time it took to generate the page by server backend. CPU Time This is the CPU time needed to generate request (you can split it to user and system if you want) and when there come various wait times - mysql, memcache, sphinx, web services etc.
For web applications doing processing in a single thread the following simple formula applies wall_time=cpu_time+sum(wait_time)+lost_time The lost time is the time which was lost for some reason - some waits we did not profile or waits we do not have control of, for example when processing had to wait for CPU available to do processing. For multi-thread application it is a bit more complicated but you still can analyze critical path.
If you have such profiling in place all you have to do is to run the query to see what are contributing factors to the response time of the problematic pages:
PLAIN TEXT SQL:Why looking only at such pages is important ? This is because if you look at all pages rather than problematic subset it may lead you away from your goal. For example it is very possible among all pages we would see CPU usage as the main factor because sphinx and MySQL respond from cache.
We however see for pages which have the problem it is Sphinx which accounts for most of the time.
Looking at the data such way we have two great benefits. First we really understand what is the bottleneck. Second we know what performance gain potential is. For example in this case we could spend a lot of time optimizing PHP code but because it takes only 10% of response time in average even speeding it up 10 times we would not get more than 10% response time reduction. At the same time if we find a way to speed up Sphinx we can reduce response time to its half.
Note in this case there is some 16% of response time which is not accounted for. Large portion probably comes from memcache accesses which are not instrumented for this application. In this case this portion is not the biggest part yet but if we'd speed up Sphinx and MySQL dramatically we would have to go and look into better instrumentation so we can look inside this black box.
Once we know it is Sphinx which causes the problem we have to go and find what queries exactly are causing it - this can be done by adding request ID as comment to Sphinx log so you can profile it carefully or you can add tracing functionality to the application. All the same. Once you found the queries causing the problem you see the ones which cause the most impact and focus on optimizing them.
There are multiple ways to optimize something, my checklist is usually get rid of it, cache it, tune it, get more hardware in this order. It is often it is possible to get rid of some queries, cache them, tune them so they are faster (often at the same time changing semantics a bit) and if nothing helps or can be done quickly we can buy more hardware, assuming application can use it.
Once you've performed optimizations you can repeat analyzes again to see if performance goals are met and where is the bottleneck this time.
As a side note I should mention looking at performance statistics for the day overall is often not enough. Application performs as good as it performs during its worst times so it is very good to plot some graph over time. Sometimes an hour base may be enough but for large scale application I'd recommend to looking down to 5 minutes or even 1 minute intervals and making sure there are no hiccups.
Check the stats from the application above for example:
PLAIN TEXT SQL:As you can see in this case during certain hours the average type of bad queries skyrockets and it becomes 90% or so driven by Sphinx. This tells us there is some irregular activity (cron jobs?) is happening and it affects Sphinx layer significantly.
Such goal based from top to bottom approach is especially helpful for complex applications using mutliple components (like sphinx and MySQL) or multiple MySQL Servers because in these cases you often can't easily guess the component which needs attention. Though even for less complicated single MySQL server application there is often the question if it is MySQL server causing the problem or if application code needs to be optimized.
Entry posted by peter | No comment
Peter said me that previous results http://www.mysqlperformanceblog.com/2008/12/18/xtradb-benchmarks-15x-gain/ are too marketing, and we should show other results also.
That’s right, as I said - do not expect performance gain in CPU bound benchmarks (at least on 8 and less CPU cores boxes). There are results for the same workload but with innodb_buffer_pool_size=8G, that is all data fits into memory.
As you see difference is just nothing, about 2.5%, which can be an error in measurement. It actually would be interesting to test on 16 cores servers - let me know if you have free one
Entry posted by Vadim | No comment
I guess it is first reaction on new storage engine - show me benefits. So there is benchmark I made on one our servers. It is Dell 2950 with 8CPU cores and RAID10 on 6 disks with BBU, and 32GB RAM on board with CentOS 5.2 as OS. This is quite typical server we recommend to run MySQL on. What is important I used Noop IO scheduler, instead of default CFQ. Disclaimer: Please note you may not get similar benefits on less powerful servers, as most important fixes in XtraDB are related to multi-core and multi-disks utilization. Also results may be different if load is CPU bound.
I compared MySQL 5.1.30 trees - MySQL 5.1.30 with standard InnoDB, MySQL 5.1.30 with InnoDB-plugin-1.0.2 and MySQL 5.1.30 with XtraDB (all plugins statically compiled in MySQL)
For benchmarks I used scripts that emulate TPCC load and datasize 40W (about 4GB in size), 20 client connections. Please note I used innodb_buffer_pool_size = 2G and innodb_flush_method=O_DIRECT to emulate IO bound load.
InnoDB parameters:
PLAIN TEXT CODE:And for XtraDB I additionally used:
PLAIN TEXT CODE:So what is in result:
Result is in NOTPM (New Order Transactions Per Minute), more is better. As you see XtraDB is somewhat 1.5x better than InnoDB in standard 5.1.30 and even more than InnoDB-plugin-1.0.2
And there is CPU utilization for all tested engines:

As you see XtraDB also utilizes CPUs better.
Finally let me show you why I took NOOP IO scheduler instead of CFQ, there are result for XtraDB with both:

4X difference is just giant one. And it is important to remember as Linux kernels 2.6.18+ (which are used on CentOS / RedHat 5.2) are coming with CFQ scheduler as default.
So echo 'noop' > /sys/block/sda/queue/scheduler should be one of first things to do on new server (sure you also need to change kernel startup parameter to make it automatic after reboot).
Entry posted by Vadim | 3 comments
So lets say you have .frm file for the table and you need to recover CREATE TABLE statement for this table. In particular when we do Innodb Recovery we often get .frm files and some mess in the Innodb tablespace from which we have to get data from. Of course we could relay on old backups (and we do ask for them for a different reason anyway) but there is never guaranty there were no schema changes in between.
So how to recover CREATE TABLE from .frm file ?
Recovering from .frm for Innodb Table
If we simply copy .frm file back to the database we will see the following MySQL creative error message:
PLAIN TEXT SQL:With more elaborate details in error log:
081217 15:59:11 [ERROR] Cannot find or open table test/test_innodb from
the internal data dictionary of InnoDB though the .frm file for the
table exists. Maybe you have deleted and recreated InnoDB data
files but have forgotten to delete the corresponding .frm files
of InnoDB tables, or you have moved .frm files to another database?
or, the table contains indexes that this version of the engine
doesn't support.
See http://dev.mysql.com/doc/refman/5.1/en/innodb-troubleshooting.html
how you can resolve the problem.
I would much rather see MySQL to report some more reasonable error message, something like Storage Engine could not open table or something like it.
So what we can do is to make sure Innodb has something in its data dictionary so it allows MySQL to succeed displaying .frm file contents:
mysql> create table test_innodb(i int) engine=innodb;
Query OK, 0 rows affected (0.06 sec)
mysql> Aborted
[root@test3 test]# cp /tmp/test_innodb.frm .
cp: overwrite `./test_innodb.frm'? y
[root@test3 test]# mysql test;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 17
Server version: 5.1.30-community-log MySQL Community Server (GPL)
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
mysql> show create table test_innodb;
+-------------+----------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------------+----------------------------------------------------------------------------------------------------------------------------+
| test_innodb | CREATE TABLE `test_innodb` (
`a` int(11) DEFAULT NULL,
`b` int(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
+-------------+----------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
Note: I have created the table which have a different definition from original table. It is good enough to get SHOW CREATE TABLE but do not try to use it any other way, as nasty things may happen:
PLAIN TEXT SQL:Recovering CREATE TABLE from .frm file for MyISAM Table
We do not really need this that frequently but I decided to cover this for completeness anyway.
With MyISAM table MySQL comes with another creative error message if .frm is the only file which exists:
PLAIN TEXT SQL:This is closer to the truth though file name is wrong - there should be test_myisam.MYI or test_myisam.MYD in the error message. The file with name "test_myisam" does not need to be exist.
The intuitive way to rebuild MyISAM table would be REPAIR TABLE test_myisam USE_FRM, however it does not work... just yet.
You need to create the .MYI and .MYD files for the table to make it work:
[root@test3 test]# touch test_myisam.MYI
[root@test3 test]# touch test_myisam.MYD
[root@test3 test]# chown mysql:mysql *
So it is not at all that complicated.
Entry posted by peter | No comment
How much space would empty MyISAM table take ? Probably 8K for .frm file, 1KB for .MYI file and 0 for MYD file. .MYI file can be larger if you have many indexes.
How much space will Innodb take:
PLAIN TEXT SQL:Check out files (using Innodb File Per Table)
-rw-rw---- 1 mysql mysql 8578 Dec 16 20:33 test_innodb.frm
-rw-rw---- 1 mysql mysql 98304 Dec 16 20:33 test_innodb.ibd
So we get about 100K and so about 10 times more for MyISAM. This is ignored space which needs to be allocated in main tablespace for Innodb data dictionary. But that one is pretty small.
This is the good reason to avoid having very small Innodb tables - they will take much more space than MyISAM.
So .ibd file we get in case of table having no indexes (besides clustered key) - takes 6*16K pages. I wonder why as much as 6 pages are required for start ?
If we add more indexes to this tables - each further index will take additional 16K page even if it contains no data. This is understandable - each index has to have at least one page allocated to it.
Now it is very interesting - SHOW TABLE STATUS does not seems to show everything:
PLAIN TEXT SQL:Such table's .idb file takes 128K from the start while we only see 16K of data+32K of index, so another 5 pages are invisible. This tells me you can't use this information to reliably identify space tables take on disk, especially for large number of very small Innodb tables.
Also note amount of free space - even though pages contain no data they are not considered free.
Avg_row_length is another field which may need an explanation. This value is computed by dividing Data_Length (exact number) by number of rows (estimated number) which means this value is going to be changing back and forth and it would be very inaccurate for small tables. For example it will show 16K as average row size for table with one row:
PLAIN TEXT SQL:Free Space for tables created in innodb_file_per_table mode is interesting question on its own.
As we populate table we will see Free space will remain at zero as Data_length is small:
PLAIN TEXT SQL:When at certain point you will see Innodb Free space to become non zero:
PLAIN TEXT SQL:And the file size also jumps significantly (to 9MB):
-rw-rw---- 1 mysql mysql 8578 Dec 16 20:58 test_innodb.frm
-rw-rw---- 1 mysql mysql 9437184 Dec 16 21:06 test_innodb.ibd
If you do the math here you can see there is only about 1MB out of 9MB seen as Index_Length+Data_Length while another 4MB are visible in the "Innodb Free Space" and almost 5MB more is not visible at all.
This tells you it is not about tables which contain couple of rows which can take a lot of space in Innodb. Tables showing as using 1MB of Innodb Data can also really be taking almost 10 times more on the disk.
It is not quite clear to me what is happening here. According to documentation each index should get 2 segments one for non-leaf an one for leaf pages. However the space allocation should happen page by page until whole 32 pages allocated. In the case above no single segment should require more than 32 pages so It is surprising why all of them take 5MB (because 4MB are free)
What is clear however is what if some pages from segment are allocated it goes in the interesting space regarding space reporting - it will be gone from the free space, while only pages actually allocated will be shown in Data_Length and Index_Length fields.
Doing more tests with Inserts I can see Innodb seems to always try to keep at least 4MB free in the tablespace - populating table with more and more data I see free space never falls below 4MB while data file on disk continues to grow.
Finally it is worth to note if you're using innodb_file_per_table the per table tablespaces are not going to grow by innodb_autoextend_increment - instead file will grow by 1MB to 4MB increments. There is a bug reported about it.
As a Summary I should not the following:
Entry posted by peter | No comment
Today we officially announce our new storage engine, "Percona XtraDB", which is based on the InnoDB storage engine. It's 100% backwards-compatible with standard InnoDB, so you can use it as a drop-in replacement in your current environment. It is designed to scale better on modern hardware, and includes a variety of other features useful in high performance environments.
Percona XtraDB includes all of InnoDB's ACID-compliant design and advanced MVCC architecture, and adds features, more tunability, more metrics, more scalability on many cores, and better memory usage. We choose features and fixes based on customer requests and on our best judgment of real-world needs. We have not included all the InnoDB patches available. For example Google's well-known InnoDB patch set is omitted (at least for now).
The first version of our new storage engine is 1.0.2-1, which is forked from InnoDB-plugin-1.0.2. Percona XtraDB is released under GPL v2, as is InnoDB-plugin base source code. Percona XtraDB is released only under the GPL v2 with no dual-licensing, and commercial support is available from Percona.
So what's new in the engine? Here is a list of new features and enhancements:
Documentation is on http://www.percona.com/docs/wiki/percona-xtradb:start.
Percona XtraDB available :
* in source code from Launchpad https://launchpad.net/percona-xtradb, the version 1.0.2-1 you can get as bzr branch lp:percona-xtradb -r tag:percona-xtradb-1.0.2-1 percona-xtradb-1.0.2-1
* as source code in tar.gz http://www.percona.com/mysql/5.1.30/source/percona-xtradb-1.0.2-1.tar.gz
* as binaries, percona xtradb is compiled in into MySQL-5.1.30 RPMS
* as separated shared library, to use as drop-in plugin for existing MySQL-5.1.30 installation http://www.percona.com/mysql/5.1.30/binary/percona-xtradb-1.0.2-1-5.1.30.x86_64.tar.gz
* OurDelta will also be using Percona XtraDB for its upcoming 5.1 builds.
XtraDB is fully compatible with existing InnoDB tables and we are going to keep compatibility in further releases.
We are open for features requests for new engine and ready to accept community patches. You can monitor Percona's current tasks and further plans on https://bugs.launchpad.net/percona-xtradb. You can also report bugs there. Also have setup two maillists http://groups.google.com/group/percona-discussion for General discussions and http://groups.google.com/group/percona-dev for development related questions.
We are looking for 6-month release cycle of XtraDB. First several releases may come faster, as many features are planned.
Why do we say it's a new storage engine, and not just a patchset? Because we are taking this as a serious project to evolve a new storage engine that will eventually become much more than just small variations from standard InnoDB. Percona will improve this storage engine using our own ideas, as well as incorporate improvements specifically sponsored by customers. Please contact us if you would like to sponsor any specific features. Contact form: http://www.percona.com/contacts.html
Entry posted by Vadim | No comment
One command, which few people realize exists is SHOW OPEN TABLES - it allows you to examine what tables do you have open right now:
PLAIN TEXT SQL:This command lists all non-temporary tables in the table-cache, showing each of them only once (even if table is opened more than ones)
In_use show how many threads are currently using this table, meaning have it locked or waiting on the table lock for this table to lock it.
Name_locked shows whenever the name for this table is locked. It is used for DROP or RENAME TABLE, so you would very rarely see this field to contain anything else than 0.
Besides just figuring out what tables are in the table_cache this command is rather helpful to understand if there is activity on the given table. Just run "FLUSH TABLES mytable" and examine open tables later - if you see this table in table cache again chances are it is being used.
Note however if you're starting MySQL Command line client without "-A" option it opens all tables in the active database to allow tab completion which can screw results.
Another use for this command is pre-flush implementation (as part of your backup routine) - instead of running FLUSH TABLES on ALL tables one by one you can run SHOW OPEN TABLES and flush only open tables, when run it again and see how many tables are open and in use and if FLUSH TABLES WITH READ LOCK can be run or not.
Unfortunately this command does not really help to answer the very common question you may have during table lock troubleshooting - who is holding lock for this table ?
I would much rather see all entries in the table_cache used, not grouped by the table, with thread_id using the table set (0 if table is not in use), lock_type READ/WRITE/READ_LOCAL etc as well as whenever the thread is looking for lock right now.
It also deserves to be converted to INFORMATION_SCHEMA table - so it would be easily to operate it using SQL commands.
Another thing which would be handly is LRU position for the given table (so you can see what tables are candidates for replacement) and the timestamp when this table was locked (or lock wait started) - MySQL anyway initializes the timer so it would not be much overhead to store that time together in the table cache. This could allow to understand table locks much better.
Entry posted by peter | No comment
As Giuseppe just reminded MySQL 4.0 is reaching its end of life in about 2 weeks from now. When it becomes unsupported by MySQL together with 3.23 version.
At Percona we do not have such restriction and we will continue to support your environment even if you’re still on MySQL 4 or 3.23 (we indeed do have customers which are still using 3.23 now)
What does it mean for you in practice ? This means we will continue helping you with issues you’re having with MySQL 3.23 or 4.0 and will be happy to backport bug fixes to MySQL 3.23 and 4.0
Supporting MySQL 3.23 and MySQL 4.0 environment is indeed more expensive than current MySQL versions because we can’t relay on Sun/MySQL doing any work with bug fixes plus more work is often needed because these versions have lower transparency than later versions. It also often takes more time to do things because we have to have an extra care to remember all limitations of these versions correctly. For example remember to use set-variable in MySQL 3.23 or remembering all optimizer issues which were fixed more recently.
It costs more to support environments based on 3.23 and 4.0 these days so we would typically recommend to upgrade, however if it is not instantly possible we continue to do support your envinronment. With our simple pricing model of “you pay for consultants true time” we can naturally afford to do it.
Entry posted by peter | 2 comments
As you might have seen MySQL QA Team has published their benchmarks for MySQL 5.0.72 and 5.1.30.
It is interesting to compare with results I posted previously
The quote from the results mentioned above:
“Maybe you’ve seen some claims by others in the MySQL community that MySQL 5.1 runs slower than MySQL 5.0. Maybe you’ve also seen some claims by others in the MySQL community that MySQL 5.1 runs faster than MySQL 5.0.
Guess what? They’re both right. “
But is it really what results are telling us ?
I do not think so. When you’re doing benchmarks you should be comparing best performance settings for given application and conditions. For example it is unfair to compare results with different innodb_buffer_pool_size or innodb_flush_log_at_trx_commit but for innodb_thread_concurrency - you should be picking the value which makes your workload to run fastest.
Lets look at the graphs provided in these benchmarks and see what value is best for MySQL 5.1 and for MySQL 5.0 respectively ?
Both versions do their best with innodb_thread_concurrency=0 and 5.0 is slightly but consistently faster. Same as in my results.
So I would interpret these results differently.
MySQL 5.0 is faster if you configure it right. If you configure it wrong the regression is going to be worse than for MySQL 5.1.
You can’t really use this results to tell MySQL 5.1 will be winner in cases when small values of innodb_thread_concurrency get best performance. The things can be completely different in this case. They may or may not, there is simply no data at all.
It is also very interesting to see benchmark run with innodb_thread_concurrency=1000 - This is exactly the value which you should never use. The limit of 1000 threads inside the kernel is by far too large (so it is same as 0 - unlimited) but it will add another mutex to deal with for the queue implementation.
What would be really interesting to learn though is why MySQL 5.1 gets so better when threads get queued up, what kind of changes in MySQL 5.0 result in this behavior. It also would be good to run profiling for these results to see where these few percent are lost for MySQL 5.0 to see if they are possible to be reclaimed.
Finally - if I would be doing QA, and benchmarks as part of them I would try to use options which are close to what people would use in production. Or at least explain why they are set so. Is it because 5.1 does not show too good results with standard settings or is it just omission ? Again quite possibly nothing will change but may be not.
In particular: innodb_support_xa=0, innodb_doublewrite=0 are not normally used in production and they do add some overhead.
When looking for results more relevant for production I also would keep binary log enabled - most big installations of MySQL use replication or at least binary log to get point in time recovery from backup. Also Innodb log files of 2*650M are larger than practical for most applications because of too large recovery time.
Anyways. It is great to see MySQL QA Team has published some benchmarks now and I can’t wait to see more. If we have data and the good disclosure (settings, versions, hardware) as we have in this case we can make our own mind of results and draw our own conclusions.
Also indeed it is a good time to try MySQL 5.1 for your environment. If you spot any regressions from MySQL 5.0 it will likely take a time to get them fixed.
Entry posted by peter | No comment
We made new binaries for MySQL 5.0.67 build 10 which include next fixes:
Next variables were added:
PLAIN TEXT CODE:With both slow_query_log = OFF and userstat_running = OFF we have no performance penalties of using of patches.
The -percona release includes:
PLAIN TEXT CODE: