Mikael Ronstrom: May 2009

Wednesday, May 20, 2009

MySQL 5.4 Webinar

The quality of MySQL 5.4.0 is very high for a beta product.
Four weeks after we released it as beta we have not had
any real serious bugs reported yet. There are some issues
due to deprecation of features, version numbers and a
bug in the SHOW INNODB STATUS printout and some concerns
with the new defaults when running on low-end machines.
It's also important as usual to read the documentation
before upgrading, it contains some instructions needed to
make an upgrade successful. The upgrade issue comes from
changing the defaults of the InnoDB log file sizes.

For those of you who want to know more about MySQL 5.4.0
and it's characteristics and why you should use it, please
join this webinar where Allan Packer will explain what
has been done in MySQL 5.4.0.

Tuesday, May 19, 2009

Patches ready for buf page hash split shootout

Today I created a patch that builds on the Google v3
patch where I added some ideas of my own and some ideas
from the Percona patches. The patch is here.

Here is a reference to the patch derived from the Google
v3 patch.

Here is a reference to my original patch (this is likely to
contain a bug somewhere so usage for other than benchmarking
isn't recommended).

So it will be interesting to see a comparison of all those
variants directly against each other on a number of benchmarks.

Analysis of split flush list from buffer pool

In the Google v3 patch the buffer pool mutex have been
split into an array of buffer page hash mutexes and a
buffer flush list mutex and the buffer pool mutex also
remains.

I derived the patch splitting out the buffer flush list
mutex from the Google v3 patch against the MySQL 5.4.0
tree. The patch is here.

I derived a lot of prototype patches based on MySQL 5.4.0
and Dimitri tried them out. This particular patch seems
to be the most successful in the pack of patches we
tested. It had a consistent positive impact.

The main contribution of this patch is twofold. It
decreases the pressure on the buffer pool mutex by
splitting out a critical part where the oldest dirty
pages are flushed out to disk. In addition this patch
also decreases the pressure on the log_sys mutex by
releasing the log_sys mutex earlier for the mini-
transactions. In addition it removes interaction
between the buffer pool mutex and the log_sys mutex.
So previously both mutexes had to be held for a
while, this is no longer necessary since only the
flush list mutex is needed, not the buffer pool
mutex.

The new patch is the b11 variant which is red in
the comparison graphs.

As we can see the read-write tests have a pretty significant boost
from this patch, it improves top performance by 5% and by 10-20%
on higher number of threads. It also moves the maximum from 16 to
32 threads.

Even on read-only there are some positive improvements although
it is very possible those are more random in nature.

Finally the above picture shows that this patch also moves the
optimal InnoDB thread concurrency up to 24 from 16 since it
allows for more concurrency inside InnoDB. This is also visible
by looking at the numbers for InnoDB Thread Concurrency set to 0
as seen below.

Friday, May 15, 2009

Shootout of split page hash from InnoDB buffer pool mutex

One of the hot mutexes in InnoDB is the buffer pool mutex.
Among other things this mutex protects the page hash where
pages reside when they are in the cache.

There is already a number of variants of how to split out
this mutex. Here follows a short description of the various
approaches.

1) Google v3 approach
Ben Hardy at Google took the approach of using an array of
mutexes (64 mutexes) and this mutex only protects the
actual read, insert and delete from the page hash table.
This has the consequence of a very simple patch, it means
also that when the block has been locked one has to check
that the owner of the block hasn't changed since we didn't
protect the block between the read of the hash and the
locking of the block, thus someone is capable of coming in
between and grabbing the block for another page before we
get to lock the block. In addition this patch focuses
mainly on optimising the path in the buf_page_get_gen
which is the routine used to get a page from the page
cache and thus the hot-spot.

2) Percona approaches
Percona has done a series of approaches where the first
only split the page hash as one mutex and still protecting
the blocks from being changed while holding this mutex.
Next step was to change the mutex into a read-write lock.

3) My approach
My approach was inspired by Percona but added two main
things. First it split the page hash into a number of
page hashes and had one RW-lock per page hash (this
number has been tested with 4, 8 and 16 and 4 was the
optimal on Linux at least). In addition to avoid having
to lock and unlock multiple pages while going through
the read ahead code the hash function to decide which
page hash to use decided on the same page hash for all
pages within 1 MByte (which is the unit of read ahead
in InnoDB).

Pros and Cons

The simplest patch is the Google patch which makes for
a very simple patch and also by only focusing on
buf_page_get_gen avoids a lot of possible extra traps
that are likely if one tries to solve too much of the
problem.

Using a RW-lock instead of a mutex seems like at least
a manner of improving the concurrency but could of
course impose a higher overhead as well so here
benchmarking should show which is best here.

When using an array of locks it makes sense to optimise
for read ahead functionality since this is a hot-spot
in the code as has been shown in some blogs lately.

4) Mixed approach
So a natural solution is then to also try a mix of the
Google variant with my approach. So still using an
array of locks (either mutex or RW-locks, whatever
has the optimal performance) but ensuring that the
pages within a read ahead area is locked by the same
lock.

This approach reuses the simplicity of the Google
approach, the total lack of deadlock problems for
the Google approach with the optimised layout from
my approach and the idea of RW-locks from Percona.

We don't have any results of this shootout yet.
This shootout should also discover the optimum number
of areas to split the page cache into, Google has
used 64, but my results so far indicates a number of
4 seems more appropriate.

Thursday, May 14, 2009

More analysis of InnoDB Thread Concurrency

When I worked with Dimitri on the analysis of the
Split Rollback Segment Mutex he came up with numbers
on InnoDB Thread Concurrency set to 16 and 32 and I was curious
to see if 24 was the optimal setting. So he made some new runs and
some new graphs that I found interesting.

The first graph analyses behaviour of MySQL 5.4.0 on a SPARC
Server using InnoDB Thread Concurrency set to 0, 16, 24 and 32.
Interestingly for both readonly and readwrite benchmarks the
optimal setting for concurrency is 16 whereas the top numbers
(at 32 threads) is achieved with concurrency set to 24 or 32.

So on the current MySQL 5.4.0 on this particular benchmark and
platform it seems that 16 is the optimal setting. However Dimitri
also analysed the same thing using the new patch for Splitting the
Rollback Segment Mutex and now the story changes.

So with this patch setting InnoDB Thread Concurrency to 24
is now the optimum setting. So it's clear that as we get more
and more improvements to the scalability of the MySQL Server and
InnoDB it will be optimal with more and more parallel threads
inside InnoDB as well. So this means that this setting is quite
likely to change as development proceeds but for MySQL 5.4.0 a
setting of around 16-24 is often a good one. To actually change
the default setting requires much more testing of various
workloads on many different computer architectures.

Similar testing I have performed on Linux using sysbench implies
that the optimal setting is around 24-28. Also the difference
between setting it to 0 and 24 is much smaller on Linux (15%
on 256 threads as shown in blog yesterday). We haven't analysed
the big difference on these SPARC Servers.

Analysis of Split of Rollback Segment Mutex

When I read the blog about Split Rollback Segment Mutex,
I was interested to verify those results in the context of MySQL 5.4.0.

The patch can be found here.

We've analysed this patch both on a large SPARC system and on my
benchmark x86/Linux machine. Our results tend to be positive for
readwrite benchmarks but sometimes negative for readonly
benchmarks. Also the gain is much smaller than found in the
blog.

Also this patch has two negative effects, the first is that it
provides an upgrade problem, this can probably be handled in the
InnoDB code, but requires quite some digging. The other is that
instead of writing UNDO results to one UNDO log, we write it to
several UNDO logs, thus decreasing the buffering effect to the
file system.

On Linux I found on readwrite benchmarks up to 7-8%
improvements of the top results. On readonly it sometime dropped
about 1-3%. I also tried with varying numbers of rollback
segments and found 4 and 8 to be better than 16. So from the
above point of view the number of rollback segments set to 4 is
probably best. The patch uses 8 (it's actually set to 9 since
the system rollback segment is a bit special).

Here are some graphs from Dimitri running it on some fat SPARC
server (MySQL-5.Perfb9-gcc43 is 5.4.0 plus the above patch).

The first graph shows the behaviour when InnoDB Thread Concurrency
is 0, here we see a speedup in the range of 3-5%.

The same results for the readonly benchmark shows positive results as well.

When one sets InnoDB Thread Concurrency equal to 16, 24 or 32
the behaviour is different. It turns out that we get worse
performance using 16 but get more positive impact using 24 and
even more using 32. So it seems that this patch requires less
limits to parallelism to get the best behaviour.

So one impact of this patch is that it can sustain a higher
number of concurrent threads and there is a small positive impact
on the performance as well.

Wednesday, May 13, 2009

More data on InnoDB Thread Concurrency

Here is the performance graph comparing using
InnoDB Thread Concurrency equal to 0 and
InnoDB Thread Concurrency equal to 24 using
sysbench readwrite with the new InnoDB
Thread concurrency algorithm as introduced
in MySQL 5.4.0.

Analysis of Google patches on 4,8 and 12 cores

One of the goals we had originally with the MySQL 5.4
development was to improve scaling from 4 cores to
8 cores. So in my early testing I ran comparisons of
the Google SMP + IO + tcmalloc patches on 4, 8 and 12
cores to see how it behaved compared with a stock
MySQL 5.1.28 version (Note the comparison here was
done on a very early version of 5.4, 5.4.0 have a
set of additional patches applied to it).

What we can see here is that the Google SMP patch and use
of tcmalloc makes a difference already on a 4-core server
using 4 threads. On 1 and 2 threads the difference is only
on the order of 1-2% so not really of smaller significance.

An interesting note in the graph is that 8-core numbers using
the Google improvements outperform the 12-core stock MySQL
5.1.28.

So what we concluded in those graphs is that the scaling from 4-cores
to 8-cores had improved greatly and that there also was a good scaling
from 8 cores to 12 cores. This improvement increased even more with
the 5.4 release. The main purpose of showing these numbers is to show
the difference between 4, 8 and 12 cores.

All benchmarks were executed on a 16-core x86 box with 4 cores
dedicated to running sysbench.

Analysis of Google patches in MySQL 5.4

Early on in the MySQL 5.4 development we tried out the
impact of the Google SMP patch and the Google IO patch.
At first we wanted to see which of the patches that
made most of an impact. The Google patches in MySQL 5.4
have 3 components at least that impact the performance.
1) Replace InnoDB memory manager by a malloc variant
2) Replace InnoDB RW-lock implementation
3) Make InnoDB use more IO threads

When disabling the InnoDB one opens up for a whole array
of potential candidates for malloc. Our work concluded
that tcmalloc behaved best on Linux and mtmalloc was
best on Solaris, see blog posts on Solaris below.

Malloc on Solaris investigation

Battle of the Mallocators on Solaris

I did also do some testing on Linux where I compared 4 different
cases (all variants were based on MySQL 5.1.28):
1) Using the Google SMP patch, Google IO patch (with 4 read and
4 write threads) and using tcmalloc
2) Using tcmalloc and no other Google patches
3) Using plain malloc from libc
4) Using plain MySQL 5.1.28 using InnoDB memory manager

Here are the results:

So as we can see here the replacement of the InnoDB memory manager
by standard malloc had no benefits whereas replacing it with
tcmalloc gave 10% extra performance. The Google SMP patch added
another 10% performance in sysbench readwrite. We have also
tested other OLTP benchmarks where the Google SMP patch added
about 5-10% performance improvement. As shown by Mark Callaghan
there are however other benchmarks where the Google SMP patch
provides much greater improvements.

Tuesday, May 12, 2009

MySQL 5.4 Patches: InnoDB Thread Concurrency

When benchmarking MySQL with InnoDB we quickly discovered
that using InnoDB Thread Concurrency set to 0 was an
improvement to performance since the implementation of
InnoDB Thread Concurrency used a mutex which in itself was
a scalability bottleneck.

Given that InnoDB Thread Concurrency is a nice feature that
ensures that one gets good performance also on an overloaded
server I was hoping to find a way to make the implementation
of this more scalable.

I tried out many different techniques using a combination of
mutexes and atomic variables. However every technique fell to
the ground and was less performant than setting it to 0 and not
using the InnoDB Thread Concurrency implementation. So I was
ready to give up the effort and move on to other ideas.

However after sleeping on it an inspirational idea came up.
Why use a mutex at all, let's see how it works by using the
OS scheduler to queue the threads that need to blocked. This
should be more scalable to use than a mutex-based approach.
There is obviously one bad thing about this approach and this
is due to that new arrivees can enter before old waiters. To
ensure we don't suffer too much from this a limit on the wait
was necessary.

So I quickly put together a solution that called yield once
and slept for 10 milliseconds twice at most and every time it
woke up it was checking an atomic variable to see if it was ok
to enter. After those three attempts it would enter without
checking.

I tried it and saw a 1% decrease on low concurrency and 5%
improvement on 32 threads and 10% on 64 threads and 15% on 128
threads. Voila, it worked. Now I decided to search for the
optimal solution to see how many yields and sleeps would be best.
It turned out I had found the optimal number at the first attempt.

The implementation still has corner cases where it provides less
benefits so I kept the possibility to use the old implementation by
adding a new variable here.

So currently the default in MySQL 5.4 is still 0 for InnoDB Thread
Concurrency. However we generally see optimal behaviour using
InnoDB Thread Concurrency set to around 24, setting it higher is
not bringing any real value to MySQL 5.4.0 and setting it lower
decreases the possible performance one can achieve. This seems
to be a fairly generic set-up that should work well in most cases.
We might change the defaults for this later.

Mikael Ronstrom