Thursday, December 03, 2009

New threadpool design

In MySQL 6.0 a threadpool design was implemented based on
libevents and mutexes.

This design unfortunately had a number of deficiences:
1) The performance under high load was constrained due to a global
mutex protecting libevent (see BUG#42288).

2) The design had no flexibility to handle cases where threads were
blocked due to either locking or latches. E.g. a thread held up by a
table lock will be kept in the threadpool as an active thread until
the table lock is released. If all threads are blocked in this state,
it's easy to see that also any query that want to release the table
lock cannot be processed since all threads in the thread pool are
blocked waiting for the table lock (see BUG#34797).

3) The design is intended to support very many connections but
didn't use the most efficient methods to do this on Windows.
libevent uses poll on Windows which isn't a scalable API when
there are thousands of connections.

Also in all of the benchmarking with MySQL it's been clear that
performance of MySQL often drops significantly when there are too
many threads hitting the MySQL Server. We have seen vast
improvements of this the last year and there are some additional
improvements of this in the pipeline for inclusion into the next
MySQL milestone release. However the basic problem is still there,
that too many waiters in the queue can lead to various performance
drop off, one reason for such drop offs can be when mutex waits
starts to timeout in InnoDB.

So actually when we're looking at the threadpool design now, we're
aiming at solving two issues in one. The first is to remove this
scalability dropoff at high thread counts and the second is to
efficiently handle MySQL servers with thousands of connections.
Threadpool also enables us to have more control over on which
CPU threads are scheduled to execute on. We can even dynamically
adapt the CPU usage to optimize for lower power consumption by
the MySQL Server with a clever threadpool design.

We're currently in the phase of experimenting with different
models, however we opted for a design based around usage of epoll
on Linux, eventports on Solaris and kqueue for FreeBSD and
Mac OS X. We will also make a poll-based variant work mostly for
portability reasons although it's scalability won't be so great.
For Windows we're experimenting with some Windows specific
API's such as the IO Completion API.

The code to support thread pooling in MySQL is actually very
small so it's easy to adapt the code for a new experiment.

Last week we found a model that seems to work very fine.
The benchmarks shows that the performance on 1-32 threads is
around 97-103% of one thread per connection performance. When
we go beyond 32 threads the thread pool gains more and more,
it's getting to about 130% at 256 threads and reaches 250%
better performance on 1024 threads. However this model still
have the problem of deadlocks, so there is still some work on
refining this model. The current approach we have is fixing
the deadlock problem but removes about 10-15% of the
performance on lower number of threads. We have however
numerous ideas on how to improve this.

The basic idea with our current approach is to use thread groups,
where each group works indepently of other groups in handling a
set of connections. We're experimenting with the number of
threads per group and also how to handle the situation when the
last thread in the group is getting ready to execute a query.

Compared to maximum performance around 32 threads we reach
about 67% of this performance also on 1024 concurrently active
threads. The drop off 33% is expected since there is some
additional load when we reach an overload situation to ensure
that the proper thread is handling the task. At low number of
threads it's possible to immediately schedule the current worker
thread to work on the query, but in the overload situation there
is some queueing and context switching needed to handle the
situation. However the overhead at overload is constant, so it
doesn't worsen when the number of threads goes to a very high
number.

To handle the problems with blocked threads, we will implement a
new part of the storage engine API and API towards the MySQL
Server where the MySQL Server and the storage engines can
announce that they're planning to go inactive for some reason.
The threadpool will however handle the situation even if a thread
goes to sleep without announcing it, it will simply be more
performant if the announcement comes in those situations.

The new MySQL development model with milestone release is a
vital new injection to the MySQL development leading to the
possibility of making new features available to the MySQL
community users in an efficient manner without endangering the
quality of the MySQL Server. There is a very strict quality model
before approving any new feature into a milestone release.
The 6.0 thread pool design would not meet this strict quality
model. The new design must meet this strict quality model before
being accepted although we have good hopes for this to happen.

Friday, November 20, 2009

Partitioning as performance booster

When I developed partitioning for MySQL the main goal was
to make it easier for MySQL users to manage large tables
by enabling them to easily add and drop partitions.

It turns out that partitioning can also be used as a manner
to make MySQL more scalable. The reason is that in some
cases the storage engine have internal locks per table or
per index (one such example is the btr_search_latch in InnoDB).

So in this case adding a
PARTITION BY KEY (key_part)
PARTITIONS 4
to the table definition makes a very hot table into 4 tables
from the storage engine point of view.

This would mostly be beneficial in cases where the main
operation is primary key lookups on the table. Dividing the
indexes in cases of scans can be both positive and negative.
So this solution is definitely not a winner for all situations.

I haven't tried this out yet myself in my benchmark suites,
but I plan to make some experiments in this area. It is usable
in sysbench, it's possible to use for DBT2 (have used partitioning
for DBT2 in MySQL Cluster benchmarks a lot already) and it's
possible to use in Dimitri's dbStress benchmark.

Full automation of DBT2 test runs in benchmark scripts

My benchmark scripts was upgraded once more today.
I fixed all issues pertaining to full automation
of DBT2 runs. So now it is as easy to start a
DBT2 test run as it previously was to start a
sysbench run. Obviously DBT2 was already earlier
supported in the benchmark scripts, so what I did
now was add the final steps to make it fully
automated. This includes also generating the DBT2
load files needed to load the DBT2 database.

See the download section on www.iclaustron.com
for the tarball including some description of
how to configure sysbench and DBT2 test runs.

Tuesday, November 17, 2009

New version of benchmark scripts also supporting Drizzle

I updated my benchmark scripts this week. These scripts can now
run:

- Sysbench benchmarks for MySQL and Drizzle
- DBT2 benchmarks for MySQL and MySQL Cluster
- TPC-W benchmark for MySQL Cluster

There is also a number of scripts to start and stop
MySQL, Drizzle and MySQL Cluster nodes.

In this version I added Drizzle support for sysbench and also
added a README-AUTOMATED file that describes the steps needed
to set-up a completely automated sysbench run for MySQL
and Drizzle.

To run a MySQL sysbench benchmark one needs the DBT2 tarball,
the sysbench tarball and a MySQL tarball (gzipped tarballs).

The tarball is found on www.iclaustron.com in the downloads
section and this version is named dbt2-0.37.47.tar.gz.

Thursday, November 12, 2009

Improvements by LOCK_* patches

I have done a long series of tests to verify that
the impact of the LOCK_alarm removal, removing two
variables from LOCK_threadcount protection to being
atomic increments instead and decreasing hold time
of LOCK_open is positive in most if not all cases.

There are a number of test cases needed:
1) With and Without cpuspeed activated
2) With sysbench local and with sysbench on another
server
3) With MySQL Server limited to 2,4,6,8,12,16 cores.
4) With number of threads going from 1 to 256 threads
in fair sized steps.

I've done most of those tests for Sysbench Readonly
and Sysbench Readwrite. The results are positive in
almost all cases towards the baseline which is based
off the MySQL 5.4.3 tree (not exactly 5.4.3 but close
enough).

In the case of networked benchmark with cpuspeed
activated the gain is biggest, top performance goes up
about 10% and also top performance moves from 64 to 128
threads, for 256 threads the performance increases
by about 50%.

When cpuspeed is deactivated and we use networked
benchmarks the network handling becomes a bottleneck, so
the numbers here are less interesting since we need to
resolve the network bottleneck first.

With cpuspeed activated and local communication the top
performance increase by about 8% and there is gain
for all number of threads. The gain is a bit higher
on sysbench readonly than on sysbench readwrite.

With cpuspeed deactivated and local communication we
naturally get the best numbers but also the
smallest gains. Top performance of Sysbench
Readonly increased by 2.5% and for Readwrite
it increased 4%, the top performance for
sysbench readwrite also moved from 16 threads
to 32 threads. The improvement is slightly
better on more threads.

Wednesday, November 11, 2009

Analysis of InnoDB adaptive hash index parameter for sysbench

As I mentioned in a previous blog post I was suspicious that the
adaptive hash index in InnoDB added to the scalability issues in
the MySQL Server. So I decided to run a test where I disabled the
adaptive hash index.

The results were fairly much as expected. The adaptive hash
index usage improves performance on low thread counts
(up to about 16) by a few percent. However at 32 threads and
beyond the performance is better without the adaptive hash
index and equal in some cases. In particular the top performance
goes up by about 3% when this is disabled.

This confirms the documentation in the InnoDB manual that the
adaptive hash index improves performance as long as the lock
around it doesn't hurt performance. So for the majority of users
it's a good idea to have it turned on, but for users with high-end
servers it's a good idea to test if turning it off will improve
performance.

For sysbench benchmarks it's clearly a good idea to turn it off.

245% improvement of MySQL performance in 1 year

When I did sysbench benchmarks 1 year ago I used a
4-socket server, a Linux kernel based on 2.6.18 and
MySQL 5.1. The sysbench readwrite numbers I got then
was around 2700. When I run the same benchmarks
now the numbers I get are 9300.

These improvements obviously comes from a mixture
of HW development, OS development (now using
a 2.6.31 based kernel) and MySQL development.

The machine is still a 4-socket server, the operating
system is still Linux and the database is still MySQL,
but the performance has improved by 245%. Needless
to say this is an extraordinary performance
improvement in just one year and clearly shows that
the both the HW industry and the open source SW
industry is quickly picking up on how to improve
performance using multi-core multi-socket servers.

Effect of CPU Powersave mode on Sysbench benchmarks

When I started analysing the various patches that I had made
for improving the MySQL Server performance I did by mistake
forget to turn cpuspeed off in Linux. This is a feature that
makes it possible to run the CPU's on a much lower frequency
in cases when they aren't so heavily used.

So at first I considered simply turning it on and forgetting the
data I had produced. Then I realised that actually to have
cpuspeed activated is the default behaviour and for many
servers out there in the world it is actually the best mode since
most servers goes from high load to low load frequently.

So I decided that it would be worthwhile to analyse behaviour
both with and without this feature turned on.

The cpuspeed feature was particularly involved when running
sysbench on a different server, so thus using a socket based
communication. In this case the performance drop off at
64,128 and 256 threads were fairly significant. However the
performance drop off when I added fixes for LOCK_open,
LOCK_alarm and LOCK_threadcount was very significant. I
got 50% better performance with these patches when
cpuspeed was activated and I was running sysbench over the
network.

When I ran sysbench and mysqld on the same host the impact
of the patches and cpuspeed was smaller but still significant.
Turning cpuspeed on decreases the performance of sysbench
readwrite by almost 10% for the baseline (MySQL 5.4.3) whereas
with the patches that fixes the LOCK_* problems the drop in
using cpuspeed is only 1%.

So it seems like having many extra mutexes to pass through
doesn't hurt performance so much when running at full CPU
speed all the time, but as soon as the CPU power save mode
is activated these many extra mutexes to pass through has a
significant negative effect.

GDB stack trace analysis of Sysbench benchmark

After many unsuccessful attempts to get MySQL to run faster I got
a very simple but effective gdb script courtesy of Intel. The script
is very simple and attaches to the mysqld process and does a
backtrace on all threads in the process and then it gathers
statistics on all stacktraces.

With this script I did some analysis on what goes in sysbench
readonly and sysbench readwrite.

Starting with sysbench readonly I discovered a lot of things I
already knew such that LOCK_open is a major bottleneck.
There were also many other things that popped up such as:
LOCK_alarm, LOCK_threadcount, LOCK_grant,
btr_search_latch and the mutex on the table object which is
used to discover TABLE level locks on the MySQL level. This
last lock is contended mostly because sysbench is only
operating on one table. So most normal benchmarks will
not have any major problems with this mutex since it's rare
with applications that put so heavy weight on a single table.

Interestingly the kernel_mutex in InnoDB wasn't so prevalent
in sysbench readonly. Also I was a bit surprised to find the
btr_search_latch there since it's a RW-lock, but it seemed
like every now and then someone took a X-lock on the
btr_search_latch even in readonly queries. Probably has
something to do with InnoDB adaptive hash index.

One surprising lock here is the LOCK_grant which is also
a RW-lock and this is never taken in anything else than
the Read mode unless one changes the grants which
doesn't happen in a sysbench run. Some discussions
concluded that pthread_rwlock is actually implemented
by using a mutex and this is the cause of the contention
on LOCK_grant. So to resolve that a read-optimised
RW-lock is needed for the MySQL Server code.

To remove LOCK_alarm there is already code in the MySQL
Server prepared to do that so the patch to remove
LOCK_alarm is fairly straightforward.

To remove LOCK_threadcount isn't necessary, it's sufficient
to remove two variables thread_running and global_query_id
from being protected by this mutex and instead using
atomic variables. To handle this one can use the my_atomic
framework and add 64-bit support to it. Then the fix of
LOCK_threadcount is straightforward.

To resolve LOCK_open is obviously a bigger problem but a
first step is to simply remove the hash calculation from
LOCK_open.

The MySQL runtime team is working on a metadata locking
infrastucture that was in MySQL 6.0 but still have some
quality issues. But when this code is ready it will also make
it possible to resolve the LOCK_open problem. Actually the
problem then is both about the new lock LOCK_mdl added
by the metadata locking code and LOCK_open. But the new
structure makes it possible to have more aggressive
solutions on both LOCK_open and LOCK_mdl code.

In analysing the sysbench readwrite benchmark the main
contender together with LOCK_open was not very
surprisingly the kernel_mutex in InnoDB. There are some
ideas on how to improve this as well but first things first.

New partitioning SQL syntax added in next MySQL milestone release treee

This blogs gives some insights into the new SQL syntax added in WL#3352
and WL#4444 and WL#4571 which all have been included in the
mysql-next-mr tree which is the base for the next milestone release
codenamed Betony.

The purpose of these new changes is to enable improved partition pruning, and
also making it possible to partition on strings. Supporting TRUNCATE on
partitions will improve partition management and the ability to use separate
key caches for different partitions makes MyISAM partitioned tables more
useful. Finally we added a new function TO_SECONDS which makes it
possible to get more fine-grained dates for partitions and subpartitions.

This information will soon find its way into documentation but if you want
to get started right away here it comes.

There are 5 additions effectively:
1) The ability to RANGE partition by column list instead of by function
2) The ability to LIST partition by column list instead of by function
3) A new function TO_SECONDS which can be used in partition functions
and where partition pruning will be used also on ranges.
4) The ability to TRUNCATE a partition
5) The ability to use a keycache in MyISAM per partition

Here's a few examples of how to use these now additions:

1)
CREATE TABLE t1 (a varchar(5), b int)
PARTITION BY RANGE COLUMNS (a,b)
( PARTITION p0 VALUES LESS THAN ("abc", 1),
PARTITION p1 VALUES LESS THAN ("def", 2));

Some things noteworthy here:
The checks for the constants is fairly strict. Thus using "1" for a constant
to b isn't allowed, the constant must be of the same type as the field it
maps to. Also sql_mode will be ignored for those partition constants.
Thus e.g. even if sql_mode specifies that non-existing dates are allowed
they will not be allowed in the partition constants since these constants
will be a part of the table and need to live for longer than the current
session.

Character sets are allowed and the string constants will be interpreted
in the character set their field belongs to. Also character set strings
without mapping are allowed. If one tries to use SHOW CREATE TABLE
on the partition table and the mapping of the partition constants from
field charset to UTF8 fails or if mapping to client charset fails, then the
partition constants will be written in hex string format.

A partition constant can be MAXVALUE, NULL is however not allowed.
There were some considerations to also be able to use MINVALUE
which effectively would make it possible to create partitions where only
the NULL values can go in. However this is still possible if one knows
the minimum value of the field.

It's possible to partition on integer fields, string fields and date fields.
It's not possible to partition on BLOBs, SETs, ENUMs, GEOMETRY fields,
BIT fields.

2)
CREATE TABLE t1 (a varchar(1))
PARTITION BY LIST COLUMNS (a)
( PARTITION p0 VALUES IN ("a","b","c"));

CREATE TABLE t1 (a varchar(1), b int)
PARTITION BY LIST COLUMNS (a,b)
( PARTITION p0 VALUES IN (("a",1),("b",2),("c",3)));

Noteworthy here is that parenthesis are required when more than one
field is in the list of columns partitioned on. It's required to not use
parenthesis when there is only one field.

NULL values are allowed as in MySQL 5.1 but not MAXVALUE.

3)
CREATE TABLE t1 (a datetime)
PARTITION BY RANGE (TO_SECONDS(a))
( PARTITION p0 VALUES LESS THAN (TO_SECONDS("2009-11-11 08:00:00")),
PARTITION p1 VALUES LESS THAN (MAXVALUE));

Same syntax as in MySQL 5.1 but also possible to use TO_SECONDS as a
partition function.

4)
ALTER TABLE t1 TRUNCATE PARTITION p0;
ALTER TABLE t1 TRUNCATE PARTITION p1;
ALTER TABLE t1 TRUNCATE PARTITION ALL;

This delete all rows in the given partitions and resets the given partitions
auto_increment values (if exists) to 0.

The syntax works in the same manner as for ANALYZE, OPTIMIZE and other
commands that can be applied on partitions already in MySQL 5.1.

5)

CACHE INDEX t1 PARTITION p0 IN keycache_fast;
CACHE INDEX t1 PARTITION p1, p2 IN keycache_slow;
LOAD INDEX INTO CACHE t1 PARTITION p0;
LOAD INDEX INTO CACHE t1 PARTITION p0, p1;

This new syntax makes it possible to have separate key caches for different
partitions in a partitioned table using the MyISAM storage engine.

Tuesday, November 10, 2009

Improvement of MySQL partitioning included in MySQL's next milestone release tree

It was quite some time since I last blogged. It's not due to
inactivity. For those of you that have followed my blog might
have seen earlier blog posts about a new partitioning feature.

This new partitioning I first blogged about in July 2006 and
that blog is still the 3rd most popular blog of my blogs, even
when looking at the last months views. The work on this started
out in 2005 and so it's nice to now get it in a state where it's
quality is ready for more heavy testing. For those interested
in partitioning I think this feature will enlarge the number of
cases where partitioning is applicable. It's now possible to
partitioning on many more field types and also on multiple fields
in an efficient manner.

This feature described by WL#3352 has now been pushed
to the mysql-next-mr tree. For those of you new to our new
milestone release model this tree is where we push new features
before clone off. After clone off this tree is merged with
mysql-trunk tree and after about 6 months of bug fixing a Milestone
Release is performed. The current mysql-trunk tree is the tree from
which the current MySQL 5.4 releases are produced. A milestone
release is of beta quality and some milestone releases will be
continued towards a GA release. There will be a new milestone
release with about 3-6 months interval. For more information on
the release model see here.

The WL#3352 was pushed into the Milestone which has the
codename Betony in the article referred to above. The current
MySQL 5.4 release series is called Azalea (actually 5.4.0 and
5.4.1 belonged to Summit and 5.4.2 was the first Azalea release).

The major advantages of this new feature is that it makes it
possible to partition on string fields and also to partition on
multiple fields and still get good partition pruning. Previous
partitioning required a partition function that delivered an
integer result and a couple of functions could deliver good
partition pruning.

Now it is possible to partition on most fields and even a set
of them and always get good partition pruning.

The final result of the new syntax is the following:

CREATE TABLE t1 (a varchar(5) character set ucs2, b int)
PARTITION BY RANGE COLUMNS (a,b)
( PARTITION p0 VALUES LESS THAN (_ucs2 0x2020, 1),
PARTITION p1 VALUES LESS THAN (MAXVALUE, MAXVALUE));

So the keyword COLUMNS indicates that a list of fields is
used to partition on instead of a function. The new
partitioning applies to RANGE and LIST partitioning.

All the management functions for partitioning still applies.
However the major difference comes when you do a query like:

select * from t1 WHERE a > _ucs2 0x2020;

In this case the partition pruning will discover that only
the partition p1 is possible to find records and will thus
prune away partition p0. In MySQL 5.1 it's only possible
to perform pruning on intervals for single fields and
the partition function must also be of a type that is
safe to always increase such as YEAR or TO_DAYS (actually
this new feature also added the function TO_SECONDS to this
list of functions that can be pruned on efficiently).

So partition pruning on this works very much like an
index. Not surprisingly the partition pruning code
reuses the code for the range optimiser which looks
at what indexes can be used for a certain query.

If you wonder why I am using these _ucs2 constants
as examples it's because I had to learn a lot about the
character set code in MySQL to get everything right with
this feature. Actually even found and fixed a few MySQL
character set bugs in the process :)

Tuesday, September 15, 2009

New launchpad tree for Column List Partitioning

I have added a new Launchpad tree for an improved
partitioning feature.

This new tree is based off mysql-trunk which is the base for
the next generation MySQL Server. The tree is now entering QA
and have been extensively tested by development and thus it is
very interesting to get feedback on usability of feature and
feedback on quality issues. This will speed up the delivery of
this new feature.

You can find more description of the feature in a previous blog:
Description of feature

Tuesday, August 04, 2009

Partition by column_list ready for alpha testers

I got time to spend on a really old worklog I completed coding
already october 2005. I blogged about it in July 2006 and
interestingly enough it's still the second most read blog
entry on my blog (probably related to search engines in some
way).

I have merged it with the azalea tree (this is an internal code
name for our development tree, name is likely to change). This
tree contains subquery optimisations, Batched join and some more
optimisations.

I have fixed a whole bunch of bugs that always shows up in early
code. The code quality is still alpha but at least you won't find
10 bugs per hour :)


Here
you can find the launch pad tree for this code.

There are two important additions made possible by this tree.
1) New function to_seconds that is recognized by range optimiser
to enable partition pruning when partitioning like:
partition by range (to_seconds(time))
2) New partitioning functionality that makes it possible to
perform partition pruning over multiple fields.

Most of the bugs I have fixed had to do with this partition pruning
of multiple fields. The routine to discover which partitions are
needed is called find_used_partitions (in sql/opt_range.cc) and this
function is called recursively over a key tree. A key tree can be
very complex and more or less have AND of key parts using next_key_part
pointer and OR condition using left and right pointers. These left and
right pointers can however show up a little here and there in the tree
so one has to be very careful about how variables are assigned, saved
and restored. I havent' worked so much with recursive functions so this
is an interesting adventure.

Here's my latest addition of a test case to give you an idea of how it
works and also what works right now.

create table t1 (a int, b char(10), c varchar(5), d int)
partition by range column_list(a,b,c)
subpartition by key (c,d)
subpartitions 3
( partition p0 values less than (column_list(1,'abc','abc')),
partition p1 values less than (column_list(2,'abc','abc')),
partition p2 values less than (column_list(3,'abc','abc')),
partition p3 values less than (column_list(4,'abc','abc')));

insert into t1 values (1,'a','b',1),(2,'a','b',2),(3,'a','b',3);
insert into t1 values (1,'b','c',1),(2,'b','c',2),(3,'b','c',3);
insert into t1 values (1,'c','d',1),(2,'c','d',2),(3,'c','d',3);
insert into t1 values (1,'d','e',1),(2,'d','e',2),(3,'d','e',3);
select * from t1 where (a = 1 AND b < 'd' AND (c = 'b' OR (c = 'c' AND d = 1)) OR
(a = 1 AND b >= 'a' AND (c = 'c' OR (c = 'd' AND d = 2))));

So in the above select statement we are performing partition
pruning over 3 fields and subpartition pruning over 2 fields
and there are 5 different ranges in the query.

So please go ahead and try this new tree out and see if it
works for you.

Wednesday, July 08, 2009

New update to DBT2 clone with automated sysbench runs

I have made an update to the DBT2 clone where I packed in all my
benchmarking support scripts.

This update adds a new script bench_prepare.sh that should be run
from the benchmark server and uses the input of 3 tarballs, the
DBT2 tarball, the sysbench tarball and a MySQL tarball. It will
automatically build all needed binaries on both the benchmark
server and on the MySQL Server machine (they could be on same
machine or on different machine).

The script only requires one parameter --default-directory where
one configuration file called autobench.conf should be placed.
This directory will also be used to house all result files,
builds and generated configuration files for all involved scripts.

The aim is to continue develop such that we can also benchmark
easily using different Linux versions.

The tarball can be downloaded from here

The script can also handle a MySQL Server which is Windows-based,
but the benchmark server cannot run Windows for the moment.

Friday, June 05, 2009

Follow-up Analysis of Split Rollback Segment Mutex

I performed a new set of tests of the patch to split the
rollback segment mutex on Linux. All these tests gave
positive results with improvements in the order of 2%.

One could also derive from the results some conclusions.
The first conclusion is that this split mainly improves
things when the number of threads is high and thus
contention of mutexes is higher. At 256 threads a number
of results improved up to 15%.

The numbers on lower number of threads were more timid
although in many cases an improvement was still seen.

What was also noticeable was that the sysbench read-write
with less reads which makes the transactions much shorter
the positive impact was much greater and the positive
impact on long transactions was much smaller (+0.4%
versus +2.5%). The impact on the short transaction test
with less reads was very positive also on lower number
of threads, the result on 32 threads improved 7%.

So the conclusion is that this patch is a useful contribution
to improvements and in particular improves matters on high
number of threads and with short transactions. According to
a comment on the previous blog it is also very positive in
insert benchmarks.

Thursday, June 04, 2009

Results of shootout on split page hash in InnoDB

I have now tried out the buffer split page hash patches on
both a Linux/x86 box and a SPARC/Solaris server (tests done
by Dimitri).

The three variants in short description are:
1) The Google v3 derived patch. This introduces a new array
of mutexes that only protect the buffer page hash. Thus some
extra checking is needed to ensure the page hasn't been
removed from the hash before using it. This is a very simple
and attractive patch from that point of view. The patch uses
an array of 64 mutexes.

2) A variant I developed with some inspiration from the Percona
patches. This patch uses an array of page hashes which each has
its own read-write lock. I've tried this with 1, 4 and 16 page
hashes and 4 is the optimum number. The rw-lock protects the
page hash long enough to ensure that the block hasn't been
possible to remove from the hash before the mutex is acquired.

3) The last variant is a mix of the two first which uses the
simplicity of the Google patch, uses a rw-lock instead and
separate page hashes (to ensure read ahead doesn't have to
go into all mutexes). Used an array of 4 page hashes here.

The conclusion is that the only version that has consistently
improved the MySQL 5.4.0 numbers is the version I originally
developed (2 above).

On sysbench read-write all versions improve numbers compared to
MySQL 5.4.0. 2 and 3 improve 2% whereas the original Google
patch improved with 1%.

On sysbench read-only on Linux it was much harder to beat the
MySQL 5.4.0 version. Only 2) did so and only by 0.5%. This is
not so surprising since this mutex is not a blocker for read-only
workloads. 1) gave -1% and 3) gave -0.3%.

On a write intensive workload on Linux 1) and 3) performed 0.5%
better than MySQL 5.4.0 whereas 2) gave 2% improvement.

Finally on a sysbench read-write with less reads on Linux, all
variants lost to MySQL 5.4.0. 1) by 2%, 2) by 0.1% and 3) by
1%.

Also the numbers from SPARC/Solaris give similar data. The major
difference is that the positive impact on SPARC servers is much
bigger, all the way up to 30% improvements in some cases. The
most likely reason for this is that SPARC servers
have bigger CPU caches and are thus more held back by lack of
concurrency and not so much by increased working set. The x86
box had 512kB cache per core and a 2MB L3 cache and is likely
to be very sensitive to any increase of the working set.

So the likely rationale for worse numbers in some cases is that
more mutexes or rw-locks gives more cache misses.

So given the outcome I will continue to see if I can keep the
simplicity of the Google patch and still maintain the improved
performance of my patch.

Wednesday, June 03, 2009

Some ideas on InnoDB kernel_mutex

I've noted that one reason that InnoDB can get difficulties
when there are many concurrent transactions in the MySQL Server
is that the lock time of the kernel_mutex often increases
linearly with the number of active transactions. One such
example is in trx_assign_read_view where each transaction
that does a consistent read creates a copy of the transaction
list to be able to deduce the read view of the transaction or
statement.

This means that each transaction is copied to the local transaction
list while holding the critical kernel_mutex.

Another such case is that most operations will set some kind of
intention lock on the table. This lock code will walk through
all locks on the table to check for compatible locks and the
first time it will even do so twice. Thus if all threads use the
same table (as they do in e.g. sysbench) then the number of locks
on the table will be more or less equal to the number of active
transactions.

Thus as an example when running with 256 threads compared to 16
threads the kernel_mutex lock will be held for 16 times longer
and possibly even more since with more contention the mutex is
needed for even longer time to start up waiting transactions.

So this is an obvious problem, so what is then the solution?
Not extremely easy but one thing one can do is to make the
kernel_mutex into a read-write lock instead of a mutex. Then
many threads can traverse those lists in parallel. It will
still block others needing write access to the kernel_mutex
but it should hopefully improve things.

Another solution that is also going to improve the problem is
to use thread pools. Thread pools ensure that not as many
threads are active at a time. However we still have a problem
that transactions can still be as many active in parallel as
there are connections (although InnoDB has a limit of 1024
concurrent active transactions). So the thread pool needs
to prioritize connections with active transactions in cases
where there are too many threads active at a time.

This type of load regulation is often used in telecom systems
where it is more important to give priority to those that have
already invested time in running the activity. Those that are
newcomer comes in when there are empty slots not taken by
already running activities.

Tuesday, June 02, 2009

Increasing log file size increases performance

I have been trying to analyse a number of new patches we've
developed for MySQL to see their scalability. However I've
have gotten very strange results which didn't at all compare
with my old results and most of changes gave negative impact :(
Not so nice.

As part of debugging the issues with sysbench I decided to go
back to the original version I used previously (sysbench 0.4.8).
Interestingly even then I saw a difference on 16 and 32 threads
whereas on 1-8 threads and 64+ threads the result were the same
as usual.

So I checked my configuration and it turned out that I had changed
log file size to 200M from 1300M and also used 8 read and write
threads instead of 4. I checked quickly and discovered that the
parameter that affected the sysbench results was the log file size.
So increasing the log file size from 200M to 1300M increased the
top result at 32 threads from 3300 to 3750, a nice 15% increase.
The setting of the number of read and write threads had no
significant impact on performance.

This is obviously part of the problem which is currently being
researched both by Mark Callaghan and Dimitri.
Coincidentally Dimitri has just recently blogged about this and
provided a number of more detailed comparisons of the
performance of various settings of the log file size in InnoDB.

Wednesday, May 20, 2009

MySQL 5.4 Webinar

The quality of MySQL 5.4.0 is very high for a beta product.
Four weeks after we released it as beta we have not had
any real serious bugs reported yet. There are some issues
due to deprecation of features, version numbers and a
bug in the SHOW INNODB STATUS printout and some concerns
with the new defaults when running on low-end machines.
It's also important as usual to read the documentation
before upgrading, it contains some instructions needed to
make an upgrade successful. The upgrade issue comes from
changing the defaults of the InnoDB log file sizes.

For those of you who want to know more about MySQL 5.4.0
and it's characteristics and why you should use it, please
join this webinar where Allan Packer will explain what
has been done in MySQL 5.4.0.

Tuesday, May 19, 2009

Patches ready for buf page hash split shootout

Today I created a patch that builds on the Google v3
patch where I added some ideas of my own and some ideas
from the Percona patches. The patch is here.

Here is a reference to the patch derived from the Google
v3 patch.

Here is a reference to my original patch (this is likely to
contain a bug somewhere so usage for other than benchmarking
isn't recommended).

So it will be interesting to see a comparison of all those
variants directly against each other on a number of benchmarks.

Analysis of split flush list from buffer pool

In the Google v3 patch the buffer pool mutex have been
split into an array of buffer page hash mutexes and a
buffer flush list mutex and the buffer pool mutex also
remains.

I derived the patch splitting out the buffer flush list
mutex from the Google v3 patch against the MySQL 5.4.0
tree. The patch is here.

I derived a lot of prototype patches based on MySQL 5.4.0
and Dimitri tried them out. This particular patch seems
to be the most successful in the pack of patches we
tested. It had a consistent positive impact.

The main contribution of this patch is twofold. It
decreases the pressure on the buffer pool mutex by
splitting out a critical part where the oldest dirty
pages are flushed out to disk. In addition this patch
also decreases the pressure on the log_sys mutex by
releasing the log_sys mutex earlier for the mini-
transactions. In addition it removes interaction
between the buffer pool mutex and the log_sys mutex.
So previously both mutexes had to be held for a
while, this is no longer necessary since only the
flush list mutex is needed, not the buffer pool
mutex.

The new patch is the b11 variant which is red in
the comparison graphs.



As we can see the read-write tests have a pretty significant boost
from this patch, it improves top performance by 5% and by 10-20%
on higher number of threads. It also moves the maximum from 16 to
32 threads.



Even on read-only there are some positive improvements although
it is very possible those are more random in nature.



Finally the above picture shows that this patch also moves the
optimal InnoDB thread concurrency up to 24 from 16 since it
allows for more concurrency inside InnoDB. This is also visible
by looking at the numbers for InnoDB Thread Concurrency set to 0
as seen below.

Friday, May 15, 2009

Shootout of split page hash from InnoDB buffer pool mutex

One of the hot mutexes in InnoDB is the buffer pool mutex.
Among other things this mutex protects the page hash where
pages reside when they are in the cache.

There is already a number of variants of how to split out
this mutex. Here follows a short description of the various
approaches.

1) Google v3 approach
Ben Hardy at Google took the approach of using an array of
mutexes (64 mutexes) and this mutex only protects the
actual read, insert and delete from the page hash table.
This has the consequence of a very simple patch, it means
also that when the block has been locked one has to check
that the owner of the block hasn't changed since we didn't
protect the block between the read of the hash and the
locking of the block, thus someone is capable of coming in
between and grabbing the block for another page before we
get to lock the block. In addition this patch focuses
mainly on optimising the path in the buf_page_get_gen
which is the routine used to get a page from the page
cache and thus the hot-spot.

2) Percona approaches
Percona has done a series of approaches where the first
only split the page hash as one mutex and still protecting
the blocks from being changed while holding this mutex.
Next step was to change the mutex into a read-write lock.

3) My approach
My approach was inspired by Percona but added two main
things. First it split the page hash into a number of
page hashes and had one RW-lock per page hash (this
number has been tested with 4, 8 and 16 and 4 was the
optimal on Linux at least). In addition to avoid having
to lock and unlock multiple pages while going through
the read ahead code the hash function to decide which
page hash to use decided on the same page hash for all
pages within 1 MByte (which is the unit of read ahead
in InnoDB).

Pros and Cons

The simplest patch is the Google patch which makes for
a very simple patch and also by only focusing on
buf_page_get_gen avoids a lot of possible extra traps
that are likely if one tries to solve too much of the
problem.

Using a RW-lock instead of a mutex seems like at least
a manner of improving the concurrency but could of
course impose a higher overhead as well so here
benchmarking should show which is best here.

When using an array of locks it makes sense to optimise
for read ahead functionality since this is a hot-spot
in the code as has been shown in some blogs lately.

4) Mixed approach
So a natural solution is then to also try a mix of the
Google variant with my approach. So still using an
array of locks (either mutex or RW-locks, whatever
has the optimal performance) but ensuring that the
pages within a read ahead area is locked by the same
lock.

This approach reuses the simplicity of the Google
approach, the total lack of deadlock problems for
the Google approach with the optimised layout from
my approach and the idea of RW-locks from Percona.

We don't have any results of this shootout yet.
This shootout should also discover the optimum number
of areas to split the page cache into, Google has
used 64, but my results so far indicates a number of
4 seems more appropriate.

Thursday, May 14, 2009

More analysis of InnoDB Thread Concurrency

When I worked with Dimitri on the analysis of the
Split Rollback Segment Mutex he came up with numbers
on InnoDB Thread Concurrency set to 16 and 32 and I was curious
to see if 24 was the optimal setting. So he made some new runs and
some new graphs that I found interesting.

The first graph analyses behaviour of MySQL 5.4.0 on a SPARC
Server using InnoDB Thread Concurrency set to 0, 16, 24 and 32.
Interestingly for both readonly and readwrite benchmarks the
optimal setting for concurrency is 16 whereas the top numbers
(at 32 threads) is achieved with concurrency set to 24 or 32.





So on the current MySQL 5.4.0 on this particular benchmark and
platform it seems that 16 is the optimal setting. However Dimitri
also analysed the same thing using the new patch for Splitting the
Rollback Segment Mutex and now the story changes.





So with this patch setting InnoDB Thread Concurrency to 24
is now the optimum setting. So it's clear that as we get more
and more improvements to the scalability of the MySQL Server and
InnoDB it will be optimal with more and more parallel threads
inside InnoDB as well. So this means that this setting is quite
likely to change as development proceeds but for MySQL 5.4.0 a
setting of around 16-24 is often a good one. To actually change
the default setting requires much more testing of various
workloads on many different computer architectures.

Similar testing I have performed on Linux using sysbench implies
that the optimal setting is around 24-28. Also the difference
between setting it to 0 and 24 is much smaller on Linux (15%
on 256 threads as shown in blog yesterday). We haven't analysed
the big difference on these SPARC Servers.

Analysis of Split of Rollback Segment Mutex

When I read the blog about Split Rollback Segment Mutex,
I was interested to verify those results in the context of MySQL 5.4.0.

The patch can be found here.

We've analysed this patch both on a large SPARC system and on my
benchmark x86/Linux machine. Our results tend to be positive for
readwrite benchmarks but sometimes negative for readonly
benchmarks. Also the gain is much smaller than found in the
blog.

Also this patch has two negative effects, the first is that it
provides an upgrade problem, this can probably be handled in the
InnoDB code, but requires quite some digging. The other is that
instead of writing UNDO results to one UNDO log, we write it to
several UNDO logs, thus decreasing the buffering effect to the
file system.

On Linux I found on readwrite benchmarks up to 7-8%
improvements of the top results. On readonly it sometime dropped
about 1-3%. I also tried with varying numbers of rollback
segments and found 4 and 8 to be better than 16. So from the
above point of view the number of rollback segments set to 4 is
probably best. The patch uses 8 (it's actually set to 9 since
the system rollback segment is a bit special).

Here are some graphs from Dimitri running it on some fat SPARC
server (MySQL-5.Perfb9-gcc43 is 5.4.0 plus the above patch).

The first graph shows the behaviour when InnoDB Thread Concurrency
is 0, here we see a speedup in the range of 3-5%.



The same results for the readonly benchmark shows positive results as well.



When one sets InnoDB Thread Concurrency equal to 16, 24 or 32
the behaviour is different. It turns out that we get worse
performance using 16 but get more positive impact using 24 and
even more using 32. So it seems that this patch requires less
limits to parallelism to get the best behaviour.

So one impact of this patch is that it can sustain a higher
number of concurrent threads and there is a small positive impact
on the performance as well.

Wednesday, May 13, 2009

More data on InnoDB Thread Concurrency

Here is the performance graph comparing using
InnoDB Thread Concurrency equal to 0 and
InnoDB Thread Concurrency equal to 24 using
sysbench readwrite with the new InnoDB
Thread concurrency algorithm as introduced
in MySQL 5.4.0.

Analysis of Google patches on 4,8 and 12 cores

One of the goals we had originally with the MySQL 5.4
development was to improve scaling from 4 cores to
8 cores. So in my early testing I ran comparisons of
the Google SMP + IO + tcmalloc patches on 4, 8 and 12
cores to see how it behaved compared with a stock
MySQL 5.1.28 version (Note the comparison here was
done on a very early version of 5.4, 5.4.0 have a
set of additional patches applied to it).



What we can see here is that the Google SMP patch and use
of tcmalloc makes a difference already on a 4-core server
using 4 threads. On 1 and 2 threads the difference is only
on the order of 1-2% so not really of smaller significance.

An interesting note in the graph is that 8-core numbers using
the Google improvements outperform the 12-core stock MySQL
5.1.28.

So what we concluded in those graphs is that the scaling from 4-cores
to 8-cores had improved greatly and that there also was a good scaling
from 8 cores to 12 cores. This improvement increased even more with
the 5.4 release. The main purpose of showing these numbers is to show
the difference between 4, 8 and 12 cores.

All benchmarks were executed on a 16-core x86 box with 4 cores
dedicated to running sysbench.

Analysis of Google patches in MySQL 5.4

Early on in the MySQL 5.4 development we tried out the
impact of the Google SMP patch and the Google IO patch.
At first we wanted to see which of the patches that
made most of an impact. The Google patches in MySQL 5.4
have 3 components at least that impact the performance.
1) Replace InnoDB memory manager by a malloc variant
2) Replace InnoDB RW-lock implementation
3) Make InnoDB use more IO threads

When disabling the InnoDB one opens up for a whole array
of potential candidates for malloc. Our work concluded
that tcmalloc behaved best on Linux and mtmalloc was
best on Solaris, see blog posts on Solaris below.

Malloc on Solaris investigation


Battle of the Mallocators on Solaris


I did also do some testing on Linux where I compared 4 different
cases (all variants were based on MySQL 5.1.28):
1) Using the Google SMP patch, Google IO patch (with 4 read and
4 write threads) and using tcmalloc
2) Using tcmalloc and no other Google patches
3) Using plain malloc from libc
4) Using plain MySQL 5.1.28 using InnoDB memory manager

Here are the results:


So as we can see here the replacement of the InnoDB memory manager
by standard malloc had no benefits whereas replacing it with
tcmalloc gave 10% extra performance. The Google SMP patch added
another 10% performance in sysbench readwrite. We have also
tested other OLTP benchmarks where the Google SMP patch added
about 5-10% performance improvement. As shown by Mark Callaghan
there are however other benchmarks where the Google SMP patch
provides much greater improvements.

Tuesday, May 12, 2009

MySQL 5.4 Patches: InnoDB Thread Concurrency

When benchmarking MySQL with InnoDB we quickly discovered
that using InnoDB Thread Concurrency set to 0 was an
improvement to performance since the implementation of
InnoDB Thread Concurrency used a mutex which in itself was
a scalability bottleneck.

Given that InnoDB Thread Concurrency is a nice feature that
ensures that one gets good performance also on an overloaded
server I was hoping to find a way to make the implementation
of this more scalable.

I tried out many different techniques using a combination of
mutexes and atomic variables. However every technique fell to
the ground and was less performant than setting it to 0 and not
using the InnoDB Thread Concurrency implementation. So I was
ready to give up the effort and move on to other ideas.

However after sleeping on it an inspirational idea came up.
Why use a mutex at all, let's see how it works by using the
OS scheduler to queue the threads that need to blocked. This
should be more scalable to use than a mutex-based approach.
There is obviously one bad thing about this approach and this
is due to that new arrivees can enter before old waiters. To
ensure we don't suffer too much from this a limit on the wait
was necessary.

So I quickly put together a solution that called yield once
and slept for 10 milliseconds twice at most and every time it
woke up it was checking an atomic variable to see if it was ok
to enter. After those three attempts it would enter without
checking.

I tried it and saw a 1% decrease on low concurrency and 5%
improvement on 32 threads and 10% on 64 threads and 15% on 128
threads. Voila, it worked. Now I decided to search for the
optimal solution to see how many yields and sleeps would be best.
It turned out I had found the optimal number at the first attempt.

The implementation still has corner cases where it provides less
benefits so I kept the possibility to use the old implementation by
adding a new variable here.

So currently the default in MySQL 5.4 is still 0 for InnoDB Thread
Concurrency. However we generally see optimal behaviour using
InnoDB Thread Concurrency set to around 24, setting it higher is
not bringing any real value to MySQL 5.4.0 and setting it lower
decreases the possible performance one can achieve. This seems
to be a fairly generic set-up that should work well in most cases.
We might change the defaults for this later.

Thursday, April 23, 2009

Join Executor for MySQL Cluster

Jonas in the Cluster team reported on his work on executing
joins in the NDB kernel for MySQL Cluster here.

This is a very interesting work we have in progress at MySQL.
We are working on an extension of the Storage Engine API
where the MySQL Server will present an Abstract Query Tree
to the Storage Engine. The Storage Engine can then decide to
execute the query on his own or decide that the MySQL Server
should execute it in the classic manner. In the first prototype
the optimisation will be done as usual and only after the
optimisation phase will we present the join to the storage
engine. However the specification also covers work on
integrating this with the optimiser and also enabling the
possibility for the storage engine to execute parts of the
query and not the entire one. The specification of this
work can be found here.

Jonas is working on the backend part for this interface in
MySQL Cluster.

What is interesting with pushing joins to the NDB kernel is that
it becomes very easy to parallelize the join execution. So what
will happen when this feature is ready is that MySQL Cluster
will shine on join performance and enable very good
performance on all sorts of application using SQL.

The reason that MySQL Cluster can so easily parallelize the query
execution of the join is due to the software architecture of the
NDB kernel. The NDB kernel is entirely developed as a message
passing architecture. So to start a thread of execution in the
NDB kernel one simply sends two messages when executing one
message and to stop a thread one simply doesn't send any messages
when executing a message. The problem then is more on that one
should not parallelize too much to run out of resources in the
system.

So with this development MySQL Cluster will also be shining at
Data Mining in an OLTP database. MySQL Cluster is designed for
systems where you need massive amounts of read and write
bandwidth (the cost of writing your data is close to the cost
of reading the data). So with the new features it will be
possible to do Data Mining on data updated in Real-time. Most
Data Mining is performed on a specialised Data Warehousing
solution. But to achieve this you need to transfer the data to
the Data Warehouse. With MySQL Cluster it will be possible to
both use the database for OLTP applications with heavy updates
always occuring while still querying the data with parallel
queries in parallel. MySQL Cluster is very efficient at
executing individual queries in the NDB kernel and can also
scale to very many machines and CPU cores.

Data on MySQL Performance

If you like to sift through tons of benchmark data about various
MySQL versions, Dimitri at the Sun Benchmark Labs have published
a serious amount of benchmark data in a report published here.

The report shows that the new MySQL 5.4.0 release
have a very good performance. The report also shows how the day
of a developer of performance improvements and the massive amount
of benchmark data that needs to be analysed and sifted through
to understand the impact of new performance improvements.

I personally met Dimitri the first time in 2002 when I was working
together with him for a couple of weeks on a benchmark on NDB Cluster
(the storage engine of MySQL Cluster). Our goal then was to perform
1 million reads per second on a 72-cpu SPARC box with UltraSparc-III
CPU's @900MHz. We struggled a lot at the time but finally we managed
to achieve the numbers we were hoping for. We actually surpassed the
goal and reached 1.5 million reads per second and we also tried an
update benchmark where we managed to do 340.000 update transactions
per second (generating a disk write bandwidth of 250 MByte per second).

This benchmark was interesting from a scientific point of view. When
I defended my Ph.D thesis I claimed that one could get superlinear
performance increases when adding more CPU's to a problem in the
database world. To achieve this the workload needs to be constant and the
number of CPU's increased. By increasing the number of CPU's and keeping
the workload constant more CPU cache memory is used on the problem.
This means that each CPU will execute more efficiently.

In the above benchmark we managed to verify my claim that I made when
defending my Ph.D thesis which I found very positive. The results we
achieved on a 16-node cluster was 500.000 reads per second and on a
32-node cluster we reached 1.500.000 reads per second.

Dimitri has a background from the past of developing his own
homegrown database, so we have had many interesting discussions both
then and now on how to achieve the best performance of NDB and
the MySQL Server.

Wednesday, April 22, 2009

MySQL 5.4 Patches: Improvements to spin-loop

In InnoDB there is an implementation of both mutexes
and RW-locks. The RW-locks implementation have been
improved by the Google SMP patches. Both of these
implementation relies on spin-loops as part of their
implementation. The defaults in InnoDB is to check
the condition, if it's not ok to enter to spin for
about 5 microseconds and then come back to check the
condition again.

If one reads the Intel manual how to do spin-loops
they propose to use a PAUSE instruction and then
check the condition again, so a much more active
checking of the condition. When we tried this out
using the sysbench benchmark we found that using
the Intel approach worsened performance. So instead
we tried an approach of putting the PAUSE instruction
into the InnoDB spinloop instead.

This approach turned out to be a success. Even on
machines with only one thread per core we were able
to get a 3-4% increase in throughput. We also tried
various settings of the defaults of the time of
spinning in the spinloop and found that the original
default values were very close to the optimum values.
We found the optimum about 20% from the old default
values and made this slight change to the default
values of the spinloop.

It's my expectation that as we remove locks and the
mutexes and RW-locks gets less contended and there
are more locks where the threads are waiting that
this optimum value will change. The current best
setting is very likely to be governed by the fact
that the most waiting happens on very hot locks.
So with improvements of the mutexes and RW-locks
we should expect to see better performance with
a shorter time in the spinloop.

On the new SPARC CPU's that Sun has developed, the
CMT boxes, we used the results from the paper:
www.ideal.ece.ufl.edu/workshops/wiosca08/paper2.pdf
which stated that the optimum instruction to use
is a cache miss instruction, however as I don't
know how to program a cache miss instruction we
opted for the second best instruction which was a
dummy test-and-set instruction. So the PAUSE
instruction is replaced by a test-and-set instruction
on SPARC CPU's.

We expect that the improvements due to this small
change is even bigger when there are multiple
threads per core since the contention on the
CPU pipeline is higher in those cases and it is
important that the spinloop stays away as much
as possible from being active executing
instructions.

Tuesday, April 21, 2009

MySQL 5.4 Scaling to 16 way x86 and 64-way CMT Servers

The release of the MySQL 5.4 contains patches which
increases the scalability of the MySQL Server. I am planning to blog
about those changes in some detail over the next few days. This blog
will give an introduction and show what the overall results we have
achieved are.

The changes we have done in MySQL 5.4 to improve scalability and
the ability to monitor the MySQL Server are:

1) Google SMP patch
2) Google IO patches
3) Update of many antiquated defaults in the MySQL Server
4) New InnoDB Thread Concurrency algorithm
5) Improved Spinloop in InnoDB mutexes and RW-locks
6) A couple of performance fixes backported from 6.0
7) Operating system specific optimisations
8) Ported the Google SMP patch to Solaris x86 and SPARC and work
underway for Windows and Intel compiler as well
9) Introducing DTrace probes in the MySQL Server
10) A build script to make it easier for community to build an efficient
MySQL Server based on source code

As an example of the improvements made available through this work we
have some benchmarks using sysbench readwrite and readonly.

We have consistently seen improvements in the order of 30-40% of
sysbench top numbers and on large number of threads 5.4.0 drops
much less in performance than 5.1. The new InnoDB Thread Concurrency
patch makes the results on high number of threads even more
impressive where the results have gone up by another 5-15% at the
expense of 1% less on the top results (there are even some DBT2
runs that gave 200% improvement with the new algorithm).

There is also a benchmark on EAStress which shows a 59% increase in
performance from 5.1 to 5.4 using the new 16-way x86 Nehalem servers.

MySQL Cluster 7.0 scales linearly in two dimensions

As recently reported on my blog we have managed to get MySQL Cluster CGE 6.3 to scale linearly with the addition of more nodes into the system.
In MySQL Cluster CGE 6.3 each node has a single thread handling most of
the work together with a set of file system threads.

In MySQL Cluster 7.0 the data nodes are now multithreaded. The design in
7.0 follows the very efficient design of 6.3 where each thread has absolutely no lock contention with other threads. All communication
between threads happens through messages. This means that scalability
of the data nodes is excellent. The single thread have been split into
up to four local data threads, one transaction handling threads,
and one socket communication thread plus the already existing file
system threads. With this set-up each data node can process 4.6X more
DBT2 transactions compared to 6.3.

This means that a 2-node cluster in 7.0 has the same performance as a
10-node cluster for 6.3 and a 4-node cluster similar performance to a
20-node cluster in 6.3. As earlier blogged each data node can handle
many hundreds of thousands of operations per second, so a cluster of
such nodes can handle many millions of operations per second.

The efficiency of the data node is such that one data node can handle
the traffic from a set of MySQL Servers residing on a 24-core MySQL
Server. So an example of a basic set-up for the MySQL Cluster 7.0 is
to use 2 8-core boxes with lots of memory and lots of disk bandwidth
for the data nodes. To use 2 24-core servers for the MySQL Servers that
mostly require CPU and networking bandwidth.

An important consideration for setting up a MySQL Cluster 7.0 is to
ensure that interrupts from the network stack doesn't kill performance
and also to have separate network infrastructure between the data
nodes in the cluster since it is very easy to overload the network
given the capabilities of the MySQL Cluster software.

MySQL 5.4 Acknowledgements

The work started when MySQL was acquired by Sun has now started to bear
fruit. Very soon after the acquisition a Sun team was formed to assist
the MySQL performance team on improving the scalability of the MySQL
server. At the same time also Google have been very active in improving
scalability of InnoDB. MySQL 5.4 scalability improvements is very much
the result of the efforts from the MySQL Performance team, the Sun
performance team and the Google efforts.

It's extremely fruitful to work with such a competent set of people. The
Sun team has experience from scaling Oracle, DB2, Informix and so forth
and knows extremely well how the interaction of software and hardware
affects performance. The Google patches have shown themselves to be of
excellent quality. From our internal testing we found two bugs in the
early testing and both those had already been fixed by the Google team
and so turnaround time was a day or two. For the last months we haven't
found any issues. The MySQL performance team have also been able to add
a few small but effective improvements on top of the Google patches.

MySQL 5.4 also introduces DTrace support in the MySQL Server. This code
is a result of a cooperation with the MySQL 6.0 development team, the
original patch was developed for 6.0. We have spent quite some time
on getting the DTrace support working on all variants of Solaris and
Mac OS X platforms. For anyone interested in getting DTrace probes into
their application I think the MySQL example is probably the most
advanced example currently available on user-level DTrace probes and
building such DTrace probes into a complex build system.

Working with competent and motivated people is always great fun, so this
has been a very rewarding project for me personally. I have always liked
to work on performance improvements, in my work on founding
MySQL Cluster we were involved in many such efforts and so far they have
almost always been successful. Actually we're releasing a new version
of MySQL Cluster 7.0 now as well with its own set of extreme performance
improvements which I will mention in a separate blog.

Saturday, January 17, 2009

New DBT2 version uploaded with more documentation of new scripts

I have had a number of request for help on how to use the DBT2
tree I'm maintaining on www.iclaustron.com. There is an extensive
set of scripts used to make it very easy to run DBT2 runs and
to start and stop cluster nodes and MySQL Servers. I personally
use it also to start MySQL Servers and clusters also when not
using DBT2.

However these scripts haven't had an overall description yet
although each component is very thoroughly documented by
using --help on the scripts (I tend to document very
heavily these things since I otherwise forget it myself).

Now I added a new README file README-ICLAUSTRON which
explains which scripts are used and their relation
and which configuration files to set-up and a
pointer to example configuration files.

Hopefully this will make it easier to use DBT2,
particularly DBT2 for MySQL Cluster.