Tuesday, September 15, 2009

New launchpad tree for Column List Partitioning

I have added a new Launchpad tree for an improved
partitioning feature.

This new tree is based off mysql-trunk which is the base for
the next generation MySQL Server. The tree is now entering QA
and have been extensively tested by development and thus it is
very interesting to get feedback on usability of feature and
feedback on quality issues. This will speed up the delivery of
this new feature.

You can find more description of the feature in a previous blog:
Description of feature

Tuesday, August 04, 2009

Partition by column_list ready for alpha testers

I got time to spend on a really old worklog I completed coding
already october 2005. I blogged about it in July 2006 and
interestingly enough it's still the second most read blog
entry on my blog (probably related to search engines in some
way).

I have merged it with the azalea tree (this is an internal code
name for our development tree, name is likely to change). This
tree contains subquery optimisations, Batched join and some more
optimisations.

I have fixed a whole bunch of bugs that always shows up in early
code. The code quality is still alpha but at least you won't find
10 bugs per hour :)


Here
you can find the launch pad tree for this code.

There are two important additions made possible by this tree.
1) New function to_seconds that is recognized by range optimiser
to enable partition pruning when partitioning like:
partition by range (to_seconds(time))
2) New partitioning functionality that makes it possible to
perform partition pruning over multiple fields.

Most of the bugs I have fixed had to do with this partition pruning
of multiple fields. The routine to discover which partitions are
needed is called find_used_partitions (in sql/opt_range.cc) and this
function is called recursively over a key tree. A key tree can be
very complex and more or less have AND of key parts using next_key_part
pointer and OR condition using left and right pointers. These left and
right pointers can however show up a little here and there in the tree
so one has to be very careful about how variables are assigned, saved
and restored. I havent' worked so much with recursive functions so this
is an interesting adventure.

Here's my latest addition of a test case to give you an idea of how it
works and also what works right now.

create table t1 (a int, b char(10), c varchar(5), d int)
partition by range column_list(a,b,c)
subpartition by key (c,d)
subpartitions 3
( partition p0 values less than (column_list(1,'abc','abc')),
partition p1 values less than (column_list(2,'abc','abc')),
partition p2 values less than (column_list(3,'abc','abc')),
partition p3 values less than (column_list(4,'abc','abc')));

insert into t1 values (1,'a','b',1),(2,'a','b',2),(3,'a','b',3);
insert into t1 values (1,'b','c',1),(2,'b','c',2),(3,'b','c',3);
insert into t1 values (1,'c','d',1),(2,'c','d',2),(3,'c','d',3);
insert into t1 values (1,'d','e',1),(2,'d','e',2),(3,'d','e',3);
select * from t1 where (a = 1 AND b < 'd' AND (c = 'b' OR (c = 'c' AND d = 1)) OR
(a = 1 AND b >= 'a' AND (c = 'c' OR (c = 'd' AND d = 2))));

So in the above select statement we are performing partition
pruning over 3 fields and subpartition pruning over 2 fields
and there are 5 different ranges in the query.

So please go ahead and try this new tree out and see if it
works for you.

Labels: , ,

Wednesday, July 08, 2009

New update to DBT2 clone with automated sysbench runs

I have made an update to the DBT2 clone where I packed in all my
benchmarking support scripts.

This update adds a new script bench_prepare.sh that should be run
from the benchmark server and uses the input of 3 tarballs, the
DBT2 tarball, the sysbench tarball and a MySQL tarball. It will
automatically build all needed binaries on both the benchmark
server and on the MySQL Server machine (they could be on same
machine or on different machine).

The script only requires one parameter --default-directory where
one configuration file called autobench.conf should be placed.
This directory will also be used to house all result files,
builds and generated configuration files for all involved scripts.

The aim is to continue develop such that we can also benchmark
easily using different Linux versions.

The tarball can be downloaded from here

The script can also handle a MySQL Server which is Windows-based,
but the benchmark server cannot run Windows for the moment.

Friday, June 05, 2009

Follow-up Analysis of Split Rollback Segment Mutex

I performed a new set of tests of the patch to split the
rollback segment mutex on Linux. All these tests gave
positive results with improvements in the order of 2%.

One could also derive from the results some conclusions.
The first conclusion is that this split mainly improves
things when the number of threads is high and thus
contention of mutexes is higher. At 256 threads a number
of results improved up to 15%.

The numbers on lower number of threads were more timid
although in many cases an improvement was still seen.

What was also noticeable was that the sysbench read-write
with less reads which makes the transactions much shorter
the positive impact was much greater and the positive
impact on long transactions was much smaller (+0.4%
versus +2.5%). The impact on the short transaction test
with less reads was very positive also on lower number
of threads, the result on 32 threads improved 7%.

So the conclusion is that this patch is a useful contribution
to improvements and in particular improves matters on high
number of threads and with short transactions. According to
a comment on the previous blog it is also very positive in
insert benchmarks.

Labels: , ,

Thursday, June 04, 2009

Results of shootout on split page hash in InnoDB

I have now tried out the buffer split page hash patches on
both a Linux/x86 box and a SPARC/Solaris server (tests done
by Dimitri).

The three variants in short description are:
1) The Google v3 derived patch. This introduces a new array
of mutexes that only protect the buffer page hash. Thus some
extra checking is needed to ensure the page hasn't been
removed from the hash before using it. This is a very simple
and attractive patch from that point of view. The patch uses
an array of 64 mutexes.

2) A variant I developed with some inspiration from the Percona
patches. This patch uses an array of page hashes which each has
its own read-write lock. I've tried this with 1, 4 and 16 page
hashes and 4 is the optimum number. The rw-lock protects the
page hash long enough to ensure that the block hasn't been
possible to remove from the hash before the mutex is acquired.

3) The last variant is a mix of the two first which uses the
simplicity of the Google patch, uses a rw-lock instead and
separate page hashes (to ensure read ahead doesn't have to
go into all mutexes). Used an array of 4 page hashes here.

The conclusion is that the only version that has consistently
improved the MySQL 5.4.0 numbers is the version I originally
developed (2 above).

On sysbench read-write all versions improve numbers compared to
MySQL 5.4.0. 2 and 3 improve 2% whereas the original Google
patch improved with 1%.

On sysbench read-only on Linux it was much harder to beat the
MySQL 5.4.0 version. Only 2) did so and only by 0.5%. This is
not so surprising since this mutex is not a blocker for read-only
workloads. 1) gave -1% and 3) gave -0.3%.

On a write intensive workload on Linux 1) and 3) performed 0.5%
better than MySQL 5.4.0 whereas 2) gave 2% improvement.

Finally on a sysbench read-write with less reads on Linux, all
variants lost to MySQL 5.4.0. 1) by 2%, 2) by 0.1% and 3) by
1%.

Also the numbers from SPARC/Solaris give similar data. The major
difference is that the positive impact on SPARC servers is much
bigger, all the way up to 30% improvements in some cases. The
most likely reason for this is that SPARC servers
have bigger CPU caches and are thus more held back by lack of
concurrency and not so much by increased working set. The x86
box had 512kB cache per core and a 2MB L3 cache and is likely
to be very sensitive to any increase of the working set.

So the likely rationale for worse numbers in some cases is that
more mutexes or rw-locks gives more cache misses.

So given the outcome I will continue to see if I can keep the
simplicity of the Google patch and still maintain the improved
performance of my patch.

Labels: , ,

Wednesday, June 03, 2009

Some ideas on InnoDB kernel_mutex

I've noted that one reason that InnoDB can get difficulties
when there are many concurrent transactions in the MySQL Server
is that the lock time of the kernel_mutex often increases
linearly with the number of active transactions. One such
example is in trx_assign_read_view where each transaction
that does a consistent read creates a copy of the transaction
list to be able to deduce the read view of the transaction or
statement.

This means that each transaction is copied to the local transaction
list while holding the critical kernel_mutex.

Another such case is that most operations will set some kind of
intention lock on the table. This lock code will walk through
all locks on the table to check for compatible locks and the
first time it will even do so twice. Thus if all threads use the
same table (as they do in e.g. sysbench) then the number of locks
on the table will be more or less equal to the number of active
transactions.

Thus as an example when running with 256 threads compared to 16
threads the kernel_mutex lock will be held for 16 times longer
and possibly even more since with more contention the mutex is
needed for even longer time to start up waiting transactions.

So this is an obvious problem, so what is then the solution?
Not extremely easy but one thing one can do is to make the
kernel_mutex into a read-write lock instead of a mutex. Then
many threads can traverse those lists in parallel. It will
still block others needing write access to the kernel_mutex
but it should hopefully improve things.

Another solution that is also going to improve the problem is
to use thread pools. Thread pools ensure that not as many
threads are active at a time. However we still have a problem
that transactions can still be as many active in parallel as
there are connections (although InnoDB has a limit of 1024
concurrent active transactions). So the thread pool needs
to prioritize connections with active transactions in cases
where there are too many threads active at a time.

This type of load regulation is often used in telecom systems
where it is more important to give priority to those that have
already invested time in running the activity. Those that are
newcomer comes in when there are empty slots not taken by
already running activities.

Labels: , ,

Tuesday, June 02, 2009

Increasing log file size increases performance

I have been trying to analyse a number of new patches we've
developed for MySQL to see their scalability. However I've
have gotten very strange results which didn't at all compare
with my old results and most of changes gave negative impact :(
Not so nice.

As part of debugging the issues with sysbench I decided to go
back to the original version I used previously (sysbench 0.4.8).
Interestingly even then I saw a difference on 16 and 32 threads
whereas on 1-8 threads and 64+ threads the result were the same
as usual.

So I checked my configuration and it turned out that I had changed
log file size to 200M from 1300M and also used 8 read and write
threads instead of 4. I checked quickly and discovered that the
parameter that affected the sysbench results was the log file size.
So increasing the log file size from 200M to 1300M increased the
top result at 32 threads from 3300 to 3750, a nice 15% increase.
The setting of the number of read and write threads had no
significant impact on performance.

This is obviously part of the problem which is currently being
researched both by Mark Callaghan and Dimitri.
Coincidentally Dimitri has just recently blogged about this and
provided a number of more detailed comparisons of the
performance of various settings of the log file size in InnoDB.

Labels: , ,