Tuesday, November 25, 2008

Impressive numbers of Next Gen MySQL Cluster

I had a very interesting conversation on the phone with Jonas
Oreland today (he also blogged about it on his blog at
http://jonasoreland.blogspot.com).

There is a lot of interesting features coming up in MySQL Cluster
version 6.4. Online Add Node is one of those, which can be done
without any downtime and even with almost no additional memory
needed other than the memory in the new machines added into the
cluster. This is a feature I started thinking almost 10 years ago
so it's nice to see the fourth version of the solution actually be
implemented and it's a really neat solution to the problem,
definitely fitting the word innovative.

The next interesting feature is to use a more efficient protocol
for handling large operations towards the data nodes. This makes it
use less bits on the wire, but even more it saves a number of copy
stages internally in the NDB data nodes. So this has a dramatic
effect on performance of reads and writes of large records. It
doubles the throughput for large records.

In addition the new 6.4 version also adds multithreading to the
data nodes. Previously the data nodes was a very efficient single
thread which handled all the code blocks and also the send and
receive handling. In the new 6.4 version the data nodes are split
into at least 4 threads for database handling, one thread for send
and receive and the usual assistance threads for file writes and
so forth. This means that a data node will fit nicely into a 8-core
server since also 1-2 cpu's are required for interrupt handling and
other operating system activity.

Jonas benchmarked using a benchmark from our popular flex-series
of benchmark. It started with that I developed flexBench more
than 10 years ago, it's been followed by flexAsynch, flexTT and a
lot more variants of the same type. It can vary the number of
threads, the size of the records, the number of operations per
batch per thread and a number of other things. flexAsynch is
really good at generating extremely high loads to the database
without doing anything useful itself :)

So what Jonas demonstrated today was a flexAsynch run where he
managed to do more than 1 million reads per second using only
one data node. MySQL Cluster is a clustered system so you can
guess what happens when we have 16, 32 or 48 of those nodes
tied together. It will do many tens of millions of reads per
second. An interesting take on this is an article in
Datateknik 3.0 (a magazine no longer around) where I was
discussing how we had reached or was about to reach 1 million
reads per second. I think
this was sometime 2001 or 2002. I was asked where we
were going next and I said that 100 million reads per
second was the next goal. We're actually in range of
achieving this now since I also have a patch lying
around which can increase the number of data nodes
in the cluster to 128 data nodes whereby with good
scalability a 100 million reads per second per
cluster is achievable.

When Jonas called he had achieved 950k reads and then I told
him to try out using the Dolphin DX cards which were also
available on the machines. Then we managed to increase the
performance to inch over 1 million upto 1.070.000.
Quite nice. Maybe even more impressive that it also was
possible to do more 600.000 write operations per second
(these are all transactional).

This run of flexAsynch was focused on seeing how many operations
per second one could get through. I then decided I was
interested in seeing also how much bandwidth we could handle
in the system. So we changed the record size from 8 bytes to
2000 Bytes. When trying it out with Gigabit Ethernet we reached
60,000 reads and 55.000 inserts/updates per second. A quick
calculation shows that we're doing almost 120 MBytes of reads
and 110 MBytes of writes to the data node. This is obviously
where the limit of Gigabit Ethernet goes so an easy catch of
the bottleneck.

Then we tried the same thing using the Dolphin DX cards. We got
250.000 reads per second and more than 200.000 writes per
second. This corresponds to almost 500 MBytes per second of
reads from the database and more than 400 MBytes of writes to
the data nodes.

I had to check whether this was actually the limit of the set-up
I had for the Dolphin cards (they can be set-up to use either
x4 or x8 on the PCI Express). Interestingly enough after working
in various ways with Dolphin cards for 15 years it's the first
time I really cared about the bandwidth it could chunk through.
The performance of MySQL Cluster have never been close to
saturating the Dolphin links in the past.

However today we managed to saturate the links. The maximum
bandwidth achievable by a microbenchmark with a single process
was 510 MBytes per second and we achieved almost 95% of this
number. Very impressive indeed I think. What's even more
interesting is that the Dolphin card used the x4 configuration
so it can actually do 2x the bandwidth in the x8 setting and
the CPU's were fairly lightly loaded on the system so it's
likely that we could come very close to saturating the load
even using a x8 configuration of the Dolphin cards. So that's
a milestone to me, that MySQL Cluster have managed to
saturate even the bandwidth of a cluster interconnect with
very decent bandwidth.

This actually imposes an interesting database recovery
solution problem into the MySQL Cluster architecture. How
does one handle 1 GBytes of writes to each data node in
the system when used with persistent tables which has
to be checkpointed and logged to disk. This requires
bandwidth to the disk subsystem in multiple GBytes per
second. It's only reasonable to even consider doing this
with the upcoming new high-performance SSD drives. I
heard an old colleague nowadays working for a disk
company mention that he had demonstrated 6 GBytes
per second to local disks, so this actually is a
very nice fit. Turns out that this problem can also be
solved.

Actually SSD drives is also a very nice fit with also
the disk data part of MySQL Cluster. Here it makes all
the sense in the world to use SSD drives as the place
to put the tablespaces for the disk part of MySQL
Cluster. This way also the disk data becomes part of
the real-time system and you can fairly easy build a
terabyte database with an exceedingly high
performance. Maybe this is to some extent a reply
Mark Callaghans request for a data warehouse based
on MySQL Cluster link. Not that we really focused so
much on it, but the parallelism and performance
available in a large MySQL Cluster based on 6.4 will
be breathtaking even to me with 15 years of thinking
into this behind me. A final word on this is that
we are actually also working on a parallel query
capability towards MySQL Cluster. This is going to
based on some new advanced additions to the storage
engine interface we're currently working on
(Pushdown Query Fragment for those that joined the
storage engine summit at Google in April this year).

A nice thing with being part of Sun is that they're
building the HW which is required to build these
very large systems and are very interested in doing
showcases for them. So all the technology to do
what has been discussed above is available within
Sun.

Sorry for writing a very long blog. I know it's
better to write short and to the point blogs,
however I found so many interesting tilts on the
subject.

3 comments:

hingo said...

Hi Mikael

Thanks for this post, in your case I very much appreciate the longer style. It gives more context and background for "a stupid sales guy" like me :-) I get from Jonas' posts that the numbers are increasing every day, and that is of course always good, but after reading this I feel much wiser. thanks!

Cyril Scetbon said...

A Really great post after the Jonas's one about is new 6.4 record of 950k requests thanks to the multithreading feature

yousuf said...

I am assuming, the tests were carried out with the private interconnect using direct TCP/IP connection among data nodes.
With respect to the gigabit ethernet limit, is it worth trying aggregation of nics to increase the bandwidth?if possible use 3 nics (if 4 are available) to create an aggregated link.
This way people doesn't have to go for any additional hardware... and still could see a close to 330-350MBps throughput.