Tuesday, November 25, 2008

Impressive numbers of Next Gen MySQL Cluster

I had a very interesting conversation on the phone with Jonas
Oreland today (he also blogged about it on his blog at
http://jonasoreland.blogspot.com).

There is a lot of interesting features coming up in MySQL Cluster
version 6.4. Online Add Node is one of those, which can be done
without any downtime and even with almost no additional memory
needed other than the memory in the new machines added into the
cluster. This is a feature I started thinking almost 10 years ago
so it's nice to see the fourth version of the solution actually be
implemented and it's a really neat solution to the problem,
definitely fitting the word innovative.

The next interesting feature is to use a more efficient protocol
for handling large operations towards the data nodes. This makes it
use less bits on the wire, but even more it saves a number of copy
stages internally in the NDB data nodes. So this has a dramatic
effect on performance of reads and writes of large records. It
doubles the throughput for large records.

In addition the new 6.4 version also adds multithreading to the
data nodes. Previously the data nodes was a very efficient single
thread which handled all the code blocks and also the send and
receive handling. In the new 6.4 version the data nodes are split
into at least 4 threads for database handling, one thread for send
and receive and the usual assistance threads for file writes and
so forth. This means that a data node will fit nicely into a 8-core
server since also 1-2 cpu's are required for interrupt handling and
other operating system activity.

Jonas benchmarked using a benchmark from our popular flex-series
of benchmark. It started with that I developed flexBench more
than 10 years ago, it's been followed by flexAsynch, flexTT and a
lot more variants of the same type. It can vary the number of
threads, the size of the records, the number of operations per
batch per thread and a number of other things. flexAsynch is
really good at generating extremely high loads to the database
without doing anything useful itself :)

So what Jonas demonstrated today was a flexAsynch run where he
managed to do more than 1 million reads per second using only
one data node. MySQL Cluster is a clustered system so you can
guess what happens when we have 16, 32 or 48 of those nodes
tied together. It will do many tens of millions of reads per
second. An interesting take on this is an article in
Datateknik 3.0 (a magazine no longer around) where I was
discussing how we had reached or was about to reach 1 million
reads per second. I think
this was sometime 2001 or 2002. I was asked where we
were going next and I said that 100 million reads per
second was the next goal. We're actually in range of
achieving this now since I also have a patch lying
around which can increase the number of data nodes
in the cluster to 128 data nodes whereby with good
scalability a 100 million reads per second per
cluster is achievable.

When Jonas called he had achieved 950k reads and then I told
him to try out using the Dolphin DX cards which were also
available on the machines. Then we managed to increase the
performance to inch over 1 million upto 1.070.000.
Quite nice. Maybe even more impressive that it also was
possible to do more 600.000 write operations per second
(these are all transactional).

This run of flexAsynch was focused on seeing how many operations
per second one could get through. I then decided I was
interested in seeing also how much bandwidth we could handle
in the system. So we changed the record size from 8 bytes to
2000 Bytes. When trying it out with Gigabit Ethernet we reached
60,000 reads and 55.000 inserts/updates per second. A quick
calculation shows that we're doing almost 120 MBytes of reads
and 110 MBytes of writes to the data node. This is obviously
where the limit of Gigabit Ethernet goes so an easy catch of
the bottleneck.

Then we tried the same thing using the Dolphin DX cards. We got
250.000 reads per second and more than 200.000 writes per
second. This corresponds to almost 500 MBytes per second of
reads from the database and more than 400 MBytes of writes to
the data nodes.

I had to check whether this was actually the limit of the set-up
I had for the Dolphin cards (they can be set-up to use either
x4 or x8 on the PCI Express). Interestingly enough after working
in various ways with Dolphin cards for 15 years it's the first
time I really cared about the bandwidth it could chunk through.
The performance of MySQL Cluster have never been close to
saturating the Dolphin links in the past.

However today we managed to saturate the links. The maximum
bandwidth achievable by a microbenchmark with a single process
was 510 MBytes per second and we achieved almost 95% of this
number. Very impressive indeed I think. What's even more
interesting is that the Dolphin card used the x4 configuration
so it can actually do 2x the bandwidth in the x8 setting and
the CPU's were fairly lightly loaded on the system so it's
likely that we could come very close to saturating the load
even using a x8 configuration of the Dolphin cards. So that's
a milestone to me, that MySQL Cluster have managed to
saturate even the bandwidth of a cluster interconnect with
very decent bandwidth.

This actually imposes an interesting database recovery
solution problem into the MySQL Cluster architecture. How
does one handle 1 GBytes of writes to each data node in
the system when used with persistent tables which has
to be checkpointed and logged to disk. This requires
bandwidth to the disk subsystem in multiple GBytes per
second. It's only reasonable to even consider doing this
with the upcoming new high-performance SSD drives. I
heard an old colleague nowadays working for a disk
company mention that he had demonstrated 6 GBytes
per second to local disks, so this actually is a
very nice fit. Turns out that this problem can also be
solved.

Actually SSD drives is also a very nice fit with also
the disk data part of MySQL Cluster. Here it makes all
the sense in the world to use SSD drives as the place
to put the tablespaces for the disk part of MySQL
Cluster. This way also the disk data becomes part of
the real-time system and you can fairly easy build a
terabyte database with an exceedingly high
performance. Maybe this is to some extent a reply
Mark Callaghans request for a data warehouse based
on MySQL Cluster link. Not that we really focused so
much on it, but the parallelism and performance
available in a large MySQL Cluster based on 6.4 will
be breathtaking even to me with 15 years of thinking
into this behind me. A final word on this is that
we are actually also working on a parallel query
capability towards MySQL Cluster. This is going to
based on some new advanced additions to the storage
engine interface we're currently working on
(Pushdown Query Fragment for those that joined the
storage engine summit at Google in April this year).

A nice thing with being part of Sun is that they're
building the HW which is required to build these
very large systems and are very interested in doing
showcases for them. So all the technology to do
what has been discussed above is available within
Sun.

Sorry for writing a very long blog. I know it's
better to write short and to the point blogs,
however I found so many interesting tilts on the
subject.

Monday, November 24, 2008

Poll set to handle poll, eventports, epoll, kqueue and Windows IO Completion

This blog describes background and implementation of a poll
set to monitor many sockets in one or several receive
threads. The blog is intended as a description but also to
enable some feedback on the design.

I've been spending some time working out all the gory details
of starting and stopping lots of send threads, connection
threads and receive threads over the last few months.

The receive threads will be monitoring a number of socket
connections and as soon as data is available ensure that
the data is received and forwarded to the proper user
thread for execution.

However listening on many sockets is a problem which needs
a scalable solution. Almost every operating system on the
planet has some solution to this problem. The problem is
that they all have different solutions. So I decided to
make a simple interface to those socket monitoring solutions.

First my requirements. I will handle a great number of socket
connections from each thread, I will have the ability to
move socket connections from receive thread to another thread
to dynamically adapt to the usage scenarios. However mostly
the receive thread will wait for events on a set of socket
connections, handle them as they arrive and go back waiting
for more events. On the operating system side I aim at
supporting Linux, OpenSolaris, Mac OS X, FreeBSD and Windows.

One receive thread might be required to listen to socket
connections from many clusters. So there is no real limit
to the number of sockets a receive thread can handle. I
decided however to put a compile time limit in there since
e.g. epoll requires this at create time. This is currently
set to 1024. So if more sockets are needed another receive
thread is needed even if not needed from a performance
point of view.

The implementation aims to cover 5 different implementations.

epoll
-----
epoll interface is a Linux-only interface which uses
epoll_create to create an epoll file descriptor, then
epoll_ctl is used to add/drop file descriptors to the epoll
set. Finally epoll_wait is used to wait on the events to
arrive. Socket connections remain in the epoll set as
long as they are not explicitly removed or closed.

poll
----
Poll is the standard which is there simply to make sure it
works also on older platforms that have none of the other
mechanisms supported. Here there is only one system call,
the poll-call and all the state of the poll set needs to
be taken care of by this implementation.

kqueue
------
kqueue achieves more or less the same thing as epoll, it does
so however with a more complex interface that can support a
lot more things such as polling for completed processes and
i-nodes and so forth. It has a kqueue-method to create the
kqueue file descriptor and then a kevent call which is used
both to add, drop and listen to events on the kqueue socket.
kqueue exists in BSD OS:s such as FreeBSD and Mac OS X.

eventports
----------
eventports is again a very similar implementation to epoll which
has the calls port_create to create the eventport file descriptor.
It has a port_associate call to add a socket to the eventport set.
It has a port_dissociate call to drop a socket from the set.
It has a port_getn call to wait on the events arriving. There is
however a major difference in that after an event arriving in a
port_getn call the socket is removed from the set and has to be
added back. From an implementation point of view this mainly
complicated my design of error handling.

Windows IO Completion
---------------------
I have only skimmed this yet and it differs mainly in being a tad
complex and also in that events from the set can be distributed to
more than one thread. However this feature will not be used in this
design, also I have currently not implemented this yet, I need to
get all the bits together on building on Windows done first.

Implementation
--------------
The implementation is done in C but I still wanted to have
a clear object-oriented interface. To achieve this I
created two header files ic_poll_set.h which declares all
the public parts and ic_poll_set_int.h which defines the
private and the public data structures used. This means that
the internals of the IC_POLL_SET-object is hidden from the
user of this interface.

Here is the public part of the interface (the code is GPL:ed
but it isn't released yet):

Copyright (C) 2008 iClaustron AB, All rights reserved
struct ic_poll_connection
{
int fd;
guint32 index;
void *user_obj;
int ret_code;
};
typedef struct ic_poll_connection IC_POLL_CONNECTION;

struct ic_poll_set;
typedef struct ic_poll_set IC_POLL_SET;
struct ic_poll_operations
{
/*
The poll set implementation isn't multi-thread safe. It's intended to be
used within one thread, the intention is that one can have several
poll sets, but only one per thread. Thus no mutexes are needed to
protect the poll set.

ic_poll_set_add_connection is used to add a socket connection to the
poll set, it requires only the file descriptor and a user object of
any kind. The poll set implementation will ensure that this file
descriptor is checked together with the other file descriptors in
the poll set independent of the implementation in the underlying OS.

ic_poll_set_remove_connection is used to remove the file descriptor
from the poll set.

ic_check_poll_set is the method that goes to check which socket
connections are ready to receive.

ic_get_next_connection is used in a loop where it is called until it
returns NULL after a ic_check_poll_set call, the output from
ic_get_next_connection is prepared already at the time of the
ic_check_poll_set call. ic_get_next_connection will return a
IC_POLL_CONNECTION object. It is possible that ic_check_poll_set
can return without error whereas the IC_POLL_CONNECTION can still
have an error in the ret_code in the object. So it is important to
both check this return code as well as the return code from the
call to ic_check_poll_set (this is due to the implementation using
eventports on Solaris).

ic_free_poll_set is used to free the poll set, it will also if the
implementation so requires close any file descriptor of the poll
set.

ic_is_poll_set_full can be used to check if there is room for more
socket connections in the poll set. The poll set has a limited size
(currently set to 1024) set by a compile time parameter.
*/
int (*ic_poll_set_add_connection) (IC_POLL_SET *poll_set,
int fd,
void *user_obj);
int (*ic_poll_set_remove_connection) (IC_POLL_SET *poll_set,
int fd);
int (*ic_check_poll_set) (IC_POLL_SET *poll_set,
int ms_time);
const IC_POLL_CONNECTION*
(*ic_get_next_connection) (IC_POLL_SET *poll_set);
void (*ic_free_poll_set) (IC_POLL_SET *poll_set);
gboolean (*ic_is_poll_set_full) (IC_POLL_SET *poll_set);
};
typedef struct ic_poll_operations IC_POLL_OPERATIONS;

/* Creates a new poll set */
IC_POLL_SET* ic_create_poll_set();

struct ic_poll_set
{
IC_POLL_OPERATIONS poll_ops;
};

Friday, November 21, 2008

DTrace, opensolaris and MySQL Performance

Currently I'm working hard to find and remove scalability
bottlenecks in the MySQL Server. MySQL was acquired by Sun
10 months ago by now. Many people have in blogs wondered what
the impact has been from this acquisition. My personal
experience is that I now have a chance to work with Sun
experts in DBMS performance. As usual it takes time when
working on new challenges before the flow of inspiration
starts flowing. However I've seen this flow of inspiration
starting to come now, so the fruit of our joint work is
starting to bear fruit. I now have a much better understanding
of MySQL Server performance than I used to have. I know fairly
well where the bottlenecks are and I've started looking
into how they can be resolved.

Another interesting thing with Sun is the innovations they have
done in a number of areas. One such area is DTrace. This is a
really interesting tool which I already used to analyse some
behaviour of MySQL Cluster internals with some success. However
to analyse other storage engines inside MySQL requires a bit more
work on inserting DTrace probes at appropriate places.

To work with DTrace obviously means that you need to work with
an OS that supports DTrace. Solaris is such a one, I actually
developed NDB Cluster (the storage engine for MySQL Cluster) on
Solaris the first 5-6 years. So one would expect Solaris to be
familiar to me, but working with Linux mainly for 6-7 years means
that most of the Solaris memory is gone.

So how go about developing on Solaris. I decided to install a virtual
machine on my desktop. As a well-behaved Sun citizen I decided to
opt for VirtualBox in my choice of VM. This was an interesting
challenge, very similar to my previous experiences on installing
a virtual machine. It's easy to get the VM up and running, but how
do you communicate with it. I found some instructions on how to
set-up IP links to a virtual machine but to make life harder I
have a fixed IP address on my desktop so this complicated life
quite a bit. Finally I learned a lot about how to set-up virtual
IP links which I already have managed to forget about :)

The next step is to get going on having a development environment
for opensolaris. I soon discovered that there was a package
manager in opensolaris which could be used to get all the needed
packages. However after downloading a number of packages I
stumbled into some serious issues. I learned from this experience
that usage of Developer Previews for OS's is even worse than newly
released OS's which I already know by experience isn't for the
fainthearted.

So I decided to install a released opensolaris version instead
(the OpenSolaris2008.05 version). After some googling I discovered
a very helpful presentation at opensolaris developer how-to
which explained a lot about how to install a development
environment for opensolaris.

After installing opensolaris 2008.05, after following the
instructions on how to install a development environment
I am now equipped to develop DTrace probes and scripts and
try them out on my desktop.

I definitely like the idea that opensolaris is looking more
like yet another Linux distribution since it makes it a
lot simpler to work with it. I would prefer GNU developer
tools to be there from scratch but I have the same issue
with Ubuntu.

That the system calls are different don't bother me as a
programmer since different API's to similar things is
something every programmer encounters if he's developing
for a multi-platform environment. I even look forward to
trying out a lot of Solaris system calls since there are
lots of cool features on locking to CPU's, controlling
CPU's for interrupts, resource groups, scheduling
algorithms and so forth. I recently noted that most of
these things are available on Linux as well. However
I am still missing the programming API's to these
features.