Thursday, March 05, 2015

Heartbeat Inclusion Protocol Handling in MySQL Cluster

The below description was added to the MySQL Cluster 7.4 source code
and describes how new nodes are added into the heartbeat protocol at
startup of a node.

The protocol to include our node in the heartbeat protocol starts when
we call execCM_INFOCONF. We start by opening communication to all nodes
in the cluster. When we start this protocol we don't know anything about
which nodes are up and running and we don't which node is currently the
president of the heartbeat protocol.

For us to be successful with being included in the heartbeat protocol we
need to be connected to all nodes currently in the heartbeat protocol. It
is important to remember that QMGR (the source code module that
controls the heartbeat handling) sees a node as alive if it is included
in the heartbeat protocol. Higher level notions of aliveness is handled
primarily by the DBDIH block (DBDIH is responsible for the database
level of distribution such as which nodes have up-to-date replicas of a
certain database fragment), but also to some extent by NDBCNTR
(NDBCNTR is a source module that controls restart phases and is a
layer on top of QMGR and below the database handling level).

The protocol starts by the new node sending CM_REGREQ to all nodes it is
connected to. Only the president will respond to this message. We could
have a situation where there currently isn't a president choosen. In this
case an election is held whereby a new president is assigned. In the rest
of this comment we assume that a president already exists.

So if we were connected to the president we will get a response to the
CM_REGREQ from the president with CM_REGCONF. The CM_REGCONF contains
the set of nodes currently included in the heartbeat protocol.

The president will send in parallel to sending CM_REGCONF a CM_ADD(prepare)
message to all nodes included in the protocol.

When receiving CM_REGCONF the new node will send CM_NODEINFOREQ with
information about version of the binary, number of LDM workers and
MySQL version of binary.

The nodes already included in the heartbeat protocol will wait until it
receives both the CM_ADD(prepare) from the president and the
CM_NODEINFOREQ from the starting node. When it receives those two
messages it will send CM_ACKADD(prepare) to the president and
CM_NODEINFOCONF to the starting node with its own node information.

When the president received CM_ACKADD(prepare) from all nodes included
in the heartbeat protocol then it sends CM_ADD(AddCommit) to all nodes
included in the heartbeat protocol.

When the nodes receives CM_ADD(AddCommit) from the president then
they will enable communication to the new node and immediately start
sending heartbeats to the new node. They will also include the new
node in their view of the nodes included in the heartbeat protocol.
Next they will send CM_ACKADD(AddCommit) back to the president.

When the president has received CM_ACKADD(AddCommit) from all nodes
included in the heartbeat protocol then it sends CM_ADD(CommitNew)
to the starting node.

This is also the point where we report the node as included in the
heartbeat protocol to DBDIH as from here the rest of the protocol is
only about informing the new node about the outcome of inclusion
protocol. When we receive the response to this message the new node
can already have proceeded a bit into its restart.

The starting node after receiving CM_REGCONF waits for all nodes
included in the heartbeat protocol to send CM_NODEINFOCONF and
also for receiving the CM_ADD(CommitNew) from the president. When
all this have been received the new nodes adds itself and all nodes
it have been informed about into its view of the nodes included in
the heartbeat protocol and enables communication to all other
nodes included therein. Finally it sends CM_ACKADD(CommitNew) to
the president.

When the president has received CM_ACKADD(CommitNew) from the starting
node the inclusion protocol is completed and the president is ready
to receive a new node into the cluster.

It is the responsibility of the starting nodes to retry after a failed
node inclusion, they will do so with 3 seconds delay. This means that
at most one node per 3 seconds will normally be added to the cluster.
So this phase of adding nodes to the cluster can add up to a little bit
more than a minute of delay in a large cluster starting up.

We try to depict the above in a graph here as well:

New node           Nodes included in the heartbeat protocol     President
------------------------------------------------------------------------------------
----CM_REGREQ--------------------->->
----CM_REGREQ---------------------------------------------------------->

< ---------------CM_REGCONF---------------------------------------------
                                  << ------CM_ADD Prepare ---------------

-----CM_NODEINFOREQ--------------- >>

Nodes included in heartbeat protocol can receive CM_ADD(Prepare) and
CM_NODEINFOREQ in any order.

<< ---CM_NODEINFOCONF-------------- --------CM_ACKADD(Prepare)--------->>

                                  << -------CM_ADD(AddCommit)------------

Here nodes enables communication to new node and starts sending heartbeats

                                  ---------CM_ACKADD(AddCommit)------- >>

Here we report to DBDIH about new node included in heartbeat protocol
in master node.

< ----CM_ADD(CommitNew)--------------------------------------------------

Here new node enables communication to new nodes and starts sending
heartbeat messages.

-----CM_ACKADD(CommitNew)---------------------------------------------- >

Here the president can complete the inclusion protocol and is ready to
receive new nodes into the heartbeat protocol.

Wednesday, March 04, 2015

Restart phases of a node restart in MySQL Cluster

In MySQL Cluster 7.4 we have worked on improving the speed of restarts.
In addition we have made it easier to track how fast the restarts are
proceeding and to see which phases of the restart take a long time.

For this purpose we have added a new Ndbinfo table in MySQL Cluster 7.4.
I will try to explain the restart phases in a series of blogs and how they map
to the columns in this Ndbinfo table.

First some short intro to concepts in MySQL Cluster. LCP stands for local
checkpoints and this is where the main memory content of tables are
written to disk at regular intervals. There is also a GCP, global checkpoint,
this is a protocol to synchronise the REDO log writes and thus about
durability of transactions. LDM stands for Local Data Manager and is
the thread that runs all the code that maintains the actual data in MySQL
Cluster. This includes row storage, REDO logging, hash index, T-tree
index (ordered index) and various internal triggering functions to support
various high-level functionality such as unique indexes, foreign keys and
so forth.

In a node restart there is a number of nodes involved, at a minimum there
are two nodes involved and there could be more than two nodes involved as
well. At least one node is a starting node and one live node is always a
master node. There can also be other live nodes that have already started and
that participate in some phases of the restart but not in all phases.
There can also be other starting nodes that have to be synchronised to
some extent with the starting node.

A node restart always begins with a node failing. This could be a controlled
failure as is used in e.g. a rolling upgrade of the MySQL Cluster software.
In this case there are usually multiple nodes that are set to fail at the
same time, often one fails half the nodes as part of a software upgrade.

So the first phase of the node restart phases is to complete the node failure.
This involves handling of in progress transactions involving the failed node.
It involves taking over as a master if the failed node was
a master node. It involves handling a number of protocols in the data nodes
where other nodes are waiting for the failed node. In 7.3 and older versions
this can at times be a fairly lengthy phase, in particular in taking over the
master role of the local checkpoint protocol. With the restart improvements
in MySQL Cluster 7.4 it should be a rare event that this phase takes a long
time.

The next node restart phase is that the crashed node starts up again. The first
thing the starting node does is to attempt to allocate a node id. The allocation
of a node id will have to wait for the node failure phase to complete first. So
this means that this phase can take some time, but normally it proceeds very
quickly.

When the new node has received a node id the next step is to be included in
the heartbeat protocol. One node can enter into the heartbeat protocol per
3 seconds. So if many nodes start up at the same time a node can be held
waiting here until the cluster accepts us into the cluster.

Once the node have been included into the heartbeat protocol which is the
lowest node handling layer in MySQL Cluster, we will ask for permission
to get included in the cluster at the next level of maintenance. The node
could be blocked to be included at this level for example if other nodes are
still removing our node from local checkpoints after we did an initial node
restart, the node could also be blocked by include node protocols still going
on. So the node could be held in this layer at times, but it's a rare event and
should not happen very often.

Also only one node at a time can be in this phase at a time, the next node will
be allowed to proceed after the node have completed this phase and the next
phase where we get a copy of the meta data in the cluster.

After being accepted into the cluster the node have to wait until we have paused
the LCP protocol. This is a new feature of MySQL Cluster 7.4 that have been
added to avoid having to wait in this state until a local checkpoint is completed
before we can proceed. In 7.4 we can pause the checkpoint execution (for the
most part this will not block checkpoint execution since a major part of the
checkpoint is off-loaded to the LDM threads that can execute up to 64 local
checkpoint fragments without getting any new requests from the master node).

As soon as the node managed to pause the local checkpoint execution we can
start to receive the meta data from the master node. The copy phase of the
meta data could take some time if there are very many tables in the cluster.
Normally it is a very quick action.

Now that the node have copied the meta data to our node, the next node can
proceed with that phase. Our node will continue by being included into the global
checkpoint protocol and a number of other node handling protocols. This is
normally a quick phase.

Now that the node have been included in the protocols the master node will
tell us what fragments to restore from disk. This phase is usually also fast.

Now the node have reached the first phase that is normal to take a bit longer
time. This phase is where the node restores all requested fragments from a
checkpoint stored on disk. All LDM threads work in parallel on restoring
all requested fragments here. So the node should be able to restore at a speed
of at least a few million records per second here.

After completing restoring the fragment checkpoints the next phase is to execute
the UNDO logs against all disk data fragments. If there are no disk data tables
in the cluster then this phase will be very quick, otherwise it can take some
time.

After restoring the fragments and applying the disk data UNDO log the node is
now ready to apply the logical REDO log. All LDM threads will execute the
REDO logs in parallel. There are 4 phases of REDO log execution, but in reality
only the first phase is used. The time of this phase depends on how fast the
checkpoints were generated before the crash. The faster we execute the
checkpoints the smaller amount of REDO log is needed to execute as part of
a node restart.

Also here the node should be able to process some millions of log records per
second since we are executing this in parallel in the LDM threads.

Now that we have restored a global checkpoint for all fragments the node is
ready to rebuild all ordered indexes. The fragments are check pointed with
only its data, indexes are not check pointed. The indexes are rebuilt as part
of a node restart. The primary key hash index is rebuilt as part of restore of
the fragments and execution of the REDO log. The ordered index is rebuilt
in a separate phase after completing the applying of the REDO log.

This is also a phase that can take some time. All LDM threads execute in
parallel in this phase. The more records and the more ordered indexes we
have the longer this phase takes to execute.

Now the node has managed to restore an old but consistent version of the
fragments we were asked to restore.

The next phase is to synchronise the old data with the live data in the
nodes that are up and running and contains the latest version of each
record.

As a new feature of MySQL Cluster 7.4 this phase can be parallelised
among the LDM threads. So the node can synchronise with data in the
live nodes in many threads in parallel.

This phase can be delayed in cases with many tables and many concurrent
node restarts since we need a lock on a fragment before copying its
meta data to new starting nodes which is also needed before starting
the copying phase to synchronise with the live nodes.

At this point in time the node have an up-to-date version of all restored
fragments in the node. The node do however not have a durable version
yet. For the fragments to be restorable on our starting node we need the
node to participate in a full local checkpoint.

Waiting for local checkpoints in various manners is the most common
activity in delaying restarts in 7.3 and older versions of MySQL Cluster.
In MySQL Cluster 7.4 we have managed to decrease the amount of
waiting for local checkpoints, in addition we have also been able to speed
up the local checkpointing in general and more specifically when node
restarts are executing. Those two things plus the parallelisation of the
synchronisation phase are the main ways of how we managed
to decrease restart times by a factor of 5.5X in MySQL Cluster 7.4.

In order to ensure that as many nodes as possible get started together
after the waiting for the same local checkpoint execution, we have the
ability to block the start of a local checkpoint if we have nodes that are
close to reaching this wait state. Given that we have statistics on how long
time the restarts takes, we can predict if it is useful to wait for
another node to reach this state before we start a new local checkpoint.

Finally when we have waited for a complete LCP it is time for the final
phase which is to wait for subscription handover to occur. This relates
to the MySQL replication support in MySQL Cluster where we have to
contact all MySQL Servers in the cluster informing them of our new
node that can also generate replication events.

After this final phase the restart is completed.

As an example we measured a node restart of around 26 GByte of data
took around 3-4 minutes in MySQL Cluster 7.4.

The speed of node restarts is highly dependent on local checkpoint speed,
we have introduced a number of new configuration parameters to control
this speed, we have also introduced a new Ndbinfo table also to track the
checkpoint speed. We will describe this in a separate blog entry.

Each of the phases described above is represented as a column in the
ndbinfo.restart_info table. We report the time that we spend in each
phase measured in seconds. There is a row for each restart that has
been observed. This means that not all nodes are present in this table.
After an initial start or after a cluster restart the table is even empty.

We will also describe the additional log printouts that we have added to
the node logs describing what is going in the node during a node restart
in a separate blog entry.

Tuesday, March 03, 2015

MySQL Cluster on my "Windows computer"

It's been some time since I wrote my last blog. As usual this means that I have
been busy developing new things. Most of my blogs are about describing new
developments that happened in the MySQL Server, MySQL Cluster, MySQL
partitioning and other areas I have been busy developing in. For the last year I have
been quite busy in working with MySQL Cluster 7.4, the newest cluster release. As
usual we have been able to add some performance improvements. But for
MySQL Cluster 7.4 the goal has also been to improve quality. There are a number
of ways that one can improve quality. One can improve quality of a cluster by making
it faster to restart as problems appear. One can also improve it by improving code
quality. We have done both of those things.

In order to improve my own possibilities to test my new developments I decided to
invest in a "Windows computer". This means a computer that I house in my window in
my living room :)

There is some new very interesting computers that you can buy from Intel today that
works perfectly as test platform for a distributed database like MySQL Cluster. It
is called Intel NUC. It is actually a computer that you need to assemble on your own,
it comes equipped with a dual core Intel i5-4250U CPU that runs at 1.3GHz and can
run up to Turbo frequency of 2.3 GHz (the current generation has been upgraded to
Intel i5-5250U processors). To this computer you can buy up to 16 GByte of memory
and you can add a 2.5" SSD drive and it's even possible to also have a M.2 SSD card
in addition to the 2.5" SSD drive. Personally I thought it was enough to have one SSD
drive of 250GB.

So this computer is amazingly small, but still amazingly capable. I have it placed in
the window in front of my desk behind our TV set. It is so quiet that I can't even
hear it when all the rest of the equipment is shut off.

Still it is capable enough to run our daily test suite with a 4 data node setup. This
test suite runs more than 400 different test cases where we test node restarts, system
restarts, index scans, backups, pk lookups and so forth. One such test run takes
13 hours, so it is nice to be able to have this running on such a small box that can
run in the background without me having to interfere at all and without it making any
noise.

So it's amazing how scalable the MySQL Cluster SW is, I can run test suites with a
4 data node setup on a box that fits in the palm of my hand. At the same I can run
benchmarks using 100 2-socket servers that requires probably 4-5 racks of computers
and that achieves 200M reads per second.

Here is a little description of how you can setup a similar box to be running
daily test runs for MySQL Cluster if ever you decide that you want to try to
develop a new feature for MySQL Cluster.

1) At first the test suite requires a ndb_cpcd process to be running. This process
takes care of starting and stopping the MySQL Cluster processes.

To do this do the following:
1) Create a new directory, I called mine /home/mikael/cpcd
In this directory create a minor script that starts the ndb_cpcd.
It contains the following command:
ndb_cpcd --work-dir=/home/mikael/cpcd --logfile=/home/mikael/cpcd/cpcd.log

For it to run you need to compile MySQL Cluster and copy ndb_cpcd to e.g.
/usr/local/bin. This binary doesn't really change very often, so you can
have one compiled and need not change it. I call this script start_ndb_cpcd.sh.

Then you start the ndb_cpcd in one window using
./start_ndb_cpcd.sh

2) In order to run the test suite I created a directory I called
/home/mikael/mysql_clones/autotest_run

This is where I run the test suite from. For this I need to two files.
The first is the autotest-boot.sh which is found in the MySQL Cluster source
in the place storage/ndb/test/run-test/autotest-boot.sh.
In addition I create here the configuration file used by this autotest-boot.sh
script, it's called autotest.conf.

In my case this file contains:

install_dir="/home/mikael/mysql_clones/autotest_install"
build_dir="/home/mikael/mysql_clones/autotest_build"
git_local_repo="/home/mikael/mysql_clones/test_74"
git_remote_repo="/home/mikael/mysql_git"
base_dir="/home/mikael/mysql_clones/autotest_results"
target="x86_64_Linux"
hosts="mikael1 mikael1 mikael1 mikael1 mikael1 mikael1"
report=
WITH_NDB_JAVA_DEFAULT=0
WITH_NDB_NODEJS=0
export WITH_NDB_JAVA_DEFAULT WITH_NDB_NODEJS
MAKEFLAGS="-j7"
export MAKEFLAGS

install_dir is the place where the build of the MySQL Cluster source is installed.

build_dir is the place where the build of the MySQL Cluster is placed.

git_local_repo is a git branch of MySQL Cluster 7.4.

git_remote_repo is a git repo of the entire MySQL source.

base_dir is the directory where the results of the test run are placed in a
compressed tarball.

target is the computer and OS, in my case a x86_64 running Linux.

hosts is the hosts I use, there should be 6 hosts here, in my case they are all the same
host which is called mikael1.

Finally I have report set to nothing

and in order to avoid having to build the Java API and NodeJS APIs I set the
WITH_NDB_JAVA_DEFAULT=0 and WITH_NDB_NODEJS=0.
Finally I set the MAKEFLAGS to get a good parallelism in building MySQL Cluster.

In order to run I need to have git installed, CMake as well and possibly some
more things. If one uses an older git version (like I do), then one needs to
change the git command in autotest-boot.sh a little bit.

Finally one needs to add a new file to the MySQL branch from where you run,
/home/mikael/mysql_clones/test_74 in my case. This file is called in my
case /home/mikael/mysql_clones/test_74/storage/ndb/test/run-test/conf-mikael1.cnf.

The autotest-boot.sh creates the config file of the cluster from this file.
In my case this file contains:

# Copyright (c) 2015, Oracle and/or its affiliates. All rights reserved.
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; version 2 of the License.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301  USA

[atrt]
basedir = CHOOSE_dir
baseport = 14000
#clusters = .4node
clusters = .2node
fix-nodeid=1
mt = 0

[ndb_mgmd]

[mysqld]
innodb
skip-grant-tables
socket=mysql.sock
default-storage-engine=myisam

[client]
protocol=tcp

[cluster_config.2node]
ndb_mgmd = CHOOSE_host1
#ndbd = CHOOSE_host2,CHOOSE_host3,CHOOSE_host4,CHOOSE_host5
ndbd = CHOOSE_host2,CHOOSE_host3
ndbapi= CHOOSE_host1,CHOOSE_host1,CHOOSE_host1
#mysqld = CHOOSE_host1,CHOOSE_host6
mysqld = CHOOSE_host1,CHOOSE_host4

NoOfReplicas = 2
IndexMemory = 100M
DataMemory = 500M
BackupMemory = 64M
MaxNoOfConcurrentScans = 100
MaxNoOfSavedMessages= 5
NoOfFragmentLogFiles = 4
FragmentLogFileSize = 32M
ODirect=1
MaxNoOfAttributes=2000
Checksum=1

SharedGlobalMemory=256M
DiskPageBufferMemory=256M
#FileSystemPath=/home/mikael/autotest
#FileSystemPathDataFiles=/home/mikael/autotest
#FileSystemPathUndoFiles=/home/mikael/autotest
InitialLogfileGroup=undo_buffer_size=64M;undofile01.dat:256M;undofile02.dat:256M
InitialTablespace=datafile01.dat:256M;datafile02.dat:256M
TimeBetweenWatchDogCheckInitial=60000

[cluster_config.ndbd.1.2node]
TwoPassInitialNodeRestartCopy=1

[cluster_config.ndbd.3.2node]
TwoPassInitialNodeRestartCopy=1

In the current setup it uses 2 data nodes, but it can also run easily with 4 data
nodes by simply changing a few lines above.
mt = 0 means that I always run with the ndbd process which is the smallest manner to run
MySQL Cluster data nodes where all blocks runs within one thread. It is possible to run
them in up to more than 50 threads, but in this machine it makes more sense to use ndbd.

The name of the file should be conf-HOST.cnf where you replace HOST with the hostname of
your test computer.

Finally in my case I also changed one line in
storage/ndb/test/run-test/atrt-gather-results.sh

as

#    rsync -a --exclude='BACKUP' --exclude='ndb_*_fs' "$1" .
    rsync -a --exclude='BACKUP' "$1" .

The idea is that I should also get the file system of the cluster reported
as part of the result tarball. This increases the size of the result tarball
significantly, but if one is looking for bugs that happen in writing of REDO
logs, UNDO logs, checkpoints and so forth, then this is required to be able
to find the problem.

Finally we have now come to the point where we need to execute the
actual test run, we place ourselves in the autotest_run directory
and execute the command:

./autotest-boot.sh --clone=mysql-5.6-cluster-7.4 daily-basic

During the test execution one can look into autotest_install, there is
a directory there starting with run that contains the current test running
and if a test fails for some reason it will create a result.number directory
there where you get the log information from the failure, successful test
cases doesn't get any logs produced. The file log.txt contains the current
test being executed.

Finally the test executed for daily-basic are defined in the file:
/home/mikael/mysql_clones/test_74/storage/ndb/test/run-test/daily-basic-tests.txt

So by adding or removing tests from this file you can add your own test cases,
most of the tests are defined in the
/home/mikael/mysql_clones/test_74/storage/ndb/test/ndbapi directory.

Good luck in trying out this test environment. Remember that any changes to files
in the test_74 directory also requires to be commited in git since the test_74
directory is cloned off using git commands.

Monday, March 02, 2015

200M reads per second in MySQL Cluster 7.4

By courtesy of Intel we had access to a very large cluster of Intel servers for a few
weeks. We took the opportunity to see the improvements of the Intel
servers in the new Haswell implementation on the Intel Xeon chips. We also took
the opportunity to see how far we can now scale flexAsynch, the NoSQL benchmark
we've developed for testing MySQL Cluster.

Last time we tested we were using MySQL Cluster 7.2 and the main bottleneck
then was that the API nodes could not push through more than around 300k reads
per second and we have a limit of up to 255 nodes in total. This meant that we
were able to reach a bit more than 70M reads per second using MySQL Cluster 7.2.

In MySQL Cluster 7.3 we improved the handling of thread contention in the NDB API
which means that we are now able to process much more traffic per API node.
In MySQL Cluster 7.4 we also improved the execution in the NDB API receive
processing, and we also improved the handling of scans and PK lookups in the data
nodes. This meant that now each API node can process more than
1M reads per second. This is very good throughput given that each read contains
about 150 bytes. So this means that each socket can handle more than 1Gb/second.

To describe what we achieved we'll first describe the HW involved.
The machines had 2 sockets with Intel E5-2697 v3 processors. These are
Haswell-based Intel Xeon that have 14 cores and 28 CPU threads per CPU socket.
Thus a total of 28 cores and 56 CPU threads in each server operating at 2.6GHz base
frequency and a turbo frequency of 3.6GHz. The machines were equipped with
64 GByte of memory each. They had an Infiniband connection and
a gigabit ethernet port for communication.

The communication to the outside was actually limited by the Infiniband interrupt
handling. The Infiniband interrupt handling was set up to be latency-optimised
which results in higher interrupt rates. We did however manage to push the
flexAsynch such that this limitiation was very minor, it limited the performance
loss to within 10% of the maximum performance available.

We started testing using just 2 data nodes with 2 replicas. In this test we were able
to reach 13.94M reads per second. Using 4 data nodes we reached
28.53M reads per second. Using 8 data nodes we were able to scale it almost
linearly up to 55.30M reads per second. We managed to continue the
almost linear scaling even up to 24 data nodes where we achieved
156.5M reads per second. We also achieved 104.7M reads per second on a
16-node cluster and 131.7M reads on a 20-node cluster. Finally we took the
benchmark to 32 data nodes where we were able to achieve a new record of
205.6M reads per second.



The configuration we used in most of these tests had:
 12 LDM threads, non-HT
 12 TC threads, HT
 2 send threads, non-HT
 8 receive threads, HT
where HT means that we used both CPU threads in a core and non-HT meant
that we only used one thread per CPU core.

We also tried with 20 LDM threads HT, which gave similar results to 12 LDM
threads non-HT. Finally we had threads for replication, main, io and other activities
that were not used much in those benchmarks.

We compared the improvement of Haswell versus Ivy Bridge (Intel Xeon v2) servers
by running a similar configuration with 24 data nodes. With Ivy Bridge
(which had 12 cores per socket and thus 24 cores and 48 CPU threads in total) we
reached 117.0M reads per second and with Haswell we reached
156.5M reads per second. So this is a 33.8% improvement. Important to note here
is that Haswell was slightly limited by the interrupt handling of Infiniband
whereas the Ivy Bridge servers were not  imited by this. So the real difference is
probably more in the order of 40-45%.

At 24 nodes we tested scaling on number of API nodes. We started at 1 API machine
using 4 API node connections. This gave 4.24M reads per second. We then tried with
3 API machines using a total of 12 API node connections where we achieved
12.84M reads per second. We then added 3 machines at a time with 12 new API
connections and this added more than 12M reads per second giving 62.71M reads
per second at 15 API machines, 122.8M reads per second at 30 API machines and
linear scaling continued until 37 API machines where we achieved 156.5M reads
per second. The best results was achieved at 37 API machines where we achieved
156.5M reads per second. Performance of 40 API machines was about the same as at
37 API machines at 156.0M reads per second. The performance was saturated here
since the interrupt handling could not handle more packets per second. Even
without this the data node was close to saturating the CPUs for both the LDM and
the TC threads and the send threads.

Running with clusters like this is interesting. The bottlenecks can be more tricky
to find than the normal case. One must remember that running a benchmark with
37 API machines and 24 data nodes where each machine has 28 CPU cores, thus
more than 1000 CPU cores are involved, it requires understanding a complex
queueing network.

What is interesting here is that the queueing network behaves best if there is some
well behaved bottleneck in the system. This bottleneck ensures that the flow
through the remainder of the system behaves well. However in some cases where
there is no bottleneck in the system one can enter into a wave of increasing and
decreasing performance. We have all experienced this type of
behaviour of queueing networks while being stuck in car queues.

What we discovered is that MySQL Cluster can enter such waves if the config doesn't
have any natural bottlenecks. What happens here is that the data nodes are able to
send results back to the API nodes in an eager fashion. This means that the API nodes
receives many small packets to process. Since small packets takes longer to process
per byte compared to large packets this has the consequence that the API node slows
down. This in turn means that the benchmark slows down. After a while the data nodes
starts sending larger packets again to speed things up and again it hits too eager
sending.

To handle this we introduced a new configuration parameter MaxSendDelay in
MySQL Cluster 7.4. This parameter ensures that we are not so eager in sending
responses back to the API nodes. We will send immediately if there is no other
competing traffic, but if there is other competing traffic, we will delay sending
a bit to ensure that we're sending larger packets. One can say that we're
introducing an artificial bottleneck into the send part. This artificial bottleneck
can in some cases improve throughput by as much as 100% and  more.

The conclusion is that MySQL Cluster 7.4 using the new Haswell computers is
capable of stunning performance. It can deliver 205.6M reads per second of
records a bit larger than 100 bytes, thus providing a data flow of more than
20 GBytes per second of key lookups or 12.3 billion reads per minute.