Monday, March 02, 2015

200M reads per second in MySQL Cluster 7.4

By courtesy of Intel we had access to a very large cluster of Intel servers for a few
weeks. We took the opportunity to see the improvements of the Intel
servers in the new Haswell implementation on the Intel Xeon chips. We also took
the opportunity to see how far we can now scale flexAsynch, the NoSQL benchmark
we've developed for testing MySQL Cluster.

Last time we tested we were using MySQL Cluster 7.2 and the main bottleneck
then was that the API nodes could not push through more than around 300k reads
per second and we have a limit of up to 255 nodes in total. This meant that we
were able to reach a bit more than 70M reads per second using MySQL Cluster 7.2.

In MySQL Cluster 7.3 we improved the handling of thread contention in the NDB API
which means that we are now able to process much more traffic per API node.
In MySQL Cluster 7.4 we also improved the execution in the NDB API receive
processing, and we also improved the handling of scans and PK lookups in the data
nodes. This meant that now each API node can process more than
1M reads per second. This is very good throughput given that each read contains
about 150 bytes. So this means that each socket can handle more than 1Gb/second.

To describe what we achieved we'll first describe the HW involved.
The machines had 2 sockets with Intel E5-2697 v3 processors. These are
Haswell-based Intel Xeon that have 14 cores and 28 CPU threads per CPU socket.
Thus a total of 28 cores and 56 CPU threads in each server operating at 2.6GHz base
frequency and a turbo frequency of 3.6GHz. The machines were equipped with
64 GByte of memory each. They had an Infiniband connection and
a gigabit ethernet port for communication.

The communication to the outside was actually limited by the Infiniband interrupt
handling. The Infiniband interrupt handling was set up to be latency-optimised
which results in higher interrupt rates. We did however manage to push the
flexAsynch such that this limitiation was very minor, it limited the performance
loss to within 10% of the maximum performance available.

We started testing using just 2 data nodes with 2 replicas. In this test we were able
to reach 13.94M reads per second. Using 4 data nodes we reached
28.53M reads per second. Using 8 data nodes we were able to scale it almost
linearly up to 55.30M reads per second. We managed to continue the
almost linear scaling even up to 24 data nodes where we achieved
156.5M reads per second. We also achieved 104.7M reads per second on a
16-node cluster and 131.7M reads on a 20-node cluster. Finally we took the
benchmark to 32 data nodes where we were able to achieve a new record of
205.6M reads per second.

The configuration we used in most of these tests had:
 12 LDM threads, non-HT
 12 TC threads, HT
 2 send threads, non-HT
 8 receive threads, HT
where HT means that we used both CPU threads in a core and non-HT meant
that we only used one thread per CPU core.

We also tried with 20 LDM threads HT, which gave similar results to 12 LDM
threads non-HT. Finally we had threads for replication, main, io and other activities
that were not used much in those benchmarks.

We compared the improvement of Haswell versus Ivy Bridge (Intel Xeon v2) servers
by running a similar configuration with 24 data nodes. With Ivy Bridge
(which had 12 cores per socket and thus 24 cores and 48 CPU threads in total) we
reached 117.0M reads per second and with Haswell we reached
156.5M reads per second. So this is a 33.8% improvement. Important to note here
is that Haswell was slightly limited by the interrupt handling of Infiniband
whereas the Ivy Bridge servers were not  imited by this. So the real difference is
probably more in the order of 40-45%.

At 24 nodes we tested scaling on number of API nodes. We started at 1 API machine
using 4 API node connections. This gave 4.24M reads per second. We then tried with
3 API machines using a total of 12 API node connections where we achieved
12.84M reads per second. We then added 3 machines at a time with 12 new API
connections and this added more than 12M reads per second giving 62.71M reads
per second at 15 API machines, 122.8M reads per second at 30 API machines and
linear scaling continued until 37 API machines where we achieved 156.5M reads
per second. The best results was achieved at 37 API machines where we achieved
156.5M reads per second. Performance of 40 API machines was about the same as at
37 API machines at 156.0M reads per second. The performance was saturated here
since the interrupt handling could not handle more packets per second. Even
without this the data node was close to saturating the CPUs for both the LDM and
the TC threads and the send threads.

Running with clusters like this is interesting. The bottlenecks can be more tricky
to find than the normal case. One must remember that running a benchmark with
37 API machines and 24 data nodes where each machine has 28 CPU cores, thus
more than 1000 CPU cores are involved, it requires understanding a complex
queueing network.

What is interesting here is that the queueing network behaves best if there is some
well behaved bottleneck in the system. This bottleneck ensures that the flow
through the remainder of the system behaves well. However in some cases where
there is no bottleneck in the system one can enter into a wave of increasing and
decreasing performance. We have all experienced this type of
behaviour of queueing networks while being stuck in car queues.

What we discovered is that MySQL Cluster can enter such waves if the config doesn't
have any natural bottlenecks. What happens here is that the data nodes are able to
send results back to the API nodes in an eager fashion. This means that the API nodes
receives many small packets to process. Since small packets takes longer to process
per byte compared to large packets this has the consequence that the API node slows
down. This in turn means that the benchmark slows down. After a while the data nodes
starts sending larger packets again to speed things up and again it hits too eager

To handle this we introduced a new configuration parameter MaxSendDelay in
MySQL Cluster 7.4. This parameter ensures that we are not so eager in sending
responses back to the API nodes. We will send immediately if there is no other
competing traffic, but if there is other competing traffic, we will delay sending
a bit to ensure that we're sending larger packets. One can say that we're
introducing an artificial bottleneck into the send part. This artificial bottleneck
can in some cases improve throughput by as much as 100% and  more.

The conclusion is that MySQL Cluster 7.4 using the new Haswell computers is
capable of stunning performance. It can deliver 205.6M reads per second of
records a bit larger than 100 bytes, thus providing a data flow of more than
20 GBytes per second of key lookups or 12.3 billion reads per minute.


Shahryar said...

Good job!

Performance is amazing.

Sean Davidson said...

Super Mikael!

I would love a peek at the NDBD_DEFAULT section of the config.ini file for these tests. Or at least the ThreadConfig setting that supported this configuration:
12 LDM threads, non-HT
12 TC threads, HT
2 send threads, non-HT
8 receive threads, HT

I assume you enabled HT system-wide (in BIOS), prevented the OS from using a group of PUs (via isolcpus, IRQBALANCE_BANNED_CPUS, or both), and configured ThreadConfig in a particular way. Did you simply leave the 2nd PU of each HT group unbound for all "non-HT" threads?

Mikael Ronstrom said...

To execute the benchmark I used the benchmark scripts available at:

In the dbt2 tarball you will find the autobench.conf file which was used to generate the config file.

Here is the ThreadConfig part of it:

So I run with HT enabled, since I don't have root access I couldn't touch isolcpus and IRQBALANCE_BANNED_CPUS. To work around this I noticed which CPUs were used for interrupts and configured the data nodes and API nodes to not use these CPUs.

In the case above I actually use HT also for LDM threads, but to not use HT above I could simply
decrease the number of LDM threads to 10 and
use the CPUs 0-4 and 14-18. This means that no activity is placed on CPUs 28-32 and 42-46, so
effectively the LDMs will run as if they were running non-HT. The OS obviously still have access to those CPUs, but has no work to place there.

Anonymous said...

Good work! Was the read traffic generated on the API nodes, or where there additional client machines? How large was the dataset you were serving?

Mikael Ronstrom said...

The data nodes was on separate machines and there were 32 of those (32 data nodes) when running 200M reads per second. The flexAsynch program was executed on separate machines and they use the NDB API as API nodes. At 32 nodes I tested with 54 API machines and 72 API machines and thel 200M reads were reached using 72 machines running flexAsynch. So a total of 105 machines was part of the benchmark run courtesy of Intel (one machine was running as management server).