Wednesday, April 01, 2015

Benchmarking MySQL Cluster 7.4 on an Intel NUC

I have done a lot of benchmarks of MySQL Cluster on large servers which
is obviously very interesting. As mentioned in a previous blog I have
an Intel NUC machine now easily accessible. So I thought it would be
fun to make some benchmarks on this machine to see how fast MySQL Cluster
runs on small HW.

First a little description of the HW. The CPU is an Intel Core i5-4250
CPU. It runs at 1.3GHz and have a turbo frequency of 2.3 GHz. The CPU
has two cores and each core can run two threads simultaneously
(called hyperhtreading in Intel CPUs). It comes with the box containing
the motherboard and the CPU. Then you buy one or two DRAMs to it and an
SSD drive. I installed two DDR3L DRAMs which gives me a total of 16GByte
memory in the machine. In addition I installed an SSD drive of 256GByte.
The box fits nicely into the palm of your hand.

On top of this HW I installed Oracle Linux 7.

The configuration of the benchmark uses my laptop as benchmark driver
and uses the Intel NUC to run the data node. Since I only have one
Intel NUC available the benchmarks focused on a setup with only one
replica.

To start with I will focus this blog on performance of a table which
is a pure main memory where after a restart the table still exists,
but is empty. This kind of tables are useful to keep track of live
states where it doesn't make sense to recover the table since the
live state changes so often that the restored data is useless.

When recoverability on the table is active it adds lots of writes
of logs and checkpoints that adds an extra overhead of about
30-50% for inserts, deletes and updates. Reads are about as fast
with recoverability as without. I’ll dig into this type of benchmark
in a later blog.

So to the results of the benchmark run. Inserts were processed at a
rate of 400k per second, updates at 402k per second, deletes at
433k per second. Reads I got up to 589k per second. For both writes
and reads we're operating here at full CPU speed and also on full
network speed (Gigabit Ethernet).

This benchmark was executed with 1 primary key of 8 bytes and one
field with 100 bytes in it.

I was curious to see what the top rate of key lookups was with very
small records where the network speed wasn't a factor in the play.
So I changed the field from 100 bytes to 8 bytes and reran the
benchmark.

Now I got inserts at 476k per second, updates at 479k per second,
deletes at 474k per second and finally reads at 925k per second.
So close to a million reads per second on this tiny machine alone.

Given that I ran without logging to disk most of the configuration
was pretty straightforward of the data node. The most important
part is obviously how to make use of the 4 CPU threads in the 2
available cores.

As usual the LDM threads is where the actual database resides is
the thread that requires the most CPU, in this particular benchmark
also the TC threads require a lot of CPU since they are involved
in each and every key lookup. For e.g. applications with scans the
TC thread is a lot less involved. There is also a send thread
and a receive thread which is important to configure correctly.
These four thread types are the most important to configure when
running without recoverability. When adding recoverability the
configuration of the file io threads is also important. There is
also a main thread and replication thread and some connection
threads, but this have no impact on the performance of this
benchmark, but they were configured to be on the CPU with lowest
use.

The best results were achieved with only one thread per thread
type. The best config of those was to allow the LDM thread to
use its own CPU core and thus not use hyperthreading for this
thread to any great extent. Then one CPU thread in the other
core was used for the TC thread and the other CPU thread was
used for the send and the receive thread. This CPU thread also
handled the other threads.

When running with 2 LDM threads the number of threads increased
and the OS had to do more scheduling, most likely this config
could provide better results in a more bursty load, but for
sustained good performance it was better to use only a few
threads. The performance difference was though quite small
between using 1 LDM and 1 TC compared to using 2 LDMs and
2 TCs.

The benchmark application was executed on my laptop connected
to the Intel NUC machine through a gigabit ethernet switch.

8 comments:

Andy said...

It's surprising that MySQL Cluster can do 589k reads per second on a dual core i5 machine

In a previous post (http://mikaelronstrom.blogspot.com/2015/03/200m-reads-per-second-in-mysql-cluster.html) you wrote "now each API node can process more than
1M reads per second" - but that was on a dual-socket server with 28 cores.

If MySQL Cluster can do 589k reads per second on a tiny dual core machine why can it only do 1M reads per second on a much more powerful 28 core server? Where's the bottleneck in the latter case?

Mikael Ronstrom said...

What wasn't obviously 100% clear in the post about 200M reads per second is that each server can have multiple API nodes. One API node is one socket connection to the cluster. One API program can have multiple API nodes connected to the cluster. So one server with 28 cores can execute at least 5-10M reads per second.

Also in this benchmark the Intel NUC only handled the data node, so the benchmark driver was executed on a high-end laptop with 4 cores with 8 CPU threads.

So the API also requires a fair amount of CPU power.

mihasusaec said...

Hello.
Could you provide ThreadConfig that you were using for such processor?
Thanks.

Mikael Ronstrom said...

I tried with a few different ThreadConfig variants. I tried with 1 and 2 LDM threads.
The best results came with 1 LDM thread that took care of 1 of the cores. Then I
placed 1 tc thread on the other core together with send, recv, io and so forth.

Anonymous said...

Hi Mikael,

I've been introduced to MySQL Cluster recently at my company.
I have two questions I was hoping you could help with.

1) Is there a limit on how many API nodes I can make the one mysqld connect to? Is this even a good idea?

At first, the cluster was setup with 5 APIs, one of which was being used by the mysqld. I then read about connection pools so I made the same mysqld connect to all the 5 API nodes. But performance didn't change. Any idea why? Currently using ndbmtd.

2) I'm using LockPagesInMainMemory=1 and I'm getting a warning in the cluster log saying, "Failed to memlock pages, error: 12 (Cannot allocate memory)". I checked ulimit -m and it's unlimited for the mysql user. What could I be missing?

Thank you

Mikael Ronstrom said...

mysqld nodes connect to DB nodes and a mysqld node is an API node.

The maximum number of DB nodes in a cluster is 48 and the max total
number of nodes is 255.

Not sure what you meant by connection pools and API nodes.

LockPagesInMemory failure can be due to not having sufficient privileges.

Anonymous said...

"LockPagesInMemory failure can be due to not having sufficient privileges."
Any chance of knowing what these privileges would be?
Also, with that error, would that be affecting performance?

Thank you again

Mikael Ronstrom said...

Privileges required to lock memory depends on OS.
Performance is affected only in cases of overloaded memory where swapping is avoided.
So for normal use case it has no effect, but if one starts up too many processes at the
same time it will ensure that the performance stays good even in overcommitted case.