Monday, June 04, 2018

Improvements from NDB wakeup threads


In MySQL Cluster 7.6 we introduced a new thread type in the NDB API.
Traditionally each cluster connection has one send thread to assist in sending messages to NDB, a receive thread that can
assist in receiving messages from NDB.
There is also a connection thread that listens to new connections and connects to the NDB data nodes.

There is also a set of user threads that is created by the application
that uses the NDB cluster connection.

Most of the sending is done by the user threads, the NDB API sending
thread is only used only when we are sending faster than the network
is able to handle.

We can process receive messages in the user threads or in the NDB
API receive thread. The default behaviour is to use the user threads
until the concurrency is higher than 8 threads working at the same
time. So at the highest concurrency it is the receive thread that
handles the signals and at low concurrency it is handled directly
by the NDB API user threads. The 8 is configurable through the
MySQL server variable ndb_recv_thread_activation_threshold.

Receiving in the NDB API is slightly faster to use from user threads
when only one thread is active.  It is 3-4% better response time in
this particular case. However as more and more threads are sending
data to the NDB data nodes the efficiency of using the NDB API
receive thread increases.

One problem in using the NDB API receive thread is that it is responsible
to both receive the messages from the NDB data nodes and to wake up the
NDB API user threads. At low load this is not an issue. But when the
load on the NDB API receive thread reaches 90% and beyond, this becomes
an issue.

To avoid this problem we added a new thread in the NDB API in MySQL Cluster 7.6.
This is the wakeup thread. This thread only has one duty, this is to wakeup
other threads. We experimented with a number of different variants to see which
ensured that user threads are woken up as quickly as possible.

Our conclusion was that at low load the optimal is that the receive thread
handles the wakeup, but at very high load it requires assistance from one
wakeup thread. As load increases the receive thread will handle less and less
wakeups. At 99-100% load the receive thread will more or less offload all
wakeup calls to the wakeup thread.

In the figure above we compare a normal sysbench OLTP RW experiment
comparing 7.5.9 with 7.6.6. As can be seen there is no difference until
we reach 32 connections. As we start to offload a subset of the wakeups
to the wakeup thread we improve performance of the application.

The throughput increases 5% due to this new feature, with even more
threads the performance drops slower such that we gain 15-20% more
performance at 512 connections.

The best performance is normally achieved by using the NDB API
receive thread and that this thread is locked to a specific CPU.
When starting the MySQL server one specifies these CPUs in the
configuration parameter ndb_recv_thread_cpu_mask. If the MySQL
Server uses several NDB cluster connections, the parameter
should specify one CPU per cluster connection.

If locking the NDB API receive thread to a CPU, it is important to
also lock the MySQL server process to other CPUs and if other processes
are running on the same machine, these also need to be locked to
CPUs not interfering with the NDB API receive thread.

The figures above shows the improvements when using one of the CPU
cores locked to handle the NDB API receive thread. Locking the receive
thread to a CPU adds another 5% to the total throughput and up to
20% more at high thread counts.

So what we have achieved with MySQL Cluster 7.6 is that we can increase
the throughput by at least 10% and performance at high thread counts
can increase by as much as 40%. All these numbers are still using the
TCP transporter. In a coming blog we will show how these numbers increase
even more when using the shared memory transporter. In addition we will
show how using the thread pool with NDB can even further increase stability
of high throughputs at high thread counts.

The above experiment was always done with one data node using 8 LDM
threads, the data node is locked to CPUs within one CPU socket. The
MySQL Server is locked to using 30 CPUs (15 CPU cores). In all cases
the bottleneck is that we only use one cluster connection. In 7.5.9 this
cluster connection scales to about 18 CPUs and with 7.6.6 it scales to
more than 20 CPUs. So using one cluster connection per 8 CPU cores is
usually appropriate.

No comments: