Wednesday, June 13, 2018

Node restart improvements in MySQL Cluster 7.6

A simple test to see how much faster restarts are in 7.6 compared to 7.5 is to
load a set of DBT2 warehouses and next perform a node restart of one of the
data nodes.

In the test we load 600 warehouses, this means about 60 GByte of data inserted
into NDB in each data node.

A node restart of 7.5.10 using this data set takes 19 minutes and 11 seconds.
In 7.6.6 this restart takes 6 minutes and 49 seconds.

The difference can vary, especially since the restart time in 7.5 has a fairly high
variance. The restart time in 7.6.6 has much smaller variance.

We will look into the different restart phases and see how those are affected.

The first phase is to allocate the memory and ensure that the memory is allocated
to our process by touching the memory. This phase roughly allocates 2.7 GByte
per second. In this case it takes 26 seconds in both 7.5 and in 7.6.

After that we have some preparatory steps, these normally are very quick and in
this takes about 1-2 seconds. These steps involve preparing the metadata to use
in the restart.

The next step is the restore phase where we recover data from a local checkpoint.
Given the introduction of Partial LCP in 7.6 this phase actually takes longer time
since there is some duplication in the restore where one row is restored multiple
times. This phase will go faster through setting RecoveryWork to a smaller
number, but this will instead increase work on checkpointing during normal
operation.

In this test 7.5 took 1 minute and 51 seconds to restore and 7.6.6 used 2 minutes
and 15 seconds.

The next phase is the REDO log execution phase. Since partial LCP means that
LCPs are shorter, this phase is shortened. This particular benchmark actually
doesn't shorten as much as could be the case since there is heavy insert activity
right before the restart. But it still decreased from 51 seconds in 7.5 to 35 seconds
in 7.6.

After this we execute the UNDO log, this particular benchmark has no disk data.
So both 7.5 and 7.6 takes no time in this phase. But this phase has improved by
5 times with 4 LDM threads and this benchmark uses 8 LDM threads. So this
phase is dramatically faster in 7.6 if used.

The next phase is the rebuilding of the ordered indexes. This phase executes the
same code as in 7.5, but we have changed default configuration to ensure that
the rebuild use fully parallelised rebuild. This means that we have 16 CPUs
working on the rebuild instead of 8 CPUs as in 7.5. This gives a rebuild time
of 1 minute and 17 seconds in 7.6.6 compared to 2 minutes and 4 seconds. The
reason it isn't twice is since we make use of hyperthreading to speed things up.

The copy fragment phase is more or less empty in both 7.5 and 7.6 since we
didn't perform any updates during the restart. We don't expect any major
differences in this phase between 7.5 and 7.6.

Next we come to biggest gain in restart times in 7.6. This is the phase where
we wait for the local checkpoint to complete. In 7.5 we have to wait between
1 and 2 checkpoint times. In 7.5 in this benchmark the checkpoint takes about
11 minutes although we have increased the disk write speed compared to the
default. In this execution 7.5 took 13 minutes and 48 seconds in this phase.

In 7.6 we execute a checkpoint that is local to the starting node. This takes
2 minutes and 4 seconds. Finally we participate in a checkpoint where all
nodes participate. This is now extremely fast, it takes only 4 seconds since
no activity is ongoing. So the total time for this phase 2 minutes and 8 seconds.

This wait for checkpoint phase is fairly constant in time in 7.6, in 7.5 it grows
with a growing database size and is dependent on the disk write speed we have
configured. Thus the gain in restart time in 7.6 is a bit variable, but this
experiment have been done with fairly conservative numbers.

The final restart phase is handing over the responsibility of event handling
between connected MySQL servers, this phase takes 5-6 seconds in both
7.5 and 7.6. It can take longer if some MySQL Server is down during the
restart.

As can be seen we have significantly brought down the restart times for a
size that is common in 7.5, with larger sizes the difference is much bigger.

Tuesday, June 12, 2018

Benchmark of new cloud feature in MySQL Cluster 7.6


In previous blogs we have shown how MySQL Cluster can use the Read Backup
feature to improve performance when the MySQL Server and the NDB data
node are colocated.

There are two scenarios in a cloud setup where additional measures are
needed to ensure localized read accesses even when using the Read Backup
feature.

The first scenario is when data nodes and MySQL Servers are not colocated.
In this case by default we have no notion of closeness between nodes in
the cluster.

The second case is when we have multiple node groups and using colocated
data nodes and MySQL Server. In this case we have a notion of closeness
to the data in the node group we are colocated with, but not to other
node groups.

In a cloud setup the closeness is dependent on whether two nodes are in
the same availability domain (availability zone in Amazon/Google) or not.
In your own network other scenarios could exist.

In MySQL Cluster 7.6 we added a new feature where it is possible
to configure nodes to be contained in a certain location domain.
Nodes that are close to each other should be configured to be part of
the same location domain. Nodes belonging to different location domains
are always considered to be further away than the one with the same
location domain.

We will use this knowledge to always use a transaction coordinator placed
in the same location domain and if possible we will always read from a
replica placed in the same location domain as the transaction coordinator.

We use this feature to direct reads to a replica that is contained
in the same availability domain.

This provides a much better throughput for read queries in MySQL Cluster
when the data nodes and MySQL servers span multiple availability domains.

In the figure below we see the setup, each sysbench application is working
against one MySQL Server, both of these are located in the same availability
domain. The MySQL Server works against a set of 3 replicas in the NDB data
nodes. Each of those 3 replicas reside in a different availabilty domain.

The graph above shows the difference between using location domain ids in
this setup compared to not using them. The lacking measurements is missing
simply because there wasn't enough time to complete this particular
benchmark, but the measurements show still the improvements possible and
the improvement is above 40%.

The Bare Metal Server used for data nodes was the DenseIO2 machines and
the MySQL Server used a bare metal server without any attached disks and
not even any block storage is needed in the MySQL Server instances. The
MySQL Servers in an NDB setup are more or stateless, all the required state
is available in the NDB data nodes. Thus it is quite ok to start up a MySQL
Server from scratch all the time. The exception is when the MySQL Server
is used for replicating to another cluster, in this case the binlog state is required
to be persistent on the MySQL Server.

Monday, June 11, 2018

Impact of sharding on query performance in MySQL Cluster


A new week of blogs about our development in MySQL Cluster 7.6.
After working a long time on a set of new developments, there is a lot
of things to describe. I will continue this week with discussing sharding
and NDB, a new cloud feature in 7.6 and provide some benchmark
results on restart performance in 7.6 compared to 7.5. I am also planning
a comparative analysis for a few more versions of NDB.

In the blog serie I have presented recently we have displayed
the performance impact of various new features in MySQL Cluster
7.5 and 7.6. All these benchmarks were executed with tables that
used 1 partition. The idea behind this is that to develop a
scalable application it is important to develop partition-aware
applications.

A partition-aware application will ensure that all partitions
except one is pruned away from the query. Thus they get the same
performance as a query on a single-partition table.

Now in this blog we analyse the difference on using 1 partition
per table and using 8 partitions per table.

The execution difference is that with 8 partitions we have to
dive into the tree 8 times instead of one time and we have to
take the startup cost of the scan as well. At the same time
using 8 partitions means that we get some amount of parallelism
in the query execution and this speeds up query execution during
low concurrency.

Thus there are two main difference with single-partition scans
and multi-partition scans.

The first difference is that the parallelism decreases the latency
of query execution at low concurrency. More partitions means a higher
speedup.

The second difference is that the data node will spend more CPU to
execute the query for multi-partition scans compared to single-partition
scans.

Most of the benchmarks I have shown are limited by the cluster connection
used. Thus we haven't focused so much on the CPU usage in data nodes.

Thus in the graph above the improvement of query speed is around 20% at
low concurrency. The performance difference for other concurrency levels
is small, the multi-partition scans uses more CPU. The multi-partition
scans is though a bit more variable in its throughput.

Tests where I focused more on data node performance showed around 10%
overhead for multi-partition scans compared to single-partition scans
in a similar setup.

An interesting observation is that although most of the applications
should be developed with partition-aware queries, those queries that
are not pruned to one partition will be automatically parallelised.

This is the advantage of the MySQL Cluster auto-sharded architecture.
In a sharded setup using any other DBMS it is necessary to ensure that
all queries are performed in only one shard since there are no automatic
queries over many shards. This means that partition-aware queries will
be ok to handle in only one data server, but the application will have to
calculate where this data server resides. Cross-shard queries have to be
automatically managed though, both sending queries in parallel to
many shards and merging the results from many shards.

With NDB all of this is automatic. If the query is partition-aware,
it will be automatically directed to the correct shard (node group
in NDB). If the query isn't partition-aware and thus a cross-shard
query, it is automatically parallelised. It is even possible to
push join queries down into the NDB data nodes to execute the
join queries using a parallel linked-join algorithm.

As we have shown in earlier blogs and will show even more in coming
blogs NDB using the Read Backup feature will ensure that read queries
are directed to a data node that is as local as possible to the MySQL
Server executing the query. This is true also for join queries being pushed
down to the NDB data nodes.

Friday, June 08, 2018

Benchmark Read Backup Feature of NDB in the Oracle Cloud


The previous blog demonstrated the improvements from using the Read Backup
feature in NDB in a tightly connected on premise installation. Now we will
show how the performance is impacted when running on the Oracle cloud.

In both benchmarks we show here the MySQL Server and the NDB data node
are colocated and we have one node group where the combined data nodes
and MySQL Servers are placed in their own availability domain. We use all
3 availability domains and thus we have 3 synchronous replicas of the
data in the database. This means that all communication between data nodes
and all communication not using the local data node from the MySQL server
will also communicate  to another availability domain.

The below figure shows the mapping of processes to cloud machines. In this
experiment we used the DenseIO2 Bare Metal Server as machines for the
combined data node, MySQL server and place to run the Sysbench application.
This machine has 52 CPU cores, it has a set of NVMe drives that provide a
very fast local storage for the NDB data nodes, this ensures very speedy
recovery and in addition provides an excellent hardware to place some columns
in the disk data parts of NDB. The machine has 768 GByte of memory and
51.2 TByte of NVMe drives. Thus providing  a highly available in-memory
storage of around 500 GByte and a few TBytes of disk data columns.



We used a standard Sysbench OLTP RO benchmark. This benchmark executes 16
SQL queries, 10 simple primary key lookups, 4 range scan queries that gets
100 rows from the database and finally a BEGIN and a COMMIT statement.
To scale the Sysbench benchmark each Sysbench instance uses different tables
and different data. In a real setting there would likely be a load balancer
that directs connections to an application in one of the availability
domains. From this application the idea is to handle most traffic internally
in the same availability domain. My benchmark scripts are ensuring that the
different sysbench programs are started in a synchronous manner.

Thus the main difference when Sysbench is replaced by a real application
is that a load balancer sits in front of the application. Load balancers
are provided as a service in the Oracle Cloud. One could also use a
MySQL Router between the application and the MySQL Server. The JDBC
driver can handle failover between MySQL Servers, this would avoid this
extra network jump. As we will see in the benchmarks, each network jump
makes it harder to get the optimal performance out of the setup and it
increases the latency of application responses.

The graph at the top of the blog shows the results when running all
nodes colocated in the same server. There is some variation on results
at low concurrency, but as the randomness of query placement goes away
the improvement of using the Read Backup feature is stable around 50%.
This improvement comes mostly from avoiding the latency when going over
to another availability domain for queries where the primary replica
is not in the same availability domain.

At extremely high concurrency the impact decreases, but at the same
time when running with more than 2k concurrent connections the
response time is a not so good anymore, at low concurrency each
transaction has a latency of around 2.5 milliseconds. At 2304
connections the latency has increased to 35 milliseconds. This is still
pretty good given that each transaction has 16 SQL queries to execute.

In the next benchmark we move the sysbench application to its own set
of machines. For the Sysbench application we used a VM2.16 instance
that has 16 CPU cores (thus 32 CPUs). The image below shows the
placement of the processes in the machines and availability domains
of the Oracle Cloud.



The graph below shows the performance numbers in this setup. In this
setup the difference between using Read Backup or not is smaller since
we introduce one more latency, we have two network jumps between servers
in the same availability domain for each SQL query. This increases the
latency of each transaction by almost a factor of 2. Thus the difference
becomes smaller between the read backup feature and not using it. In this
setup the difference is around 30%.


Some things to note about latency between servers in the Oracle Cloud and
obviously in any cloud installation is that the latency between servers
can differ. This is natural since the speed of light is a factor in the
latency and thus the latency between servers in the cloud is varying
dependent on how far apart availability domains are and the placement of
the used servers. The experienced latency numbers were a lot better though
than the ones promised in the marketing. Thus most applications will
be able to handle their latency requirements in the Oracle Cloud.

An interesting thing to note is that when running applications that use
a lot of network resources, it is important to configure the Linux networking
correctly. Interestingly I had some issues with this in the last benchmark
where the sysbench VMs could not deliver more than 40k TPS at first. After
searching around for the bottleneck I found it to be in the Linux interrupts
on the Sysbench VM. CPU 30 and 31 were completely busy. I was able to
issue a few Linux commands and immediately the performance jumped up to
70k transactions when the soft interrupt part was moved to CPU 0 through
CPU 15 using RPS in Linux. I described this in an earlier blog and also
my book on MySQL Cluster 7.5 contains a chapter discussing those configuration
options in Linux.

In a later blog I will describe exactly what I did to setup those benchmarks.
This means that it will be possible to replicate these benchmarks for anyone
with an Oracle Cloud account. This is one of the cool features of a cloud
installation that it makes it possible to replicate any benchmark setups.

Thursday, June 07, 2018

Benchmarking the Read Backup feature in the NDB storage engine

Read Backup was a new feature in MySQL Cluster 7.5. When MySQL
Cluster 7.5 was released I was already busily engaged in working
on the partial LCP feature we now released in 7.6. So I had not
much time producing benchmarks showing the impact of the
Read Backup feature.

Read Backup means that committed reads in NDB can use the backup
replicas as well. In NDB tables reads are already directed towards
the primary replica. The reason is that MySQL Cluster wants to
ensure that applications can trust that a reader can see his own
updates. Many modern NoSQL DBMSs lack this feature since they are
using eventual replication and a very flexible scheduling of which
replicas to read. NDB provides a stronger consistency guarantee
in that all applications can see their own updates and replication
is synchronous.

The reason that reading using a backup replica can fail to see its own
changes in NDB is that we release the locks on the primary replica first,
next we deliver the committed message to the application and last
we release the locks on the backup. This means that reading a
backup replica using committed read (reads the latest committed row without
locks) can only be guaranteed to see its own updates if it reads the
primary replica.

With tables that have the Read Backup feature we will delay the
sending of the committed message to the application until all rows
have been unlocked. This means that we can safely read any replica
for those tables and still see our own updates.

Setting the Read Backup feature for a table can either be set through
a special syntax using the COMMENT in the CREATE TABLE statement. A
much easier and likely more useful approach is to set the
ndb_read_backup configuration variable in the MySQL Server to 1. This
means that all tables in this MySQL Server will be created with the
Read Backup feature. Similarly there is a similar feature for ensuring
that all tables are created with the fully replicated feature. In this
case the configuration variable ndb_fully_replicated is set to 1. In
MySQL Cluster 7.6 none of these configuration variables are enabled by
default. But for SQL applications it is a good to always enable the
read backup feature and for applications that focus on read scalability
with fairly limited size of the data, the fully replicated feature can
also be enabled.

Fully replicated tables have a replica in each data node and any replica
can be read for committed reads.

The benchmark we present here is performed in a setup with the optimal
network architecture. It is two machines where the MySQL Server and the
NDB data node can be colocated and the network between the nodes is
using an Infiniband network. This means that in most every real case the
impact of using read backup in a colocated scenario is even bigger.

In this scenario the extra delay to go over the network is fairly small,
thus the impact of low concurrency is fairly small, but the extra overhead
of going over the network a lot shows its impact on higher concurrency and
grows all the way up to 30%.

MySQL Cluster 7.6 and the thread pool

Looking at the graphs in the previous blog post one can see that
MySQL Cluster 7.6 using the shared memory transporter can improve
performance at very high thread counts by more than 100%. Still
the performance is still dropping fairly significantly moving from
512 to 1536 threads. The MySQL Server using the NDB transporter
scales very well on all sorts of architectures and using very many
cores. But I have noted that when the number of connections goes
beyond some limit (in my benchmarks usually around 512 threads),
the performance starts to drop.

Actually in the commercial version of MySQL Cluster help is available
to resolve this problem. The thread pool was developed by me and a team
of performance experts to ensure that MySQL using InnoDB would have
the same performance even with massive amounts of threads hitting at the
MySQL server. It still works for this purpose. I have never mentioned
the use of thread pool for MySQL Cluster before, but the fact of the matter
is that it works perfectly fine to use the thread pool in combination
with MySQL Cluster.

There is one limitation in the current thread pool implementation. The maximum
number of thread groups are 64. This limit was set since MySQL didn't scale
beyond this number using InnoDB in those days. NDB is a distributed engine,
so it works a bit differently for NDB compared to InnoDB. It would be possible
to make the integration of the thread pool and NDB a bit tighter. But even with
the current implementation NDB can be used perfectly fine with the thread pool.

The limit 64 means that it won't really be so useful to use the thread pool and
NDB in combination with MySQL servers that use more than 16 CPUs.
The graph above show how the thread pool compares to the performance of
MySQL Cluster 7.6 on a small MySQL Server. It loses 1-2% on low thread
counts, but it continues to deliver good results even when passing the 512 thread limit.

The graph below shows how MySQL Cluster 7.6.6 using the thread pool compares to
MySQL Cluster 7.5.9 without thread pool. We see the usual linear curve at
high concurrency for the thread pool, in this case however it is limited by
the 64 thread groups since the setup in this case have access to 30 CPUs using
one cluster connection. I did some experiments where I moved the limit of 64
up a bit (a minor change). The performance for most experiments has a
good setting with thread pool size set to 128 and in this case the performance
actually increases a small bit as the number of threads increase.


Tuesday, June 05, 2018

Colocated MySQL Server and NDB data nodes


Historically the advice for MySQL Cluster has been to not colocate
the MySQL Server and the NDB data node for scalable applications.

There are still reasons to apply this principle in scalable setups
where the application isn't close to the data.

But with MySQL Cluster 7.6 we have added a number of reasons why it
makes sense to colocate the MySQL Server and the NDB data node.

Internally in the NDB development we have had a debate about whether
to integrate the NDB data node inside the MySQL Server. The reasons
for this is that the MySQL Server will be closer to the data. The
argument against is that the NDB data node and the MySQL Server are
designed with different recovery models. NDB data nodes are failfast,
as soon as we find a fault that is not supposed to happen we will
crash the data node. The MySQL Server on the other hand tries to
stay up as much as possible since a crash of the MySQL Server brings
down the data accessibility. In NDB we always expect another data
node to have a synchronous replica, thus data is accessible even in
the case of a crash.

With MySQL Cluster 7.6 we have gotten the best of both worlds. We
can now communicate from the MySQL Server to a NDB data node using
a shared memory transporter. This means that communication goes
entirely through the memory of the machine, the communication between
a thread in the MySQL Server and a thread in the NDB data node
goes through memory and when a thread needs to wake up a thread a
mutex is used with a condition variable exactly as in the MySQL
Server. Still the NDB data node and the MySQL Server is separate
programs that can reside on machines independent of each other
and they can crash independently of each other.

So with the release of MySQL Cluster 7.6 it is possible to have
clusters with locality of reads. Already in MySQL Cluster 7.5 we
introduced the possibility to declare tables as being able to
read from all replicas (Read Backup feature). In addition we
introduced tables that can be fully replicated in MySQL Cluster 7.5.
In these fully replicated tables access to a table is always local
to the data node we are in.

In MySQL Cluster 7.6 we are introducing a shared memory transporter
for efficient communication between a colocated MySQL Server and
an NDB data node. In addition we are introducing the possibility
to declare location domain ids for all nodes in the cluster. This
means that we can ensure that reads always stays local to the
Availability Domain in an Oracle Cloud (Availability Zone in Amazon
and Google clouds). Thus it is possible to design applications
without having to consider networking constraints as much as before
with NDB.

This means that we expect NDB to work very well in SQL applications.
We are also constantly working on improving the SQL performance of
NDB by supporting more and more push down of joins to the NDB data nodes.
We are working on improving the basic range scan mechanism in NDB,
we are working on improving the interface between the NDB storage
engine and the MySQL Server. Finally we are also working hard to
integrate all the changes in MySQL 8.0 into MySQL Cluster as well.

I will describe a number of different scenarios for how to build
applications in the cloud using a setup where we have 3 data nodes,
one in each availability domain of the Oracle Cloud.

But in this blog and a few more blogs I will start by looking
specifically at how the shared memory transporter improves performance
of standard sysbench benchmarks.

In the previous blog I showed how we have improved performance of
Sysbench OLTP RW even for the standard TCP transporter. This was
due to the use of a new wakeup thread and the use of locking the NDB API
receive thread to a CPU where it can work undisturbed. The receive
thread handles receive of all messages from the NDB data nodes and
must be prioritised over the other MySQL Server threads, the best way
to achieve this is to use CPU locking. In the benchmarks we present in
this blog we will always use this CPU locking.

In the figure above we show how the performance of a normal setup using
7.5.9 compares to the 7.6.6 with receive thread locked to a CPU using
the TCP transporter. Next we have a curve that shows performance when
simply replacing the TCP transporter with a shared memory transporter.
Next we show a curve of what happens if we configure the shared memory
transporter to use spinning for a while before it goes to sleep.

The final curve shows the performance when also spinning in the TC
threads and the LDM threads in the NDB data node. Spinning in those
threads is not likely to be beneficial if those threads are not locked
to their own CPU core, thus in this one should not use hyperthreading
for those threads.

The takeaways from the graph above are the following:

1) The shared memory transporter have similar performance at low
concurrency as the TCP transporter. As concurrency increases the
shared memory transporter has better performance, the improvement
is 10% at top performance and more than 40% at very high concurrency.

2) Using spinning in the configuration of the shared memory transporter
improves performance at low concurrency significantly, by more than
20%. Top performance is similar to not using spinning, but it is
easier to get to this top performance.

3) Using spinning in the TC threads and LDM threads improves performance
even more at low concurrency. Performance increases by more than 30% at
low concurrency compared to no spinning and by 10% compared to spinning
only in transporter. Performance at high concurrency is similar for all
variants using shared memory transporter. So spinning helps to make the
MySQL Server need less concurrency to reach high performance levels.

We have added a graph below where highlight the performance at 1 and 2
threads since it is difficult to see those differences in the first
figure.

Configuring NDB to use a shared memory transporter is easy, the easiest
way is to simply set a new configuration variable UseShm to 1 on the
NDB data nodes. With this setting we will create a shared memory transporter
between all API nodes and this node when the API node and the data node
share the same hostname. It is also possible to create a separate shared
memory section to describe the transporter setup between two specific
nodes in the cluster.

Spintime for the shared memory transporter is easiest to setup using the default
shared memory transporter section. Spintime for TC and LDM threads in the NDB
data nodes are configured using the ThreadConfig variable in NDB data nodes.

Since we are using mutex and condition variables in shared memory we are
only supporting shared memory transporters on Linux at the moment.

The conclusion is that using the shared memory transporter we can improve
performance at low concurrency by more than 30%, we can improve throughput
by 20% and at very high concurrency (1536 threads) we get about 100%
improvement, all comparing to the result in using 7.5.9.

In the graph below we show only the 7.5.9 curve and compare it to the curve
achieved with all improvements in 7.6.


Monday, June 04, 2018

Improvements from NDB wakeup threads


In MySQL Cluster 7.6 we introduced a new thread type in the NDB API.
Traditionally each cluster connection has one send thread to assist in sending messages to NDB, a receive thread that can
assist in receiving messages from NDB.
There is also a connection thread that listens to new connections and connects to the NDB data nodes.

There is also a set of user threads that is created by the application
that uses the NDB cluster connection.

Most of the sending is done by the user threads, the NDB API sending
thread is only used only when we are sending faster than the network
is able to handle.

We can process receive messages in the user threads or in the NDB
API receive thread. The default behaviour is to use the user threads
until the concurrency is higher than 8 threads working at the same
time. So at the highest concurrency it is the receive thread that
handles the signals and at low concurrency it is handled directly
by the NDB API user threads. The 8 is configurable through the
MySQL server variable ndb_recv_thread_activation_threshold.

Receiving in the NDB API is slightly faster to use from user threads
when only one thread is active.  It is 3-4% better response time in
this particular case. However as more and more threads are sending
data to the NDB data nodes the efficiency of using the NDB API
receive thread increases.

One problem in using the NDB API receive thread is that it is responsible
to both receive the messages from the NDB data nodes and to wake up the
NDB API user threads. At low load this is not an issue. But when the
load on the NDB API receive thread reaches 90% and beyond, this becomes
an issue.

To avoid this problem we added a new thread in the NDB API in MySQL Cluster 7.6.
This is the wakeup thread. This thread only has one duty, this is to wakeup
other threads. We experimented with a number of different variants to see which
ensured that user threads are woken up as quickly as possible.

Our conclusion was that at low load the optimal is that the receive thread
handles the wakeup, but at very high load it requires assistance from one
wakeup thread. As load increases the receive thread will handle less and less
wakeups. At 99-100% load the receive thread will more or less offload all
wakeup calls to the wakeup thread.

In the figure above we compare a normal sysbench OLTP RW experiment
comparing 7.5.9 with 7.6.6. As can be seen there is no difference until
we reach 32 connections. As we start to offload a subset of the wakeups
to the wakeup thread we improve performance of the application.

The throughput increases 5% due to this new feature, with even more
threads the performance drops slower such that we gain 15-20% more
performance at 512 connections.

The best performance is normally achieved by using the NDB API
receive thread and that this thread is locked to a specific CPU.
When starting the MySQL server one specifies these CPUs in the
configuration parameter ndb_recv_thread_cpu_mask. If the MySQL
Server uses several NDB cluster connections, the parameter
should specify one CPU per cluster connection.

If locking the NDB API receive thread to a CPU, it is important to
also lock the MySQL server process to other CPUs and if other processes
are running on the same machine, these also need to be locked to
CPUs not interfering with the NDB API receive thread.

The figures above shows the improvements when using one of the CPU
cores locked to handle the NDB API receive thread. Locking the receive
thread to a CPU adds another 5% to the total throughput and up to
20% more at high thread counts.

So what we have achieved with MySQL Cluster 7.6 is that we can increase
the throughput by at least 10% and performance at high thread counts
can increase by as much as 40%. All these numbers are still using the
TCP transporter. In a coming blog we will show how these numbers increase
even more when using the shared memory transporter. In addition we will
show how using the thread pool with NDB can even further increase stability
of high throughputs at high thread counts.

The above experiment was always done with one data node using 8 LDM
threads, the data node is locked to CPUs within one CPU socket. The
MySQL Server is locked to using 30 CPUs (15 CPU cores). In all cases
the bottleneck is that we only use one cluster connection. In 7.5.9 this
cluster connection scales to about 18 CPUs and with 7.6.6 it scales to
more than 20 CPUs. So using one cluster connection per 8 CPU cores is
usually appropriate.

Saturday, June 02, 2018

MySQL Cluster 7.6 future proof

MySQL Cluster 7.6 is designed to improve the restart times
for database sizes that MySQL Cluster 7.5 and earlier versions
support.

At the same time MySQL Cluster 7.6 is preparing for the innovations
in HW architecture. Between 2008 and 2012 I was heavily involved
in handling the previous change in HW architecture. This change
was the introduction of multi-core architectures. Between 2008 and
2012 we scaled the MySQL Server from 4 CPUs to 64 CPUs.
The NDB data nodes was scaled from 2 CPUs to more than 50 CPUs in
the same timeframe.

The next major shift in the HW architecture is the introduction of
persistent memory, this means that we will get persistent memory
accessible at the same level as DRAM. We don't know yet all
characteristics we will see on those persistent memories, but a
guestimate on what to expect are:

1) About 10x more memory per DIMM
2) About 4x cheaper memory
3) About 10x slower access to the memory compared to DRAM
4) Memory will be persistent and survive a restart of the machine

As an example of this development Intel announced Optane persistent memory
to be fully available in 2019. These memory DIMMs will be available in 512 GB
DIMMs. A modern 2-socket server of today comes equipped with about 512 GB
memory. A high-end server of 2019 will be able to be shipped with 6 TByte of
persistent memory and on top of that also 400 GByte of DRAM memory.

MySQL Cluster 7.5 and earlier versions have a good fit for the modern servers
of today. MySQL Cluster 7.6 brings a much improved recovery architecture that
will improve restart times by 4x using current database sizes.

At the same time MySQL Cluster 7.5 won't work very well with machines with
6 TByte machines. This is due to the use of full checkpoints, thus each
checkpoint will have to write 6 TBytes to disk.

Actually most every in-memory DBMS have the same issue, so all in-memory
DBMSs have to adapt to this new reality. MySQL Cluster leads the way here
by introducing partial checkpoints in MySQL Cluster 7.6. Even disk-based
DBMS will get a fair amount of issues to handle around checkpointing
when the page cache grows to multi-TByte sizes.

During development of partial checkpoints I analyzed the difference between
the method implemented in MySQL Cluster 7.6 and using a page cache. The
method used in MySQL Cluster 7.6 needed to write 100x less data to disk as
part of checkpoints.

To give you a feeling for the impact of the checkpointing times in MySQL
Cluster I will describe what will happen with full checkpoints using
a 6 TByte database size.

Assume that we will write 100 MByte per second to disk for checkpoints.
In this case it will take 60.000 seconds to perform a checkpoint. This
means 16 hours and 40 minutes.

Now assume we perform a checkpoint in MySQL Cluster 7.6. In 7.6 we only
need to checkpoint those partitions that have changed any data. We assume
that half of the partitions in the database haven't changed since the last
checkpoint. We assume that we have a fair amount of updates, but since
checkpoints happen with intervals of around a minute, it means that only
a small portion of the 6 TByte will be updated. A partial checkpoint will
always checkpoint at least one part in 2048. This means that the minimum
size of a checkpoint in this scenario would be 1.5 GByte. Thus the
checkpoint will take 15 seconds in MySQL Cluster 7.6.

This means that the checkpoint time has decreased by a factor of 4000x
in this particular case.

The choice of the factor 2048 is to ensure that we can maintain very short
checkpoint times all the way up to memories of 16 TByte and even beyond this
it will still function very well.

Thus MySQL Cluster 7.6 is already prepared for the next generation of
HW architectures arriving in 2019.

As part of a node restart we perform a checkpoint and have to wait for the
previous checkpoint to complete. From this we can deduce that the improvement
of restart times is even bigger as we go towards bigger memories.

Friday, June 01, 2018

MySQL Cluster 7.6 in numbers

MySQL Cluster 7.6 is now released as GA. 7.6 contains a lot of interesting changes.
Here is a list of changes in numbers in the recent MySQL Cluster versions.

1) With the new changes of our checkpointing scheme the restart times in MySQL
Cluster are 3-4x faster for a database size of around 60 GByte.

2) The changes we have done in our parallel query execution parts have improved
some benchmark queries we have tested with up to a factor of 2.

3) A new wakeup threads in the NDB API means that each API node can deliver 10%
more throughput.

4) 7.6.6 introduces a new shared memory transporter to communicate between
colocated MySQL Server and NDB data nodes. This improves throughput by 10% and
can improve throughput on low thread counts as much as 60%.

5) Benchmarks using the Read Backup feature (released in 7.5) shows that
performance increases around 50% even in a network with a very high bandwidth.

6) A benchmark in the Oracle Cloud using the new feature to ensure locality of
reads within the same availability domain shows 50% improved throughput.

7) Parallelisation of the UNDO log applier speeds up this phase of the NDB recovery
by a factor of 5.

8) We have changed the defaults for index build during restarts that can speed up
the index build phase of the restart by a factor of 2.

9) We have changed the default batch sizes of index builds during restore of a
backup that can speed up restore by at least a factor of 2.

So as these numbers reflect MySQL Cluster 7.6 have significantly improved our
availability by decreasing the amount of time to restart a node. At the same
time we have also significantly improved performance of an SQL application
on top of MySQL Cluster using Read Backup feature, Fully Replicated feature,
a new shared memory transporter, a new feature for locality of reads in a
cloud environment and finally an improved throughput on each API node.

I will present those results in detail in the coming weeks. Some of those tests
was performed in the Oracle Cloud which means that anyone can reproduce the
exact same tests themselves which I think is a very nice feature with the
cloud.