Wednesday, June 13, 2018

Node restart improvements in MySQL Cluster 7.6

A simple test to see how much faster restarts are in 7.6 compared to 7.5 is to
load a set of DBT2 warehouses and next perform a node restart of one of the
data nodes.

In the test we load 600 warehouses, this means about 60 GByte of data inserted
into NDB in each data node.

A node restart of 7.5.10 using this data set takes 19 minutes and 11 seconds.
In 7.6.6 this restart takes 6 minutes and 49 seconds.

The difference can vary, especially since the restart time in 7.5 has a fairly high
variance. The restart time in 7.6.6 has much smaller variance.

We will look into the different restart phases and see how those are affected.

The first phase is to allocate the memory and ensure that the memory is allocated
to our process by touching the memory. This phase roughly allocates 2.7 GByte
per second. In this case it takes 26 seconds in both 7.5 and in 7.6.

After that we have some preparatory steps, these normally are very quick and in
this takes about 1-2 seconds. These steps involve preparing the metadata to use
in the restart.

The next step is the restore phase where we recover data from a local checkpoint.
Given the introduction of Partial LCP in 7.6 this phase actually takes longer time
since there is some duplication in the restore where one row is restored multiple
times. This phase will go faster through setting RecoveryWork to a smaller
number, but this will instead increase work on checkpointing during normal
operation.

In this test 7.5 took 1 minute and 51 seconds to restore and 7.6.6 used 2 minutes
and 15 seconds.

The next phase is the REDO log execution phase. Since partial LCP means that
LCPs are shorter, this phase is shortened. This particular benchmark actually
doesn't shorten as much as could be the case since there is heavy insert activity
right before the restart. But it still decreased from 51 seconds in 7.5 to 35 seconds
in 7.6.

After this we execute the UNDO log, this particular benchmark has no disk data.
So both 7.5 and 7.6 takes no time in this phase. But this phase has improved by
5 times with 4 LDM threads and this benchmark uses 8 LDM threads. So this
phase is dramatically faster in 7.6 if used.

The next phase is the rebuilding of the ordered indexes. This phase executes the
same code as in 7.5, but we have changed default configuration to ensure that
the rebuild use fully parallelised rebuild. This means that we have 16 CPUs
working on the rebuild instead of 8 CPUs as in 7.5. This gives a rebuild time
of 1 minute and 17 seconds in 7.6.6 compared to 2 minutes and 4 seconds. The
reason it isn't twice is since we make use of hyperthreading to speed things up.

The copy fragment phase is more or less empty in both 7.5 and 7.6 since we
didn't perform any updates during the restart. We don't expect any major
differences in this phase between 7.5 and 7.6.

Next we come to biggest gain in restart times in 7.6. This is the phase where
we wait for the local checkpoint to complete. In 7.5 we have to wait between
1 and 2 checkpoint times. In 7.5 in this benchmark the checkpoint takes about
11 minutes although we have increased the disk write speed compared to the
default. In this execution 7.5 took 13 minutes and 48 seconds in this phase.

In 7.6 we execute a checkpoint that is local to the starting node. This takes
2 minutes and 4 seconds. Finally we participate in a checkpoint where all
nodes participate. This is now extremely fast, it takes only 4 seconds since
no activity is ongoing. So the total time for this phase 2 minutes and 8 seconds.

This wait for checkpoint phase is fairly constant in time in 7.6, in 7.5 it grows
with a growing database size and is dependent on the disk write speed we have
configured. Thus the gain in restart time in 7.6 is a bit variable, but this
experiment have been done with fairly conservative numbers.

The final restart phase is handing over the responsibility of event handling
between connected MySQL servers, this phase takes 5-6 seconds in both
7.5 and 7.6. It can take longer if some MySQL Server is down during the
restart.

As can be seen we have significantly brought down the restart times for a
size that is common in 7.5, with larger sizes the difference is much bigger.

Tuesday, June 12, 2018

Benchmark of new cloud feature in MySQL Cluster 7.6


In previous blogs we have shown how MySQL Cluster can use the Read Backup
feature to improve performance when the MySQL Server and the NDB data
node are colocated.

There are two scenarios in a cloud setup where additional measures are
needed to ensure localized read accesses even when using the Read Backup
feature.

The first scenario is when data nodes and MySQL Servers are not colocated.
In this case by default we have no notion of closeness between nodes in
the cluster.

The second case is when we have multiple node groups and using colocated
data nodes and MySQL Server. In this case we have a notion of closeness
to the data in the node group we are colocated with, but not to other
node groups.

In a cloud setup the closeness is dependent on whether two nodes are in
the same availability domain (availability zone in Amazon/Google) or not.
In your own network other scenarios could exist.

In MySQL Cluster 7.6 we added a new feature where it is possible
to configure nodes to be contained in a certain location domain.
Nodes that are close to each other should be configured to be part of
the same location domain. Nodes belonging to different location domains
are always considered to be further away than the one with the same
location domain.

We will use this knowledge to always use a transaction coordinator placed
in the same location domain and if possible we will always read from a
replica placed in the same location domain as the transaction coordinator.

We use this feature to direct reads to a replica that is contained
in the same availability domain.

This provides a much better throughput for read queries in MySQL Cluster
when the data nodes and MySQL servers span multiple availability domains.

In the figure below we see the setup, each sysbench application is working
against one MySQL Server, both of these are located in the same availability
domain. The MySQL Server works against a set of 3 replicas in the NDB data
nodes. Each of those 3 replicas reside in a different availabilty domain.

The graph above shows the difference between using location domain ids in
this setup compared to not using them. The lacking measurements is missing
simply because there wasn't enough time to complete this particular
benchmark, but the measurements show still the improvements possible and
the improvement is above 40%.

The Bare Metal Server used for data nodes was the DenseIO2 machines and
the MySQL Server used a bare metal server without any attached disks and
not even any block storage is needed in the MySQL Server instances. The
MySQL Servers in an NDB setup are more or stateless, all the required state
is available in the NDB data nodes. Thus it is quite ok to start up a MySQL
Server from scratch all the time. The exception is when the MySQL Server
is used for replicating to another cluster, in this case the binlog state is required
to be persistent on the MySQL Server.

Monday, June 11, 2018

Impact of sharding on query performance in MySQL Cluster


A new week of blogs about our development in MySQL Cluster 7.6.
After working a long time on a set of new developments, there is a lot
of things to describe. I will continue this week with discussing sharding
and NDB, a new cloud feature in 7.6 and provide some benchmark
results on restart performance in 7.6 compared to 7.5. I am also planning
a comparative analysis for a few more versions of NDB.

In the blog serie I have presented recently we have displayed
the performance impact of various new features in MySQL Cluster
7.5 and 7.6. All these benchmarks were executed with tables that
used 1 partition. The idea behind this is that to develop a
scalable application it is important to develop partition-aware
applications.

A partition-aware application will ensure that all partitions
except one is pruned away from the query. Thus they get the same
performance as a query on a single-partition table.

Now in this blog we analyse the difference on using 1 partition
per table and using 8 partitions per table.

The execution difference is that with 8 partitions we have to
dive into the tree 8 times instead of one time and we have to
take the startup cost of the scan as well. At the same time
using 8 partitions means that we get some amount of parallelism
in the query execution and this speeds up query execution during
low concurrency.

Thus there are two main difference with single-partition scans
and multi-partition scans.

The first difference is that the parallelism decreases the latency
of query execution at low concurrency. More partitions means a higher
speedup.

The second difference is that the data node will spend more CPU to
execute the query for multi-partition scans compared to single-partition
scans.

Most of the benchmarks I have shown are limited by the cluster connection
used. Thus we haven't focused so much on the CPU usage in data nodes.

Thus in the graph above the improvement of query speed is around 20% at
low concurrency. The performance difference for other concurrency levels
is small, the multi-partition scans uses more CPU. The multi-partition
scans is though a bit more variable in its throughput.

Tests where I focused more on data node performance showed around 10%
overhead for multi-partition scans compared to single-partition scans
in a similar setup.

An interesting observation is that although most of the applications
should be developed with partition-aware queries, those queries that
are not pruned to one partition will be automatically parallelised.

This is the advantage of the MySQL Cluster auto-sharded architecture.
In a sharded setup using any other DBMS it is necessary to ensure that
all queries are performed in only one shard since there are no automatic
queries over many shards. This means that partition-aware queries will
be ok to handle in only one data server, but the application will have to
calculate where this data server resides. Cross-shard queries have to be
automatically managed though, both sending queries in parallel to
many shards and merging the results from many shards.

With NDB all of this is automatic. If the query is partition-aware,
it will be automatically directed to the correct shard (node group
in NDB). If the query isn't partition-aware and thus a cross-shard
query, it is automatically parallelised. It is even possible to
push join queries down into the NDB data nodes to execute the
join queries using a parallel linked-join algorithm.

As we have shown in earlier blogs and will show even more in coming
blogs NDB using the Read Backup feature will ensure that read queries
are directed to a data node that is as local as possible to the MySQL
Server executing the query. This is true also for join queries being pushed
down to the NDB data nodes.

Friday, June 08, 2018

Benchmark Read Backup Feature of NDB in the Oracle Cloud


The previous blog demonstrated the improvements from using the Read Backup
feature in NDB in a tightly connected on premise installation. Now we will
show how the performance is impacted when running on the Oracle cloud.

In both benchmarks we show here the MySQL Server and the NDB data node
are colocated and we have one node group where the combined data nodes
and MySQL Servers are placed in their own availability domain. We use all
3 availability domains and thus we have 3 synchronous replicas of the
data in the database. This means that all communication between data nodes
and all communication not using the local data node from the MySQL server
will also communicate  to another availability domain.

The below figure shows the mapping of processes to cloud machines. In this
experiment we used the DenseIO2 Bare Metal Server as machines for the
combined data node, MySQL server and place to run the Sysbench application.
This machine has 52 CPU cores, it has a set of NVMe drives that provide a
very fast local storage for the NDB data nodes, this ensures very speedy
recovery and in addition provides an excellent hardware to place some columns
in the disk data parts of NDB. The machine has 768 GByte of memory and
51.2 TByte of NVMe drives. Thus providing  a highly available in-memory
storage of around 500 GByte and a few TBytes of disk data columns.



We used a standard Sysbench OLTP RO benchmark. This benchmark executes 16
SQL queries, 10 simple primary key lookups, 4 range scan queries that gets
100 rows from the database and finally a BEGIN and a COMMIT statement.
To scale the Sysbench benchmark each Sysbench instance uses different tables
and different data. In a real setting there would likely be a load balancer
that directs connections to an application in one of the availability
domains. From this application the idea is to handle most traffic internally
in the same availability domain. My benchmark scripts are ensuring that the
different sysbench programs are started in a synchronous manner.

Thus the main difference when Sysbench is replaced by a real application
is that a load balancer sits in front of the application. Load balancers
are provided as a service in the Oracle Cloud. One could also use a
MySQL Router between the application and the MySQL Server. The JDBC
driver can handle failover between MySQL Servers, this would avoid this
extra network jump. As we will see in the benchmarks, each network jump
makes it harder to get the optimal performance out of the setup and it
increases the latency of application responses.

The graph at the top of the blog shows the results when running all
nodes colocated in the same server. There is some variation on results
at low concurrency, but as the randomness of query placement goes away
the improvement of using the Read Backup feature is stable around 50%.
This improvement comes mostly from avoiding the latency when going over
to another availability domain for queries where the primary replica
is not in the same availability domain.

At extremely high concurrency the impact decreases, but at the same
time when running with more than 2k concurrent connections the
response time is a not so good anymore, at low concurrency each
transaction has a latency of around 2.5 milliseconds. At 2304
connections the latency has increased to 35 milliseconds. This is still
pretty good given that each transaction has 16 SQL queries to execute.

In the next benchmark we move the sysbench application to its own set
of machines. For the Sysbench application we used a VM2.16 instance
that has 16 CPU cores (thus 32 CPUs). The image below shows the
placement of the processes in the machines and availability domains
of the Oracle Cloud.



The graph below shows the performance numbers in this setup. In this
setup the difference between using Read Backup or not is smaller since
we introduce one more latency, we have two network jumps between servers
in the same availability domain for each SQL query. This increases the
latency of each transaction by almost a factor of 2. Thus the difference
becomes smaller between the read backup feature and not using it. In this
setup the difference is around 30%.


Some things to note about latency between servers in the Oracle Cloud and
obviously in any cloud installation is that the latency between servers
can differ. This is natural since the speed of light is a factor in the
latency and thus the latency between servers in the cloud is varying
dependent on how far apart availability domains are and the placement of
the used servers. The experienced latency numbers were a lot better though
than the ones promised in the marketing. Thus most applications will
be able to handle their latency requirements in the Oracle Cloud.

An interesting thing to note is that when running applications that use
a lot of network resources, it is important to configure the Linux networking
correctly. Interestingly I had some issues with this in the last benchmark
where the sysbench VMs could not deliver more than 40k TPS at first. After
searching around for the bottleneck I found it to be in the Linux interrupts
on the Sysbench VM. CPU 30 and 31 were completely busy. I was able to
issue a few Linux commands and immediately the performance jumped up to
70k transactions when the soft interrupt part was moved to CPU 0 through
CPU 15 using RPS in Linux. I described this in an earlier blog and also
my book on MySQL Cluster 7.5 contains a chapter discussing those configuration
options in Linux.

In a later blog I will describe exactly what I did to setup those benchmarks.
This means that it will be possible to replicate these benchmarks for anyone
with an Oracle Cloud account. This is one of the cool features of a cloud
installation that it makes it possible to replicate any benchmark setups.

Thursday, June 07, 2018

Benchmarking the Read Backup feature in the NDB storage engine

Read Backup was a new feature in MySQL Cluster 7.5. When MySQL
Cluster 7.5 was released I was already busily engaged in working
on the partial LCP feature we now released in 7.6. So I had not
much time producing benchmarks showing the impact of the
Read Backup feature.

Read Backup means that committed reads in NDB can use the backup
replicas as well. In NDB tables reads are already directed towards
the primary replica. The reason is that MySQL Cluster wants to
ensure that applications can trust that a reader can see his own
updates. Many modern NoSQL DBMSs lack this feature since they are
using eventual replication and a very flexible scheduling of which
replicas to read. NDB provides a stronger consistency guarantee
in that all applications can see their own updates and replication
is synchronous.

The reason that reading using a backup replica can fail to see its own
changes in NDB is that we release the locks on the primary replica first,
next we deliver the committed message to the application and last
we release the locks on the backup. This means that reading a
backup replica using committed read (reads the latest committed row without
locks) can only be guaranteed to see its own updates if it reads the
primary replica.

With tables that have the Read Backup feature we will delay the
sending of the committed message to the application until all rows
have been unlocked. This means that we can safely read any replica
for those tables and still see our own updates.

Setting the Read Backup feature for a table can either be set through
a special syntax using the COMMENT in the CREATE TABLE statement. A
much easier and likely more useful approach is to set the
ndb_read_backup configuration variable in the MySQL Server to 1. This
means that all tables in this MySQL Server will be created with the
Read Backup feature. Similarly there is a similar feature for ensuring
that all tables are created with the fully replicated feature. In this
case the configuration variable ndb_fully_replicated is set to 1. In
MySQL Cluster 7.6 none of these configuration variables are enabled by
default. But for SQL applications it is a good to always enable the
read backup feature and for applications that focus on read scalability
with fairly limited size of the data, the fully replicated feature can
also be enabled.

Fully replicated tables have a replica in each data node and any replica
can be read for committed reads.

The benchmark we present here is performed in a setup with the optimal
network architecture. It is two machines where the MySQL Server and the
NDB data node can be colocated and the network between the nodes is
using an Infiniband network. This means that in most every real case the
impact of using read backup in a colocated scenario is even bigger.

In this scenario the extra delay to go over the network is fairly small,
thus the impact of low concurrency is fairly small, but the extra overhead
of going over the network a lot shows its impact on higher concurrency and
grows all the way up to 30%.

MySQL Cluster 7.6 and the thread pool

Looking at the graphs in the previous blog post one can see that
MySQL Cluster 7.6 using the shared memory transporter can improve
performance at very high thread counts by more than 100%. Still
the performance is still dropping fairly significantly moving from
512 to 1536 threads. The MySQL Server using the NDB transporter
scales very well on all sorts of architectures and using very many
cores. But I have noted that when the number of connections goes
beyond some limit (in my benchmarks usually around 512 threads),
the performance starts to drop.

Actually in the commercial version of MySQL Cluster help is available
to resolve this problem. The thread pool was developed by me and a team
of performance experts to ensure that MySQL using InnoDB would have
the same performance even with massive amounts of threads hitting at the
MySQL server. It still works for this purpose. I have never mentioned
the use of thread pool for MySQL Cluster before, but the fact of the matter
is that it works perfectly fine to use the thread pool in combination
with MySQL Cluster.

There is one limitation in the current thread pool implementation. The maximum
number of thread groups are 64. This limit was set since MySQL didn't scale
beyond this number using InnoDB in those days. NDB is a distributed engine,
so it works a bit differently for NDB compared to InnoDB. It would be possible
to make the integration of the thread pool and NDB a bit tighter. But even with
the current implementation NDB can be used perfectly fine with the thread pool.

The limit 64 means that it won't really be so useful to use the thread pool and
NDB in combination with MySQL servers that use more than 16 CPUs.
The graph above show how the thread pool compares to the performance of
MySQL Cluster 7.6 on a small MySQL Server. It loses 1-2% on low thread
counts, but it continues to deliver good results even when passing the 512 thread limit.

The graph below shows how MySQL Cluster 7.6.6 using the thread pool compares to
MySQL Cluster 7.5.9 without thread pool. We see the usual linear curve at
high concurrency for the thread pool, in this case however it is limited by
the 64 thread groups since the setup in this case have access to 30 CPUs using
one cluster connection. I did some experiments where I moved the limit of 64
up a bit (a minor change). The performance for most experiments has a
good setting with thread pool size set to 128 and in this case the performance
actually increases a small bit as the number of threads increase.


Tuesday, June 05, 2018

Colocated MySQL Server and NDB data nodes


Historically the advice for MySQL Cluster has been to not colocate
the MySQL Server and the NDB data node for scalable applications.

There are still reasons to apply this principle in scalable setups
where the application isn't close to the data.

But with MySQL Cluster 7.6 we have added a number of reasons why it
makes sense to colocate the MySQL Server and the NDB data node.

Internally in the NDB development we have had a debate about whether
to integrate the NDB data node inside the MySQL Server. The reasons
for this is that the MySQL Server will be closer to the data. The
argument against is that the NDB data node and the MySQL Server are
designed with different recovery models. NDB data nodes are failfast,
as soon as we find a fault that is not supposed to happen we will
crash the data node. The MySQL Server on the other hand tries to
stay up as much as possible since a crash of the MySQL Server brings
down the data accessibility. In NDB we always expect another data
node to have a synchronous replica, thus data is accessible even in
the case of a crash.

With MySQL Cluster 7.6 we have gotten the best of both worlds. We
can now communicate from the MySQL Server to a NDB data node using
a shared memory transporter. This means that communication goes
entirely through the memory of the machine, the communication between
a thread in the MySQL Server and a thread in the NDB data node
goes through memory and when a thread needs to wake up a thread a
mutex is used with a condition variable exactly as in the MySQL
Server. Still the NDB data node and the MySQL Server is separate
programs that can reside on machines independent of each other
and they can crash independently of each other.

So with the release of MySQL Cluster 7.6 it is possible to have
clusters with locality of reads. Already in MySQL Cluster 7.5 we
introduced the possibility to declare tables as being able to
read from all replicas (Read Backup feature). In addition we
introduced tables that can be fully replicated in MySQL Cluster 7.5.
In these fully replicated tables access to a table is always local
to the data node we are in.

In MySQL Cluster 7.6 we are introducing a shared memory transporter
for efficient communication between a colocated MySQL Server and
an NDB data node. In addition we are introducing the possibility
to declare location domain ids for all nodes in the cluster. This
means that we can ensure that reads always stays local to the
Availability Domain in an Oracle Cloud (Availability Zone in Amazon
and Google clouds). Thus it is possible to design applications
without having to consider networking constraints as much as before
with NDB.

This means that we expect NDB to work very well in SQL applications.
We are also constantly working on improving the SQL performance of
NDB by supporting more and more push down of joins to the NDB data nodes.
We are working on improving the basic range scan mechanism in NDB,
we are working on improving the interface between the NDB storage
engine and the MySQL Server. Finally we are also working hard to
integrate all the changes in MySQL 8.0 into MySQL Cluster as well.

I will describe a number of different scenarios for how to build
applications in the cloud using a setup where we have 3 data nodes,
one in each availability domain of the Oracle Cloud.

But in this blog and a few more blogs I will start by looking
specifically at how the shared memory transporter improves performance
of standard sysbench benchmarks.

In the previous blog I showed how we have improved performance of
Sysbench OLTP RW even for the standard TCP transporter. This was
due to the use of a new wakeup thread and the use of locking the NDB API
receive thread to a CPU where it can work undisturbed. The receive
thread handles receive of all messages from the NDB data nodes and
must be prioritised over the other MySQL Server threads, the best way
to achieve this is to use CPU locking. In the benchmarks we present in
this blog we will always use this CPU locking.

In the figure above we show how the performance of a normal setup using
7.5.9 compares to the 7.6.6 with receive thread locked to a CPU using
the TCP transporter. Next we have a curve that shows performance when
simply replacing the TCP transporter with a shared memory transporter.
Next we show a curve of what happens if we configure the shared memory
transporter to use spinning for a while before it goes to sleep.

The final curve shows the performance when also spinning in the TC
threads and the LDM threads in the NDB data node. Spinning in those
threads is not likely to be beneficial if those threads are not locked
to their own CPU core, thus in this one should not use hyperthreading
for those threads.

The takeaways from the graph above are the following:

1) The shared memory transporter have similar performance at low
concurrency as the TCP transporter. As concurrency increases the
shared memory transporter has better performance, the improvement
is 10% at top performance and more than 40% at very high concurrency.

2) Using spinning in the configuration of the shared memory transporter
improves performance at low concurrency significantly, by more than
20%. Top performance is similar to not using spinning, but it is
easier to get to this top performance.

3) Using spinning in the TC threads and LDM threads improves performance
even more at low concurrency. Performance increases by more than 30% at
low concurrency compared to no spinning and by 10% compared to spinning
only in transporter. Performance at high concurrency is similar for all
variants using shared memory transporter. So spinning helps to make the
MySQL Server need less concurrency to reach high performance levels.

We have added a graph below where highlight the performance at 1 and 2
threads since it is difficult to see those differences in the first
figure.

Configuring NDB to use a shared memory transporter is easy, the easiest
way is to simply set a new configuration variable UseShm to 1 on the
NDB data nodes. With this setting we will create a shared memory transporter
between all API nodes and this node when the API node and the data node
share the same hostname. It is also possible to create a separate shared
memory section to describe the transporter setup between two specific
nodes in the cluster.

Spintime for the shared memory transporter is easiest to setup using the default
shared memory transporter section. Spintime for TC and LDM threads in the NDB
data nodes are configured using the ThreadConfig variable in NDB data nodes.

Since we are using mutex and condition variables in shared memory we are
only supporting shared memory transporters on Linux at the moment.

The conclusion is that using the shared memory transporter we can improve
performance at low concurrency by more than 30%, we can improve throughput
by 20% and at very high concurrency (1536 threads) we get about 100%
improvement, all comparing to the result in using 7.5.9.

In the graph below we show only the 7.5.9 curve and compare it to the curve
achieved with all improvements in 7.6.


Monday, June 04, 2018

Improvements from NDB wakeup threads


In MySQL Cluster 7.6 we introduced a new thread type in the NDB API.
Traditionally each cluster connection has one send thread to assist in sending messages to NDB, a receive thread that can
assist in receiving messages from NDB.
There is also a connection thread that listens to new connections and connects to the NDB data nodes.

There is also a set of user threads that is created by the application
that uses the NDB cluster connection.

Most of the sending is done by the user threads, the NDB API sending
thread is only used only when we are sending faster than the network
is able to handle.

We can process receive messages in the user threads or in the NDB
API receive thread. The default behaviour is to use the user threads
until the concurrency is higher than 8 threads working at the same
time. So at the highest concurrency it is the receive thread that
handles the signals and at low concurrency it is handled directly
by the NDB API user threads. The 8 is configurable through the
MySQL server variable ndb_recv_thread_activation_threshold.

Receiving in the NDB API is slightly faster to use from user threads
when only one thread is active.  It is 3-4% better response time in
this particular case. However as more and more threads are sending
data to the NDB data nodes the efficiency of using the NDB API
receive thread increases.

One problem in using the NDB API receive thread is that it is responsible
to both receive the messages from the NDB data nodes and to wake up the
NDB API user threads. At low load this is not an issue. But when the
load on the NDB API receive thread reaches 90% and beyond, this becomes
an issue.

To avoid this problem we added a new thread in the NDB API in MySQL Cluster 7.6.
This is the wakeup thread. This thread only has one duty, this is to wakeup
other threads. We experimented with a number of different variants to see which
ensured that user threads are woken up as quickly as possible.

Our conclusion was that at low load the optimal is that the receive thread
handles the wakeup, but at very high load it requires assistance from one
wakeup thread. As load increases the receive thread will handle less and less
wakeups. At 99-100% load the receive thread will more or less offload all
wakeup calls to the wakeup thread.

In the figure above we compare a normal sysbench OLTP RW experiment
comparing 7.5.9 with 7.6.6. As can be seen there is no difference until
we reach 32 connections. As we start to offload a subset of the wakeups
to the wakeup thread we improve performance of the application.

The throughput increases 5% due to this new feature, with even more
threads the performance drops slower such that we gain 15-20% more
performance at 512 connections.

The best performance is normally achieved by using the NDB API
receive thread and that this thread is locked to a specific CPU.
When starting the MySQL server one specifies these CPUs in the
configuration parameter ndb_recv_thread_cpu_mask. If the MySQL
Server uses several NDB cluster connections, the parameter
should specify one CPU per cluster connection.

If locking the NDB API receive thread to a CPU, it is important to
also lock the MySQL server process to other CPUs and if other processes
are running on the same machine, these also need to be locked to
CPUs not interfering with the NDB API receive thread.

The figures above shows the improvements when using one of the CPU
cores locked to handle the NDB API receive thread. Locking the receive
thread to a CPU adds another 5% to the total throughput and up to
20% more at high thread counts.

So what we have achieved with MySQL Cluster 7.6 is that we can increase
the throughput by at least 10% and performance at high thread counts
can increase by as much as 40%. All these numbers are still using the
TCP transporter. In a coming blog we will show how these numbers increase
even more when using the shared memory transporter. In addition we will
show how using the thread pool with NDB can even further increase stability
of high throughputs at high thread counts.

The above experiment was always done with one data node using 8 LDM
threads, the data node is locked to CPUs within one CPU socket. The
MySQL Server is locked to using 30 CPUs (15 CPU cores). In all cases
the bottleneck is that we only use one cluster connection. In 7.5.9 this
cluster connection scales to about 18 CPUs and with 7.6.6 it scales to
more than 20 CPUs. So using one cluster connection per 8 CPU cores is
usually appropriate.

Saturday, June 02, 2018

MySQL Cluster 7.6 future proof

MySQL Cluster 7.6 is designed to improve the restart times
for database sizes that MySQL Cluster 7.5 and earlier versions
support.

At the same time MySQL Cluster 7.6 is preparing for the innovations
in HW architecture. Between 2008 and 2012 I was heavily involved
in handling the previous change in HW architecture. This change
was the introduction of multi-core architectures. Between 2008 and
2012 we scaled the MySQL Server from 4 CPUs to 64 CPUs.
The NDB data nodes was scaled from 2 CPUs to more than 50 CPUs in
the same timeframe.

The next major shift in the HW architecture is the introduction of
persistent memory, this means that we will get persistent memory
accessible at the same level as DRAM. We don't know yet all
characteristics we will see on those persistent memories, but a
guestimate on what to expect are:

1) About 10x more memory per DIMM
2) About 4x cheaper memory
3) About 10x slower access to the memory compared to DRAM
4) Memory will be persistent and survive a restart of the machine

As an example of this development Intel announced Optane persistent memory
to be fully available in 2019. These memory DIMMs will be available in 512 GB
DIMMs. A modern 2-socket server of today comes equipped with about 512 GB
memory. A high-end server of 2019 will be able to be shipped with 6 TByte of
persistent memory and on top of that also 400 GByte of DRAM memory.

MySQL Cluster 7.5 and earlier versions have a good fit for the modern servers
of today. MySQL Cluster 7.6 brings a much improved recovery architecture that
will improve restart times by 4x using current database sizes.

At the same time MySQL Cluster 7.5 won't work very well with machines with
6 TByte machines. This is due to the use of full checkpoints, thus each
checkpoint will have to write 6 TBytes to disk.

Actually most every in-memory DBMS have the same issue, so all in-memory
DBMSs have to adapt to this new reality. MySQL Cluster leads the way here
by introducing partial checkpoints in MySQL Cluster 7.6. Even disk-based
DBMS will get a fair amount of issues to handle around checkpointing
when the page cache grows to multi-TByte sizes.

During development of partial checkpoints I analyzed the difference between
the method implemented in MySQL Cluster 7.6 and using a page cache. The
method used in MySQL Cluster 7.6 needed to write 100x less data to disk as
part of checkpoints.

To give you a feeling for the impact of the checkpointing times in MySQL
Cluster I will describe what will happen with full checkpoints using
a 6 TByte database size.

Assume that we will write 100 MByte per second to disk for checkpoints.
In this case it will take 60.000 seconds to perform a checkpoint. This
means 16 hours and 40 minutes.

Now assume we perform a checkpoint in MySQL Cluster 7.6. In 7.6 we only
need to checkpoint those partitions that have changed any data. We assume
that half of the partitions in the database haven't changed since the last
checkpoint. We assume that we have a fair amount of updates, but since
checkpoints happen with intervals of around a minute, it means that only
a small portion of the 6 TByte will be updated. A partial checkpoint will
always checkpoint at least one part in 2048. This means that the minimum
size of a checkpoint in this scenario would be 1.5 GByte. Thus the
checkpoint will take 15 seconds in MySQL Cluster 7.6.

This means that the checkpoint time has decreased by a factor of 4000x
in this particular case.

The choice of the factor 2048 is to ensure that we can maintain very short
checkpoint times all the way up to memories of 16 TByte and even beyond this
it will still function very well.

Thus MySQL Cluster 7.6 is already prepared for the next generation of
HW architectures arriving in 2019.

As part of a node restart we perform a checkpoint and have to wait for the
previous checkpoint to complete. From this we can deduce that the improvement
of restart times is even bigger as we go towards bigger memories.

Friday, June 01, 2018

MySQL Cluster 7.6 in numbers

MySQL Cluster 7.6 is now released as GA. 7.6 contains a lot of interesting changes.
Here is a list of changes in numbers in the recent MySQL Cluster versions.

1) With the new changes of our checkpointing scheme the restart times in MySQL
Cluster are 3-4x faster for a database size of around 60 GByte.

2) The changes we have done in our parallel query execution parts have improved
some benchmark queries we have tested with up to a factor of 2.

3) A new wakeup threads in the NDB API means that each API node can deliver 10%
more throughput.

4) 7.6.6 introduces a new shared memory transporter to communicate between
colocated MySQL Server and NDB data nodes. This improves throughput by 10% and
can improve throughput on low thread counts as much as 60%.

5) Benchmarks using the Read Backup feature (released in 7.5) shows that
performance increases around 50% even in a network with a very high bandwidth.

6) A benchmark in the Oracle Cloud using the new feature to ensure locality of
reads within the same availability domain shows 50% improved throughput.

7) Parallelisation of the UNDO log applier speeds up this phase of the NDB recovery
by a factor of 5.

8) We have changed the defaults for index build during restarts that can speed up
the index build phase of the restart by a factor of 2.

9) We have changed the default batch sizes of index builds during restore of a
backup that can speed up restore by at least a factor of 2.

So as these numbers reflect MySQL Cluster 7.6 have significantly improved our
availability by decreasing the amount of time to restart a node. At the same
time we have also significantly improved performance of an SQL application
on top of MySQL Cluster using Read Backup feature, Fully Replicated feature,
a new shared memory transporter, a new feature for locality of reads in a
cloud environment and finally an improved throughput on each API node.

I will present those results in detail in the coming weeks. Some of those tests
was performed in the Oracle Cloud which means that anyone can reproduce the
exact same tests themselves which I think is a very nice feature with the
cloud.

Tuesday, May 01, 2018

Linux configuration for MySQL Cluster

NDB Cluster was designed from the ground up for real-time operations.
It has its origins in the telecom industry where predictability of performance
and latency is absolutely critical. In addition the telecom vendors are
competing very much on pricing of their product.

This leads to that it is important to get the most performance from each
telecom server. Increasing the performance by 10% means that you need 10%
less network equipment. If many servers are shipped this can be a substantial
cost that is worth spending valuable engineering time to achieve.

Thus if you are using or are planning to MySQL Cluster on many servers it
is a good idea to spend some time ensuring that one achieves optimal
performance. If you are operating MySQL Cluster in just a few servers the
payback on your time investment might not be as high.

Another reason to take a deep interest in performance of MySQL Cluster is
of course that you want to learn more about Linux and MySQL Cluster.

In this blog post I will mainly discuss one specific optimisation that can have
large impact on your performance. This is placing the execution threads of
MySQL Cluster and Linux on to the proper CPUs.

In my book MySQL Cluster 7.5 inside and out I devote 3 chapters to this
topic. One chapter on the usage of hyperthreading in data nodes, one
chapter on configuration of CPU locking for data nodes in NDB and
one chapter on configuring Linux interrupt handling.

In this blog post I will mainly discuss the Linux configuration.

NDB is a distributed DBMS where all operations pass through the network,
thus network operations is an essential part of the setup of a MySQL Cluster.

So how does a network packet arrive from the wire into MySQL Cluster and
out again.

The first thing that happens is that the network card receives a packet. It now
needs to inform a CPU about this packet such that Linux TCP/IP code can
handle the packet.

Now the Linux TCP/IP is divided into 3 parts. The first part is handling the
HW interrupt from the network card, the second part is handling the
soft interrupt. The final part is a function call interrupt from the soft interrupt
to the device driver to handle most of the interrupt handling.

The function call interrupt part was introduced in Linux 2.6.35, thus in older
Linux kernels this concept doesn't exist.

Modern Linux kernels also use the NAPI mechanism, this means that the
HW interrupt is disabled during the time we process a set of HW interrupts.
This mechanisms avoids overloading the CPUs with interrupt handling and
takes care of interrupts in batches at high load.

Now Linux interrupt setup is highly configurable and the defaults depends
among other things on the Linux distribution.

There are three main areas that can be configured for Linux. These are
RSS (Receive Side Scaling), RPS (Receive Packet Steering) and
RFS (Receive Flow Steering).

RSS handles setup of the HW interrupt and the soft interrupt. Often the
default is to spread these interrupts on all CPUs. This is usually not a
good idea for MySQL Cluster since some of the CPUs we want to
protect as much as possible from all OS activity (these are in particular
the LDM threads).

RPS is the new mechanism introduced in Linux 2.6.35 and one can configure
the CPUs where the function call interrupts are handled.

Finally RFS tries to ensure that the function call interrupt is executed
on the same CPUs where the application calls recv from. In NDB data nodes
this happens in the recv threads. In a MySQL Server the communication
from the data nodes arrives in the NDB API receive threads (this can be
configured to be locked to a specific CPU). The communication from MySQL
clients arrive in the connection thread where they are executed, so these are
spread on all CPUs the MySQL server is executing on.

When the NDB data node has processed the messages arriving on the network
it will send some responses onto the network. In the NDB data nodes this
happens either in all the threads or it happens in a send thread or a combination
of the two.

Linux has the possibility to configure transmit interrupts as well through
something called XPS (Transmit Packet Steering). The optimal behaviour is
achieved if the transmit interrupts are handled by the same CPU as the send
call is done from.

Configuring Linux and MySQL Cluster together can have a major positive
impact on performance. To achieve the best performance it is important to
consider the flow of data through the servers to process the database requests
from NDB clients (e.g. the MySQL Server).

Saturday, March 17, 2018

NDB Cluster and disk columns

NDB is mainly an In-memory database. We have however also the possibility to
store non-indexed columns on disk. This data uses a page cache as any
other normal disk-based DBMS.

Interestingly with the increases of memory sizes one could think that
disk data becomes less important for MySQL Cluster. The answer is actually
the opposite.

The reason is again the HW development. NDB is designed with predictable
latency as a very basic requirement. In the past disks meant hard drives. Access
time to a hard disk was several milliseconds at best. Given that our requirement
was to handle complex transactions within 10 milliseconds disk data storage
was out of the question.

Modern HW is completely different, they use SSD devices, first attached through
the SATA interface that enabled up to around 500 MByte per second and
a few thousand IO operations per second (IOPS). The second step was the
introduction of SSD devices on the PCI bus. This lifted the performance up to more
than  1 GByte per second. These devices are extremely small and still very powerful.
I have an Intel NUC at home that has two of those devices.

Thus the performance difference between disk storage and RAM has decreased.

The next step on the way was to change the storage protocol and introduce NVMe
devices. These still use the same HW, but use a new standard that is designed for
the new type of storage devices. Given those devices we have now the ability to
execute millions of IOPS on a standard server box with access times of a few tens
of microseconds.

For NDB this means that this HW fits very well into the NDB architecture. The work
we did on developing the Partial LCP algorithm did also a lot of work on improving
our disk data implementation. We see more and more people that use disk data
columns in NDB.

The next step is even more interesting, this will bring storage into the memory bus and
access times of around one microsecond. For NDB this disk storage can be treated as
memory to start with, thus making it possible to soon have multiple TBytes of memory
in standard boxes.

Thus HW development is making the NDB engine more and more interesting to use.

One notable example that uses disk data columns in NDB is HopsFS. They use the
disk data columns to store small files in the meta data server of the HopsFS
implementation of the Hadoop HDFS Name Server. This means much faster
access to small files. The tests they did showed that they could handled hundreds
of thousands of file reads and writes per second even using fairly standard SSD disks
on the servers.

The implementation of disk data in NDB is done such that each row can have three
parts. The fixed memory part that is accessed quickly using a row id. The variable
sized part that is accessed through a pointer from the fixed size part.

The disk columns are also accessed through a reference in the fixed size part. This
reference is an 8-bit value that refers to the page id and page index of the disk
columns.

Before we can access those pages we go through a page cache. The page cache was
implemented on caching techniques that was state of the art a few years ago.

The idea is quite simple. The page cache uses a normal hot page queue. Pages are
brought up in this queue when they are accessed. A single access will bring it up,
but to be more permanent in the page cache a page has to be accessed several times.

Now each page is represented in those queues by a page state record. The basis
of the page cache algorithm is that a page can be represented in a page state
record even if the page is not in the page cache.

NDB has a configuration variable called DiskPageBufferEntries, by default this is
set to 10. It is the multiplication factor of how many more pages we have
page state records compared to the amount of pages we have in the page cache.

So for example if we have set DiskPageBufferMemory to 10 GByte and we have
set DiskPageBufferEntries we will have page state records that holds pages of
100 GBytes in the queues. Thus even when a page is paged out we keep it in the
list and thus we can see patterns of reuse that are longer than the page cache
we have access to. The factor of 10 means that the page state records are of
about 3% of the size of the page cache itself. Thus the benefits of the extra
knowledge about page usage patterns comes at a fairly low cost. The factor
10 is configurable.

Many cloud servers comes equipped with hundreds of GBytes (some even TBytes)
and can also store a number of TBytes on NVMe devices. NDB is well suited
for those modern machines and MySQL Cluster 7.6 have been designed to be
suitable for this new generation of HW.

Friday, March 16, 2018

Discovering rows that have been updated since last checkpoint

One important problem that requires a solution is to decide whether
a row has been updated since the last checkpoint or not.

Most implementations use some kind of mechanism that requires extra
memory resources and/or CPU resources to handle this.

NDB uses the fact that each row is already stamped with a timestamp.
The timestamp is what we call a global checkpoint id. A new global
checkpoint is created about once every 2 seconds (can be faster or
slower by configuration).

Thus we will overestimate the number of rows written since last checkpoint
with a little bit, but with checkpoints taking a few minutes, the extra overhead
of this is only around 1%.

Thus when we scan rows we check the global checkpoint id of the row, if
it is bigger than the global checkpoint that the last checkpoint had fully
covered we will write the row as changed since last checkpoint. Actually
we also have the same information on the page level, thus we can check
the page header and very quickly scan past an entire page if it hasn't been
updated since last checkpoint.

The same type of scanning is used also to bring a restarting node up to
synch with the live node. This algorithm has been present in NDB since
MySQL 5.1.

Partial LCPs and Read-only tables

In MySQL Cluster 7.5 we use Complete Checkpoints. In MySQL Cluster 7.6
we implement an approach where we only checkpoint a part of the database
in each checkpoint.

A special case is a checkpoint of a table partition where no changes
at all have happened since the last checkpoint. In this case we implemented
a special optimisation such that it is not necessary to checkpoint anything
at all for this table partition. It is only necessary to write a new LCP
control file which is 4 kBytes in size for each table partition (can grow to
8 kBytes if the recovery will require more than 980 checkpoints to
recover.

This means that if your database contains a large set of read-only tables,
there will be no need to checkpoint those tables at all. This feature
is used also when setting EnablePartialLcp to false.

Partial LCPs and disk space

One of the main objectives of the new Partial LCP algorithm in MySQL
Cluster 7.6 is to keep up with the development of modern HW.

I have already described in previous blogs how Partial LCP can handle
nicely even database sizes of 10 TBytes of memory with a very modest
load on the disk devices.

Now modern HW has shifted from using hard drives to using SSDs.

The original approach in NDB is assuming that the checkpoints and
REDO logs are stored on hard drives. In MySQL Cluster 7.5 the
disk space required for the REDO log is that it is a bit larger than the
DataMemory size. The reason is that we want to survive also when
loading massive amounts of data.

In MySQL Cluster 7.5 we cannot remove any checkpoint files until
a checkpoint is fully completed. This means that we require around
4x the memory size of disk space for REDO logs and checkpoints.

With hard drives this is not a problem at all. As an example my
development box has 32 GBytes of memory with 2 TByte of disk
space. Thus 64x more disk space compared to the memory space.

With modern servers this size difference between memory and
disks is decreasing. For example many cloud VMs only have
a bit more than 2x the disk size compared to the memory size.

So one goal of MySQL Cluster 7.6 is to fit in much less disk
space.

The aim is to solve this with a three-thronged approach.

1) Partial LCP means that we can execute the checkpoints much
faster. Since REDO logs only need to be kept for around two
checkpoints this means a significant decrease of size requirements
for REDO logs. The aim is to only need around 10% of the disk
space of memory for the REDO logs. This work is not completed
in 7.6.4. As usual there are no guarantees when this work will be
completed.

2) Using Partial LCP we can throw away old LCP files as soon
as we have created a new recoverable LCP for the table partition.
Thus it is no longer necessary to store 2 LCPs on disk. At the
same time there is some overhead related to Partial LCPs. By default
setting this overhead is 50% plus a bit more. Thus we should always
fit within about 1.6x times the memory size.

It is possible to set EnablePartialLcp to false, in this case all
checkpoints will be Complete Checkpoints. This means more
writes to disk for checkpoints, but it will decrease the storage
space to around the same as the memory size.

3) Using CompressedLCP set to 1 we can decrease LCP storage
by another factor of 2-3x (usually around 2.7x). This feature has
existed for a long time in NDB.

Thus it should be possible to significantly decrease the requirements
on storage space when running NDB using MySQL Cluster 7.6.

NDB Checkpoints vs Disk-based checkpoints

At one point in the development of the Partial LCP algorithm I started
wondering how it compares to an approach where one uses a standard
page-based checkpointing as happens in a traditional disk-based DBMS.

The scenario I thought was a case where one have a 10 TByte memory in
the system. In this case the likelihood of updating the same row twice
in a checkpoint is very small (except for hotspot rows of course).

With a row size of 250 bytes there will be 50 billion rows in the database.
I often assume that the checkpoint takes about 5 minutes in my modelling.
This means that even with 1 million writes per second less than 1% of the
data in the database is updated during a checkpoint.

A more normal update speed would be e.g. 100.000 writes per second.
In this case each checkpoint will write 10 Mbytes of rows per second
and thus about 6 GBytes is written in 5 minutes. This represents the
Delta, the changed records.

In addition in the Partial LCP implementation a part of the database
must also be written. In this case we have written 0.075% of the database
since the last checkpoints. The defaults requires thus that we select
a number of parts that ensures that at least 0.15% of the database is
fully written. Thus 2 of the 2048 parts will be written and thus the
total checkpoint size will be 22.5 GByte. To write this in 5 minutes
we need to write 75 Mbyte per second checkpoint data.

Now let us do the same experiment with a traditional disk-based
DBMS. We assume a standard page size of 8 kBytes. This means
that the page cache will have 1.28 billion pages. Thus with 100.000
updates per second we will update 36 M pages. Thus around 3% of
the pages. Thus it is very unlikely that any large number of pages
have more than one page write.

Thus a checkpoint must write each and every one of those 36M pages.
This means a total checkpoint size of 288 GByte. This means that the
DBMS must write almost 1 Gbyte of checkpoints per second, thus
more than 10x the amount NDB will write using the Partial LCP
algorithm.

In NDB it is possible to write even less during a checkpoint by setting
the configuration parameter RecoveryWork higher. Setting this to
its maximum size 100 means in the above calculation that we
only need to write 15 GByte per checkpoint and thus checkpoint
speed is 50 MBytes per second.

The drawback of this setting is that we increase the work at recovery
instead. The overhead stored on disk with setting 100 is 2x and this
overhead will amount to the same overhead at recovery time. The
default setting is 50% overhead in storage and at recovery.

It is possible to set it lower as well, down to 25%. In this case we will
write more, in the example we would write 37.5GByte and thus
125 MByte per second. So still 8x better than the disk-based
DBMS. In this case the overhead in storage is 25% and similarly
the overhead at recovery.

Although the overhead for restoring checkpoints is higher using
Partial LCP, the recovery will be a lot faster in 7.6. Recovery
contains running one LCP as part of recovery. This LCP
can be 100x faster compared to executing an LCP in 7.5.
Thus recovery will often be significantly faster using 7.6.

Also the disk storage for LCPs is decreased in 7.6.

NDB Checkpoints and research on In-Memory Databases

I just read an article called Low-Overhead Asynchronous Checkpointing in
Main-Memory Database Systems. It was mentioned in a course in Database
Systems at Carnegie-Mellon University, see here.

In MySQL Cluster 7.6.4 we released a new variant of our checkpointing designed
for modern HW with TBytes of main memory. I think studying this implementation
will be very worthwhile both for users of NDB, but also for researchers in DBMS
implementations. It implements a new class of checkpoint algorithms that is currently
a research topic in the database research community.

It was interesting to compare our approach that I called Partial LCP with approaches
taken by other commercial in-memory databases and with the approach presented
in the paper.

LCP is Local CheckPoint which is the name we use for our checkpoint protocol
in NDB.

The course presents a number of ideal properties of a checkpoint implementation.

The first property is that doesn't slow down regular transaction processing.

In the case of NDB we execute checkpoints at a steady pace which consumes
around 5-10% of the available CPU resources. This will decrease even more with
the implementation in 7.6.

The second is that it doesn't introduce any latency spikes.

NDB checkpointing both new and old executes in steps of at most 10-20
microseconds. So there will be extremely small impact on latency of
transactions due to checkpointing.

The third property is that it doesn't require excessive memory overhead.

NDB checkpointing consumes a configurable buffer in each database thread. The
ideal size of this is around 1 MByte. In addition we have a REDO log buffer that
is usually a bit bigger than that. That is all there is to it. There is no extra memory
space needed for checkpointing rows. The checkpointing performs a normal scan
of the rows and copies the memory content to the buffer and as soon as the buffer
is full it writes it to disk using sequential disk writes.

It is fair to say that NDB does a good job in handling those ideal properties.

The course presents two variants called fuzzy checkpoints and consistent checkpoints.
The course defines fuzzy checkpoints as a checkpoint that can write uncommitted
data. I would normally use the term fuzzy checkpoint to mean that the checkpoint
is not consistent at a database level, but can still be consistent on a row basis.

Actually NDB is a mix of the definition provided in the course material. It is a
consistent checkpoint for each row. But different rows can be consistent at very
different points in time. So on a row basis NDB is consistent, but at the database
level the checkpoint is fuzzy. Thus to perform recovery one needs to install the
checkpoint and then apply the REDO log to get a consistent checkpoint restored.

Next the course presents two variants called Complete Checkpoints and Delta
Checkpoints. Complete Checkpoint means that the entire database is written in
each checkpoint. Delta Checkpoint means that only changes are written in a
checkpoint.

This is where MySQL Cluster 7.6 differs from 7.5. 7.5 uses a Complete Checkpoint
scheme. 7.6 uses a Partial Checkpoint scheme.

In my view the NDB variant is a third variant which is not complete and not a
Delta Checkpoint. Partial means that it writes the Delta, that is it writes all changes
since the last checkpoint. But it does also write a Complete Checkpoint for a part
of the database, thus the name Partial Checkpoint. Thus it is similar to an
incremental backup scheme.

NDB can divide the database up in up to 2048 parts, each checkpoint can write
0 parts (only if no changes occurred in the table partition since last checkpoint).
It can write 1 part if the number of writes is very small, it can write all 2048 parts
if almost all rows have been updated and it can write anywhere between 1 and
2048 based on how many rows were updated since last checkpoint.

Almost all commercial In-Memory DBMSs still use a complete checkpoint scheme.
As we move towards TBytes of memory this is no longer a plausible approach.

The NDB approach means that we can perform a checkpoint in a few minutes
even in a system with 16 TBytes of memory where we need to write about
8 GBytes plus the changes since the last checkpoint.

Thus NDB takes the step into a new world of massively large In-Memory DBMSs
with the introduction of MySQL Cluster 7.6 and its new Partial LCP implementation.

My new book "MySQL Cluster 7.5 inside and out" describes the LCP
implementation in 7.5, the description of the Partial LCP can be found in my blogs
and also some very detailed descriptions in the source code itself. Among other
things a 10-page proof of that the algorithm actually works :)

The nice thing with the Partial LCP approach in NDB is that it requires no
more work after writing the checkpoint. There is no need of merging checkpoints.
This happens automatically at recovery. There is some amount of overhead in
that the checkpoints can have some rows in multiple checkpoints and thus there is
some amount of overhead at recovery. We calculate the number of parts to use
based on the amount of changes. We even implemented a LCP simulator that
calculates the overhead while inserting and deleting large amounts of row
and has been used to find the proper configurable parameters for the algorithm.


Wednesday, February 28, 2018

UPDATE: MySQL Cluster 7.5 inside and out

Publishing a book internationally turned out to be a bit more complex than I
originally thought. Therefore there are three different ways to order the book
MySQL Cluster 7.5 inside and out.

The E-book which is now available world-wide.
The paperback version. This is also now available world-wide.
Finally the bound version which is available from Nordic countries
and Germany and Switzerland.

The original idea was to publish it as an E-book and as a bound book.
Given that the book is 640 pages long I felt that I wanted a bound book
to ensure that I can read the book a lot. I've got a few copies of the bound
book at home and I have it on my desk all the time together with Jesper
and Mikiyas Pro MySQL NDB Cluster book.

As it turned out the printer only had international agreements to
print paperback books with figures in black and white (the bound
version have color figures). To ensure that the book is world-wide
available I decided to also publish a paperback version.

So for example at the UK/US Amazon's bookshop the versions available are
the E-book and the paperback version.

Personally I still prefer the bound version. I discovered that a german
internet site have international delivery. So if you want to buy the bound version
of the book you can use this site: Hugendubel.de.

If you have any comments on the book, any errata, you can publish a comment
on this blog. I will also publish comments to this blog every now and then when
I discover any errors or comments.

Feel free to also provide ideas for future inclusion in possible future editions of
this book.

Monday, February 12, 2018

Adaptive algorithms in NDB and in cars

The world is brimming with the ideas of self-driving cars and all sorts of
other concepts where computers are supposed to take care of
decision making.

This makes a bit worried, for one because I simply like to drive and
would not want a computer to interfere with my driving. I am already
quite irritated by many automatic things in cars that don't really work
when winter is upon in Sweden :)

Anyways this post is not about that, this post is more about the general
problem of designing adaptive algorithms.

I've been designing NDB software for more than 20 years. During the
course of these years I have learned a bit about what is optimal
when executing NDB. Most of the software I write today is about
putting this knowledge into the NDB software itself.

This is a trend in databases today to automate configuration handling
in a DBMS. In NDB we started this trend in MySQL Cluster 7.4
when we implemented a "top" facility inside the NDB data nodes.
At the same time we also keep track of lags in writing to disk.

We used this knowledge to design an adaptive algorithm that changes
the speed of writing local checkpoints based on the current CPU usage
and IO lag.

We moved on in 7.5 and implemented an adaptive algorithm to control
from where sending will happen. This algorithm is also based on
keeping track of CPU usage in the threads in the data node.

The new partial LCP algorithm is also highly adaptive where it decides
how incremental the LCP should be based on the writing in the
database.

There is also work ongoing on some adaptiveness in the NDB API
where some threads will start up to assist the receive thread in the NDB
API when it gets overloaded.

There is even more work ongoing to ensure that the checkpoint speed
adapts also to conditions where we are starting to run out of REDO log.

Now the idea of adaptive algorithms is clearly a good idea, but, and there
is a big but, there are two problems with ANY adaptive algorithm.

The first problem is oscillation. Adaptive algorithms works by changing
the environment based on input from the environment. When you look
at an adaptive algorithm that works it is actually quite impressive. By
merely finding the proper conditions based on the input you can get a
system that quickly adapts to any working condition and finds a new
optimal spot to work in.

My original study at the university was mathematical statistics.
One important fact in most mathematical statistics is that you
have stable states and you have transient states.

An adaptive algorithm will work fine as long as the frequency of
changes in the environment is not faster than the time it takes to
find a new stable state.

As an example in the algorithms in NDB, most of them takes
decisions to change the environment about once per second.
One important thing to make those adaptive algorithms better
at adapting is to not change the controls to much. If one base
the decision on what to do the next second only on the last
second the adaptive algorithm is quite likely to
self-oscillate.

Thus it is important to build in some inertia in the adaptive
algorithm. This protects the algorithm from going wild.
But it doesn't make it adapt to conditions that change
quicker than the change frequency. Adaptive algorithms
cannot handle that.

So this is the first problem, to ensure that the adaptive
algorithm is quick enough to change to handle the
changing environment, but not so quick that it starts to
self-oscillate.

The second problem is when two adaptive algorithms
crash into each other. As an example in NDB we have a
problem when CPU load is extremely high due to
application activity while at the same time we are
coming close to the limit of the REDO log. In this case
we have two adaptive algorithms that conflict, one wants
to decrease the checkpoint speed to keep the application
activity while the other algorithm tries to slow down the
checkpoint activity to avoid running out of REDO log.

Now in a car the bets are higher, its human lifes involved.
Almost the same problem a self-driving car will have to
solve when the driver has decided on the speed he wants
to travel while at the same time the control of the car sees
dangers coming up ahead. These dangers could be other
cars, cliffs or any other thing.

Sometimes cars even have to make decision on whether
its own passengers should survive or whether the by-stander
should survive.

So the software of a self-driving car and any other
self-controlling software suffers from two big problems
to solve.

1) How often should I take input from the environment and
decide to change the controller parameters.
2) How should I handle conflicting requirements

Failure in handling 1) will lead to self-oscillating
behaviour and failure to handle 2) will lead to
crashes.

So hopefully any developer of self-driving cars has read up
a lot on adaptive algorithms and know exactly when the
algorithm is safe and when it isn't.

Personally I always feel a bit uneasy about any adaptive
algorithm since I know that it is almost impossible to
predict exactly how it is going to behave in all situations.

The mathematics involved in understanding adaptive
algorithms requires a lot of understanding of differential
equations.

Thursday, February 08, 2018

Content of MySQL Cluster 7.5 inside and out

Here is a link to the Book content in the new book MySQL Cluster 7.5 inside and out.

MySQL Cluster 7.5 inside and out

A new book on MySQL Cluster is out. It is called MySQL Cluster 7.5 inside and out.
It is on its way out to all internet sites. Currently it is accessible on adlibris.se and on
BoD's bookshop and now also cdon.com. They are all in swedish, but with Google
Translate that should be possible to overcome. It will hit most other international book
stores within a week or less.

It is currently available for orders as a printed book, it will become available as an
eBook in 1-2 weeks. The printed book is in bound format since I wanted to make it
possible to read it frequently, it is 640 pages long.

I started development on NDB as my Ph.D research. The first years I did collect requirements
and tried to understand how database internals works. In 1996 the development started.
I wrote my Ph.D thesis in 1998 that stated most of the ideas used in the early versions of
the NDB implementation.

The idea on writing a book about MySQL Cluster have been coming up for me every now
and then since more than 10 years back. However all the time I felt it was more important
to focus on one more feature to develop.

In 2015 I decided that it was more important to write down a description of the features in
NDB. So in 2016 I started writing this book. As usual with book projects they take a lot longer
than expected.

At about the same time Jesper Wisborg Krogh and  Mikiya Okuno also started writing a
book about MySQL Cluster. This is called Pro MySQL NDB Cluster.

So the good news is that we now have two very good books about MySQL Cluster.
Jesper and Mikiyas book is written from the perspective of the DBA that have
decided to use NDB.

My book explains why NDB was developed, it describes a great number of applications
where it fits in. It compares it to other clustering solutions for MySQL.

I wanted to make sure that the first step to install and get started with MySQL Cluster
isn't a showstopper, so I described in some detail how to install it and get up and running
on various platforms. This includes a chapter on MySQL Cluster and Docker. Later
there is also a chapter on using NDB in the cloud.

Next it goes through NDB from an SQL point of view and describes all the things that
are good to understand when working with MySQL Cluster. It goes through the direct
NDB APIs (C++ API, Java API and a Node.js API). It goes through how to import
and export data to/from NDB.

It explains the various ways you can replicate between clusters using MySQL Cluster.
It also explains why those solutions exist and what the problem it is trying to solve is.

We have developed quite a few ndbinfo tables that can be used to gather an understanding
of the cluster in operation. These tables are explained and the purpose of them.

Next I dive into some internals, describing the software architecture, the message flows
and the restarts in NDB. I provide some advices on how to optimise restart times.

Next I dive deep into the configuration of MySQL Cluster, both the cluster configuration
and the MySQL servers. I provide detailed descriptions of how to configure for optimal
performance. I also provide details on the memory impact of many configuration parameters.
The configuration chapters include detailed descriptions of how to setup an optimal
execution environment for NDB, this includes details on how to set up the Linux
infrastructure for optimal performance.

Finally I go through our testing frameworks that we make use. I go through in detail
the benchmark framework I developed for more than 10 years called dbt2-0.37.50
that can be used to benchmark with sysbench, DBT2 and flexAsynch.

Finally the history of MySQL Cluster is provided.


Monday, February 05, 2018

Wednesday, January 31, 2018

Partial LCP in MySQL Cluster 7.6.4

Today MySQL Cluster 7.6.4 DMR is out. This new version contains some very interesting
new developments in the area of checkpointing.

When I developed the original NDB algorithms in the late 90s the computer I had access to
had 2 CPUs, 1 GByte of memory. At the time we were aiming at 10.000 updates per second.

So with average row sizes of 100 bytes this meant that we changed 1 MByte per second.
The local checkpoint algorithm was designed to be executed once per about 5 minutes.

So this meant that most of the database would be changed at high loads. So the
checkpointing algorithm writes the entire database. This means that a lot of updates
are merged together in the checkpoint.

This algorithm has been used now in NDB for 20 years and it still works fine.

Now HW is developing in a number of ways.

1) Memory is getting larger and larger. Today it is not uncommon to find machines
with TBytes of memory.

2) The ratio between available disk space is decreasing.

In the past it was not uncommon to have 100 times as much disk space as memory space.
With SSDs this factor have been decreased significantly. Particularly in servers where
NDB resides this factor have decreased to around 2-3 in many cases.

In addition the development of persistent memory is ongoing, this is likely to cause
memory to grow with a jump of another 4x or so. This means that even tens of TBytes
in a machine is likely to be common.

When starting the development of the new recovery algorithm in NDB 2 years ago the
requirement was thus to implement a new recovery algorithm that will handle
main memory sizes of up to at least 16 TByte of memory and with disk sizes that are
about 2x the memory size.

These requirements leads to the following conclusions:
1) We need to implement some form of incremental checkpoint algorithms.
2) We can only maintain one copy of the data on disk
3) We must have the ability to use REDO logs that are much smaller than the memory size

Currently in NDB a checkpoint is not completed until all data have been written. This
means that we must have disk space to handle up to at least 2x the memory size for
checkpoints.

During massive inserts (e.g. importing data), it is necessary to have a very large REDO log
to ensure that we don't run out of REDO log during import of a database.

These requirements are ok if there is sufficient disk space, but we wanted to make sure
that we don't rely on large disk spaces in the future.

In addition we wanted reengineer the LCP algorithm to take decisions locally and not
rely on all nodes participating in each decision about LCPs. This means that we can now
perform checkpoints locally in a node during restart without affecting LCPs in other
nodes. This is particularly interesting for initial node restarts where a new node will
take a very long time to execute an LCP whereas the live nodes can perform LCPs in
a very short time.

There were two main possible implementations for incremental checkpoints.
1) Use a standard page cache implementation also for main memory
This would mean that we would store two pages for each page in main memory
and write each page such that we always keep the old page until the LCP
is completed.

2) A partial LCP where a part of the rows are fully checkpointed and the rest only
checkpoints the changed rows.

I did analyse the two algorithms and concluded that the standard page cache
algorithm writes far too with small rows.

When the memory size is TBytes in size, the likelihood of one page having more
than write in a checkpoint is small, thus each row change will lead to one
page written in LCP.

With a row size of 100 bytes and a page size of 32 kBytes this would lead to
a waste of more than 300x of the disk bandwidth.

In addition it would still require 2x the disk space.

So the choice was taken to go with the partial LCP variant. Interestingly the
analysis of the standard page cache algorithms will be a problem for all
disk-based DBMSs. The growth to larger page caches will mean that more
and more disk bandwidth is spent on writing checkpoints.

So here is a short description of how the partial LCP algorithm works.

1) For each table partition we keep one or two LCP control files. This file
is normally 4 kBytes in size (can be 8 kBytes in some cases).
This file is used at recovery to know which checkpoint files to use in recovery,
it is also used at the next LCP to know which checkpoint files to write.

2) We keep track of the number of rows in a table partition and we keep
track of the number of row changes since the last LCP was started on the
table partition. These two numbers are used to decide on how many parts
of the table partition to fully checkpoint.

If the number of changes is 0, we only write a new control file.
If there are changes we will write at least 1 part and at most a full
local checkpoint that writes all 2048 parts.

3) The table partition is divided into parts based on the row id. We use the
page part of the row id to decide on which part a row id is part of.

4) The number of parts to write uses some mathematical formulas.
As it turns out there is an interesting relation here.
If we write less parts fully the work at recovery is increasing and the
size of all checkpoints increases but at the same time the amount of
writes to disk is decreasing.
With more parts written per checkpoint we increase the amount of
writes to disk per checkpoint, but we decrease the checkpoint size
and the amount of work at recovery.

We decided to make the choice here configurable in a new configuration
parameter called RecoveryWork. This can be set between 25 and 100 and
defaults to 50.

At 50 the checkpoint size will be around 1.6 times the data size. The
amount of checkpoint writes to do will be around 3 times the size of
the changed rows.

Setting it to 25 means that the checkpoint size will be around 1.35 times
the data size. The checkpoint writes will be around 4 times the size of
the changed rows.

Setting it to 100 means that the checkpoint size will be around 2.1 times
the data size and the checkpoint writes will be around 2 times the size
of the changed rows.

Thus there is an exponential dependency on the amount of checkpoint
writes required to achieve the minimum restart time.

We have selected a balanced approach as the default setting.

It is also possible to set EnablePartialLcp to 0. In this case we always
write full checkpoints if any row changed. This means that the checkpoint
size will be equal to the data size. In this case it isn't possible to use a
small REDO log since checkpoint write speed will ensure that we can
complete an LCP in a certain time. In this setup the REDO log should
be 2x the size of data size to ensure that we can handle survive even a
large import of a database.

The above calculation is based on calculation under the assumption on
that the amount of writes is very small per LCP compared to the data size.
The code contains large comments in Backup.cpp that explains this in
even more detail.

There is an additional 12.5% in the checkpoint size due to the fact that
we only delete files and a full checkpoint writes 8 files, so in the worst
case we might have to keep a file that contains only 1 part that is relevant
and the rest is not needed anymore, this means that 1/8 could in the worst
case be wasted space. Normally this would not be the case, but we want
to ensure that we can keep the disk space within limits all the time, even
in the worst case.




MySQL Cluster 7.6.4 is out

MySQL Cluster 7.6.4 DMR is out.

This new version contains a number of goodies.

1) Local checkpoint algorithm have been rewritten
The new checkpointing is designed to scale to at least 16 TBytes of DataMemory sizes
Checkpoints will be much faster, this decreases recovery times significantly
Table fragments that are not updated will not need any new checkpoints written
Checkpoint size on disk is significantly decreased

2) MySQL Cluster Configurator (MCC, Auto Installer)
MCC is significantly improved. Particularly for setups where you
have external computers either on-premise or in the cloud.

3) New cloud feature
In the cloud with availability domains/zones it is possible to have
10x difference between latency inside an AD compared to between
ADs. To run MySQL Cluster in a cloud region with synchronous
replication between ADs one can now configure nodes with a
LocationDomainId. This LocationDomainId will be used to
ensure that transaction coordinator is placed in the same AD and
that we always prefer reading data from our own AD if possible.

4) New ODirectSyncFlag
When using ODirect there are a number of file systems that ensures that writes
are also synched to disk. If the user knows that he is working in such an
environment setting ODirectSyncFlag can improve disk write speeds by
around 2-3x. This is particularly interesting when using hard drives.

5) Change default behaviour of restart configuration
We changed the BuildIndexThreads from 0 to 128 to improve speed of
index rebuilds. We added a new configuration setting to specify which
CPUs that can be used for index rebuilds.

We increased batch sizes (and made them configurable) to improve
performance of unique index creation and online add node and some
other algorithms.

We changed the default algorithm for initial node restart to rebuild indexes
in a special phase.

All these changes can lead to a very significant reduction in restore times.

6) Many improvements to our parallel query implementation (pushdown join,
SPJ). The improvement depends on the queries, but in our special benchmark
query we have improved latency of query execution to almost half.

7) Parallel UNDO log applier for disk columns
The phase where we apply the UNDO log is now fully parallelised over all
LDM threads. For a scenario with 4 LDM threads we've seen a speed up of
5x for the UNDO log applier phase (this is only used to UNDO changes in
pages containing disk columns during a restart).

8) Bug fixes
We have continued our quality improvements to ensure that each new version
is even more stable compared to the previous one.