Friday, August 24, 2018

Manual for benchmark toolset dbt2-0.37.50.15

Manual for dbt2-0.37.50.15

My career has been focused on two important aspects of DBMSs. The
first is the recovery algorithms to enable the DBMS to never be down.
The second is efficient execution of OLTP in the DBMS.

When I started a short career as a consultant in 2006 I noted that I had
to spend more and more time setting up and tearing down NDB Clusters
to perform benchmarks.

The particular benchmark I started developing was DBT2. I downloaded
the dbt2-0.37 version. I quickly noted that it was very hard to run a
benchmark in this version in an automated manner.

My long-term goal was to achieve a scenario where an entire benchmark
could be executed with only command. This goal took several years to
achieve and many versions of this new dbt2-0.37 tree.

Since I automated more and more, the scripts I developed are layered
such that I have base scripts that execute start and stop of individual
nodes. On top of this script I have another script that can start and
stop a cluster of nodes. In addition I have a set of scripts that execute
benchmarks. On top of all those scripts I have a script that executes
the entire thing.

The dbt2-0.37.50 tree also fixed one bug in the benchmark execution and
ensured that I could handle hundreds of thousands of transactions per second
and still handle the statistics in the benchmark.

Later I also added support for executing Sysbench and flexAsynch. Sysbench
support was added by forking sysbench-0.4.12 and fixing some scalability
issues in the benchmark and added support for continuous updates of
performance (defaults to one report per 3 seconds).

Today I view the dbt2-0.37.50 tree and sysbench-0.4.12 as my toolbox.
It would be quite time consuming to analyse any new features from a
performance point of view without this tool. This means that I think
it is worthwhile to continue developing this toolset for my own purposes.

About 10 years ago we decided to make this toolset publically available
at the MySQL website. One reason for this was to ensure that anyone
that wants to replicate the benchmarks that I report on my blog is able to
do this.

Currently my focus is on developing MySQL Cluster and thus the focus
on the development of this toolset is centered around these 3 benchmarks
with NDB. But the benchmark scripts still support running Sysbench and
DBT2 for InnoDB as well. I used to develop InnoDB performance
improvements as well for a few years (and MySQL server improvements)
and this toolset was equally important at that time.

When I wrote my book I decided to write a chapter to document this
toolset. This chapter of my book is really helpful to myself. It is easy
to forget some detail of how to use it.

Above is a link to the PDF file of this chapter if anyone wants to
try out and use these benchmark toolsets.

My last set of benchmark blogs on benchmarks of MySQL Cluster
in the Oracle Cloud used these benchmark scripts.

The toolset is not intended for production setups of NDB, but I am
sure it can be used for this with some adaption of the scripts.

For development setups of MySQL Cluster we are developing
MySQL Cluster Configurator (MCC) sometimes called
Auto Installer.

For production setups of NDB we are developing MySQL Cluster
Manager (MCM).

Happy benchmarking :)

Monday, August 20, 2018

Use cases for MySQL Cluster

Third chapter of "MySQL Cluster 7.5 inside and out".

NDB was designed for a number of networking applications. In particular the original
design focused a lot on telecom applications. The telecom applications have extreme
requirements on availability, many networking applications as well.

This chapter goes through a number of application types where NDB have been used
and also uses a number of public examples of use cases.

These application types include DNS servers, DHCP servers, Intelligent Network (IN)
applications, 3G/4G/5G applications, RADIUS servers, number portability, DIAMETER
server, Video-on-demand applications, Payment Gateways, Software Defined
Networking (SDN), Network Function Virtualization (NFV), Voice over IP.

In addition also many internet applications such as Gaming servers, financial
applications such as stock quote servers, whitelist/blacklist handling. eCommerce
applications, payment services, web applications, fraud detection, online banking,
session database.

In addition a number of frameworks have used NDB and presented it publically such
as a GE presentation, HopsFS (Hadoop distribution), HopsYARN, Oracle OpenStack
and OpenLDAP.

With the possibility to handle more data in MySQL Cluster 7.6 it is likely that this
list of application types will grow even more.

Friday, August 17, 2018

Rationale for MySQL Cluster

Second chapter of "MySQL Cluster 7.5 inside and out".

When I started developing the ideas for NDB Cluster in my Ph.D studies
about 25 years ago, the first thing I did was to perform a very thorough
study of the requirements.

At that time I participated in a European Research for UMTS. UMTS was
later marketed as 3G. My part of this big study group of more than 100
researchers was to simulate the network behaviour. We used the
protocols developed by other groups to see what the load would be on
the various nodes in the telecom network.

I was mostly interested in the load it contributed to the network databases.
Through these studies and also by studying the AXE system developed
in Ericsson I got a very good picture of the requirements on a DBMS
to be used for future telecom services.

In parallel with this I also studied a number of other areas such as
multimedia email servers, news-on-demand servers, genealogy
servers that would be other popular services in the telecom network.

In the above chapter I go through how those requirements was turned
into requirements on predictable response times, availability requirements,
throughput requirements and so forth.

In particular how the requirement led to a model that divided the Data
Server functionality (NDB data nodes) and the Query Server
functionality (MySQL Server nodes).

Also how predictable response times are achieved by building on ideas
from the AXE architecture developed in Ericsson.

Thursday, August 16, 2018

What is special with MySQL Cluster

The first chapter from the book "MySQL Cluster 7.5 inside and out".
This chapter presents a number of key features that makes NDB
unique.

Thursday, August 09, 2018

Optimising scan filter for checkpoints in NDB

When loading massive amounts of data into NDB when testing the new
adaptive checkpoint speed I noted that checkpoints slowed down as the
database size grew.

I could note in debug logs that the amount of checkpoint writes was
dropping significantly at times. After some investigation I discovered
the root cause.

The checkpoint algorithm in NDB requires all changed rows to be written
to the checkpoint even if it is not a part that is fully checkpointed.
This means that each row has to be scanned to discover if it has been
written.

When loading 600 GByte of DBT2 data we have more than two billion rows
in the database. Scanning two billion rows takes around 15-20 seconds
when simultaneously handling lots of inserts.

This slowed down checkpoints and in addition it uses a lot of CPU.
Thus we wanted a more efficient scanning algorithm in this case.

The solution is based on dividing the database into larger segments.
When updating a row, one has to ensure that a flag on the larger
segment is also updated. A simple first approach is to implement
this on page level for our fixed size pages. Every row has an entry
in the fixed size. This part contains the row header and all fixed
size columns that are not defined as using DYNAMIC storage.

In DBT2 this means that most fixed size pages have around 300 row
entries. Thus we can check one page and if no row has been changed
we can skip checking 300 row entries.

When data size grows to TBytes and we checkpoint every 10-20 seconds,
the risk of a row in a page being updated is actually fairly low.
Thus this simple optimisation brings down the slowdown of the
checkpoints to small parts of a second.

Obviously it is possible to use smaller regions and also larger regions
to control this if required.

This is an important improvement of the checkpointing in
MySQL Cluster 7.6.7.

Scheduling challenges of checkpoints in NDB

The NDB data nodes are implemented using asynchronous programming. The model is
quite simple. One can send asynchronous messages on two priority levels, the
A-level is high priority messages that are mainly used for various management
actions. The B-level represents the normal priority level where all normal
messages handling transactions are executed.

It is also possible to send delayed signals that will wait for a certain
number of milliseconds before being delivered to the receiver.

When developing MySQL Cluster 7.4 we noted a problem with local checkpoints. If
transaction load was extremely high, the checkpoints almost stopped. If such a
situation stays for too long, we will run out of REDO log.

To handle this we introduced a special version of delayed signals. This new
signal will be scheduled such that at most around 75 messages are executed
before this message is delivered. There can be thousands of messages waiting
in queue, so this gives a higher priority to this signal type.

This feature was used to get control of checkpoint execution and introduced in
MySQL Cluster 7.4.7. With this feature each LDM thread will at least be able
to deliver 10 MBytes of checkpoint writes per second.

With the introduction of adaptive checkpoint speed this wasn't enough. In a
situation where we load data into NDB Cluster we might need to write much
more data to the checkpoints.

To solve this we keep track of how much data we need to write per second to
ensure that we don't run out of REDO log.

If the REDO log comes to a critical point where the risk of running out of
REDO log is high, we will raise priority of checkpointing even higher such
that we can ensure that we don't run out of REDO log.

This means that during a critical situation, normal transaction throughput
will decrease since we will put a lot of effort into ensuring that we don't
get into a situation of a complete stop due to running out of REDO log.

We solve this by executing checkpoint scans without real-time breaks for a
number of rows and if we need to continue writing checkpoints we send a
message on A-level to ourself to continue without giving transactions a
chance to come in. When we written enough we will give the transactions a
chance again by sending the new special delayed signal.

The challenge that we get here is that checkpoints must be prioritised over
normal transactions in many situations. At the same time we want the
prioritisation to be smooth to avoid start and stop situations that can
easily cause ripple effects in a large cluster.

This improved scheduling of checkpoints was one part of the solution to
the adaptive checkpoint speed that is introduced in MySQL Cluster 7.6.7.

Wednesday, August 08, 2018

Loading data into TByte sized NDB data nodes

One of the main design goals with MySQL Cluster 7.6 was to
support much larger data sets in each data node. The
checkpoint algorithm makes it hard to manage data nodes in
7.5 and earlier versions with many hundreds of GBytes of
data. Using MySQL Cluster 7.6.7 the algorithms scale to
multiple TBytes per data node.

However the largest machines currently at my disposal have
1 TByte of RAM. Thus I went about testing to load data into
a cluster on two such machines. There is no comparison to
older version, it is possible to load data into earlier
versions, but it is not very practical with such large
data nodes in earlier versions.

7.6.7 comes with both partial checkpoints and adaptive
control of disk write speed. This means that we can load
terabytes of data into the cluster even with moderately
sized REDO logs.

To experiment with these I used the same test as in the
previous blog.

Here is the setup.

2 data nodes, one per machine where each machine is equipped
with 1 TByte of RAM, 60 CPU cores distributed on 4 CPU sockets.

The NDB configuration used 750 GByte of DataMemory, 8 LDM threads,
the data node only used CPUs from one CPU socket. It was essential
to set Numa=1 in the configuration to be able to use memory from
all four CPU sockets.

We used the default setup for MinDiskWriteSpeed and MaxDiskWriteSpeed*
configuration parameters. The REDO log size was 2 GByte per LDM thread.
We thus set the following REDO log parameters:
NoOfFragmentLogParts=8 (equal to number of LDM threads)
NoOfFragmentLogFiles=8
FragmentLogFileSize=256M

To enable the new algorithm for adaptive disk write speed it is
necessary to set EnableRedoControl=1 in configuration.

Interestingly in loading the data we were limited by the disk read
bandwidth of the disk where the CSV files of DBT2 were stored. We also
went ahead and created one CSV file per table using SELECT INTO OUTFILE
and used ndb_import to load data into NDB. Using ndb_import we were able
to load data faster than using LOAD DATA INFILE since we were then able
to use files stored on RAID:ed SSD drives.

One challenge with loading data through DBT2 is that we are using warehouse
id as the partition key. This means that all loads from one CSV file goes
into one LDM thread. This means that the load among the LDM threads isn't
balanced. This complicates the checkpointing scheme a bit. We ensure that
any REDO log part that becomes critical also ensures that all other LDM
threads in the cluster knows that we are in a critical state.

Loading through ndb_import is thus easier since the load is balanced over
the LDM threads.

With 8 LDM threads we load 1-2 warehouses per second and one warehouse is
around 100 MByte of data spread into about 370k rows.

In this test we loaded 6000 DBT2 warehouses, thus around 600 GByte in
database size.

The restart is completely similar to the recovery when loading 600 warehouses.
Touching the memory now takes a bit longer, it takes 4 minutes and 44 seconds
plus another 28 seconds when starting the first recovery phases.

The restore phase took 27 minutes and 40 seconds, the REDO phase was still less
than a second and finally rebuilding the ordered indexes took 12 minutes and
13 seconds.

The copy fragment increased to 17 seconds since we needed to scan more data
this time. The phase waiting for checkpoints to complete was still 5 seconds
and the phase that waited for replication subscriptions to be configured also
took 5 seconds. This is usually around 5-6 seconds unless a MySQL server is
down when it can take up to 2 minutes before this phase is completed.

Total restart time thus became 45 minutes and 32 seconds.

As we can see we can recover a TByte sized data node within an hour and this is
with 8 LDM threads. If we instead use 24 LDM threads the restore and rebuild
index phase will go about 3 times faster and thus we would cut restart time by
another 25 minutes and thus restart time would be about 20 minutes in this case
and we would even be able to restart more than 2 TBytes within an hour.

Analysis of restart improvements in MySQL Cluster 7.6.7

To test restart times I am using the DBT2 test suite that
I developed based on DBT2 0.37 since 2006.

The following test setup is used:
DataMemory: 100 GByte (90 Gbyte in 7.5 and earlier)
IndexMemory: 10 GByte in 7.5 and earlier versions)
NoOfFragmentLogParts=8
NoOfFragmentLogFiles=50
FragmentLogFileSize=256M
MinDiskWriteSpeed=20M (two times the default)
MaxDiskWriteSpeed=40M (two times the default)

I load the database using LOAD DATA FROM INFILE with
precreated CSV files that contains the data for one
warehouse and table. This means that there are 8 CSV
files per warehouse. I load 600 warehouses into NDB.
This means a database size of around 60 GByte with
around 280 million rows.

Next I run the DBT2 from 2 MySQL servers using 16 threads
each for two minutes. This creates a load of almost
100k TPM.

The next step is to restart one of the 2 data nodes and
measure the time it takes to restart the data node.

Obviously an even more realistic benchmark would be to
restart while running the benchmark, but the effect on
the restart analysis would not be substantial.

I tested using the latest GA versions from 7.4 and 7.5
and also both 7.6 GA versions (7.6.6 and 7.6.7).

First the results:
7.4.21: 31 minutes 29 seconds
7.5.11: 44 minutes 9 seconds
7.6.6:  18 minutes 2 seconds
7.6.7:  4 minutes 45 seconds

In order to understand better these numbers we will look
at the most important restart phases and analyse numbers
there.

It takes about 3 seconds to stop a node and start it again.
This time is constant in all versions.

The next step is allocating memory and touching the memory.
Allocating memory doesn't actually commit the memory to RAM.
It only ensures that there is space in RAM or in swap file
allocated for the memory. So in order to commit the memory
to RAM, it is necessary to touch the memory (read or write
from it). The speed of this touching of memory is fairly
constant and depends on Linux version (slight speedup in
newer Linux versions). My measurements shows that this
touching of memory handles about 2.5-3.5 GByte of memory
per second. Thus the restart time is dependent on the
DataMemory size and other memory consuming parts of the
NDB data node.

NDB data nodes always allocate and commit all the memory
as part of the restart. It is even possible to lock the
memory to RAM through setting LockPagesInMemory to 1 in
the configuration.

This step takes 26 seconds for all versions.

A major step in the recovery is to recreate the database that
was in the data node at the time of the node stop. This is
performed in 3 phases.

1) Restore data from a checkpoint
2) Execute REDO log
3) Rebuild ordered indexes

The time of all these phases are dependent on the version.
In 7.3 and earlier versions there was also a lot of time
spent waiting for the metadata lock when copying the
metadata to the starting node. This meant waiting for the
current checkpoint to complete (a checkpoint in 7.3 with
these settings and the database size takes about 20 minutes).

Thus 7.3 would add approximately 10 minutes to the restart times.

After restoring the local database from disk the next major
phase is the synchronisation phase. This phase will take longer
time if there has been updates during the restart. The time spent
in this phase is not expected to have changed in any material
fashion in 7.6.

The final phase is to wait for one checkpoint to complete to
ensure that the node is recoverable even if the other node
should fail completely.

Restore phase in 7.4.21 and 7.5.11 only takes about 5-10 seconds.
The reason is that the last checkpoint completed happened early
in the load phase. Thus almost the entire database has to be
loaded from the REDO log.

The numbers on the restore phase plus the REDO phase is
3 minutes and 48 seconds in 7.4.21 and 3 minutes and 30 seconds
in 7.5.11.

In 7.6.6 the restore phase takes considerably longer and thus
the REDO phase is shortened. 7.6.6 makes partial checkpoints and
can thus write checkpoints a bit faster. But the disk write speed
is too slow to keep up with the insert rate. Actually setting
the MaxDiskWriteSpeed to 100M in 7.6.6 speeds up restarts by a
factor of 3. The time for the restore phase in 7.6.6 is
1 minute and 12 seconds and the REDO phase is 2 minutes and
20 seconds. Thus the total time of these two phases are
3 minutes and 32 seconds.

So what we can conclude here is that 7.6.6 requires a higher
setting of the disk write speed to materially improve the
restart times in these two phases for restarts during massive
inserts.

Now the restore phase in 7.6.7 recovers almost all the data
since checkpoints are executed with 15-20 seconds intervals.
The restore phase consumes 2 minutes and 48 seconds and the
REDO phase takes less than one second. The speed up of these
two comes from that the restore phase is faster per row
compared to executing the REDO log although more data has to
be restored in 7.6.

Next we analyse the phase that rebuilds the ordered indexes.
The change here actually comes from configuration changes and
the ability to lock the index build threads to more CPUs.

In 7.6 we changed the default of BuildIndexThreads to 128.
This means that each fragment that requires rebuild of an
index for a table can be executed in parallel. The default
in 7.5 and earlier meant that all rebuild of indexes happened
in LDM threads. Thus in this case the 7.5 and 7.4 versions
could use 8 CPUs to rebuild indexes while 7.6 could use
16 CPUs to rebuild indexes. The parallelisation of
rebuild indexes can happen in 7.5, but 7.6 ensures that we
lock to more CPUs compared in 7.5.

This change meant that the times of 7.4.21 (2 minutes
19 seconds) and 7.5.11 (2 minutes 12 seconds) was significantly
improved. In 7.6.6 the time was 1 minutes 20 seconds and in
7.6.7 1 minutes and 19 seconds. Thus a significant improvement
of this phase in 7.6.

The synchronisation phase takes 1-2 seconds in all versions.
Since no changes happened during the restart this second is spent
in scanning the 280 million rows to see if any changes have
occurred during the restart (happens in live node).

Now we come to the phase where the big change in 7.6.6 happened
and where it happens even more in 7.6.7. This phase is where
we wait for a checkpoint to complete that we participated in.

Actually this means first waiting for the ongoing checkpoint to
complete and next to participate in another checkpoint.

The time to execute a checkpoint is fairly constant in 7.5 and
earlier versions. In this particular setup it takes about 22
minutes. This means that this wait phase can take anywhere
between 22 minutes and 44 minutes dependent on the timing of
the restart.

This is the reason why 7.4.21 restarted so much faster than
7.5.11. It was pure luck in timing the checkpoints.

In 7.6.6 the time to execute checkpoints is much lower compared
to in 7.5. Thus this phase is much shorter here. However the time
for a checkpoint varies dependent on how much changes have happened
since the last checkpoint. In this particular case the phase took
12 minutes and 26 seconds.

With 7.6.7 we adapt checkpoint speed to ensure that we can
survive even with very small REDO logs. The second reason
for this is to make checkpoints much faster. During execution of
this benchmark no checkpoints took more than 25 seconds and most
of them took about 15-20 seconds and when the DBT2 benchmark
executed it took about 10-15 seconds. In idle mode a checkpoint
is executed within a few seconds.

This means that waiting for a checkpoint to complete and execute
a new one is very fast. In this benchmark it took only 5 seconds.

Thus in this particular restarts of 7.6.7 was almost 10 times
faster compared to 7.4 and 7.5 and even 4 times faster than
restarts of 7.6.6.

Thus most of the restart times are now linearly dependent on the
database size and the size of the ordered indexes.

There are more phases in NDB restarts that can consume time, for
instance with disk data we have to UNDO disk data changes. This
phase was improved 5 times with 4 LDM threads in 7.6. There
are also various steps where occasionally the restart can be blocked
due to metadata locks and other reasons.

More automated control in MySQL Cluster 7.6.7

Apart from bug fixes the 7.6.7 version of MySQL Cluster also brings
a major improvement of restart times through adaptively controlling
checkpoint speed.

Many DBMSs work hard on automating management of the database nodes.
In NDB automated management was a design point from the very first
version. This means that nodes crash and restart without operator
assistance.

For the last years we have also worked on developing algorithms that
require less configuration. This will greatly simplify the configuration
of NDB Cluster.

In 7.6.7 we have made it much easier to configure handling of checkpoints
(LCPs) and REDO logging.

In earlier versions of NDB the checkpoint speed has been controlled by
two things. The first is based on the following configuration variables:

MinDiskWriteSpeed: This is the minimum disk write speed we will attempt
to write during a checkpoint even in the presence of CPU overload and
disk overload. This defaults to 10 MByte per second, this is the sum
on all LDM threads, thus on the entire data node.

MaxDiskWriteSpeed: This is the maximum disk write speed we will attempt
to write during a checkpoint, if no CPU overload or disk overload is
seen, this is the checkpoint speed that will be used in normal operation.
Defaults to 20 MByte per second.

MaxDiskWriteSpeedOtherNodeRestart: This is the maximum disk write speed
we will write during a checkpoint when another node is restarting. It
defaults to 50 MByte per second.

MaxDiskWriteSpeedOwnNodeRestart: This is the maximum disk write speed
we will write during a checkpoint when our node is restarting. It defaults to
200 MByte per second.

The actual disk write speed achieved is using those configuration variables
in combination with an adaptive algorithm that will decrease the checkpoint
speed when the CPU or the disk is overloaded.

These parameters exists also in 7.6.7, but there is very little reason to
change them from their default value if the new configuration variable
EnableRedoControl is set to 1. By default this variable is set to 0 to
avoid changes of behaviour in a GA released version of MySQL Cluster.

In earlier versions of NDB it was necessary to have very large REDO logs.
The reason is that earlier versions (7.5 and earlier) wrote the entire
database to disk in each checkpoint. This meant that checkpoints during
massive inserts got larger and larger and to ensure successful insertion
of the entire data set it was necessary to have REDO logs that was about
twice the size of the DataMemory.

Now in 7.6.7 it should be quite enough to have 2-4 GByte of REDO log per
REDO log part (normally equal to the number of LDM threads). This REDO
log size works perfectly even when loading TBytes of data into NDB.
Remember that EnableRedoControl needs to be set to 1 for this to work.

Thus in MySQL Cluster 7.6.7 one can simplify the configuration of REDO logs
and checkpointing.

In earlier versions we need to set the following variables:
NoOfFragmentLogParts (always set equal to number of LDM threads)
NoOfFragmentLogFiles
FragmentLogFileSize
MinDiskWriteSpeed
MaxDiskWriteSpeed
MaxDiskWriteSpeedOtherNodeRestart
MaxDiskWriteSpeedOwnNodeRestart

The product of NoOfFragmentLogParts, NoOfFragmentLogFiles and
FragmentLogFileSize is the size of the REDO log. In earlier versions
this product should be roughly two times the setting of DataMemory.

The default setting of FragmentLogFileSize is 16 MByte. Personally I always
increase this setting to 256 MByte (set to 256M).

So e.g. with a DataMemory of 100 GByte and 8 LDM threads one can set those to
NoOfFragmentLogParts=8
NoOfFragmentLogFiles=50
FragmentLogFileSize=256M

This gives a REDO log size of 200 GByte.

The setting of disk write speed will be discussed a bit more in a coming blog.

In 7.6.7 one can instead configure as follows.

EnableRedoControl=1
NoOfFragmentLogParts=8
NoOfFragmentLogFiles=8
FragmentLogFileSize=256M

The setting of disk write speed variables need not be considered. The setting
of NoOfFragmentLogFiles to 8 and FragmentLogFileSize to 256M should work for
almost all setups of NDB. Only when dealing with data nodes larger than
one terabyte could it be considered to increase the REDO log size. The
NoOfFragmentLogParts should still be set to the number of LDM threads.

Thus in 7.6.7 a lot less thought has to go into configuration of REDO logs
and disk write speeds. Disk write speed still affects backup write speeds
as well though, so it could be a good idea to consider how fast you want to
write your backups using the variables MinDiskWriteSpeed and MaxDiskWriteSpeed.

The reason that disk write speeds for checkpoints is less important to consider
is that we calculate how fast we need to write the checkpoints based on the
write activity in NDB. This means that when setting EnableRedoControl the write
speed to the disk can be quite substantial. So this setting will not work very
well unless the disk subsystem is able to handle the load. The disk subsystem
should be able to handle around 100 MByte of disk writes per LDM thread.

With modern HW this should not be an issue, in particular not when using NVMe
drives. In our benchmarking we are using a RAID 0 setup of 6 SSD drives. With
8 LDM threads inserting at full speed we use about 50% of the disk bandwidth
in this case (500 MByte per second).