Wednesday, October 24, 2018

MySQL Cluster 7.6.8 performance jump of up to 240%

In February I added a new feature to my Sysbench version that I use in
my MySQL Cluster testing. This new feature adds a new column in the
table called filter. It contains the same value as the primary key.

With this new column I can easily change the range scan queries in
sysbench from returning 100 rows to instead scan 100 rows and
return 1 row. This means that sysbench can benchmark the filtering
performance of the underlying database engine.

Next I ran tests where set the number of rows in the range to
10.000 rows. This new test was a perfect vehicle to improve performance
in NDB for scan filtering.

Filtering one row in 7.6.7 in this sysbench tests costs about 750 ns.

When I started out optimising these 750 ns of time I didn't expect so
much improvement, but using perf it was possible to get very
fine-grained pinpointing of the wasted CPU performance. One
interesting thing was that I found a bitmask that had zeroing of the
bitmask in the constructor, it turned out that this constructor was
called twice in filtering a row and neither of them was required.
So fixing this simple thing removed about 20 ns of CPU usage and
in this case about 3-4% performance improvement.

As you can see this is micro-optimisations and for those perf is a
splendid tool.

One of the biggest reasons for bad performance in modern software
applications is instruction cache misses. Most modern software
is packed with features and this requires a lot of code to handle.
The compiler has a hard time knowing which code is the common
path and which path is the error handling path.

In the MySQL code we have two macro's likely and unlikely that
can hint the compiler what code path to optimise for.

In this code path I was optimising I found that I had roughly 1 billion
instruction cache misses over a short period (20 seconds if I remember
correctly). I managed with numerous changes to decrease the number
of instruction cache misses to 100 million in the same amount of time.

I also found some simple fixes that cut away a third of the processing time.

In the end I found myself looking at the cost being brought down to around
250ns. So comparing the performance of this scan filtering with 7.5.10 we
have optimised this particular code path by 240%.

During the development of these improvements of scan filtering, I discovered
that some of the optimisations could be applied also to searching in our
ordered indexes. The impact of this is that the index rebuild phase of a restart
will go faster, I haven't measured the exact impact this has yet. It also means
that any application using ordered indexes will go a lot faster.

For example performance of a standard Sysbench OLTP RW benchmark
with one added ordered index column improves by 70% in 7.6.8
compared to earlier versions of 7.6 and 7.5.

Monday, September 10, 2018

Non-blocking Two-phase commit in NDB Cluster

Non-blocking 2PC protocol

Many of the new DBMSs developed in the last 10 years have abandoned the
two-phase commit protocol and instead relied on replication protocols.

One of the main reasons for this has been the notion that two-phase commit
protocol is a blocking protocol. This is true for the classic version of the
two-phase commit protocol.

When NDB Cluster was developed in the 1990s we had requirements that
the replication protocol could not be blocking. A competitor at the time,
ClustRa, solved this by using a backup transaction coordinator. Given that
NDB Cluster had requirements to survive multiple simultaneous node failures,
this wasn't sufficient.

Thus a new two-phase commit protocol was developed that is completely
non-blocking. The main idea is that one uses a take-over protocol, this means
that any number of nodes can crash and we can still handle it as long as there
is enough nodes to keep all data available.

In addition NDB Cluster is designed both for Disk Durable transactions
and Network Durable transactions. Disk Durable transactions requires
data to be durable on disk when the transaction have committed and
Network Durable requires that the transaction is on at least 2 computers
when the transaction is committed.

Due to the response time requirements for applications that NDB Cluster
was designed for, we implemented it such that when applications received
the response the transaction was Network Durable.

The Disk Durability is handled in a background phase where data is
consistently flushed to disk such that we can always recover a consistent
version of the data even in the presence of a complete failure of the

This part is handled by the Global Checkpoint protocol. The PDF above
describes the transaction protocol and the global checkpoint protocol that
together implement the Network Durability and Disk Durability of NDB

Thursday, September 06, 2018

Basics of MySQL Cluster

Basics of MySQL Cluster

This PDF introduces the basic architecture of MySQL Cluster and how to
access it with various APIs.

Friday, August 24, 2018

Manual for benchmark toolset dbt2-

Manual for dbt2-

My career has been focused on two important aspects of DBMSs. The
first is the recovery algorithms to enable the DBMS to never be down.
The second is efficient execution of OLTP in the DBMS.

When I started a short career as a consultant in 2006 I noted that I had
to spend more and more time setting up and tearing down NDB Clusters
to perform benchmarks.

The particular benchmark I started developing was DBT2. I downloaded
the dbt2-0.37 version. I quickly noted that it was very hard to run a
benchmark in this version in an automated manner.

My long-term goal was to achieve a scenario where an entire benchmark
could be executed with only command. This goal took several years to
achieve and many versions of this new dbt2-0.37 tree.

Since I automated more and more, the scripts I developed are layered
such that I have base scripts that execute start and stop of individual
nodes. On top of this script I have another script that can start and
stop a cluster of nodes. In addition I have a set of scripts that execute
benchmarks. On top of all those scripts I have a script that executes
the entire thing.

The dbt2-0.37.50 tree also fixed one bug in the benchmark execution and
ensured that I could handle hundreds of thousands of transactions per second
and still handle the statistics in the benchmark.

Later I also added support for executing Sysbench and flexAsynch. Sysbench
support was added by forking sysbench-0.4.12 and fixing some scalability
issues in the benchmark and added support for continuous updates of
performance (defaults to one report per 3 seconds).

Today I view the dbt2-0.37.50 tree and sysbench-0.4.12 as my toolbox.
It would be quite time consuming to analyse any new features from a
performance point of view without this tool. This means that I think
it is worthwhile to continue developing this toolset for my own purposes.

About 10 years ago we decided to make this toolset publically available
at the MySQL website. One reason for this was to ensure that anyone
that wants to replicate the benchmarks that I report on my blog is able to
do this.

Currently my focus is on developing MySQL Cluster and thus the focus
on the development of this toolset is centered around these 3 benchmarks
with NDB. But the benchmark scripts still support running Sysbench and
DBT2 for InnoDB as well. I used to develop InnoDB performance
improvements as well for a few years (and MySQL server improvements)
and this toolset was equally important at that time.

When I wrote my book I decided to write a chapter to document this
toolset. This chapter of my book is really helpful to myself. It is easy
to forget some detail of how to use it.

Above is a link to the PDF file of this chapter if anyone wants to
try out and use these benchmark toolsets.

My last set of benchmark blogs on benchmarks of MySQL Cluster
in the Oracle Cloud used these benchmark scripts.

The toolset is not intended for production setups of NDB, but I am
sure it can be used for this with some adaption of the scripts.

For development setups of MySQL Cluster we are developing
MySQL Cluster Configurator (MCC) sometimes called
Auto Installer.

For production setups of NDB we are developing MySQL Cluster
Manager (MCM).

Happy benchmarking :)