Wednesday, June 13, 2018

Node restart improvements in MySQL Cluster 7.6

A simple test to see how much faster restarts are in 7.6 compared to 7.5 is to
load a set of DBT2 warehouses and next perform a node restart of one of the
data nodes.

In the test we load 600 warehouses, this means about 60 GByte of data inserted
into NDB in each data node.

A node restart of 7.5.10 using this data set takes 19 minutes and 11 seconds.
In 7.6.6 this restart takes 6 minutes and 49 seconds.

The difference can vary, especially since the restart time in 7.5 has a fairly high
variance. The restart time in 7.6.6 has much smaller variance.

We will look into the different restart phases and see how those are affected.

The first phase is to allocate the memory and ensure that the memory is allocated
to our process by touching the memory. This phase roughly allocates 2.7 GByte
per second. In this case it takes 26 seconds in both 7.5 and in 7.6.

After that we have some preparatory steps, these normally are very quick and in
this takes about 1-2 seconds. These steps involve preparing the metadata to use
in the restart.

The next step is the restore phase where we recover data from a local checkpoint.
Given the introduction of Partial LCP in 7.6 this phase actually takes longer time
since there is some duplication in the restore where one row is restored multiple
times. This phase will go faster through setting RecoveryWork to a smaller
number, but this will instead increase work on checkpointing during normal
operation.

In this test 7.5 took 1 minute and 51 seconds to restore and 7.6.6 used 2 minutes
and 15 seconds.

The next phase is the REDO log execution phase. Since partial LCP means that
LCPs are shorter, this phase is shortened. This particular benchmark actually
doesn't shorten as much as could be the case since there is heavy insert activity
right before the restart. But it still decreased from 51 seconds in 7.5 to 35 seconds
in 7.6.

After this we execute the UNDO log, this particular benchmark has no disk data.
So both 7.5 and 7.6 takes no time in this phase. But this phase has improved by
5 times with 4 LDM threads and this benchmark uses 8 LDM threads. So this
phase is dramatically faster in 7.6 if used.

The next phase is the rebuilding of the ordered indexes. This phase executes the
same code as in 7.5, but we have changed default configuration to ensure that
the rebuild use fully parallelised rebuild. This means that we have 16 CPUs
working on the rebuild instead of 8 CPUs as in 7.5. This gives a rebuild time
of 1 minute and 17 seconds in 7.6.6 compared to 2 minutes and 4 seconds. The
reason it isn't twice is since we make use of hyperthreading to speed things up.

The copy fragment phase is more or less empty in both 7.5 and 7.6 since we
didn't perform any updates during the restart. We don't expect any major
differences in this phase between 7.5 and 7.6.

Next we come to biggest gain in restart times in 7.6. This is the phase where
we wait for the local checkpoint to complete. In 7.5 we have to wait between
1 and 2 checkpoint times. In 7.5 in this benchmark the checkpoint takes about
11 minutes although we have increased the disk write speed compared to the
default. In this execution 7.5 took 13 minutes and 48 seconds in this phase.

In 7.6 we execute a checkpoint that is local to the starting node. This takes
2 minutes and 4 seconds. Finally we participate in a checkpoint where all
nodes participate. This is now extremely fast, it takes only 4 seconds since
no activity is ongoing. So the total time for this phase 2 minutes and 8 seconds.

This wait for checkpoint phase is fairly constant in time in 7.6, in 7.5 it grows
with a growing database size and is dependent on the disk write speed we have
configured. Thus the gain in restart time in 7.6 is a bit variable, but this
experiment have been done with fairly conservative numbers.

The final restart phase is handing over the responsibility of event handling
between connected MySQL servers, this phase takes 5-6 seconds in both
7.5 and 7.6. It can take longer if some MySQL Server is down during the
restart.

As can be seen we have significantly brought down the restart times for a
size that is common in 7.5, with larger sizes the difference is much bigger.

Tuesday, June 12, 2018

Benchmark of new cloud feature in MySQL Cluster 7.6


In previous blogs we have shown how MySQL Cluster can use the Read Backup
feature to improve performance when the MySQL Server and the NDB data
node are colocated.

There are two scenarios in a cloud setup where additional measures are
needed to ensure localized read accesses even when using the Read Backup
feature.

The first scenario is when data nodes and MySQL Servers are not colocated.
In this case by default we have no notion of closeness between nodes in
the cluster.

The second case is when we have multiple node groups and using colocated
data nodes and MySQL Server. In this case we have a notion of closeness
to the data in the node group we are colocated with, but not to other
node groups.

In a cloud setup the closeness is dependent on whether two nodes are in
the same availability domain (availability zone in Amazon/Google) or not.
In your own network other scenarios could exist.

In MySQL Cluster 7.6 we added a new feature where it is possible
to configure nodes to be contained in a certain location domain.
Nodes that are close to each other should be configured to be part of
the same location domain. Nodes belonging to different location domains
are always considered to be further away than the one with the same
location domain.

We will use this knowledge to always use a transaction coordinator placed
in the same location domain and if possible we will always read from a
replica placed in the same location domain as the transaction coordinator.

We use this feature to direct reads to a replica that is contained
in the same availability domain.

This provides a much better throughput for read queries in MySQL Cluster
when the data nodes and MySQL servers span multiple availability domains.

In the figure below we see the setup, each sysbench application is working
against one MySQL Server, both of these are located in the same availability
domain. The MySQL Server works against a set of 3 replicas in the NDB data
nodes. Each of those 3 replicas reside in a different availabilty domain.

The graph above shows the difference between using location domain ids in
this setup compared to not using them. The lacking measurements is missing
simply because there wasn't enough time to complete this particular
benchmark, but the measurements show still the improvements possible and
the improvement is above 40%.

The Bare Metal Server used for data nodes was the DenseIO2 machines and
the MySQL Server used a bare metal server without any attached disks and
not even any block storage is needed in the MySQL Server instances. The
MySQL Servers in an NDB setup are more or stateless, all the required state
is available in the NDB data nodes. Thus it is quite ok to start up a MySQL
Server from scratch all the time. The exception is when the MySQL Server
is used for replicating to another cluster, in this case the binlog state is required
to be persistent on the MySQL Server.

Monday, June 11, 2018

Impact of sharding on query performance in MySQL Cluster


A new week of blogs about our development in MySQL Cluster 7.6.
After working a long time on a set of new developments, there is a lot
of things to describe. I will continue this week with discussing sharding
and NDB, a new cloud feature in 7.6 and provide some benchmark
results on restart performance in 7.6 compared to 7.5. I am also planning
a comparative analysis for a few more versions of NDB.

In the blog serie I have presented recently we have displayed
the performance impact of various new features in MySQL Cluster
7.5 and 7.6. All these benchmarks were executed with tables that
used 1 partition. The idea behind this is that to develop a
scalable application it is important to develop partition-aware
applications.

A partition-aware application will ensure that all partitions
except one is pruned away from the query. Thus they get the same
performance as a query on a single-partition table.

Now in this blog we analyse the difference on using 1 partition
per table and using 8 partitions per table.

The execution difference is that with 8 partitions we have to
dive into the tree 8 times instead of one time and we have to
take the startup cost of the scan as well. At the same time
using 8 partitions means that we get some amount of parallelism
in the query execution and this speeds up query execution during
low concurrency.

Thus there are two main difference with single-partition scans
and multi-partition scans.

The first difference is that the parallelism decreases the latency
of query execution at low concurrency. More partitions means a higher
speedup.

The second difference is that the data node will spend more CPU to
execute the query for multi-partition scans compared to single-partition
scans.

Most of the benchmarks I have shown are limited by the cluster connection
used. Thus we haven't focused so much on the CPU usage in data nodes.

Thus in the graph above the improvement of query speed is around 20% at
low concurrency. The performance difference for other concurrency levels
is small, the multi-partition scans uses more CPU. The multi-partition
scans is though a bit more variable in its throughput.

Tests where I focused more on data node performance showed around 10%
overhead for multi-partition scans compared to single-partition scans
in a similar setup.

An interesting observation is that although most of the applications
should be developed with partition-aware queries, those queries that
are not pruned to one partition will be automatically parallelised.

This is the advantage of the MySQL Cluster auto-sharded architecture.
In a sharded setup using any other DBMS it is necessary to ensure that
all queries are performed in only one shard since there are no automatic
queries over many shards. This means that partition-aware queries will
be ok to handle in only one data server, but the application will have to
calculate where this data server resides. Cross-shard queries have to be
automatically managed though, both sending queries in parallel to
many shards and merging the results from many shards.

With NDB all of this is automatic. If the query is partition-aware,
it will be automatically directed to the correct shard (node group
in NDB). If the query isn't partition-aware and thus a cross-shard
query, it is automatically parallelised. It is even possible to
push join queries down into the NDB data nodes to execute the
join queries using a parallel linked-join algorithm.

As we have shown in earlier blogs and will show even more in coming
blogs NDB using the Read Backup feature will ensure that read queries
are directed to a data node that is as local as possible to the MySQL
Server executing the query. This is true also for join queries being pushed
down to the NDB data nodes.

Friday, June 08, 2018

Benchmark Read Backup Feature of NDB in the Oracle Cloud


The previous blog demonstrated the improvements from using the Read Backup
feature in NDB in a tightly connected on premise installation. Now we will
show how the performance is impacted when running on the Oracle cloud.

In both benchmarks we show here the MySQL Server and the NDB data node
are colocated and we have one node group where the combined data nodes
and MySQL Servers are placed in their own availability domain. We use all
3 availability domains and thus we have 3 synchronous replicas of the
data in the database. This means that all communication between data nodes
and all communication not using the local data node from the MySQL server
will also communicate  to another availability domain.

The below figure shows the mapping of processes to cloud machines. In this
experiment we used the DenseIO2 Bare Metal Server as machines for the
combined data node, MySQL server and place to run the Sysbench application.
This machine has 52 CPU cores, it has a set of NVMe drives that provide a
very fast local storage for the NDB data nodes, this ensures very speedy
recovery and in addition provides an excellent hardware to place some columns
in the disk data parts of NDB. The machine has 768 GByte of memory and
51.2 TByte of NVMe drives. Thus providing  a highly available in-memory
storage of around 500 GByte and a few TBytes of disk data columns.



We used a standard Sysbench OLTP RO benchmark. This benchmark executes 16
SQL queries, 10 simple primary key lookups, 4 range scan queries that gets
100 rows from the database and finally a BEGIN and a COMMIT statement.
To scale the Sysbench benchmark each Sysbench instance uses different tables
and different data. In a real setting there would likely be a load balancer
that directs connections to an application in one of the availability
domains. From this application the idea is to handle most traffic internally
in the same availability domain. My benchmark scripts are ensuring that the
different sysbench programs are started in a synchronous manner.

Thus the main difference when Sysbench is replaced by a real application
is that a load balancer sits in front of the application. Load balancers
are provided as a service in the Oracle Cloud. One could also use a
MySQL Router between the application and the MySQL Server. The JDBC
driver can handle failover between MySQL Servers, this would avoid this
extra network jump. As we will see in the benchmarks, each network jump
makes it harder to get the optimal performance out of the setup and it
increases the latency of application responses.

The graph at the top of the blog shows the results when running all
nodes colocated in the same server. There is some variation on results
at low concurrency, but as the randomness of query placement goes away
the improvement of using the Read Backup feature is stable around 50%.
This improvement comes mostly from avoiding the latency when going over
to another availability domain for queries where the primary replica
is not in the same availability domain.

At extremely high concurrency the impact decreases, but at the same
time when running with more than 2k concurrent connections the
response time is a not so good anymore, at low concurrency each
transaction has a latency of around 2.5 milliseconds. At 2304
connections the latency has increased to 35 milliseconds. This is still
pretty good given that each transaction has 16 SQL queries to execute.

In the next benchmark we move the sysbench application to its own set
of machines. For the Sysbench application we used a VM2.16 instance
that has 16 CPU cores (thus 32 CPUs). The image below shows the
placement of the processes in the machines and availability domains
of the Oracle Cloud.



The graph below shows the performance numbers in this setup. In this
setup the difference between using Read Backup or not is smaller since
we introduce one more latency, we have two network jumps between servers
in the same availability domain for each SQL query. This increases the
latency of each transaction by almost a factor of 2. Thus the difference
becomes smaller between the read backup feature and not using it. In this
setup the difference is around 30%.


Some things to note about latency between servers in the Oracle Cloud and
obviously in any cloud installation is that the latency between servers
can differ. This is natural since the speed of light is a factor in the
latency and thus the latency between servers in the cloud is varying
dependent on how far apart availability domains are and the placement of
the used servers. The experienced latency numbers were a lot better though
than the ones promised in the marketing. Thus most applications will
be able to handle their latency requirements in the Oracle Cloud.

An interesting thing to note is that when running applications that use
a lot of network resources, it is important to configure the Linux networking
correctly. Interestingly I had some issues with this in the last benchmark
where the sysbench VMs could not deliver more than 40k TPS at first. After
searching around for the bottleneck I found it to be in the Linux interrupts
on the Sysbench VM. CPU 30 and 31 were completely busy. I was able to
issue a few Linux commands and immediately the performance jumped up to
70k transactions when the soft interrupt part was moved to CPU 0 through
CPU 15 using RPS in Linux. I described this in an earlier blog and also
my book on MySQL Cluster 7.5 contains a chapter discussing those configuration
options in Linux.

In a later blog I will describe exactly what I did to setup those benchmarks.
This means that it will be possible to replicate these benchmarks for anyone
with an Oracle Cloud account. This is one of the cool features of a cloud
installation that it makes it possible to replicate any benchmark setups.