Friday, January 31, 2020

Thursday, January 16, 2020

Preview of upcoming NDB Cluster benchmark


Just a fun image from running a benchmark in the Oracle Cloud. The image above
shows 6 hours of benchmark run in a data node on a Bare Metal Server. First creating
the disk data tablespaces, next loading the data and finally running the benchmark.

During loading the network was loaded to 1.8 GByte per second, the disks was writing
4 Gbyte per second. During the benchmark run the disks was writing 5 GByte per
second in addition to reading 1.5 Gbyte per second.

All this while CPUs were never loaded to more than 20 percent. Many interesting
things to consider when running benchmarks against modern disk drives.
Bottlenecks can appear in CPUs, in disk drives, in networks and of course it is
possible to create bottlenecks in your software. But pretty satisfied above in that
we're close to the physical limits of both network and disk drives.

Wednesday, January 15, 2020

Support 3-4 replicas in NDB Cluster 8.0

One thing that we announced in the GA release of MySQL Cluster 8.0 is
that we now support using 3 and 4 replicas. Actually the NDB software
was designed to support 1 through 4 replicas from the very get go in the
1990s. The reason it has not been supported is simply that we haven't
had any regular testing effort attached to it. Thus we haven't felt that
we can support such an important feature without having proper testing
of it.

What we did in the development of MySQL Cluster 8.0 is that we have
added a number of new configurations with both 3 replicas and with
4 replicas. These configurations will now be used in our daily testing
of NDB.

In the process of working on this we found some issues with arbitration
already in MySQL Cluster 7.6. This issue was fixed in 7.6 already.
So the way to decide how to handle a node failure in 7.6 and onwards
is the following:

1) If all nodes in one node group is down then the cluster will fail.
2) If 1) is false AND at least one node group has all nodes still up
then the cluster will survive.
3) If 1) AND 2) is false AND a majority of the nodes are still up
the cluster will survive.
4) If 1) AND 2) AND 3) is false and a minority of the nodes are
still up the cluster will fail
5) If 1) AND 2) AND 3) AND 4) is false then half the nodes are
still up and arbitration is required to decide whether the cluster
will stay up.

Arbitration means that we contact a predefined (is set up dynamically
by cluster using configuration) management node or API node. This
node will say yes to the first node requesting arbitration and will
respond no to any further request. After that a new arbitrator has to
be selected.

When deciding if a majority of the nodes are up we use the previous
amount of alive nodes and compare it with the number of nodes
up after the failure.

This means that a cluster with 4 replicas can survive a crash of
2 replicas simultaneously. After this crash has been handled we
can survive one more crash and thus a 4 replica cluster can still
be operational even with just 1 replica still operational.

Our testing of 3-4 replicas found  4 bugs that all
have been fixed in MySQL Cluster 8.0. So using 3-4 replicas in 7.6
means that one can hit any of those bugs.

Properties of 3 replica cluster

Updates in a 3 replica cluster takes a bit more time. NDB employs a linear
commit protocol within replicas, this means that each additional replica
adds linearly to the latency of the transaction. So one should expect at least
50% longer latency for write operations and commits when using 3 replicas
compared to when using 2 replicas.

3 replicas means 50% more CPU bandwidth available for reads when all nodes
are up.

Previous blogs I did already in June 2018 using 7.6 shows some performance
when using 3 replicas in combination with Read Backup feature introduced
in 7.5. These blogs also shows the use LocationDomainId to ensure that reads
in cloud environments are sent to replicas within the same
Availability Domain/Zone.

16 TB data node instances in NDB Cluster 8.0

As you probably noticed MySQL Cluster 8.0 was released yesterday as a GA release.
One important thing that we changed in the GA release is that we made it possible to
set DataMemory to up to 16 TB.

We are currently working with Intel with machines that can handle 6 TB of memory
and using these machines we have already been able to load more than 5 TB of user
data in the DBT2 benchmark. We will publish more details on the results of those
benchmark investigations later.

Given that we have also improved support for disk data so much in MySQL Cluster 8.0
it means that it is perfectly sensible to store 10s of TB of data in each data node and
even up to 100 TB. So this means that a cluster with 144 data nodes would be able to
store all the way up to 5 PB of data even with 3 replicas.

The changes that made this possible is actually done already in MySQL Cluster 7.6
as part of the development of Partial LCP which was a key feature in 7.6.
Now that machines with more than 1 TB of memory becomes common it is
important for MySQL Cluster to support even larger memories.

The development of Partial LCP was developed to handle up to 16 TB DataMemory
efficiently. It will still work with even bigger DataMemory, but to support larger
memories efficiently would require some minor changes in the product.