Thursday, January 16, 2020

Preview of upcoming NDB Cluster benchmark


Just a fun image from running a benchmark in the Oracle Cloud. The image above
shows 6 hours of benchmark run in a data node on a Bare Metal Server. First creating
the disk data tablespaces, next loading the data and finally running the benchmark.

During loading the network was loaded to 1.8 GByte per second, the disks was writing
4 Gbyte per second. During the benchmark run the disks was writing 5 GByte per
second in addition to reading 1.5 Gbyte per second.

All this while CPUs were never loaded to more than 20 percent. Many interesting
things to consider when running benchmarks against modern disk drives.
Bottlenecks can appear in CPUs, in disk drives, in networks and of course it is
possible to create bottlenecks in your software. But pretty satisfied above in that
we're close to the physical limits of both network and disk drives.

Wednesday, January 15, 2020

Support 3-4 replicas in NDB Cluster 8.0

One thing that we announced in the GA release of MySQL Cluster 8.0 is
that we now support using 3 and 4 replicas. Actually the NDB software
was designed to support 1 through 4 replicas from the very get go in the
1990s. The reason it has not been supported is simply that we haven't
had any regular testing effort attached to it. Thus we haven't felt that
we can support such an important feature without having proper testing
of it.

What we did in the development of MySQL Cluster 8.0 is that we have
added a number of new configurations with both 3 replicas and with
4 replicas. These configurations will now be used in our daily testing
of NDB.

In the process of working on this we found some issues with arbitration
already in MySQL Cluster 7.6. This issue was fixed in 7.6 already.
So the way to decide how to handle a node failure in 7.6 and onwards
is the following:

1) If all nodes in one node group is down then the cluster will fail.
2) If 1) is false AND at least one node group has all nodes still up
then the cluster will survive.
3) If 1) AND 2) is false AND a majority of the nodes are still up
the cluster will survive.
4) If 1) AND 2) AND 3) is false and a minority of the nodes are
still up the cluster will fail
5) If 1) AND 2) AND 3) AND 4) is false then half the nodes are
still up and arbitration is required to decide whether the cluster
will stay up.

Arbitration means that we contact a predefined (is set up dynamically
by cluster using configuration) management node or API node. This
node will say yes to the first node requesting arbitration and will
respond no to any further request. After that a new arbitrator has to
be selected.

When deciding if a majority of the nodes are up we use the previous
amount of alive nodes and compare it with the number of nodes
up after the failure.

This means that a cluster with 4 replicas can survive a crash of
2 replicas simultaneously. After this crash has been handled we
can survive one more crash and thus a 4 replica cluster can still
be operational even with just 1 replica still operational.

Our testing of 3-4 replicas found  4 bugs that all
have been fixed in MySQL Cluster 8.0. So using 3-4 replicas in 7.6
means that one can hit any of those bugs.

Properties of 3 replica cluster

Updates in a 3 replica cluster takes a bit more time. NDB employs a linear
commit protocol within replicas, this means that each additional replica
adds linearly to the latency of the transaction. So one should expect at least
50% longer latency for write operations and commits when using 3 replicas
compared to when using 2 replicas.

3 replicas means 50% more CPU bandwidth available for reads when all nodes
are up.

Previous blogs I did already in June 2018 using 7.6 shows some performance
when using 3 replicas in combination with Read Backup feature introduced
in 7.5. These blogs also shows the use LocationDomainId to ensure that reads
in cloud environments are sent to replicas within the same
Availability Domain/Zone.

16 TB data node instances in NDB Cluster 8.0

As you probably noticed MySQL Cluster 8.0 was released yesterday as a GA release.
One important thing that we changed in the GA release is that we made it possible to
set DataMemory to up to 16 TB.

We are currently working with Intel with machines that can handle 6 TB of memory
and using these machines we have already been able to load more than 5 TB of user
data in the DBT2 benchmark. We will publish more details on the results of those
benchmark investigations later.

Given that we have also improved support for disk data so much in MySQL Cluster 8.0
it means that it is perfectly sensible to store 10s of TB of data in each data node and
even up to 100 TB. So this means that a cluster with 144 data nodes would be able to
store all the way up to 5 PB of data even with 3 replicas.

The changes that made this possible is actually done already in MySQL Cluster 7.6
as part of the development of Partial LCP which was a key feature in 7.6.
Now that machines with more than 1 TB of memory becomes common it is
important for MySQL Cluster to support even larger memories.

The development of Partial LCP was developed to handle up to 16 TB DataMemory
efficiently. It will still work with even bigger DataMemory, but to support larger
memories efficiently would require some minor changes in the product.

Wednesday, December 11, 2019

NDB Parallel Query, part 5

In this part we are going to analyze a bit more complex query than before.
This query is a 6-way join.

The query is:
SELECT
        supp_nation,
        cust_nation,
        l_year,
        SUM(volume) AS revenue
FROM
        (
                SELECT
                        n1.n_name AS supp_nation,
                        n2.n_name AS cust_nation,
                        extract(year FROM l_shipdate) as l_year,
                        l_extendedprice * (1 - l_discount) AS volume
                FROM
                        supplier,
                        lineitem,
                        orders,
                        customer,
                        nation n1,
                        nation n2
                WHERE
                        s_suppkey = l_suppkey
                        AND o_orderkey = l_orderkey
                        AND c_custkey = o_custkey
                        AND s_nationkey = n1.n_nationkey
                        AND c_nationkey = n2.n_nationkey
                        AND (
                                (n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE')
                                OR (n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY')
                        )
                        AND l_shipdate BETWEEN '1995-01-01' AND '1996-12-31'
        ) AS shipping
GROUP BY
        supp_nation,
        cust_nation,
        l_year
ORDER BY
        supp_nation,
        cust_nation,
        l_year;

It is the inner SELECT that is the 6-way join. The outer part only deals with the
GROUP BY aggregation and ORDER BY of the result set from the inner
SELECT. As mentioned before the GROUP BY aggregation and ORDER BY
parts are handled by the MySQL Server. So the NDB join pushdown only deals
with the inner select.

In the previous queries we analysed the join order was pretty obvious. In this
case it isn't that obvious. But the selection of join order is still fairly
straightforward. The selected join order is
n1 -> supplier -> lineitem -> orders -> customer -> n2.

Query analysis

The query starts by reading 2 rows from the nation table. This creates a new scan
on the supplier table, these 2 rows are either coming from the same TC thread or
from separate TC threads. This scan generates data for the next scan in the supplier
table. The supplier table will return 798 rows that is used in the scan against the
lineitem table. This assumes scale factor 1.

This represents a new thing to discuss. If this query would have been executed in the
MySQL Server we would only be able to handle one row from the supplier table at a
time. There have been some improvement in the storage engine API to handle this
using read multi range API in the storage engine API. This means a lot of
communication back and forth and starting up new scans. With the NDB join
processing we will send a multi-range scan to the lineitem table. This means that we
will send one scan message that contains many different ranges. There will still be a
new walking through the index tree for each range, but there is no need to send the
scan messages again and again.

Creation of these multi-ranges is handled as part of the join processing in the
DBSPJ module.

The join between supplier table and the lineitem contains one more interesting
aspect. Here we join towards the column l_orderkey in the lineitem table. In many
queries in TPC-H the join against the lineitem table uses the order key as the join
column. The order key is the first part of the primary key and is thus a candidate to
use as partition key. The TPC-H queries definitely improves by using the order key as
partition key instead of the primary key. This means that the orders and all lineitems
for the order are stored in the same LDM thread.

The scan on the lineitem will produce 145.703 to join with the orders table. The rest of
the joins are joined through the primary key. Thus we will perform 145.703 key lookups
in the orders table, there will be 145.703 key lookups in the customer table and finally
there will be 145.703 lookups against the nations table. The only filtering here will be
on the last table that will decrease the amount of result rows to the MySQL Server,
the end result will be 5.924 rows.

This gives another new point that it would be possible to increase parallelism in this
query by storing the result rows in the DBSPJ. However this would increase the
overhead, so it would improve parallelism at the cost of efficiency.

Scalability impact

If we make sure that the lineitem table is partitioned on the order key this query will
scale nicely. There will be fairly small impact with more partitions since only the scan
against the supplier table will be more costly in a larger cluster.

One thing that will make the query cost more is when the primary key lookups are
distributed instead of local. One table that definitely will be a good idea to use
FULLY REPLICATED for is the nations table. This means that all those 145.703 key
lookups will be handled inside a data node instead of over the network.

The supplier table has only 10.000 rows compared to the lineitem table that has
6M rows. Thus it should definitely be possible to use FULLY REPLICATED also
for this table. The customer table has 150.000 rows and is another candidate to use
for FULLY REPLICATED.

Since the MySQL Server will have to handle more than 300.000 rows in this query,
this will be the main bottleneck for parallelism. This means that the query will have a
parallelism of about 5. This is also the speed up we see compared to single threaded
storage engine for this query. This bottleneck will be about the same even with
larger clusters.

Next Part

I will take a break in this sequence of blogs for now and come back later with a
description of a bit more involved queries and how NDB handles pushing down
subqueries and parts of join query.