Monday, October 19, 2020

New things coming in MySQL Cluster 8.0.22

In version 8.0.22 we have added a number of important new features. The first

one is that we introduced the capability to create encrypted backups.

This includes capabilities to create encrypted backups, restore encryped

backups and also encrypt backups that was created without encryption.


The second is that we have added support for IPv6 addresses.  If it is

necessary to mix IPv4 and IPv6 addresses in a cluster it is important

to be careful and follow a number of rules as documented.


Performance of replication of BLOB columns has been significantly

improved and similarly for SELECT statements using BLOB columns.

This has been achieved through decreasing the amount of round-trips used

to perform operations involving BLOB tables. This significantly decreases

the latency of operations involving BLOBs.


Also in 8.0.21 an important new feature was added that made some

complex queries significantly faster. We added the ability to push

antijoins and semijoin execution down to the NDB data nodes.

One example query of this is Q22 in TPC-H. This query executes

3.5x faster due to this change and there is likely queries that will

benefit even more than this query due to this change.

select

        cntrycode,

        count(*) as numcust,

        sum(c_acctbal) as totacctbal

from

        (

                select

                        substr(c_phone from 1 for 2) as cntrycode,

                        c_acctbal

                from

                        customer

                where

                        substr(c_phone from 1 for 2) in

                                ('13', '31', '23', '29', '30', '18', '17')

                        and c_acctbal > (

                                select

                                        avg(c_acctbal)

                                from

                                        customer

                                where

                                        c_acctbal > 0.00

                                        and substr(c_phone from 1 for 2) in

                                                ('13', '31', '23', '29', '30', '18', '17')

                        )

                        and not exists (

                                select

                                        *

                                from

                                        orders

                                where

                                        o_custkey = c_custkey

                        )

        ) as custsale

group by

        cntrycode

order by

        cntrycode;


Wednesday, October 07, 2020

DBT2 benchmarks with NDB Cluster

This blog post is referring to the uploaded slides.


DBT2 is based on the standard benchmark TPC-C. This benchmark is a mix of read

and write activity. It is much more write intensive than many other benchmarks.

This has resulted that performance in NDB in previous versions have only scaled

to nodes with around 6 LDM threads. With the introduction of multiple sockets

for communication between nodes we have been able to remove this limitation.


Thus DBT2 serves as a good tool to verify that MySQL Cluster 8.0.20 has improved

its capability to scale writes.


To our disposal for those tests we had access to a set of bare metal servers in

the Oracle Cloud (OCI). The data nodes used DenseIO2 bare metal servers with

52 CPU cores, 768 GByte of memory and 8 NVMe drives. We had 6 such servers,

these 6 servers were spread in 3 different availability domains.


The queries was executed by MySQL Servers, we used 15 bare metal servers with

36 CPU cores each to run the MySQL Servers. These servers are spread such that

5 servers are located in each availability domain.


The actual benchmark driver and client was executed on its own bare metal server.


DBT2 has five different transaction types. Each transaction is a complex transaction

consisting of a set of SQL queries. On average each transaction consists of around

30-35 SQL queries. Only NewOrder transactions are counted in the results. These

constitute about 45% of the transactions. This means that if the report says 45.000

TPM (transactions per minute), it means that actually 100.000 transactions were

executed.


DBT2 doesn't follow the standard TPC-C since each terminal continues immediately

without waiting before it starts a new transaction. This makes it possible to also

use DBT2 to test in-memory DBMS. Interestingly however with persistent memory a

server can have 6 TB of memory and this means that more than 50.000 warehouses

can be used and thus results beyond 500.000 TPM can be handled even with an

in-memory setup.


Each SQL query is created in the DBT2 driver, sent to the DBT2 client and from there

it is sent to the MySQL Server and from there it accesses a set of NDB data nodes to

perform the actual query.


Since the servers are located in different availability domains the main cause of

latency comes from the communication between availability domains. When executing

DBT2 in a local network one can achieve latency down to 5-10 ms for a transaction

that contains around 30 SQL statements. In this environment we get latencies of

around 30 ms. Most of this time is spent communicating from the DBT2 client to the

MySQL Servers when they are located in a different availability domain.


NDB has a configuration option that ensures that any reads will always stay local

in the availability domain of the MySQL Server it was sent to. Updates need to

update all replicas and this might require communication over availability domains.


We have selected 3 different configurations to execute DBT2 in.


The first configuration is shown in slide 5. It uses 2 data nodes, these are both

located in the same availability domain. They are however located in different

failure domains and thus only share electricity network. We use 10 MySQL servers

from 2 different availability domains.


The second configuration shows a setup where we still have 2 replicas, but now we

have one "shard" per availability domain. Thus replicas are still in the same

availability domain. This means that we use all 6 data node servers. We also use

all 15 MySQL servers in 3 different availability domains.


The third configuration we use 3 replicas where the replicas are in different

availability domains. In this case we get 2 node groups using the 6 data nodes.

There is still 15 MySQL Servers in 3 different availability domains.


The first configuration reaches almost 1.4M TPM at 2560 connections. The second

configuration with 3 node groups reaches almost linear scaling  and reaches

a bit beyond 4M TPM with 12000 connections.


Finally the last configuration reaches around 2.5M TPM at 8400 connections.


Thus we can see that performance is fairly constant per node group. The latency

differs and the number of connections required to reach optimal throughput

differs a bit.


It is very clear that the multi-socket solution has improved the scalability of

writes and thus ensuring that NDB can reach very high numbers using DBT2.


In this benchmark we showed NDB scale to using more than 1000 CPUs in the MySQL

Servers and the NDB data nodes. MySQL NDB Cluster 8.0 can scale to 144 data

nodes and thus can scale far beyond those numbers.

TPC-H Improvements in NDB Cluster

 This blog is a presentation based on the uploaded slides.


We have focused a lot of energy on improving our capabilities to execute complex

queries in an efficient manner. The first step on these improvements we took in

MySQL Cluster 7.6, but even more improvements have been made in

MySQL Cluster 8.0.


There are four types of improvements that we have made. One is that we have added

the option to use a shared memory transporter between the MySQL Server and the

NDB data nodes when they are located on the same machine.


The second improvement is the adaptive CPU spinning described in blog post 1.


The third improvement is that we made Read Backup the default, this means that in

complex queries any replica can be used for reading.


Finally we have added a number of algorithmic improvements, these mainly focus on

enabling more parts of the queries to be pushed down into the NDB data nodes.

We described the improvements in 8.0.20 in blog post 2.


Slide 4 shows a 2-node setup optimised for optimal query latency.


In slide 5 we show the improvements in TPC-H comparing MySQL Cluster 8.0.20 vs

MySQL Cluster 7.6.12. 15 out of 22 queries have improved by more 100% and most

of those even a lot more than 100%. 3 queries have improved betwen 50% and 100%.

Only 4 queries have been unaffected by these improvements.


In slide 6 we see the latency improvements that comes from colocating the MySQL

Server and the NDB data nodes using a shared memory transporter. 14 out of 22

queries improve in this scenario and some of them almost 100% and even above

100%.


Slide 7 shows the improvements based on using LatencyOptimisedSpinning as

SpinMethod as described in blog post 3. 13 out of 22 queries sees an improvement

in this scenario. Most of them around 15-20% with a few that improves up to 60%.


Slide 8 shows the sum of the improvements of colocation and CPU spinning.


Slide 9 shows that some queries can benefit from more LDMs, but some also get a

negative impact from this. Thus we conclude that there is a limit to have many

fragments we split tables into for optimal latency. This is an important thing

to consider in any distributed DBMS.

Tuesday, October 06, 2020

YCSB Disk Data Benchmark with NDB Cluster

 As mentioned in  blog post 1 we have improved the capabilities to handle

very large write bandwidth and blog post 2 improved the checkpointing of

disk data columns in MySQL Cluster 8.0.20.


We wanted to verify that these changes were succesful. To do this we

selected to use the YCSB benchmark. This is a very simple benchmark,

it contains one table with in our variant 2 columns. The first

column is the primary key, the second key contains the payload data

and is stored in VARCHAR(29500) column that is stored on disk.


The first part of the benchmark fills the database. The database is

mostly stored in a tablespace that is setup in accordance with the

blog post 3. This means that we had a tablespace size of 20 TBytes.

We loaded 600M rows, thus creating a database size of 18 TByte.


The load phase inserted 44.500 rows per second. This means that we

loaded 1.25 GByte per second into the database. The bottleneck in

both the load phase and the benchmark run phase was mainly the

NVMe drives, but in some cases also the 25G Ethernet became the

bottleneck. The CPUs were never loaded more than 20%, thus never

becoming a bottleneck.


From this we can conclude that the setup and use of the NVMe drives

is the most important part of achieving extreme write rates for

use cases where NDB Cluster is used for file write loads.


The cluster setup used 2 data nodes, each of the data nodes was a bare

metal server in the Oracle Cloud (OCI) that had 8 NVMe drives (2 used

for logs and checkpoints and 6 used for the tablespace). The servers

had 52 CPU cores each. Instead of setting up a RAID on the 6 NVMe drives

we instead opted for one file system per NVMe drive and added one

data file per NVMe drive to NDB. This meant that NDB handled the

spread of writes on the different data files. Thus no complex RAID

solutions was required. However to get the best possible performance

it was necessary to use SSD overprovisioning.


The disk usage and CPU usage and network usage during this benchmark can

be found in blog post 4. The mean latency of those transactions was a bit more

than 2 milliseconds where reads took a bit more than 1 ms and writes around

4-5 milliseconds.


The actual benchmark consisted of 50% reads and 50% writes. Here we

achieved almost 70.000 transactions per second. This meant that we

read 1 GByte per second in parallel with writing 1 GByte per second.

Parallel execution of Outer/Semi joins in NDB Cluster

In MySQL Cluster 8.0.20 we added one more query type that can be pushed

down to NDB data nodes. These queries are outer join queries and semi-join

queries using the FirstMatch strategy.


These queries can be pushed if conditions can be pushed on those parts of

the query that involve the outer join using a concept in the MySQL Server

called join nests.


Pushed queries means that the join execution is pushed down to the NDB

data nodes. This means that the rows to process for the second table in

the join is sent directly from the LDM threads to the join processor

that will start the select on the second table. The join can contain

up to 32 tables. Many filters can be pushed as well. It is also possible

to push parts of a join query. We are actively working on supporting

more and more query variants for push down to NDB data nodes.


The main benefits of pushing the queries down to data nodes is that it

enables the queries to be parallelised since both accesses to the data

owning parts (LDM threads) and the join processors (tc threads) can be

executed in parallel on multiple nodes and multiple threads in each

node. This means that queries can return the results much quicker.


Some queries can become more than 50 times faster using this approach

compared to when queries are executed entirely in the MySQL Server.


In our development of these new pushdown join features we measure the

result of executing queries in TPC-H. In TPC-H there is one outer join

query that looks like this:


select

        c_count,

        count(*) as custdist

from

        (

                select

                        c_custkey,

                        count(o_orderkey) as c_count

                from

                        customer left outer join orders on

                                c_custkey = o_custkey

                                and o_comment not like '%special%requests%'

                group by

                        c_custkey

        ) as c_orders

group by

        c_count

order by

        custdist desc,

        c_count desc;


This query benefits greatly from both the push of outer joins, it also

benefits greatly from the new CPU spinning and it benefits from using the

shared memory transporter between the data node and the MySQL Server.

Combined together these 3 things together improve latency of this query

(Q13) in TPC-H by a factor of 10x when compared to executing it in

MySQL Cluster 7.6.


Another query that benefits greatly from these changes is Q4 in TPC-H.


select

        o_orderpriority,

        count(*) as order_count

from

        orders

where

        o_orderdate >=  '1993-07-01'

        and o_orderdate < date_add( '1993-07-01', interval '3' month)

        and exists (

                select

                        *

                from

                        lineitem

                where

                        l_orderkey = o_orderkey

                        and l_commitdate < l_receiptdate

        )

group by

        o_orderpriority

order by

        o_orderpriority;


This query executes more than 50 times faster in MySQL Cluster 8.0.20

compared to in MySQL Cluster 7.6 and this benefit comes strictly from

changing the execution algorithm.

Adaptive CPU spinning in NDB Cluster

 CPU spinning is something that has been around in NDB for a long time. This CPU

spinning has been static which means that each time a thread wakes up it will spin

waiting for incoming signals. Thus if spin time is set to e.g. 500 microsecond we will

spin 5% of the time even when there is no activity since each thread wakes up each

10 ms to perform timer based activities.


In the new spinning approach we wanted to achieve the following.


- Power efficient CPU spinning

  NDB Cluster data nodes can execute queries very efficiently, this leads to efficient

  power usage. However CPU spinning requires careful control to ensure that one

  doesn't  waste power while spinning.

- No spinning unless there is something to gain from it

  Static spinning will always spin independent of whether it makes sense.

  With the new approach we gather statistics about the usefulness of spinning. This

  means that when the thread goes to sleep we will collect statistics on the sleep

  time when the thread wakes up again. With this information we are able to calculate

  the gains from spinning. We also calculate the cost of going to sleep and waking up

  again (this cost is higher usually in virtual machines compared to bare metal servers).


One important part of CPU spinning is what we do while spinning. In x86 and ARM

CPUs there exists CPU instructions for spinning. While spinning we will spend around

one microsecond in the spinning state before checking for input again. Going to sleep

and waking up again can consume between 10 and 50 microseconds dependent on

OS, hardware and whether we are running in a VM, in Docker and so forth. These

special assembler instructions will ensure that the CPU doesn't use so much power

and in hyperthreaded CPUs the other hyperthread(s) will get almost exclusive access

to the CPU core while spinning is performed.


The adaptive part of our new CPU spinning is that we will keep track of the workload

and its ability to benefit from CPU spinning. We do so through a number of

configuration parameters. However for ease of use we have only 4 variants that one

can configure.


The configuration parameter SpinMethod is used to set the variant of CPU spinning to

use. The following settings are available:


- CostBasedSpinning

This is the default setting. This means that we will only spin when we actually benefit

from it. This means that if we spend 10 microseconds spinning we will gain 10

microseconds from spinning in avoiding the cost of going to sleep and waking up.

This is a conservative setting that ensures good power efficiency and avoids issues

in situations where the data nodes executes in a shared environment.


- LatencyOptimisedSpinning

This setting is a very good trade off between best latency and best power efficiency.

Here we will spin if 10 microseconds of spinning can gain 1-2 microseconds of

avoided costs of going to sleep and waking up. This is recommended to use in cases

where the data nodes executes in an exclusive environment, e.g. in its own virtual

machine, Docker container or even exclusively using the entire server. This provides

a very distinct advantage in latency of queries as we will show in a later blog.


- DatabaseMachineSpinning

This setting provides the optimal latency at the expense of decreased power

efficiency. This setting is intended for use cases where latency is of outmost important

and one is willing to sacrifice some extra power to achieve this.


-StaticSpinning

This disables the adaptive spinning and makes NDB Cluster backwards compatible.

Adaptive checkpoints of Disk Data columns in NDB Cluster

 In MySQL Cluster 8.0.20 we made it possible to write to disk data columns

at considerably higher write rates than before. In MySQL Cluster 7.6 we

introduced adaptive checkpointing of in-memory rows to ensure that restarts

can be faster.


In order to handle very high write rates (up to more than 1 GByte per second)

on disk data columns in NDB it is necessary to control write rates more

efficiently.


To get the most efficient use of disk bandwidth it is important to ensure that

write rates are kept fairly constant such that we constantly make use of the

disk bandwidth. If the load on the disks goes up and down too fast we will not

be able to make use of all the available bandwidth in the disks.


With modern NVMe devices found in e.g. the Oracle Cloud one can achieve 10-20

GByte of reads and writes to a set of 8 NVMe devices using specialised disk

benchmarks. Achieving this with a DBMS is not as easy, but we were able to

see NDB use 7 GBytes per second of reads and writes in some YCSB benchmarks.


This represents one more step in the direction of adaptive algorithms in NDB

Cluster to ensure that users always will get the optimal behaviour of their

hardware when using NDB.


Some previous steps are:

- Adaptive send assistance

  This feature was introduced in 7.5 that made it possible for all threads

  to assist the send threads to send messages on sockets. The main input to

  the adaptive algorithm is the amount of CPU time spent in each thread for

  normal execution, sleeping, assisting send threads.

- Adaptive in-memory checkpoint speed

  This feature adapts the speed of checkpoints based on the amount of write

  activity in NDB Cluster.

- Adaptive CPU spinning

  This is introduced in 8.0.20 and will be presented in separate blog.

Multi-socket feature in NDB Cluster

 In MySQL Cluster 8.0.20 a new feature was added that enables higher write rates

in NDB Cluster. In previous releases the communication between nodes in the same

node group was handled by one socket. This single socket was able to handle about

200-300 MBytes of write traffic between the replicas inside one node group. Thus

in clusters with many node groups the write rates can be much higher.


For most applications this doesn't represent any problem at all. However there are

certain applications that require wast amount of writes.


One such application is when NDB is used to store file data. File writes can easily

go beyond 200-300 MBytes per second. Another application is when NDB is used to

store data from measurements, stock data or other applications that require millions of

writes per second.


The solution to this problem is to simply use multiple sockets when communicating

between nodes in the same cluster. This removes the bottleneck on write rates

between nodes in the same node group. The solution is applied by default since we

haven't seen any negative effects of using it. The number of sockets will be

derived from the number of LDM threads used in the data nodes.


We have verified the feature through the use of YCSB where the data is stored in

a disk data column. This is a good representation of writes of large amounts of

file data to NDB.


The second type of verification is through DBT2, an open source variant of TPC-C

where each benchmark thread starts a new operation immediately after completing

the previous transaction. This benchmark was previously limited by the single

socket such that scaling beyond 6 LDM threads didn't improve numbers at all.

With this change we can see scaling to full bare metal servers in the Oracle Cloud

with 52 CPU cores.


Reports of those benchmarks will be in separate blogs.