Tuesday, October 06, 2020

Parallel execution of Outer/Semi joins in NDB Cluster

In MySQL Cluster 8.0.20 we added one more query type that can be pushed

down to NDB data nodes. These queries are outer join queries and semi-join

queries using the FirstMatch strategy.


These queries can be pushed if conditions can be pushed on those parts of

the query that involve the outer join using a concept in the MySQL Server

called join nests.


Pushed queries means that the join execution is pushed down to the NDB

data nodes. This means that the rows to process for the second table in

the join is sent directly from the LDM threads to the join processor

that will start the select on the second table. The join can contain

up to 32 tables. Many filters can be pushed as well. It is also possible

to push parts of a join query. We are actively working on supporting

more and more query variants for push down to NDB data nodes.


The main benefits of pushing the queries down to data nodes is that it

enables the queries to be parallelised since both accesses to the data

owning parts (LDM threads) and the join processors (tc threads) can be

executed in parallel on multiple nodes and multiple threads in each

node. This means that queries can return the results much quicker.


Some queries can become more than 50 times faster using this approach

compared to when queries are executed entirely in the MySQL Server.


In our development of these new pushdown join features we measure the

result of executing queries in TPC-H. In TPC-H there is one outer join

query that looks like this:


select

        c_count,

        count(*) as custdist

from

        (

                select

                        c_custkey,

                        count(o_orderkey) as c_count

                from

                        customer left outer join orders on

                                c_custkey = o_custkey

                                and o_comment not like '%special%requests%'

                group by

                        c_custkey

        ) as c_orders

group by

        c_count

order by

        custdist desc,

        c_count desc;


This query benefits greatly from both the push of outer joins, it also

benefits greatly from the new CPU spinning and it benefits from using the

shared memory transporter between the data node and the MySQL Server.

Combined together these 3 things together improve latency of this query

(Q13) in TPC-H by a factor of 10x when compared to executing it in

MySQL Cluster 7.6.


Another query that benefits greatly from these changes is Q4 in TPC-H.


select

        o_orderpriority,

        count(*) as order_count

from

        orders

where

        o_orderdate >=  '1993-07-01'

        and o_orderdate < date_add( '1993-07-01', interval '3' month)

        and exists (

                select

                        *

                from

                        lineitem

                where

                        l_orderkey = o_orderkey

                        and l_commitdate < l_receiptdate

        )

group by

        o_orderpriority

order by

        o_orderpriority;


This query executes more than 50 times faster in MySQL Cluster 8.0.20

compared to in MySQL Cluster 7.6 and this benefit comes strictly from

changing the execution algorithm.

Adaptive CPU spinning in NDB Cluster

 CPU spinning is something that has been around in NDB for a long time. This CPU

spinning has been static which means that each time a thread wakes up it will spin

waiting for incoming signals. Thus if spin time is set to e.g. 500 microsecond we will

spin 5% of the time even when there is no activity since each thread wakes up each

10 ms to perform timer based activities.


In the new spinning approach we wanted to achieve the following.


- Power efficient CPU spinning

  NDB Cluster data nodes can execute queries very efficiently, this leads to efficient

  power usage. However CPU spinning requires careful control to ensure that one

  doesn't  waste power while spinning.

- No spinning unless there is something to gain from it

  Static spinning will always spin independent of whether it makes sense.

  With the new approach we gather statistics about the usefulness of spinning. This

  means that when the thread goes to sleep we will collect statistics on the sleep

  time when the thread wakes up again. With this information we are able to calculate

  the gains from spinning. We also calculate the cost of going to sleep and waking up

  again (this cost is higher usually in virtual machines compared to bare metal servers).


One important part of CPU spinning is what we do while spinning. In x86 and ARM

CPUs there exists CPU instructions for spinning. While spinning we will spend around

one microsecond in the spinning state before checking for input again. Going to sleep

and waking up again can consume between 10 and 50 microseconds dependent on

OS, hardware and whether we are running in a VM, in Docker and so forth. These

special assembler instructions will ensure that the CPU doesn't use so much power

and in hyperthreaded CPUs the other hyperthread(s) will get almost exclusive access

to the CPU core while spinning is performed.


The adaptive part of our new CPU spinning is that we will keep track of the workload

and its ability to benefit from CPU spinning. We do so through a number of

configuration parameters. However for ease of use we have only 4 variants that one

can configure.


The configuration parameter SpinMethod is used to set the variant of CPU spinning to

use. The following settings are available:


- CostBasedSpinning

This is the default setting. This means that we will only spin when we actually benefit

from it. This means that if we spend 10 microseconds spinning we will gain 10

microseconds from spinning in avoiding the cost of going to sleep and waking up.

This is a conservative setting that ensures good power efficiency and avoids issues

in situations where the data nodes executes in a shared environment.


- LatencyOptimisedSpinning

This setting is a very good trade off between best latency and best power efficiency.

Here we will spin if 10 microseconds of spinning can gain 1-2 microseconds of

avoided costs of going to sleep and waking up. This is recommended to use in cases

where the data nodes executes in an exclusive environment, e.g. in its own virtual

machine, Docker container or even exclusively using the entire server. This provides

a very distinct advantage in latency of queries as we will show in a later blog.


- DatabaseMachineSpinning

This setting provides the optimal latency at the expense of decreased power

efficiency. This setting is intended for use cases where latency is of outmost important

and one is willing to sacrifice some extra power to achieve this.


-StaticSpinning

This disables the adaptive spinning and makes NDB Cluster backwards compatible.

Adaptive checkpoints of Disk Data columns in NDB Cluster

 In MySQL Cluster 8.0.20 we made it possible to write to disk data columns

at considerably higher write rates than before. In MySQL Cluster 7.6 we

introduced adaptive checkpointing of in-memory rows to ensure that restarts

can be faster.


In order to handle very high write rates (up to more than 1 GByte per second)

on disk data columns in NDB it is necessary to control write rates more

efficiently.


To get the most efficient use of disk bandwidth it is important to ensure that

write rates are kept fairly constant such that we constantly make use of the

disk bandwidth. If the load on the disks goes up and down too fast we will not

be able to make use of all the available bandwidth in the disks.


With modern NVMe devices found in e.g. the Oracle Cloud one can achieve 10-20

GByte of reads and writes to a set of 8 NVMe devices using specialised disk

benchmarks. Achieving this with a DBMS is not as easy, but we were able to

see NDB use 7 GBytes per second of reads and writes in some YCSB benchmarks.


This represents one more step in the direction of adaptive algorithms in NDB

Cluster to ensure that users always will get the optimal behaviour of their

hardware when using NDB.


Some previous steps are:

- Adaptive send assistance

  This feature was introduced in 7.5 that made it possible for all threads

  to assist the send threads to send messages on sockets. The main input to

  the adaptive algorithm is the amount of CPU time spent in each thread for

  normal execution, sleeping, assisting send threads.

- Adaptive in-memory checkpoint speed

  This feature adapts the speed of checkpoints based on the amount of write

  activity in NDB Cluster.

- Adaptive CPU spinning

  This is introduced in 8.0.20 and will be presented in separate blog.

Multi-socket feature in NDB Cluster

 In MySQL Cluster 8.0.20 a new feature was added that enables higher write rates

in NDB Cluster. In previous releases the communication between nodes in the same

node group was handled by one socket. This single socket was able to handle about

200-300 MBytes of write traffic between the replicas inside one node group. Thus

in clusters with many node groups the write rates can be much higher.


For most applications this doesn't represent any problem at all. However there are

certain applications that require wast amount of writes.


One such application is when NDB is used to store file data. File writes can easily

go beyond 200-300 MBytes per second. Another application is when NDB is used to

store data from measurements, stock data or other applications that require millions of

writes per second.


The solution to this problem is to simply use multiple sockets when communicating

between nodes in the same cluster. This removes the bottleneck on write rates

between nodes in the same node group. The solution is applied by default since we

haven't seen any negative effects of using it. The number of sockets will be

derived from the number of LDM threads used in the data nodes.


We have verified the feature through the use of YCSB where the data is stored in

a disk data column. This is a good representation of writes of large amounts of

file data to NDB.


The second type of verification is through DBT2, an open source variant of TPC-C

where each benchmark thread starts a new operation immediately after completing

the previous transaction. This benchmark was previously limited by the single

socket such that scaling beyond 6 LDM threads didn't improve numbers at all.

With this change we can see scaling to full bare metal servers in the Oracle Cloud

with 52 CPU cores.


Reports of those benchmarks will be in separate blogs.