Thursday, August 09, 2018

Scheduling challenges of checkpoints in NDB

The NDB data nodes are implemented using asynchronous programming. The model is
quite simple. One can send asynchronous messages on two priority levels, the
A-level is high priority messages that are mainly used for various management
actions. The B-level represents the normal priority level where all normal
messages handling transactions are executed.

It is also possible to send delayed signals that will wait for a certain
number of milliseconds before being delivered to the receiver.

When developing MySQL Cluster 7.4 we noted a problem with local checkpoints. If
transaction load was extremely high, the checkpoints almost stopped. If such a
situation stays for too long, we will run out of REDO log.

To handle this we introduced a special version of delayed signals. This new
signal will be scheduled such that at most around 75 messages are executed
before this message is delivered. There can be thousands of messages waiting
in queue, so this gives a higher priority to this signal type.

This feature was used to get control of checkpoint execution and introduced in
MySQL Cluster 7.4.7. With this feature each LDM thread will at least be able
to deliver 10 MBytes of checkpoint writes per second.

With the introduction of adaptive checkpoint speed this wasn't enough. In a
situation where we load data into NDB Cluster we might need to write much
more data to the checkpoints.

To solve this we keep track of how much data we need to write per second to
ensure that we don't run out of REDO log.

If the REDO log comes to a critical point where the risk of running out of
REDO log is high, we will raise priority of checkpointing even higher such
that we can ensure that we don't run out of REDO log.

This means that during a critical situation, normal transaction throughput
will decrease since we will put a lot of effort into ensuring that we don't
get into a situation of a complete stop due to running out of REDO log.

We solve this by executing checkpoint scans without real-time breaks for a
number of rows and if we need to continue writing checkpoints we send a
message on A-level to ourself to continue without giving transactions a
chance to come in. When we written enough we will give the transactions a
chance again by sending the new special delayed signal.

The challenge that we get here is that checkpoints must be prioritised over
normal transactions in many situations. At the same time we want the
prioritisation to be smooth to avoid start and stop situations that can
easily cause ripple effects in a large cluster.

This improved scheduling of checkpoints was one part of the solution to
the adaptive checkpoint speed that is introduced in MySQL Cluster 7.6.7.

No comments: