Friday, March 20, 2015

Controlling checkpoint speed in MySQL Cluster 7.4

A question from digitalpoint on a previous blog about 7.4 restarts requires
an explanation of how we control checkpoint speed in MySQL Cluster 7.4.
Since the explanation is fairly lengthy I will do it in this blog instead
of as a reply to a blog comment.

First of all some introduction into checkpoints in MySQL Cluster 7.4. We
actually use the term checkpoint for 2 things. We have LCPs (local
checkpoints) and GCPs (Global checkpoints). LCPs are the traditional
checkpoints and the one that will be described in this blog. GCPs are
about forming groups of transactions that will be durable after a
restart. GCPs happens very often whereas an LCP is a fairly long process.

So first I'll introduce why we're doing checkpoints (LCPs) in the first
place. There are two reasons for doing LCPs. The first is that MySQL
Cluster uses a log-based approach for recovery. Every update, insert,
delete and write is logged into the REDO log in MySQL Cluster. We also
have an UNDO log for disk data. To avoid that the logs grow to infinity
we use LCPs to be able to cut the log tail.

The logs have a fixed size, the REDO logs cannot be changed in size whereas
the UNDO log you can add new files to a logfile group. When a log is full
no updates are allowed anymore since we cannot recover the database anymore
if we allow non-logged updates.

All the above is true for tables that require durability. MySQL Cluster
actually also supports non-durable tables. These tables will be available
as long as the cluster survives, if the entire cluster fails, then those
tables will be restored to an empty state. Most applications need durable
tables, but there are also applications where durability doesn't make
sense and therefore it is better to run those tables without durability.

Checkpointing is obviously only relevant to durable tables where we recover
based on a combination of logs and checkpoints.

An LCP is used to cut the log tail. This means that we always need to write
the checkpoint at such a speed that we don't come into a point where we
run out of log (both REDO log and UNDO log). As an example assume that we
can execute 100.000 updates per second at top speed that updates 20 fields
of 200 bytes in size. The REDO log in this case will be about 200 bytes
plus 20 * 4 bytes for field info and around 70 bytes of fixed overhead for
a REDO log record. So about 350 bytes times 100.000 per second of REDO log
records created which is 35 MByte per second. Assuming that the data size is
50 GByte. We can either calculate the REDO log size based on the choice
of checkpoint write speed or the other way around that we calculate the
checkpoint write speed based on the REDO log size. If we start with a
checkpoint write speed of 100 MByte per second then it will take about
500 seconds to write an LCP. We need to have enough REDO log for at least
2 checkpoints and it is good have at least 100% safety margin. So then
the REDO log size needs to be big enough for 2000 seconds of log generation.
So 70 GByte of REDO log should be sufficient. The size of the REDO log is
a multiplication of number of log parts, number of log files and log file

So being able to cut the log tail is the only reason for running LCPs in
normal operation when there are no restarts ongoing. Obviously cutting the
log to be short also means shorter restart times since there is less
log to execute. So increasing checkpoint write speed means faster restart
times at the expense of more work in normal operation.

We also use the LCPs as part of restarts. We need to run an entire LCP where
the starting node participates to ensure that the starting node is
recoverable. In this scenario we simply want the LCP to run as fast as
possible to complete the restart as soon as possible.

So this means that we have two different speeds for checkpoint write speeds,
we have the speed needed during normal operation and we have the speed needed
for restarts. Actually we can even separate out a total cluster restart
(we call this system restart) from node restarts. The reason is that during
an LCP in a system restart there is no other action other than to perform
the LCP. Thus we can set the checkpoint write speed to the absolute maximum
possible. During another nodes restart we also need to execute user
transactions and so we want to find a balance between speeding up the restart
and being able to service current users of the cluster.

So this means that we need three config parameters. We call them
MaxDiskWriteSpeed (speed during normal operation),
MaxDiskWriteSpeedOtherNodeRestart (speed during another node performing
a node restart), MaxDiskWriteSpeedOwnNodeRestart (speed during our own node
restart, but also more importantly of all nodes during a system restart).

Obviously there is a cap to the write speed we can handle, there is both a
limit to what the disk drives can handle and there is also a limit to how
much each CPU running an LDM thread can output.

My experiments shows that one CPU that runs an LDM thread can output about
100 MByte per second if used only for generation of checkpoint writes.
The checkpoint format is essentially packed rows in a format that can be
restored by simply running normal inserts. We have changed a bit back and
forth between different checkpoint formats, we have used page-based schemes
that required UNDO logging also of memory tables in earlier versions of
MySQL Cluster. The benefit of the current scheme is that enables us to
avoid UNDO logs entirely for memory tables.

Modern disks can write a few hundred MBytes per second. Even cheap SSDs can
handle up to about 500 MBytes per second of writes. The disk has to handle
both writing of checkpoints and writes of the REDO log (and also of the
UNDO log for disk data).

In 7.3 and earlier the above config parameters was named differently.
They were named DiskCheckpointSpeed (more or less equal to MaxDiskWriteSpeed
in 7.4) and DiskCheckpointSpeedRestart (more or less equal to
MaxDiskWriteSpeedOwnRestart). MaxDiskWriteOtherNodeRestart have no
similitude in 7.3 and earlier versions, one uses DiskCheckpointSpeed also
when another node is restarting. This is one major factor in how we can
speed up node restarts and rolling upgrades.

One more thing is that the execution of LCPs have been improved in 7.4.
In 7.4 all LDM threads can write checkpoints in parallel. The config
parameter specifies the total speed of the node, thus each LDM will
use a part of this speed. So with write speed set to 100 MByte with
4 LDM threads means that each LDM thread will write at about 25 Mbyte
per second and thus about 25% of the available CPU capacity in the
LDM thread for checkpoints. In earlier versions up to 2 LDM threads
could write in parallel and each such LDM thread used the config
parameter DiskCheckpointSpeed/DiskCheckpointSpeedRestart. Thus the
write speed of the node was usually twice as much as the config
parameter was set to.

The change in 7.4 to enable all LDM threads to write in parallel means that
we can set the write speed higher in 7.4 than earlier. The CPU overhead of
checkpointing is spread on all LDM threads and this ensures that we are
able to increase the speed without detrimenting normal user operation.

There is also one more innovation in 7.4. This is seen from the config
parameter MinDiskWriteSpeed.

The problem here is that checkpoint writes compete for CPU capacity with
normal user operations. So if the user needs to run transactions at the
very highest speed, it is a good idea to slow down the CPU usage from
from writing checkpoints.

Additionally if we experience an IO lag where the disk gets overloaded
we also want to slow down the writing of checkpoints.

What we have done here is that we have implemented an adaptive algorithm
that will slow down the checkpoint speed when we either have a CPU
overload situation or we have an IO overload situation. To get a balanced
adaptive algorithm we decrease speed faster than we increase the speed.

For CPU overload we start reacting when the CPU usage reaches 95% and more.
We use a mechanism to see how much CPU time the LDM thread has had each
second. This means that this algorithm will work best in an environment
where the LDM threads always have access to CPU time if they need to.
If the LDM threads are sharing CPU time with many other threads and processes
the data node will still function but this particular adaptive algorithm
won't work very well.

Personally I always set up my data nodes using CPU locking with the
ThreadConfig config parameter to get the optimal stable performance from
the data nodes.

It is actually even possible to setup a real-time environment with MySQL
Cluster, more on the how-tos of that in a separate blog.

IO lagging is discovered by analysing how many seconds of REDO log is
on its way to the disk which hasn't yet been written to disk.

The adaptive algorithm will always ensure that we never go below
MinDiskWriteSpeed independent of CPU load and IO load. This parameter should
be set at such a level to never jeopardize the system and not risking that
the log tail meets the log head due to not completing LCPs quick enough.

The maximum checkpoint write speed is then based on whether we are in normal
operation, whether another node is currently restarting or whether we are
restarting our own node.

So by this thorough introduction to execution of LCPs in MySQL Cluster 7.4
it is possible to answer digitalpoint's question.

In normal operation when there is no high load on the node, there is no
overload of the IO subsystem, then we will write with LCPs at a constant
speed of MaxDiskWriteSpeed. The MinDiskWriteSpeed we will reach in a
scenario with overload either of the CPUs or the IO subsystem due to
many transactions being executed by the users of MySQL Cluster.

So MinDiskWriteSpeed should be set at the minimum level required to ensure
that we don't run out of REDO log space between. MaxDiskWriteSpeed should
be set to a level dependent on how fast restarts we want to have and this
is the speed for the most part so also one should take into account the
possibility of wearing out the disk when setting this parameter.

MaxDiskWriteSpeedOtherNodeRestart should be set quite high to enable fast
restarts. We should still not set it drastically high to avoid that we
run into unnecessary overload situations during the node restart.

MaxDiskWriteSpeedOwnRestart can be set to a level that can be handled by
the IO subsystem and CPU subsystem since it has very little other
activity to handle during a restart.

The adaptive algorithm makes it possible to fairly aggressive in setting
the last two config parameters since the adaptive algorithm will ensure
that we still maintain a healthy level of CPU usage for user transactions.

A final word is that it is possible to change those parameters in any restart.
Actually we even implemented an undocumented feature whereby you can change
these parameters in live nodes. The reason we haven't documented those is
that they are intended for persons that completely understand how their
system behaves and thus we consider that they are able to read the code of
Backup.cpp to understand what to do to change this parameters. The change
done through this interface won't survive a restart, for this one needs to
change the configuration.

We have also added a new NDBINFO table making it possible to follow how
the checkpoint write speed and other disk writes is occuring in your own
cluster. More on that in a later blog.


digitalpoint said...

Much appreciated... Makes perfect sense now that I know what it's trying to accomplish underneath it all.

I ended up lowering my MaxDiskWriteSpeed (which is cake with how fast the restarts are now).

The only reason I cranked it up as high as I did was for backup speed, but I guess that really doesn't matter too much. Would be nice if there was a separate option like MaxDiskWriteSpeedBackup.

Mikael Ronstrom said...

I'll look into it, I thought there was a similar parameter for that purpose, but I'll come back to it on Monday, if not it's a good idea to add one.