Wednesday, March 04, 2015

Restart phases of a node restart in MySQL Cluster

In MySQL Cluster 7.4 we have worked on improving the speed of restarts.
In addition we have made it easier to track how fast the restarts are
proceeding and to see which phases of the restart take a long time.

For this purpose we have added a new Ndbinfo table in MySQL Cluster 7.4.
I will try to explain the restart phases in a series of blogs and how they map
to the columns in this Ndbinfo table.

First some short intro to concepts in MySQL Cluster. LCP stands for local
checkpoints and this is where the main memory content of tables are
written to disk at regular intervals. There is also a GCP, global checkpoint,
this is a protocol to synchronise the REDO log writes and thus about
durability of transactions. LDM stands for Local Data Manager and is
the thread that runs all the code that maintains the actual data in MySQL
Cluster. This includes row storage, REDO logging, hash index, T-tree
index (ordered index) and various internal triggering functions to support
various high-level functionality such as unique indexes, foreign keys and
so forth.

In a node restart there is a number of nodes involved, at a minimum there
are two nodes involved and there could be more than two nodes involved as
well. At least one node is a starting node and one live node is always a
master node. There can also be other live nodes that have already started and
that participate in some phases of the restart but not in all phases.
There can also be other starting nodes that have to be synchronised to
some extent with the starting node.

A node restart always begins with a node failing. This could be a controlled
failure as is used in e.g. a rolling upgrade of the MySQL Cluster software.
In this case there are usually multiple nodes that are set to fail at the
same time, often one fails half the nodes as part of a software upgrade.

So the first phase of the node restart phases is to complete the node failure.
This involves handling of in progress transactions involving the failed node.
It involves taking over as a master if the failed node was
a master node. It involves handling a number of protocols in the data nodes
where other nodes are waiting for the failed node. In 7.3 and older versions
this can at times be a fairly lengthy phase, in particular in taking over the
master role of the local checkpoint protocol. With the restart improvements
in MySQL Cluster 7.4 it should be a rare event that this phase takes a long
time.

The next node restart phase is that the crashed node starts up again. The first
thing the starting node does is to attempt to allocate a node id. The allocation
of a node id will have to wait for the node failure phase to complete first. So
this means that this phase can take some time, but normally it proceeds very
quickly.

When the new node has received a node id the next step is to be included in
the heartbeat protocol. One node can enter into the heartbeat protocol per
3 seconds. So if many nodes start up at the same time a node can be held
waiting here until the cluster accepts us into the cluster.

Once the node have been included into the heartbeat protocol which is the
lowest node handling layer in MySQL Cluster, we will ask for permission
to get included in the cluster at the next level of maintenance. The node
could be blocked to be included at this level for example if other nodes are
still removing our node from local checkpoints after we did an initial node
restart, the node could also be blocked by include node protocols still going
on. So the node could be held in this layer at times, but it's a rare event and
should not happen very often.

Also only one node at a time can be in this phase at a time, the next node will
be allowed to proceed after the node have completed this phase and the next
phase where we get a copy of the meta data in the cluster.

After being accepted into the cluster the node have to wait until we have paused
the LCP protocol. This is a new feature of MySQL Cluster 7.4 that have been
added to avoid having to wait in this state until a local checkpoint is completed
before we can proceed. In 7.4 we can pause the checkpoint execution (for the
most part this will not block checkpoint execution since a major part of the
checkpoint is off-loaded to the LDM threads that can execute up to 64 local
checkpoint fragments without getting any new requests from the master node).

As soon as the node managed to pause the local checkpoint execution we can
start to receive the meta data from the master node. The copy phase of the
meta data could take some time if there are very many tables in the cluster.
Normally it is a very quick action.

Now that the node have copied the meta data to our node, the next node can
proceed with that phase. Our node will continue by being included into the global
checkpoint protocol and a number of other node handling protocols. This is
normally a quick phase.

Now that the node have been included in the protocols the master node will
tell us what fragments to restore from disk. This phase is usually also fast.

Now the node have reached the first phase that is normal to take a bit longer
time. This phase is where the node restores all requested fragments from a
checkpoint stored on disk. All LDM threads work in parallel on restoring
all requested fragments here. So the node should be able to restore at a speed
of at least a few million records per second here.

After completing restoring the fragment checkpoints the next phase is to execute
the UNDO logs against all disk data fragments. If there are no disk data tables
in the cluster then this phase will be very quick, otherwise it can take some
time.

After restoring the fragments and applying the disk data UNDO log the node is
now ready to apply the logical REDO log. All LDM threads will execute the
REDO logs in parallel. There are 4 phases of REDO log execution, but in reality
only the first phase is used. The time of this phase depends on how fast the
checkpoints were generated before the crash. The faster we execute the
checkpoints the smaller amount of REDO log is needed to execute as part of
a node restart.

Also here the node should be able to process some millions of log records per
second since we are executing this in parallel in the LDM threads.

Now that we have restored a global checkpoint for all fragments the node is
ready to rebuild all ordered indexes. The fragments are check pointed with
only its data, indexes are not check pointed. The indexes are rebuilt as part
of a node restart. The primary key hash index is rebuilt as part of restore of
the fragments and execution of the REDO log. The ordered index is rebuilt
in a separate phase after completing the applying of the REDO log.

This is also a phase that can take some time. All LDM threads execute in
parallel in this phase. The more records and the more ordered indexes we
have the longer this phase takes to execute.

Now the node has managed to restore an old but consistent version of the
fragments we were asked to restore.

The next phase is to synchronise the old data with the live data in the
nodes that are up and running and contains the latest version of each
record.

As a new feature of MySQL Cluster 7.4 this phase can be parallelised
among the LDM threads. So the node can synchronise with data in the
live nodes in many threads in parallel.

This phase can be delayed in cases with many tables and many concurrent
node restarts since we need a lock on a fragment before copying its
meta data to new starting nodes which is also needed before starting
the copying phase to synchronise with the live nodes.

At this point in time the node have an up-to-date version of all restored
fragments in the node. The node do however not have a durable version
yet. For the fragments to be restorable on our starting node we need the
node to participate in a full local checkpoint.

Waiting for local checkpoints in various manners is the most common
activity in delaying restarts in 7.3 and older versions of MySQL Cluster.
In MySQL Cluster 7.4 we have managed to decrease the amount of
waiting for local checkpoints, in addition we have also been able to speed
up the local checkpointing in general and more specifically when node
restarts are executing. Those two things plus the parallelisation of the
synchronisation phase are the main ways of how we managed
to decrease restart times by a factor of 5.5X in MySQL Cluster 7.4.

In order to ensure that as many nodes as possible get started together
after the waiting for the same local checkpoint execution, we have the
ability to block the start of a local checkpoint if we have nodes that are
close to reaching this wait state. Given that we have statistics on how long
time the restarts takes, we can predict if it is useful to wait for
another node to reach this state before we start a new local checkpoint.

Finally when we have waited for a complete LCP it is time for the final
phase which is to wait for subscription handover to occur. This relates
to the MySQL replication support in MySQL Cluster where we have to
contact all MySQL Servers in the cluster informing them of our new
node that can also generate replication events.

After this final phase the restart is completed.

As an example we measured a node restart of around 26 GByte of data
took around 3-4 minutes in MySQL Cluster 7.4.

The speed of node restarts is highly dependent on local checkpoint speed,
we have introduced a number of new configuration parameters to control
this speed, we have also introduced a new Ndbinfo table also to track the
checkpoint speed. We will describe this in a separate blog entry.

Each of the phases described above is represented as a column in the
ndbinfo.restart_info table. We report the time that we spend in each
phase measured in seconds. There is a row for each restart that has
been observed. This means that not all nodes are present in this table.
After an initial start or after a cluster restart the table is even empty.

We will also describe the additional log printouts that we have added to
the node logs describing what is going in the node during a node restart
in a separate blog entry.

9 comments:

Kadaan said...

Is any of the info available when starting up an entire cluster through the ndbapi? The mysql api nodes won't be able to connect until the cluster is started.

Mikael Ronstrom said...

No information is available during a cluster restart for the reason you mentioned. However during a node restart information about ongoing node restarts is available.

digitalpoint said...

Looking forward to the blog entry about checkpoint speed config parameters.

Side note - for me, 60GB data nodes were taking about 50 minutes to restart under 7.3. Under 7.4, the first one is up to 70 minutes. Maybe a rolling restart from 7.3 to 7.4 is slower (hopefully that's what it is)? {shrug}

Mikael Ronstrom said...

Checkpoint parameters are important to set properly.
There are now 4 ones to consider:
MinDiskWriteSpeed
MaxDiskWriteSpeed
MaxDiskWriteSpeedOtherNodeRestart
MaxDiskWriteSpeedOwnRestart

The old parameter
DiskCheckpointSpeed is deprecated in 7.4.

The speed in normal operation goes between MinDiskWriteSpeed and MaxDiskWriteSpeed and
in a node restart all nodes speed up to at most
MaxDiskWriteSpeedOtherNodeRestart whereas the starting node can go all the way up to MaxDiskWriteSpeedOwnRestart.

Most of the speedups that makes one jump a few wait states requires that the master node (the oldest node) is on 7.4. So this won't be the case in a rolling upgrade from 7.3 to 7.4. So hopefully the next one should be faster, but still important to set the config parameter above to appropriate numbers.

digitalpoint said...

Just noticed the new parameters require a full cluster shutdown/restart...

MinDiskWriteSpeed
MaxDiskWriteSpeed
MaxDiskWriteSpeedOtherNodeRestart
MaxDiskWriteSpeedOwnRestart

"System restart: The cluster must be shut down completely, then restarted, to effect a change in this parameter."

I guess they stay at default for us. :)

Mikael Ronstrom said...

Good catch, they can definitely be changed as part of node restart, so no need to shut down the cluster for this, I will post a bug report about the docs for this.

Also for the brave :) I added some undocumented DUMP codes to update those even in a live node
as part of the 7.4 development.

Read in Backup::execDUMP_STATE_ORD on the how to. These codes are mostly intended as a last resort, we haven't developed a proper interface yet to changing these kind of things online. Can only be used against 7.4 nodes, will be ignored by 7.3 nodes.

digitalpoint said...

Just a followup... did a rolling restart today for the first time with everything already on 7.4. 60GB nodes restarting in about 20 minutes now. A massive improvement to be able to do all 8 nodes in 2.5 hours, vs. closer to 8 hours total under 7.3.

Now I just have to figure out why the data nodes are using so much disk i/o. No matter what's going on, they will use 100% of MaxDiskWriteSpeed. Maybe I'm just not understanding what it's intended for, but seems strange that if you set it to 400MB, ndbd uses 400MB constantly... set it to 20MB stuff still runs fine, but it only does 20MB disk i/o. What is it doing that it has to saturate whatever disk i/o is allowed? lol

Mikael Ronstrom said...

The comment about MaxDiskWriteSpeed requires a lengthy reply which I will respond to in a new blog rather. I was actually planning to write this blog anyways. So will arrive a new blog on the topic later today.

If you do an upgrade you can actually also upgrade more than 1 node at a time. Many phases (but not all) are running in parallel. So oftentimes people run upgrades of 8 nodes by restarting 4 nodes (1 node per node group) simultaneously. So this means that you can probably slice your upgrade time even down to 1 hour.

digitalpoint said...

Restarting more than 1 node at a time freaks me out a bit... I know it's strictly psychological, but still... hah