Wednesday, August 08, 2018

Analysis of restart improvements in MySQL Cluster 7.6.7

To test restart times I am using the DBT2 test suite that
I developed based on DBT2 0.37 since 2006.

The following test setup is used:
DataMemory: 100 GByte (90 Gbyte in 7.5 and earlier)
IndexMemory: 10 GByte in 7.5 and earlier versions)
MinDiskWriteSpeed=20M (two times the default)
MaxDiskWriteSpeed=40M (two times the default)

I load the database using LOAD DATA FROM INFILE with
precreated CSV files that contains the data for one
warehouse and table. This means that there are 8 CSV
files per warehouse. I load 600 warehouses into NDB.
This means a database size of around 60 GByte with
around 280 million rows.

Next I run the DBT2 from 2 MySQL servers using 16 threads
each for two minutes. This creates a load of almost
100k TPM.

The next step is to restart one of the 2 data nodes and
measure the time it takes to restart the data node.

Obviously an even more realistic benchmark would be to
restart while running the benchmark, but the effect on
the restart analysis would not be substantial.

I tested using the latest GA versions from 7.4 and 7.5
and also both 7.6 GA versions (7.6.6 and 7.6.7).

First the results:
7.4.21: 31 minutes 29 seconds
7.5.11: 44 minutes 9 seconds
7.6.6:  18 minutes 2 seconds
7.6.7:  4 minutes 45 seconds

In order to understand better these numbers we will look
at the most important restart phases and analyse numbers

It takes about 3 seconds to stop a node and start it again.
This time is constant in all versions.

The next step is allocating memory and touching the memory.
Allocating memory doesn't actually commit the memory to RAM.
It only ensures that there is space in RAM or in swap file
allocated for the memory. So in order to commit the memory
to RAM, it is necessary to touch the memory (read or write
from it). The speed of this touching of memory is fairly
constant and depends on Linux version (slight speedup in
newer Linux versions). My measurements shows that this
touching of memory handles about 2.5-3.5 GByte of memory
per second. Thus the restart time is dependent on the
DataMemory size and other memory consuming parts of the
NDB data node.

NDB data nodes always allocate and commit all the memory
as part of the restart. It is even possible to lock the
memory to RAM through setting LockPagesInMemory to 1 in
the configuration.

This step takes 26 seconds for all versions.

A major step in the recovery is to recreate the database that
was in the data node at the time of the node stop. This is
performed in 3 phases.

1) Restore data from a checkpoint
2) Execute REDO log
3) Rebuild ordered indexes

The time of all these phases are dependent on the version.
In 7.3 and earlier versions there was also a lot of time
spent waiting for the metadata lock when copying the
metadata to the starting node. This meant waiting for the
current checkpoint to complete (a checkpoint in 7.3 with
these settings and the database size takes about 20 minutes).

Thus 7.3 would add approximately 10 minutes to the restart times.

After restoring the local database from disk the next major
phase is the synchronisation phase. This phase will take longer
time if there has been updates during the restart. The time spent
in this phase is not expected to have changed in any material
fashion in 7.6.

The final phase is to wait for one checkpoint to complete to
ensure that the node is recoverable even if the other node
should fail completely.

Restore phase in 7.4.21 and 7.5.11 only takes about 5-10 seconds.
The reason is that the last checkpoint completed happened early
in the load phase. Thus almost the entire database has to be
loaded from the REDO log.

The numbers on the restore phase plus the REDO phase is
3 minutes and 48 seconds in 7.4.21 and 3 minutes and 30 seconds
in 7.5.11.

In 7.6.6 the restore phase takes considerably longer and thus
the REDO phase is shortened. 7.6.6 makes partial checkpoints and
can thus write checkpoints a bit faster. But the disk write speed
is too slow to keep up with the insert rate. Actually setting
the MaxDiskWriteSpeed to 100M in 7.6.6 speeds up restarts by a
factor of 3. The time for the restore phase in 7.6.6 is
1 minute and 12 seconds and the REDO phase is 2 minutes and
20 seconds. Thus the total time of these two phases are
3 minutes and 32 seconds.

So what we can conclude here is that 7.6.6 requires a higher
setting of the disk write speed to materially improve the
restart times in these two phases for restarts during massive

Now the restore phase in 7.6.7 recovers almost all the data
since checkpoints are executed with 15-20 seconds intervals.
The restore phase consumes 2 minutes and 48 seconds and the
REDO phase takes less than one second. The speed up of these
two comes from that the restore phase is faster per row
compared to executing the REDO log although more data has to
be restored in 7.6.

Next we analyse the phase that rebuilds the ordered indexes.
The change here actually comes from configuration changes and
the ability to lock the index build threads to more CPUs.

In 7.6 we changed the default of BuildIndexThreads to 128.
This means that each fragment that requires rebuild of an
index for a table can be executed in parallel. The default
in 7.5 and earlier meant that all rebuild of indexes happened
in LDM threads. Thus in this case the 7.5 and 7.4 versions
could use 8 CPUs to rebuild indexes while 7.6 could use
16 CPUs to rebuild indexes. The parallelisation of
rebuild indexes can happen in 7.5, but 7.6 ensures that we
lock to more CPUs compared in 7.5.

This change meant that the times of 7.4.21 (2 minutes
19 seconds) and 7.5.11 (2 minutes 12 seconds) was significantly
improved. In 7.6.6 the time was 1 minutes 20 seconds and in
7.6.7 1 minutes and 19 seconds. Thus a significant improvement
of this phase in 7.6.

The synchronisation phase takes 1-2 seconds in all versions.
Since no changes happened during the restart this second is spent
in scanning the 280 million rows to see if any changes have
occurred during the restart (happens in live node).

Now we come to the phase where the big change in 7.6.6 happened
and where it happens even more in 7.6.7. This phase is where
we wait for a checkpoint to complete that we participated in.

Actually this means first waiting for the ongoing checkpoint to
complete and next to participate in another checkpoint.

The time to execute a checkpoint is fairly constant in 7.5 and
earlier versions. In this particular setup it takes about 22
minutes. This means that this wait phase can take anywhere
between 22 minutes and 44 minutes dependent on the timing of
the restart.

This is the reason why 7.4.21 restarted so much faster than
7.5.11. It was pure luck in timing the checkpoints.

In 7.6.6 the time to execute checkpoints is much lower compared
to in 7.5. Thus this phase is much shorter here. However the time
for a checkpoint varies dependent on how much changes have happened
since the last checkpoint. In this particular case the phase took
12 minutes and 26 seconds.

With 7.6.7 we adapt checkpoint speed to ensure that we can
survive even with very small REDO logs. The second reason
for this is to make checkpoints much faster. During execution of
this benchmark no checkpoints took more than 25 seconds and most
of them took about 15-20 seconds and when the DBT2 benchmark
executed it took about 10-15 seconds. In idle mode a checkpoint
is executed within a few seconds.

This means that waiting for a checkpoint to complete and execute
a new one is very fast. In this benchmark it took only 5 seconds.

Thus in this particular restarts of 7.6.7 was almost 10 times
faster compared to 7.4 and 7.5 and even 4 times faster than
restarts of 7.6.6.

Thus most of the restart times are now linearly dependent on the
database size and the size of the ordered indexes.

There are more phases in NDB restarts that can consume time, for
instance with disk data we have to UNDO disk data changes. This
phase was improved 5 times with 4 LDM threads in 7.6. There
are also various steps where occasionally the restart can be blocked
due to metadata locks and other reasons.

No comments: