Wednesday, January 31, 2018

Partial LCP in MySQL Cluster 7.6.4

Today MySQL Cluster 7.6.4 DMR is out. This new version contains some very interesting
new developments in the area of checkpointing.

When I developed the original NDB algorithms in the late 90s the computer I had access to
had 2 CPUs, 1 GByte of memory. At the time we were aiming at 10.000 updates per second.

So with average row sizes of 100 bytes this meant that we changed 1 MByte per second.
The local checkpoint algorithm was designed to be executed once per about 5 minutes.

So this meant that most of the database would be changed at high loads. So the
checkpointing algorithm writes the entire database. This means that a lot of updates
are merged together in the checkpoint.

This algorithm has been used now in NDB for 20 years and it still works fine.

Now HW is developing in a number of ways.

1) Memory is getting larger and larger. Today it is not uncommon to find machines
with TBytes of memory.

2) The ratio between available disk space is decreasing.

In the past it was not uncommon to have 100 times as much disk space as memory space.
With SSDs this factor have been decreased significantly. Particularly in servers where
NDB resides this factor have decreased to around 2-3 in many cases.

In addition the development of persistent memory is ongoing, this is likely to cause
memory to grow with a jump of another 4x or so. This means that even tens of TBytes
in a machine is likely to be common.

When starting the development of the new recovery algorithm in NDB 2 years ago the
requirement was thus to implement a new recovery algorithm that will handle
main memory sizes of up to at least 16 TByte of memory and with disk sizes that are
about 2x the memory size.

These requirements leads to the following conclusions:
1) We need to implement some form of incremental checkpoint algorithms.
2) We can only maintain one copy of the data on disk
3) We must have the ability to use REDO logs that are much smaller than the memory size

Currently in NDB a checkpoint is not completed until all data have been written. This
means that we must have disk space to handle up to at least 2x the memory size for
checkpoints.

During massive inserts (e.g. importing data), it is necessary to have a very large REDO log
to ensure that we don't run out of REDO log during import of a database.

These requirements are ok if there is sufficient disk space, but we wanted to make sure
that we don't rely on large disk spaces in the future.

In addition we wanted reengineer the LCP algorithm to take decisions locally and not
rely on all nodes participating in each decision about LCPs. This means that we can now
perform checkpoints locally in a node during restart without affecting LCPs in other
nodes. This is particularly interesting for initial node restarts where a new node will
take a very long time to execute an LCP whereas the live nodes can perform LCPs in
a very short time.

There were two main possible implementations for incremental checkpoints.
1) Use a standard page cache implementation also for main memory
This would mean that we would store two pages for each page in main memory
and write each page such that we always keep the old page until the LCP
is completed.

2) A partial LCP where a part of the rows are fully checkpointed and the rest only
checkpoints the changed rows.

I did analyse the two algorithms and concluded that the standard page cache
algorithm writes far too with small rows.

When the memory size is TBytes in size, the likelihood of one page having more
than write in a checkpoint is small, thus each row change will lead to one
page written in LCP.

With a row size of 100 bytes and a page size of 32 kBytes this would lead to
a waste of more than 300x of the disk bandwidth.

In addition it would still require 2x the disk space.

So the choice was taken to go with the partial LCP variant. Interestingly the
analysis of the standard page cache algorithms will be a problem for all
disk-based DBMSs. The growth to larger page caches will mean that more
and more disk bandwidth is spent on writing checkpoints.

So here is a short description of how the partial LCP algorithm works.

1) For each table partition we keep one or two LCP control files. This file
is normally 4 kBytes in size (can be 8 kBytes in some cases).
This file is used at recovery to know which checkpoint files to use in recovery,
it is also used at the next LCP to know which checkpoint files to write.

2) We keep track of the number of rows in a table partition and we keep
track of the number of row changes since the last LCP was started on the
table partition. These two numbers are used to decide on how many parts
of the table partition to fully checkpoint.

If the number of changes is 0, we only write a new control file.
If there are changes we will write at least 1 part and at most a full
local checkpoint that writes all 2048 parts.

3) The table partition is divided into parts based on the row id. We use the
page part of the row id to decide on which part a row id is part of.

4) The number of parts to write uses some mathematical formulas.
As it turns out there is an interesting relation here.
If we write less parts fully the work at recovery is increasing and the
size of all checkpoints increases but at the same time the amount of
writes to disk is decreasing.
With more parts written per checkpoint we increase the amount of
writes to disk per checkpoint, but we decrease the checkpoint size
and the amount of work at recovery.

We decided to make the choice here configurable in a new configuration
parameter called RecoveryWork. This can be set between 25 and 100 and
defaults to 50.

At 50 the checkpoint size will be around 1.6 times the data size. The
amount of checkpoint writes to do will be around 3 times the size of
the changed rows.

Setting it to 25 means that the checkpoint size will be around 1.35 times
the data size. The checkpoint writes will be around 4 times the size of
the changed rows.

Setting it to 100 means that the checkpoint size will be around 2.1 times
the data size and the checkpoint writes will be around 2 times the size
of the changed rows.

Thus there is an exponential dependency on the amount of checkpoint
writes required to achieve the minimum restart time.

We have selected a balanced approach as the default setting.

It is also possible to set EnablePartialLcp to 0. In this case we always
write full checkpoints if any row changed. This means that the checkpoint
size will be equal to the data size. In this case it isn't possible to use a
small REDO log since checkpoint write speed will ensure that we can
complete an LCP in a certain time. In this setup the REDO log should
be 2x the size of data size to ensure that we can handle survive even a
large import of a database.

The above calculation is based on calculation under the assumption on
that the amount of writes is very small per LCP compared to the data size.
The code contains large comments in Backup.cpp that explains this in
even more detail.

There is an additional 12.5% in the checkpoint size due to the fact that
we only delete files and a full checkpoint writes 8 files, so in the worst
case we might have to keep a file that contains only 1 part that is relevant
and the rest is not needed anymore, this means that 1/8 could in the worst
case be wasted space. Normally this would not be the case, but we want
to ensure that we can keep the disk space within limits all the time, even
in the worst case.




MySQL Cluster 7.6.4 is out

MySQL Cluster 7.6.4 DMR is out.

This new version contains a number of goodies.

1) Local checkpoint algorithm have been rewritten
The new checkpointing is designed to scale to at least 16 TBytes of DataMemory sizes
Checkpoints will be much faster, this decreases recovery times significantly
Table fragments that are not updated will not need any new checkpoints written
Checkpoint size on disk is significantly decreased

2) MySQL Cluster Configurator (MCC, Auto Installer)
MCC is significantly improved. Particularly for setups where you
have external computers either on-premise or in the cloud.

3) New cloud feature
In the cloud with availability domains/zones it is possible to have
10x difference between latency inside an AD compared to between
ADs. To run MySQL Cluster in a cloud region with synchronous
replication between ADs one can now configure nodes with a
LocationDomainId. This LocationDomainId will be used to
ensure that transaction coordinator is placed in the same AD and
that we always prefer reading data from our own AD if possible.

4) New ODirectSyncFlag
When using ODirect there are a number of file systems that ensures that writes
are also synched to disk. If the user knows that he is working in such an
environment setting ODirectSyncFlag can improve disk write speeds by
around 2-3x. This is particularly interesting when using hard drives.

5) Change default behaviour of restart configuration
We changed the BuildIndexThreads from 0 to 128 to improve speed of
index rebuilds. We added a new configuration setting to specify which
CPUs that can be used for index rebuilds.

We increased batch sizes (and made them configurable) to improve
performance of unique index creation and online add node and some
other algorithms.

We changed the default algorithm for initial node restart to rebuild indexes
in a special phase.

All these changes can lead to a very significant reduction in restore times.

6) Many improvements to our parallel query implementation (pushdown join,
SPJ). The improvement depends on the queries, but in our special benchmark
query we have improved latency of query execution to almost half.

7) Parallel UNDO log applier for disk columns
The phase where we apply the UNDO log is now fully parallelised over all
LDM threads. For a scenario with 4 LDM threads we've seen a speed up of
5x for the UNDO log applier phase (this is only used to UNDO changes in
pages containing disk columns during a restart).

8) Bug fixes
We have continued our quality improvements to ensure that each new version
is even more stable compared to the previous one.