Saturday, June 02, 2018

MySQL Cluster 7.6 future proof

MySQL Cluster 7.6 is designed to improve the restart times
for database sizes that MySQL Cluster 7.5 and earlier versions
support.

At the same time MySQL Cluster 7.6 is preparing for the innovations
in HW architecture. Between 2008 and 2012 I was heavily involved
in handling the previous change in HW architecture. This change
was the introduction of multi-core architectures. Between 2008 and
2012 we scaled the MySQL Server from 4 CPUs to 64 CPUs.
The NDB data nodes was scaled from 2 CPUs to more than 50 CPUs in
the same timeframe.

The next major shift in the HW architecture is the introduction of
persistent memory, this means that we will get persistent memory
accessible at the same level as DRAM. We don't know yet all
characteristics we will see on those persistent memories, but a
guestimate on what to expect are:

1) About 10x more memory per DIMM
2) About 4x cheaper memory
3) About 10x slower access to the memory compared to DRAM
4) Memory will be persistent and survive a restart of the machine

As an example of this development Intel announced Optane persistent memory
to be fully available in 2019. These memory DIMMs will be available in 512 GB
DIMMs. A modern 2-socket server of today comes equipped with about 512 GB
memory. A high-end server of 2019 will be able to be shipped with 6 TByte of
persistent memory and on top of that also 400 GByte of DRAM memory.

MySQL Cluster 7.5 and earlier versions have a good fit for the modern servers
of today. MySQL Cluster 7.6 brings a much improved recovery architecture that
will improve restart times by 4x using current database sizes.

At the same time MySQL Cluster 7.5 won't work very well with machines with
6 TByte machines. This is due to the use of full checkpoints, thus each
checkpoint will have to write 6 TBytes to disk.

Actually most every in-memory DBMS have the same issue, so all in-memory
DBMSs have to adapt to this new reality. MySQL Cluster leads the way here
by introducing partial checkpoints in MySQL Cluster 7.6. Even disk-based
DBMS will get a fair amount of issues to handle around checkpointing
when the page cache grows to multi-TByte sizes.

During development of partial checkpoints I analyzed the difference between
the method implemented in MySQL Cluster 7.6 and using a page cache. The
method used in MySQL Cluster 7.6 needed to write 100x less data to disk as
part of checkpoints.

To give you a feeling for the impact of the checkpointing times in MySQL
Cluster I will describe what will happen with full checkpoints using
a 6 TByte database size.

Assume that we will write 100 MByte per second to disk for checkpoints.
In this case it will take 60.000 seconds to perform a checkpoint. This
means 16 hours and 40 minutes.

Now assume we perform a checkpoint in MySQL Cluster 7.6. In 7.6 we only
need to checkpoint those partitions that have changed any data. We assume
that half of the partitions in the database haven't changed since the last
checkpoint. We assume that we have a fair amount of updates, but since
checkpoints happen with intervals of around a minute, it means that only
a small portion of the 6 TByte will be updated. A partial checkpoint will
always checkpoint at least one part in 2048. This means that the minimum
size of a checkpoint in this scenario would be 1.5 GByte. Thus the
checkpoint will take 15 seconds in MySQL Cluster 7.6.

This means that the checkpoint time has decreased by a factor of 4000x
in this particular case.

The choice of the factor 2048 is to ensure that we can maintain very short
checkpoint times all the way up to memories of 16 TByte and even beyond this
it will still function very well.

Thus MySQL Cluster 7.6 is already prepared for the next generation of
HW architectures arriving in 2019.

As part of a node restart we perform a checkpoint and have to wait for the
previous checkpoint to complete. From this we can deduce that the improvement
of restart times is even bigger as we go towards bigger memories.

No comments: