Saturday, March 17, 2018

NDB Cluster and disk columns

NDB is mainly an In-memory database. We have however also the possibility to
store non-indexed columns on disk. This data uses a page cache as any
other normal disk-based DBMS.

Interestingly with the increases of memory sizes one could think that
disk data becomes less important for MySQL Cluster. The answer is actually
the opposite.

The reason is again the HW development. NDB is designed with predictable
latency as a very basic requirement. In the past disks meant hard drives. Access
time to a hard disk was several milliseconds at best. Given that our requirement
was to handle complex transactions within 10 milliseconds disk data storage
was out of the question.

Modern HW is completely different, they use SSD devices, first attached through
the SATA interface that enabled up to around 500 MByte per second and
a few thousand IO operations per second (IOPS). The second step was the
introduction of SSD devices on the PCI bus. This lifted the performance up to more
than  1 GByte per second. These devices are extremely small and still very powerful.
I have an Intel NUC at home that has two of those devices.

Thus the performance difference between disk storage and RAM has decreased.

The next step on the way was to change the storage protocol and introduce NVMe
devices. These still use the same HW, but use a new standard that is designed for
the new type of storage devices. Given those devices we have now the ability to
execute millions of IOPS on a standard server box with access times of a few tens
of microseconds.

For NDB this means that this HW fits very well into the NDB architecture. The work
we did on developing the Partial LCP algorithm did also a lot of work on improving
our disk data implementation. We see more and more people that use disk data
columns in NDB.

The next step is even more interesting, this will bring storage into the memory bus and
access times of around one microsecond. For NDB this disk storage can be treated as
memory to start with, thus making it possible to soon have multiple TBytes of memory
in standard boxes.

Thus HW development is making the NDB engine more and more interesting to use.

One notable example that uses disk data columns in NDB is HopsFS. They use the
disk data columns to store small files in the meta data server of the HopsFS
implementation of the Hadoop HDFS Name Server. This means much faster
access to small files. The tests they did showed that they could handled hundreds
of thousands of file reads and writes per second even using fairly standard SSD disks
on the servers.

The implementation of disk data in NDB is done such that each row can have three
parts. The fixed memory part that is accessed quickly using a row id. The variable
sized part that is accessed through a pointer from the fixed size part.

The disk columns are also accessed through a reference in the fixed size part. This
reference is an 8-bit value that refers to the page id and page index of the disk
columns.

Before we can access those pages we go through a page cache. The page cache was
implemented on caching techniques that was state of the art a few years ago.

The idea is quite simple. The page cache uses a normal hot page queue. Pages are
brought up in this queue when they are accessed. A single access will bring it up,
but to be more permanent in the page cache a page has to be accessed several times.

Now each page is represented in those queues by a page state record. The basis
of the page cache algorithm is that a page can be represented in a page state
record even if the page is not in the page cache.

NDB has a configuration variable called DiskPageBufferEntries, by default this is
set to 10. It is the multiplication factor of how many more pages we have
page state records compared to the amount of pages we have in the page cache.

So for example if we have set DiskPageBufferMemory to 10 GByte and we have
set DiskPageBufferEntries we will have page state records that holds pages of
100 GBytes in the queues. Thus even when a page is paged out we keep it in the
list and thus we can see patterns of reuse that are longer than the page cache
we have access to. The factor of 10 means that the page state records are of
about 3% of the size of the page cache itself. Thus the benefits of the extra
knowledge about page usage patterns comes at a fairly low cost. The factor
10 is configurable.

Many cloud servers comes equipped with hundreds of GBytes (some even TBytes)
and can also store a number of TBytes on NVMe devices. NDB is well suited
for those modern machines and MySQL Cluster 7.6 have been designed to be
suitable for this new generation of HW.

Friday, March 16, 2018

Discovering rows that have been updated since last checkpoint

One important problem that requires a solution is to decide whether
a row has been updated since the last checkpoint or not.

Most implementations use some kind of mechanism that requires extra
memory resources and/or CPU resources to handle this.

NDB uses the fact that each row is already stamped with a timestamp.
The timestamp is what we call a global checkpoint id. A new global
checkpoint is created about once every 2 seconds (can be faster or
slower by configuration).

Thus we will overestimate the number of rows written since last checkpoint
with a little bit, but with checkpoints taking a few minutes, the extra overhead
of this is only around 1%.

Thus when we scan rows we check the global checkpoint id of the row, if
it is bigger than the global checkpoint that the last checkpoint had fully
covered we will write the row as changed since last checkpoint. Actually
we also have the same information on the page level, thus we can check
the page header and very quickly scan past an entire page if it hasn't been
updated since last checkpoint.

The same type of scanning is used also to bring a restarting node up to
synch with the live node. This algorithm has been present in NDB since
MySQL 5.1.

Partial LCPs and Read-only tables

In MySQL Cluster 7.5 we use Complete Checkpoints. In MySQL Cluster 7.6
we implement an approach where we only checkpoint a part of the database
in each checkpoint.

A special case is a checkpoint of a table partition where no changes
at all have happened since the last checkpoint. In this case we implemented
a special optimisation such that it is not necessary to checkpoint anything
at all for this table partition. It is only necessary to write a new LCP
control file which is 4 kBytes in size for each table partition (can grow to
8 kBytes if the recovery will require more than 980 checkpoints to
recover.

This means that if your database contains a large set of read-only tables,
there will be no need to checkpoint those tables at all. This feature
is used also when setting EnablePartialLcp to false.

Partial LCPs and disk space

One of the main objectives of the new Partial LCP algorithm in MySQL
Cluster 7.6 is to keep up with the development of modern HW.

I have already described in previous blogs how Partial LCP can handle
nicely even database sizes of 10 TBytes of memory with a very modest
load on the disk devices.

Now modern HW has shifted from using hard drives to using SSDs.

The original approach in NDB is assuming that the checkpoints and
REDO logs are stored on hard drives. In MySQL Cluster 7.5 the
disk space required for the REDO log is that it is a bit larger than the
DataMemory size. The reason is that we want to survive also when
loading massive amounts of data.

In MySQL Cluster 7.5 we cannot remove any checkpoint files until
a checkpoint is fully completed. This means that we require around
4x the memory size of disk space for REDO logs and checkpoints.

With hard drives this is not a problem at all. As an example my
development box has 32 GBytes of memory with 2 TByte of disk
space. Thus 64x more disk space compared to the memory space.

With modern servers this size difference between memory and
disks is decreasing. For example many cloud VMs only have
a bit more than 2x the disk size compared to the memory size.

So one goal of MySQL Cluster 7.6 is to fit in much less disk
space.

The aim is to solve this with a three-thronged approach.

1) Partial LCP means that we can execute the checkpoints much
faster. Since REDO logs only need to be kept for around two
checkpoints this means a significant decrease of size requirements
for REDO logs. The aim is to only need around 10% of the disk
space of memory for the REDO logs. This work is not completed
in 7.6.4. As usual there are no guarantees when this work will be
completed.

2) Using Partial LCP we can throw away old LCP files as soon
as we have created a new recoverable LCP for the table partition.
Thus it is no longer necessary to store 2 LCPs on disk. At the
same time there is some overhead related to Partial LCPs. By default
setting this overhead is 50% plus a bit more. Thus we should always
fit within about 1.6x times the memory size.

It is possible to set EnablePartialLcp to false, in this case all
checkpoints will be Complete Checkpoints. This means more
writes to disk for checkpoints, but it will decrease the storage
space to around the same as the memory size.

3) Using CompressedLCP set to 1 we can decrease LCP storage
by another factor of 2-3x (usually around 2.7x). This feature has
existed for a long time in NDB.

Thus it should be possible to significantly decrease the requirements
on storage space when running NDB using MySQL Cluster 7.6.

NDB Checkpoints vs Disk-based checkpoints

At one point in the development of the Partial LCP algorithm I started
wondering how it compares to an approach where one uses a standard
page-based checkpointing as happens in a traditional disk-based DBMS.

The scenario I thought was a case where one have a 10 TByte memory in
the system. In this case the likelihood of updating the same row twice
in a checkpoint is very small (except for hotspot rows of course).

With a row size of 250 bytes there will be 50 billion rows in the database.
I often assume that the checkpoint takes about 5 minutes in my modelling.
This means that even with 1 million writes per second less than 1% of the
data in the database is updated during a checkpoint.

A more normal update speed would be e.g. 100.000 writes per second.
In this case each checkpoint will write 10 Mbytes of rows per second
and thus about 6 GBytes is written in 5 minutes. This represents the
Delta, the changed records.

In addition in the Partial LCP implementation a part of the database
must also be written. In this case we have written 0.075% of the database
since the last checkpoints. The defaults requires thus that we select
a number of parts that ensures that at least 0.15% of the database is
fully written. Thus 2 of the 2048 parts will be written and thus the
total checkpoint size will be 22.5 GByte. To write this in 5 minutes
we need to write 75 Mbyte per second checkpoint data.

Now let us do the same experiment with a traditional disk-based
DBMS. We assume a standard page size of 8 kBytes. This means
that the page cache will have 1.28 billion pages. Thus with 100.000
updates per second we will update 36 M pages. Thus around 3% of
the pages. Thus it is very unlikely that any large number of pages
have more than one page write.

Thus a checkpoint must write each and every one of those 36M pages.
This means a total checkpoint size of 288 GByte. This means that the
DBMS must write almost 1 Gbyte of checkpoints per second, thus
more than 10x the amount NDB will write using the Partial LCP
algorithm.

In NDB it is possible to write even less during a checkpoint by setting
the configuration parameter RecoveryWork higher. Setting this to
its maximum size 100 means in the above calculation that we
only need to write 15 GByte per checkpoint and thus checkpoint
speed is 50 MBytes per second.

The drawback of this setting is that we increase the work at recovery
instead. The overhead stored on disk with setting 100 is 2x and this
overhead will amount to the same overhead at recovery time. The
default setting is 50% overhead in storage and at recovery.

It is possible to set it lower as well, down to 25%. In this case we will
write more, in the example we would write 37.5GByte and thus
125 MByte per second. So still 8x better than the disk-based
DBMS. In this case the overhead in storage is 25% and similarly
the overhead at recovery.

Although the overhead for restoring checkpoints is higher using
Partial LCP, the recovery will be a lot faster in 7.6. Recovery
contains running one LCP as part of recovery. This LCP
can be 100x faster compared to executing an LCP in 7.5.
Thus recovery will often be significantly faster using 7.6.

Also the disk storage for LCPs is decreased in 7.6.

NDB Checkpoints and research on In-Memory Databases

I just read an article called Low-Overhead Asynchronous Checkpointing in
Main-Memory Database Systems. It was mentioned in a course in Database
Systems at Carnegie-Mellon University, see here.

In MySQL Cluster 7.6.4 we released a new variant of our checkpointing designed
for modern HW with TBytes of main memory. I think studying this implementation
will be very worthwhile both for users of NDB, but also for researchers in DBMS
implementations. It implements a new class of checkpoint algorithms that is currently
a research topic in the database research community.

It was interesting to compare our approach that I called Partial LCP with approaches
taken by other commercial in-memory databases and with the approach presented
in the paper.

LCP is Local CheckPoint which is the name we use for our checkpoint protocol
in NDB.

The course presents a number of ideal properties of a checkpoint implementation.

The first property is that doesn't slow down regular transaction processing.

In the case of NDB we execute checkpoints at a steady pace which consumes
around 5-10% of the available CPU resources. This will decrease even more with
the implementation in 7.6.

The second is that it doesn't introduce any latency spikes.

NDB checkpointing both new and old executes in steps of at most 10-20
microseconds. So there will be extremely small impact on latency of
transactions due to checkpointing.

The third property is that it doesn't require excessive memory overhead.

NDB checkpointing consumes a configurable buffer in each database thread. The
ideal size of this is around 1 MByte. In addition we have a REDO log buffer that
is usually a bit bigger than that. That is all there is to it. There is no extra memory
space needed for checkpointing rows. The checkpointing performs a normal scan
of the rows and copies the memory content to the buffer and as soon as the buffer
is full it writes it to disk using sequential disk writes.

It is fair to say that NDB does a good job in handling those ideal properties.

The course presents two variants called fuzzy checkpoints and consistent checkpoints.
The course defines fuzzy checkpoints as a checkpoint that can write uncommitted
data. I would normally use the term fuzzy checkpoint to mean that the checkpoint
is not consistent at a database level, but can still be consistent on a row basis.

Actually NDB is a mix of the definition provided in the course material. It is a
consistent checkpoint for each row. But different rows can be consistent at very
different points in time. So on a row basis NDB is consistent, but at the database
level the checkpoint is fuzzy. Thus to perform recovery one needs to install the
checkpoint and then apply the REDO log to get a consistent checkpoint restored.

Next the course presents two variants called Complete Checkpoints and Delta
Checkpoints. Complete Checkpoint means that the entire database is written in
each checkpoint. Delta Checkpoint means that only changes are written in a
checkpoint.

This is where MySQL Cluster 7.6 differs from 7.5. 7.5 uses a Complete Checkpoint
scheme. 7.6 uses a Partial Checkpoint scheme.

In my view the NDB variant is a third variant which is not complete and not a
Delta Checkpoint. Partial means that it writes the Delta, that is it writes all changes
since the last checkpoint. But it does also write a Complete Checkpoint for a part
of the database, thus the name Partial Checkpoint. Thus it is similar to an
incremental backup scheme.

NDB can divide the database up in up to 2048 parts, each checkpoint can write
0 parts (only if no changes occurred in the table partition since last checkpoint).
It can write 1 part if the number of writes is very small, it can write all 2048 parts
if almost all rows have been updated and it can write anywhere between 1 and
2048 based on how many rows were updated since last checkpoint.

Almost all commercial In-Memory DBMSs still use a complete checkpoint scheme.
As we move towards TBytes of memory this is no longer a plausible approach.

The NDB approach means that we can perform a checkpoint in a few minutes
even in a system with 16 TBytes of memory where we need to write about
8 GBytes plus the changes since the last checkpoint.

Thus NDB takes the step into a new world of massively large In-Memory DBMSs
with the introduction of MySQL Cluster 7.6 and its new Partial LCP implementation.

My new book "MySQL Cluster 7.5 inside and out" describes the LCP
implementation in 7.5, the description of the Partial LCP can be found in my blogs
and also some very detailed descriptions in the source code itself. Among other
things a 10-page proof of that the algorithm actually works :)

The nice thing with the Partial LCP approach in NDB is that it requires no
more work after writing the checkpoint. There is no need of merging checkpoints.
This happens automatically at recovery. There is some amount of overhead in
that the checkpoints can have some rows in multiple checkpoints and thus there is
some amount of overhead at recovery. We calculate the number of parts to use
based on the amount of changes. We even implemented a LCP simulator that
calculates the overhead while inserting and deleting large amounts of row
and has been used to find the proper configurable parameters for the algorithm.