I just read an article called Low-Overhead Asynchronous Checkpointing in
Main-Memory Database Systems. It was mentioned in a course in Database
Systems at Carnegie-Mellon University, see here.
In MySQL Cluster 7.6.4 we released a new variant of our checkpointing designed
for modern HW with TBytes of main memory. I think studying this implementation
will be very worthwhile both for users of NDB, but also for researchers in DBMS
implementations. It implements a new class of checkpoint algorithms that is currently
a research topic in the database research community.
It was interesting to compare our approach that I called Partial LCP with approaches
taken by other commercial in-memory databases and with the approach presented
in the paper.
LCP is Local CheckPoint which is the name we use for our checkpoint protocol
in NDB.
The course presents a number of ideal properties of a checkpoint implementation.
The first property is that doesn't slow down regular transaction processing.
In the case of NDB we execute checkpoints at a steady pace which consumes
around 5-10% of the available CPU resources. This will decrease even more with
the implementation in 7.6.
The second is that it doesn't introduce any latency spikes.
NDB checkpointing both new and old executes in steps of at most 10-20
microseconds. So there will be extremely small impact on latency of
transactions due to checkpointing.
The third property is that it doesn't require excessive memory overhead.
NDB checkpointing consumes a configurable buffer in each database thread. The
ideal size of this is around 1 MByte. In addition we have a REDO log buffer that
is usually a bit bigger than that. That is all there is to it. There is no extra memory
space needed for checkpointing rows. The checkpointing performs a normal scan
of the rows and copies the memory content to the buffer and as soon as the buffer
is full it writes it to disk using sequential disk writes.
It is fair to say that NDB does a good job in handling those ideal properties.
The course presents two variants called fuzzy checkpoints and consistent checkpoints.
The course defines fuzzy checkpoints as a checkpoint that can write uncommitted
data. I would normally use the term fuzzy checkpoint to mean that the checkpoint
is not consistent at a database level, but can still be consistent on a row basis.
Actually NDB is a mix of the definition provided in the course material. It is a
consistent checkpoint for each row. But different rows can be consistent at very
different points in time. So on a row basis NDB is consistent, but at the database
level the checkpoint is fuzzy. Thus to perform recovery one needs to install the
checkpoint and then apply the REDO log to get a consistent checkpoint restored.
Next the course presents two variants called Complete Checkpoints and Delta
Checkpoints. Complete Checkpoint means that the entire database is written in
each checkpoint. Delta Checkpoint means that only changes are written in a
checkpoint.
This is where MySQL Cluster 7.6 differs from 7.5. 7.5 uses a Complete Checkpoint
scheme. 7.6 uses a Partial Checkpoint scheme.
In my view the NDB variant is a third variant which is not complete and not a
Delta Checkpoint. Partial means that it writes the Delta, that is it writes all changes
since the last checkpoint. But it does also write a Complete Checkpoint for a part
of the database, thus the name Partial Checkpoint. Thus it is similar to an
incremental backup scheme.
NDB can divide the database up in up to 2048 parts, each checkpoint can write
0 parts (only if no changes occurred in the table partition since last checkpoint).
It can write 1 part if the number of writes is very small, it can write all 2048 parts
if almost all rows have been updated and it can write anywhere between 1 and
2048 based on how many rows were updated since last checkpoint.
Almost all commercial In-Memory DBMSs still use a complete checkpoint scheme.
As we move towards TBytes of memory this is no longer a plausible approach.
The NDB approach means that we can perform a checkpoint in a few minutes
even in a system with 16 TBytes of memory where we need to write about
8 GBytes plus the changes since the last checkpoint.
Thus NDB takes the step into a new world of massively large In-Memory DBMSs
with the introduction of MySQL Cluster 7.6 and its new Partial LCP implementation.
My new book "MySQL Cluster 7.5 inside and out" describes the LCP
implementation in 7.5, the description of the Partial LCP can be found in my blogs
and also some very detailed descriptions in the source code itself. Among other
things a 10-page proof of that the algorithm actually works :)
The nice thing with the Partial LCP approach in NDB is that it requires no
more work after writing the checkpoint. There is no need of merging checkpoints.
This happens automatically at recovery. There is some amount of overhead in
that the checkpoints can have some rows in multiple checkpoints and thus there is
some amount of overhead at recovery. We calculate the number of parts to use
based on the amount of changes. We even implemented a LCP simulator that
calculates the overhead while inserting and deleting large amounts of row
and has been used to find the proper configurable parameters for the algorithm.
Main-Memory Database Systems. It was mentioned in a course in Database
Systems at Carnegie-Mellon University, see here.
In MySQL Cluster 7.6.4 we released a new variant of our checkpointing designed
for modern HW with TBytes of main memory. I think studying this implementation
will be very worthwhile both for users of NDB, but also for researchers in DBMS
implementations. It implements a new class of checkpoint algorithms that is currently
a research topic in the database research community.
It was interesting to compare our approach that I called Partial LCP with approaches
taken by other commercial in-memory databases and with the approach presented
in the paper.
LCP is Local CheckPoint which is the name we use for our checkpoint protocol
in NDB.
The course presents a number of ideal properties of a checkpoint implementation.
The first property is that doesn't slow down regular transaction processing.
In the case of NDB we execute checkpoints at a steady pace which consumes
around 5-10% of the available CPU resources. This will decrease even more with
the implementation in 7.6.
The second is that it doesn't introduce any latency spikes.
NDB checkpointing both new and old executes in steps of at most 10-20
microseconds. So there will be extremely small impact on latency of
transactions due to checkpointing.
The third property is that it doesn't require excessive memory overhead.
NDB checkpointing consumes a configurable buffer in each database thread. The
ideal size of this is around 1 MByte. In addition we have a REDO log buffer that
is usually a bit bigger than that. That is all there is to it. There is no extra memory
space needed for checkpointing rows. The checkpointing performs a normal scan
of the rows and copies the memory content to the buffer and as soon as the buffer
is full it writes it to disk using sequential disk writes.
It is fair to say that NDB does a good job in handling those ideal properties.
The course presents two variants called fuzzy checkpoints and consistent checkpoints.
The course defines fuzzy checkpoints as a checkpoint that can write uncommitted
data. I would normally use the term fuzzy checkpoint to mean that the checkpoint
is not consistent at a database level, but can still be consistent on a row basis.
Actually NDB is a mix of the definition provided in the course material. It is a
consistent checkpoint for each row. But different rows can be consistent at very
different points in time. So on a row basis NDB is consistent, but at the database
level the checkpoint is fuzzy. Thus to perform recovery one needs to install the
checkpoint and then apply the REDO log to get a consistent checkpoint restored.
Next the course presents two variants called Complete Checkpoints and Delta
Checkpoints. Complete Checkpoint means that the entire database is written in
each checkpoint. Delta Checkpoint means that only changes are written in a
checkpoint.
This is where MySQL Cluster 7.6 differs from 7.5. 7.5 uses a Complete Checkpoint
scheme. 7.6 uses a Partial Checkpoint scheme.
In my view the NDB variant is a third variant which is not complete and not a
Delta Checkpoint. Partial means that it writes the Delta, that is it writes all changes
since the last checkpoint. But it does also write a Complete Checkpoint for a part
of the database, thus the name Partial Checkpoint. Thus it is similar to an
incremental backup scheme.
NDB can divide the database up in up to 2048 parts, each checkpoint can write
0 parts (only if no changes occurred in the table partition since last checkpoint).
It can write 1 part if the number of writes is very small, it can write all 2048 parts
if almost all rows have been updated and it can write anywhere between 1 and
2048 based on how many rows were updated since last checkpoint.
Almost all commercial In-Memory DBMSs still use a complete checkpoint scheme.
As we move towards TBytes of memory this is no longer a plausible approach.
The NDB approach means that we can perform a checkpoint in a few minutes
even in a system with 16 TBytes of memory where we need to write about
8 GBytes plus the changes since the last checkpoint.
Thus NDB takes the step into a new world of massively large In-Memory DBMSs
with the introduction of MySQL Cluster 7.6 and its new Partial LCP implementation.
My new book "MySQL Cluster 7.5 inside and out" describes the LCP
implementation in 7.5, the description of the Partial LCP can be found in my blogs
and also some very detailed descriptions in the source code itself. Among other
things a 10-page proof of that the algorithm actually works :)
The nice thing with the Partial LCP approach in NDB is that it requires no
more work after writing the checkpoint. There is no need of merging checkpoints.
This happens automatically at recovery. There is some amount of overhead in
that the checkpoints can have some rows in multiple checkpoints and thus there is
some amount of overhead at recovery. We calculate the number of parts to use
based on the amount of changes. We even implemented a LCP simulator that
calculates the overhead while inserting and deleting large amounts of row
and has been used to find the proper configurable parameters for the algorithm.
Hi Mikael,
ReplyDeleteWith this new checkpoint algorithm,the log is still flushed to persistent storage every second and in a system-wide failure we can still recover to everything less the last second of data as before, right? If the answer is yes, I am wondering in the flush if you flush everything in the log available at that time or up to the boundary of an epoch. In other words, if there are committed transactions in the last epoch that hasn't completed, do you flush them too or just everything up to second to the last epoch?
Alex
We flush at least up to the latest epoch, but also any pages that are ready
ReplyDeletefor writing. Even if a log record survives, it will only be restored if its
epoch is restored. So the restore will always restore a consistent checkpoint.