Saturday, March 17, 2018

NDB Cluster and disk columns

NDB is mainly an In-memory database. We have however also the possibility to
store non-indexed columns on disk. This data uses a page cache as any
other normal disk-based DBMS.

Interestingly with the increases of memory sizes one could think that
disk data becomes less important for MySQL Cluster. The answer is actually
the opposite.

The reason is again the HW development. NDB is designed with predictable
latency as a very basic requirement. In the past disks meant hard drives. Access
time to a hard disk was several milliseconds at best. Given that our requirement
was to handle complex transactions within 10 milliseconds disk data storage
was out of the question.

Modern HW is completely different, they use SSD devices, first attached through
the SATA interface that enabled up to around 500 MByte per second and
a few thousand IO operations per second (IOPS). The second step was the
introduction of SSD devices on the PCI bus. This lifted the performance up to more
than  1 GByte per second. These devices are extremely small and still very powerful.
I have an Intel NUC at home that has two of those devices.

Thus the performance difference between disk storage and RAM has decreased.

The next step on the way was to change the storage protocol and introduce NVMe
devices. These still use the same HW, but use a new standard that is designed for
the new type of storage devices. Given those devices we have now the ability to
execute millions of IOPS on a standard server box with access times of a few tens
of microseconds.

For NDB this means that this HW fits very well into the NDB architecture. The work
we did on developing the Partial LCP algorithm did also a lot of work on improving
our disk data implementation. We see more and more people that use disk data
columns in NDB.

The next step is even more interesting, this will bring storage into the memory bus and
access times of around one microsecond. For NDB this disk storage can be treated as
memory to start with, thus making it possible to soon have multiple TBytes of memory
in standard boxes.

Thus HW development is making the NDB engine more and more interesting to use.

One notable example that uses disk data columns in NDB is HopsFS. They use the
disk data columns to store small files in the meta data server of the HopsFS
implementation of the Hadoop HDFS Name Server. This means much faster
access to small files. The tests they did showed that they could handled hundreds
of thousands of file reads and writes per second even using fairly standard SSD disks
on the servers.

The implementation of disk data in NDB is done such that each row can have three
parts. The fixed memory part that is accessed quickly using a row id. The variable
sized part that is accessed through a pointer from the fixed size part.

The disk columns are also accessed through a reference in the fixed size part. This
reference is an 8-bit value that refers to the page id and page index of the disk
columns.

Before we can access those pages we go through a page cache. The page cache was
implemented on caching techniques that was state of the art a few years ago.

The idea is quite simple. The page cache uses a normal hot page queue. Pages are
brought up in this queue when they are accessed. A single access will bring it up,
but to be more permanent in the page cache a page has to be accessed several times.

Now each page is represented in those queues by a page state record. The basis
of the page cache algorithm is that a page can be represented in a page state
record even if the page is not in the page cache.

NDB has a configuration variable called DiskPageBufferEntries, by default this is
set to 10. It is the multiplication factor of how many more pages we have
page state records compared to the amount of pages we have in the page cache.

So for example if we have set DiskPageBufferMemory to 10 GByte and we have
set DiskPageBufferEntries we will have page state records that holds pages of
100 GBytes in the queues. Thus even when a page is paged out we keep it in the
list and thus we can see patterns of reuse that are longer than the page cache
we have access to. The factor of 10 means that the page state records are of
about 3% of the size of the page cache itself. Thus the benefits of the extra
knowledge about page usage patterns comes at a fairly low cost. The factor
10 is configurable.

Many cloud servers comes equipped with hundreds of GBytes (some even TBytes)
and can also store a number of TBytes on NVMe devices. NDB is well suited
for those modern machines and MySQL Cluster 7.6 have been designed to be
suitable for this new generation of HW.

No comments: