Wednesday, August 08, 2018

Loading data into TByte sized NDB data nodes

One of the main design goals with MySQL Cluster 7.6 was to
support much larger data sets in each data node. The
checkpoint algorithm makes it hard to manage data nodes in
7.5 and earlier versions with many hundreds of GBytes of
data. Using MySQL Cluster 7.6.7 the algorithms scale to
multiple TBytes per data node.

However the largest machines currently at my disposal have
1 TByte of RAM. Thus I went about testing to load data into
a cluster on two such machines. There is no comparison to
older version, it is possible to load data into earlier
versions, but it is not very practical with such large
data nodes in earlier versions.

7.6.7 comes with both partial checkpoints and adaptive
control of disk write speed. This means that we can load
terabytes of data into the cluster even with moderately
sized REDO logs.

To experiment with these I used the same test as in the
previous blog.

Here is the setup.

2 data nodes, one per machine where each machine is equipped
with 1 TByte of RAM, 60 CPU cores distributed on 4 CPU sockets.

The NDB configuration used 750 GByte of DataMemory, 8 LDM threads,
the data node only used CPUs from one CPU socket. It was essential
to set Numa=1 in the configuration to be able to use memory from
all four CPU sockets.

We used the default setup for MinDiskWriteSpeed and MaxDiskWriteSpeed*
configuration parameters. The REDO log size was 2 GByte per LDM thread.
We thus set the following REDO log parameters:
NoOfFragmentLogParts=8 (equal to number of LDM threads)
NoOfFragmentLogFiles=8
FragmentLogFileSize=256M

To enable the new algorithm for adaptive disk write speed it is
necessary to set EnableRedoControl=1 in configuration.

Interestingly in loading the data we were limited by the disk read
bandwidth of the disk where the CSV files of DBT2 were stored. We also
went ahead and created one CSV file per table using SELECT INTO OUTFILE
and used ndb_import to load data into NDB. Using ndb_import we were able
to load data faster than using LOAD DATA INFILE since we were then able
to use files stored on RAID:ed SSD drives.

One challenge with loading data through DBT2 is that we are using warehouse
id as the partition key. This means that all loads from one CSV file goes
into one LDM thread. This means that the load among the LDM threads isn't
balanced. This complicates the checkpointing scheme a bit. We ensure that
any REDO log part that becomes critical also ensures that all other LDM
threads in the cluster knows that we are in a critical state.

Loading through ndb_import is thus easier since the load is balanced over
the LDM threads.

With 8 LDM threads we load 1-2 warehouses per second and one warehouse is
around 100 MByte of data spread into about 370k rows.

In this test we loaded 6000 DBT2 warehouses, thus around 600 GByte in
database size.

The restart is completely similar to the recovery when loading 600 warehouses.
Touching the memory now takes a bit longer, it takes 4 minutes and 44 seconds
plus another 28 seconds when starting the first recovery phases.

The restore phase took 27 minutes and 40 seconds, the REDO phase was still less
than a second and finally rebuilding the ordered indexes took 12 minutes and
13 seconds.

The copy fragment increased to 17 seconds since we needed to scan more data
this time. The phase waiting for checkpoints to complete was still 5 seconds
and the phase that waited for replication subscriptions to be configured also
took 5 seconds. This is usually around 5-6 seconds unless a MySQL server is
down when it can take up to 2 minutes before this phase is completed.

Total restart time thus became 45 minutes and 32 seconds.

As we can see we can recover a TByte sized data node within an hour and this is
with 8 LDM threads. If we instead use 24 LDM threads the restore and rebuild
index phase will go about 3 times faster and thus we would cut restart time by
another 25 minutes and thus restart time would be about 20 minutes in this case
and we would even be able to restart more than 2 TBytes within an hour.

No comments: