Monday, February 10, 2020

Benchmarking a 5 TB Data Node in NDB Cluster

Through the courtesy of Intel I have access to a machine with 6 TB of Intel
Optane DC Persistent Memory. This is memory that can be used both as
persistent memory in App Direct Mode or simply used as a very large
DRAM in Memory Mode.

Slides for a presentation of this is available at slideshare.net.

This memory can be bigger than DRAM, but has some different characteristics
compared to DRAM. Due to this different characteristics all accesses to this
memory goes through a cache and here the cache is the entire DRAM in the
machine.

In the test machine there was a 768 GB DRAM acting as a cache for the
6 TB of persistent memory. When a miss happens in the DRAM cache
one has to go towards the persistent memory instead. The persistent memory
has higher latency and lower throughput. Thus it is important as a programmer
to ensure that your product can work with this new memory.

What one can expect performance-wise is that performance will be similar to
using DRAM as long as the working set is smaller than DRAM. As the working
set grows one expects the performance to drop a bit, but not in a very significant
manner.

We tested NDB Cluster using the DBT2 benchmark which is based on the
standard TPC-C benchmark but uses zero latency between transactions in
the benchmark client.

This benchmark has two phases, the first phase loads the data from 32 threads
where each threads loads one warehouse at a time. Each warehouse contains
almost 500.000 rows in a number of tables.

The second phase executes the benchmark where a number of threads execute
transactions in parallel towards the database using 5 different transactions.

The result is based on how many new order transactions can be processed per
minute. Each such transaction report requires more than 50 SQL statements to be
executed where the majority is UPDATE's and SELECT FOR UPDATE.

Through experiments using the same machines with only DRAM it was
verified that performance running a benchmark with a working set smaller
than DRAM size the performance was within a few percent's margin the
same.

Next we performed benchmarks comparing results when running in a database
of almost 5 TB in size and comparing it to a benchmark that executed only on
warehouses that fit in the DRAM cache.

Our findings was that latency of DBT2 transactions increased by 10-12% when
using the full data set of the machine. However the benchmark was limited by
the CPUs available to run the MySQL Server and thus the throughput was
the same.

NDB Cluster worked like a charm during these tests. We found a minor issue in
the local checkpoint processing where we prefetched some cache lines that
wasn't going to be used. This had a negative performance effect, in particular
when loading. This is fixed in MySQL Cluster 8.0.20.

This benchmark proves two things. First that MySQL Cluster 8.0 works fine
with Intel Optane DC Persistent Memory in Memory Mode. Second it proves
that NDB can work with very large memories, here we tested with up to
more than 5 TB of data in a single data node. The configuration parameter
for DataMemory supports settings up to 16 TB. Beyond 16 TB there are some
constants in the checkpoint processing that would require tweaking. The
current product is designed to work very well up to 16 TB and even work
with even larger memories.

Thus with support for up to 144 data nodes and thus 72 node groups we can
support up to more than 1 PB of in-memory data. On top of this one can also
use disk data of even bigger sizes making it possible to handle multiple
PBs of data in one NDB Cluster.

1 comment:

FARUSAC said...

Really awesome but difficult make the same test in house