Thursday, February 13, 2020

NDB Cluster, the World's Fastest Key-Value Store

Using numbers produced already with MySQL Cluster 7.6.10 we have
shown that NDB Cluster is the world's fastest Key-Value store using
the Yahoo Cloud Serving Benchmark (YCSB) Workload A.

Presentation at Slideshare.net.

We reached 1.4M operations using 2 Data Nodes and 2.8M operations
using a 4 Data Node setup. All this using a standard JDBC driver.
Obviously using a specialised ClusterJ client will improve performance
further. These benchmarks was executed by Bernd Ocklin.

The benchmark was executed in the Oracle Cloud. Each Data Node used
a Bare Metal Server using DenseIO which have 52 CPU cores with
8 NVMe drives.

The MySQL Servers and Benchmark clients was executed on Bare Metal
servers with 2 MySQL Server per server (1 MySQL Server per CPU socket).
These Bare Metal servers contained 36 CPU cores each.

All servers used Oracle Linux 7.

YCSB Workload A means that 50% of the operations are reads that read the
full rows (1 kByte in size) and 50% perform updates of one of the fields
(100 bytes in size).

Oracle Cloud contains 3 different levels of domains. The first level is that
servers are placed in different Failure Domains within the same Availability
Domains. This means essentially that servers are not relying on the same
switches and power supplies. But they can still be in the same building.

The second level is Availability Domains that are in the same region, but each
Availability Domain is failing independently of the other Availability Domains
in the same region.

The third level is regions that are separated by long distances as well.

Most applications of NDB Cluster relies on a model that would use 2 or
more NDB Clusters in different regions, but each cluster contained inside
an Availability Domain. Next global replication between the NDB Clusters
is used for fail-over when one region or availability domain fails.

With Oracle Cloud one can also setup a cluster to have Data Nodes in
different Availability Domain. This increases the availability of the
cluster at the expense of higher latency for write operations. NDB Cluster
have configuration options to ensure that one always performs local
reads on either the same server or at least in the same Availability/Failure
Domain.

The Oracle Cloud have the most competitive real-time characteristics of
the enterprise clouds. Our experience is that the Oracle Cloud provides
2-4x better latency compared to other cloud vendors. Thus the Oracle
Cloud is perfectly suitable for NDB Cluster.

The DenseIO Bare Metal Servers or DenseIO VMs are suitable for
use for NDB Data Nodes or a NDB Data Nodes colocated with
MySQL Server. These servers have excellent CPU combined with
25G Ethernet links and extremely high performing NVMe drives.

This benchmark reported here stores the table as In-Memory tables.
We will later report on some benchmarks where we use a slightly
modified YCSB benchmark to show numbers when we instead use
Disk-based tables with much heavier update loads.

The Oracle Cloud contains a number of variants of Bare Metal servers
and VMs that are suitable for MySQL Servers and applications.

In NDB Cluster the MySQL Servers are actually stateless since all
the state is in the NDB Data Node. The only exception to this rule
is the MySQL Servers used for replication to another cluster that
requires disk storage for the MySQL binlog.

So usually a standard server can be setup without any special extra
disks for MySQL Servers and clients.

In the presentation we show the following important results.

The latency of DBMS operations is independent of the data size. We
get the same latency when Data set have 300M rows as when there are
600M rows.

We show that scaling to 8 Data Nodes with 4 Data Nodes in each Availability
Domains scales from 4 Data Nodes in the same Availability Domain. But
the extra latency increases the latency and this also some loss in throughput.
Still we reach 3.7M operations per second for this 8-node setup.

We show that an important decision for the cluster setup is the number of
LDM threads. These are the threads doing the actual database work. We get
best scalability when going for the maximum number of LDM threads which
is 32. Using 32 LDM threads can increase latency at low number of clients,
but when the clients increase the 32 LDM setup will scale much longer than
the 16 LDM thread setup.

In MySQL Cluster 8.0.20 we have made more efforts to improve scaling to
many LDM threads. So we expect the performance of large installations to
scale even further in 8.0.20.

The benchmark report above gives very detailed numbers of latency in various
situations. As can seen there we can handle 1.3M operations per second with
latency of reads below 1 ms and updates having latency below 2 ms!

Finally the benchmark report also shows the impact of various NUMA settings
on performance and latency. It is shown that Interlaced NUMA settings have
a slight performance disadvantage, but since it means that we get access to the
full DRAM and the full machine, it is definitely a good idea to use this setting.
In NDB Cluster this is the default setting.

The YCSB benchmark shows NDB Cluster in its home turf with enormous
throughput of key operations, both read and write, with predictable and low
latency.

Couple with the high availability features that have been proven in the field
with more than 15 years of continous operations with better than Class 6
availability we feel confident to claim that NDB Cluster is the World's
Fastest and Most Available Key-Value Store!

The YCSB benchmark is a standard benchmark, so any competing solution
is free to challenge our claim. We used a standard YCSB client of version
0.15.0 using a standard MySQL JDBC driver.

NDB Cluster supports full SQL through the MySQL Server, it can push joins
down to the NDB Data Nodes for parallel query filtering and joining.
NDB Cluster supports sharding transparently and complex SQL queries
executes cross-shard joins which most competing Key-Value stores don't
support.

One interesting example using NDB Cluster as a Key-Value Store is HopsFS
that implements a hierarchical file system based on Hadoop HDFS. It has been
shown to scale to 1.6M file operations per second and small files can be stored
in the NDB Cluster for low latency access to small files.

Monday, February 10, 2020

Benchmarking a 5 TB Data Node in NDB Cluster

Through the courtesy of Intel I have access to a machine with 6 TB of Intel
Optane DC Persistent Memory. This is memory that can be used both as
persistent memory in App Direct Mode or simply used as a very large
DRAM in Memory Mode.

Slides for a presentation of this is available at slideshare.net.

This memory can be bigger than DRAM, but has some different characteristics
compared to DRAM. Due to this different characteristics all accesses to this
memory goes through a cache and here the cache is the entire DRAM in the
machine.

In the test machine there was a 768 GB DRAM acting as a cache for the
6 TB of persistent memory. When a miss happens in the DRAM cache
one has to go towards the persistent memory instead. The persistent memory
has higher latency and lower throughput. Thus it is important as a programmer
to ensure that your product can work with this new memory.

What one can expect performance-wise is that performance will be similar to
using DRAM as long as the working set is smaller than DRAM. As the working
set grows one expects the performance to drop a bit, but not in a very significant
manner.

We tested NDB Cluster using the DBT2 benchmark which is based on the
standard TPC-C benchmark but uses zero latency between transactions in
the benchmark client.

This benchmark has two phases, the first phase loads the data from 32 threads
where each threads loads one warehouse at a time. Each warehouse contains
almost 500.000 rows in a number of tables.

The second phase executes the benchmark where a number of threads execute
transactions in parallel towards the database using 5 different transactions.

The result is based on how many new order transactions can be processed per
minute. Each such transaction report requires more than 50 SQL statements to be
executed where the majority is UPDATE's and SELECT FOR UPDATE.

Through experiments using the same machines with only DRAM it was
verified that performance running a benchmark with a working set smaller
than DRAM size the performance was within a few percent's margin the
same.

Next we performed benchmarks comparing results when running in a database
of almost 5 TB in size and comparing it to a benchmark that executed only on
warehouses that fit in the DRAM cache.

Our findings was that latency of DBT2 transactions increased by 10-12% when
using the full data set of the machine. However the benchmark was limited by
the CPUs available to run the MySQL Server and thus the throughput was
the same.

NDB Cluster worked like a charm during these tests. We found a minor issue in
the local checkpoint processing where we prefetched some cache lines that
wasn't going to be used. This had a negative performance effect, in particular
when loading. This is fixed in MySQL Cluster 8.0.20.

This benchmark proves two things. First that MySQL Cluster 8.0 works fine
with Intel Optane DC Persistent Memory in Memory Mode. Second it proves
that NDB can work with very large memories, here we tested with up to
more than 5 TB of data in a single data node. The configuration parameter
for DataMemory supports settings up to 16 TB. Beyond 16 TB there are some
constants in the checkpoint processing that would require tweaking. The
current product is designed to work very well up to 16 TB and even work
with even larger memories.

Thus with support for up to 144 data nodes and thus 72 node groups we can
support up to more than 1 PB of in-memory data. On top of this one can also
use disk data of even bigger sizes making it possible to handle multiple
PBs of data in one NDB Cluster.

Friday, January 31, 2020

Thursday, January 16, 2020

Preview of upcoming NDB Cluster benchmark


Just a fun image from running a benchmark in the Oracle Cloud. The image above
shows 6 hours of benchmark run in a data node on a Bare Metal Server. First creating
the disk data tablespaces, next loading the data and finally running the benchmark.

During loading the network was loaded to 1.8 GByte per second, the disks was writing
4 Gbyte per second. During the benchmark run the disks was writing 5 GByte per
second in addition to reading 1.5 Gbyte per second.

All this while CPUs were never loaded to more than 20 percent. Many interesting
things to consider when running benchmarks against modern disk drives.
Bottlenecks can appear in CPUs, in disk drives, in networks and of course it is
possible to create bottlenecks in your software. But pretty satisfied above in that
we're close to the physical limits of both network and disk drives.