In a blog post I posted on the 16th of January I showed some graphs of CPU
usage, network usage and disk usage. The benchmark that was running was
a modified variant of YCSB (Yahoo Cloud Serving Benchmark) based on version
0.15.0 of YCSB.
In this blog post I will describe the setup of the NVMe drives for this
benchmark using DenseIO machines in the Oracle Cloud.
Oracle Cloud has a variety of machines available. In this benchmark we wanted
to show NDB with a database size of around 20 TByte of user data in a replicated
setup.
There are numerous ways to connect disk drives to Oracle Cloud machines. One
manner is to use block storage. In this case the actual storage is on separate storage
servers and all disk traffic requires using the network. The other option is to
use DenseIO machines that has 8 NVMe drives with a total of 51.2 TByte of disk
space.
In this benchmark we opted for the DenseIO machines given that they can handle more
than twice the disk load compared to the block storage. Block storage is limited by
the Ethernet connection for block storage at 25 Gb per second. In addition the Oracle
Cloud has limits on the amount of IOPS that are allowed for block storage. So in our
case the DenseIO machines was the obvious candidate.
We used the Bare Metal variant called BM.DenseIO2.52.
This DenseIO machine has the following HW setup. It has 2 CPU sockets that each
contain an Intel Xeon Platinum 8167 that each have 26 CPU cores for a total of
52 CPU cores with 104 CPUs (my terminology is CPUs are contained in CPU cores
that are contained in CPU sockets that are contained in a computer/server).
It is equipped with 768 GB of memory. In addition it has 8 NVMe drives,
each with 6.4 TByte of disk space. Each such disk is capable of achieving around
1-2 GB/sec read/write speed dependent on workload.
An NDB data node requires disks for the following things:
- Checkpoints of in-memory data
- REDO log for in-memory data and disk data records
- UNDO log for disk data records
- Tablespaces for disk data records
The first 3 use cases are all write-only in normal operation and read-only in recovery.
Thus they mainly require disk bandwidth to write sequential writes of sizes that make
use of the disk bandwidth in an optimal manner. The in-memory data is in this case a
very small part of the disk load.
We strive to reach beyond 1 GByte per second insert and update rates for each data
node. For insert loads the the main load happens in checkpoint the tablespaces for
disk data records and writing the REDO log for disk data records.
For update loads there will also be a fair amount of UNDO log records to write. Thus
the worst case for the first three parts is when performing updates since then we have
to write the record to both the REDO log and to the UNDO log.
So now we come to the point of how to setup the 8 NVMe drives. The OCI provides
those 8 devices as bare devices and so we need to decide how to set those drives
up to get a proper filesystem.
One approach would be to simply create one file system with all 8 drives. Given that
the first 3 use cases are completely sequential in nature and the tablespace is making
lots of small read and writes, we opted to split at least into 2 file systems.
At most the sequential part will generate 2-3 GByte per second of disk writes and
thus 2 of the 8 NVMe drives will handle this nicely.
The commands to create this file system is the following:
#Make sure the mdadm tool is installed
sudo yum install mdadm -y
#Create a RAID0 device using the last 2 NVMe devices with chunk size of 256 kByte since
#most writes are fairly large sequential writes
sudo mdadm --create /dev/md0 --chunk=256 --raid-devices=2 \
--level=raid0 /dev/nvme6n1 /dev/nvme7n1
#Create an XFS file system on this device
#Block size is 4kB since this is NVMe devices that has no reason to have smaller block size
#The stripe unit 256 kByte aligned to the chunk size, there are many parallel writes going on,
#so no need to parallelise
#With 2 disks our stripe width is 2 * 256 kByte thus 512 kB
sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=1024 /dev/md0
#Use /ndb as the directory for the sequential file system
sudo mkdir -p /ndb
#Mount the file system on this directory, no need to keep times on files and directories
sudo mount /dev/md0 /ndb -o noatime,nodiratime
#Ensure that the Oracle Cloud user and group opc owns this directory
sudo chown opc /ndb
sudo chgrp opc /ndb
Now we have a directory /ndb that we can use for the NDB data directory
The next step is to setup the file system for the tablespace data for disk records.
One variant to do this is to follow the above scheme with a smaller chunk size and
stripe unit.
This was also what I did as my first attempt. This had some issues. It created a
tablespace of 6 * 6.4 TByte. I created a tablespace file of around 35 TByte. Everything
went well in loading the data, it continued to run will during the benchmark for a few
minutes. But after some time of running the benchmark performance dropped to half.
The problem is that NVMe devices and SSD devices need some free space to prepare
for new writes. At the disk write speeds NDB achieves in this benchmark,
the NVMe devices simply couldn't keep up with preparing free space to write into.
Thus when the file system was close to full the performance started dropping.
It stabilised at around half the workload it could handle with a non-full tablespace.
So what to do? Google as usual found the answer. The solution was to use the tool
parted to ensure that 40% of the disk space is not usable by the file system and thus
always available for new writes. This gave the NVMe devices sufficient amount of
time to prepare new writes even at benchmarks that ran for many hours with
consistent load of many GBytes per second of disk writes.
Obviously the more disk space that is removed from file system usage, the better disk
bandwidth one gets, but also there is less disk space available for the application.
In this case I was running a benchmark and wanted the optimal performance and then
using 60% of the disk space for the file system.
Using the full disk space cut performance in half, probably most of the performance
would be available also with 80% available for file system usage.
Here I also decided to skip using the RAID of the devices. Instead I created 6 different
file systems. This will work on MySQL Cluster 8.0.20 where we will spread the use of
the different tablespace files on a round robin basis.
So here are the commands to create those 6 file systems.
#Use the parted tool to only allow 60% of the usage for the file system
sudo parted -a opt --script /dev/nvme0n1 mklabel gpt mkpart primary 0% 60%
sudo parted -a opt --script /dev/nvme1n1 mklabel gpt mkpart primary 0% 60%
sudo parted -a opt --script /dev/nvme2n1 mklabel gpt mkpart primary 0% 60%
sudo parted -a opt --script /dev/nvme3n1 mklabel gpt mkpart primary 0% 60%
sudo parted -a opt --script /dev/nvme4n1 mklabel gpt mkpart primary 0% 60%
sudo parted -a opt --script /dev/nvme5n1 mklabel gpt mkpart primary 0% 60%
#Create 6 file systems with each 4kB blocks and stripe size 256 kByte
sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme0n1p1
sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme1n1p1
sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme2n1p1
sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme3n1p1
sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme4n1p1
sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme5n1p1
#Create the 6 directories for the tablespace files
sudo mkdir -p /ndb_data1
sudo mkdir -p /ndb_data2
sudo mkdir -p /ndb_data3
sudo mkdir -p /ndb_data4
sudo mkdir -p /ndb_data5
sudo mkdir -p /ndb_data6
#Mount those 6 file systems
sudo mount /dev/nvme0n1p1 /ndb_data1 -o noatime,nodiratime
sudo mount /dev/nvme1n1p1 /ndb_data2 -o noatime,nodiratime
sudo mount /dev/nvme2n1p1 /ndb_data3 -o noatime,nodiratime
sudo mount /dev/nvme3n1p1 /ndb_data4 -o noatime,nodiratime
sudo mount /dev/nvme4n1p1 /ndb_data5 -o noatime,nodiratime
sudo mount /dev/nvme5n1p1 /ndb_data6 -o noatime,nodiratime
#Move ownership of the file systems to the Oracle Cloud user
sudo chown opc /ndb_data1
sudo chgrp opc /ndb_data1
sudo chown opc /ndb_data2
sudo chgrp opc /ndb_data2
sudo chown opc /ndb_data3
sudo chgrp opc /ndb_data3
sudo chown opc /ndb_data4
sudo chgrp opc /ndb_data4
sudo chown opc /ndb_data5
sudo chgrp opc /ndb_data5
sudo chown opc /ndb_data6
When NDB has been started the following commands are used to create the
tablespace on these file system. These commands will take some time since
NDB will initialise the file system to ensure that the disk space is
really allocated to avoid that we run out of disk space.
#First create the UNDO log file group
#We set the size to 512G, we have plenty of disk space for logs, so no need
#to use a very small file. We allocate 4GByte of memory for the UNDO
#log buffer, the machine have 768 GByte of memory, and the memory
#is thus abundant and there is no need to save on UNDO buffer memory.
CREATE LOGFILE GROUP lg1
ADD UNDOFILE 'undofile.dat'
INITIAL_SIZE 512G
UNDO_BUFFER_SIZE 4G
ENGINE NDB;
#Next create the tablespace and the first data file
#We set the size to more than 3 TByte of usable space
CREATE TABLESPACE ts1
ADD DATAFILE '/ndb_data1/datafile.dat'
USE LOGFILE GROUP lg1
INITIAL_SIZE 3200G
ENGINE NDB;
#Now add the remaining 5 data files
#Each of the same size
ALTER TABLESPACE ts1
ADD DATAFILE '/ndb_data2/datafile.dat'
INITIAL_SIZE 3200G
ENGINE NDB;
ALTER TABLESPACE ts1
ADD DATAFILE '/ndb_data3/datafile.dat'
INITIAL_SIZE 3200G
ENGINE NDB;
ALTER TABLESPACE ts1
ADD DATAFILE '/ndb_data4/datafile.dat'
INITIAL_SIZE 3200G
ENGINE NDB;
ALTER TABLESPACE ts1
ADD DATAFILE '/ndb_data5/datafile.dat'
INITIAL_SIZE 3200G
ENGINE NDB;
ALTER TABLESPACE ts1
ADD DATAFILE '/ndb_data6/datafile.dat'
INITIAL_SIZE 3200G
ENGINE NDB;
In the next blog post we will show how to write the configuration file for this
NDB Cluster.
This blog post showed the setup of a highly optimised setup for a key-value store
that can store 20 TB in a 2-node setup that can handle more than 1 GByte insert
rate and also around 1 GByte update speed.
From the graphs in the blog post one can see that performance is a function of the
performance of the NVMe drives and the available network bandwidth.
The CPU usage is never more than 20% of the available CPU power.