Wednesday, August 12, 2020

Setting up NVMe drives on Oracle Cloud for NDB Cluster

 In a blog post I posted on the 16th of January I showed some graphs of CPU

usage, network usage and disk usage. The benchmark that was running was

a modified variant of YCSB (Yahoo Cloud Serving Benchmark) based on version

0.15.0 of YCSB.


In this blog post I will describe the setup of the NVMe drives for this

benchmark using DenseIO machines in the Oracle Cloud.


Oracle Cloud has a variety of machines available. In this benchmark we wanted

to show NDB with a database size of around 20 TByte of user data in a replicated

setup.


There are numerous ways to connect disk drives to Oracle Cloud machines. One

manner is to use block storage. In this case the actual storage is on separate storage

servers and all disk traffic requires using the network. The other option is to

use DenseIO machines that has 8 NVMe drives with a total of 51.2 TByte of disk

space.


In this benchmark we opted for the DenseIO machines given that they can handle more

than twice the disk load compared to the block storage. Block storage is limited by

the Ethernet connection for block storage at 25 Gb per second. In addition the Oracle

Cloud has limits on the amount of IOPS that are allowed for block storage. So in our

case the DenseIO machines was the obvious candidate.


We used the Bare Metal variant called BM.DenseIO2.52.


This DenseIO machine has the following HW setup. It has 2 CPU sockets that each

contain an Intel Xeon Platinum 8167 that each have 26 CPU cores for a total of

52 CPU cores with 104 CPUs (my terminology is CPUs are contained in CPU cores

that are contained in CPU sockets that are contained in a computer/server).

It is equipped with 768 GB of memory. In addition it has 8 NVMe drives,

each with 6.4 TByte of disk space. Each such disk is capable of achieving around

1-2 GB/sec read/write speed dependent on workload.


An NDB data node requires disks for the following things:

  1. Checkpoints of in-memory data
  2. REDO log for in-memory data and disk data records
  3. UNDO log for disk data records
  4. Tablespaces for disk data records


The first 3 use cases are all write-only in normal operation and read-only in recovery.

Thus they mainly require disk bandwidth to write sequential writes of sizes that make

use of the disk bandwidth in an optimal manner. The in-memory data is in this case a

very small part of the disk load.


We strive to reach beyond 1 GByte per second insert and update rates for each data

node. For insert loads the the main load happens in checkpoint the tablespaces for

disk data records and writing the REDO log for disk data records.


For update loads there will also be a fair amount of UNDO log records to write. Thus

the worst case for the first three parts is when performing updates since then we have

to write the record to both the REDO log and to the UNDO log.


So now we come to the point of how to setup the 8 NVMe drives. The OCI provides

those 8 devices as bare devices and so we need to decide how to set those drives

up to get a proper filesystem.


One approach would be to simply create one file system with all 8 drives. Given that

the first 3 use cases are completely sequential in nature and the tablespace is making

lots of small read and writes, we opted to split at least into 2 file systems.


At most the sequential part will generate 2-3 GByte per second of disk writes and

thus 2 of the 8 NVMe drives will handle this nicely.


The commands to create this file system is the following:

#Make sure the mdadm tool is installed

sudo yum install mdadm -y


#Create a RAID0 device using the last 2 NVMe devices with chunk size of 256 kByte since

#most writes are fairly large sequential writes

sudo mdadm --create /dev/md0 --chunk=256 --raid-devices=2 \

           --level=raid0 /dev/nvme6n1 /dev/nvme7n1


#Create an XFS file system on this device

#Block size is 4kB since this is NVMe devices that has no reason to have smaller block size

#The stripe unit 256 kByte aligned to the chunk size, there are many parallel writes going on,

#so no need to parallelise

#With 2 disks our stripe width is 2 * 256 kByte thus 512 kB

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=1024 /dev/md0


#Use /ndb as the directory for the sequential file system

sudo mkdir -p /ndb


#Mount the file system on this directory, no need to keep times on files and directories

sudo mount /dev/md0 /ndb -o noatime,nodiratime


#Ensure that the Oracle Cloud user and group opc owns this directory

sudo chown opc /ndb

sudo chgrp opc /ndb


Now we have a directory /ndb that we can use for the NDB data directory


The next step is to setup the file system for the tablespace data for disk records.

One variant to do this is to follow the above scheme with a smaller chunk size and

stripe unit.


This was also what I did as my first attempt. This had some issues. It created a

tablespace of 6 * 6.4 TByte. I created a tablespace file of around 35 TByte. Everything

went well in loading the data, it continued to run will during the benchmark for a few

minutes. But after some time of running the benchmark performance dropped to half.


The problem is that NVMe devices and SSD devices need some free space to prepare

for new writes. At the disk write speeds NDB achieves in this benchmark,

the NVMe devices simply couldn't keep up with preparing free space to write into.

Thus when the file system was close to full the performance started dropping.

It stabilised at around half the workload it could handle with a non-full tablespace.


So what to do? Google as usual found the answer. The solution was to use the tool

parted to ensure that 40% of the disk space is not usable by the file system and thus

always available for new writes. This gave the NVMe devices sufficient amount of

time to prepare new writes even at benchmarks that ran for many hours with

consistent load of many GBytes per second of disk writes.


Obviously the more disk space that is removed from file system usage, the better disk

bandwidth one gets, but also there is less disk space available for the application.

In this case I was running a benchmark and wanted the optimal performance and then

using 60% of the disk space for the file system.


Using the full disk space cut performance in half, probably most of the performance

would be available also with 80% available for file system usage.


Here I also decided to skip using the RAID of the devices. Instead I created 6 different

file systems. This will work on MySQL Cluster 8.0.20 where we will spread the use of

the different tablespace files on a round robin basis.


So here are the commands to create those 6 file systems.

#Use the parted tool to only allow 60% of the usage for the file system

sudo parted -a opt --script /dev/nvme0n1 mklabel gpt mkpart primary 0% 60%

sudo parted -a opt --script /dev/nvme1n1 mklabel gpt mkpart primary 0% 60%

sudo parted -a opt --script /dev/nvme2n1 mklabel gpt mkpart primary 0% 60%

sudo parted -a opt --script /dev/nvme3n1 mklabel gpt mkpart primary 0% 60%

sudo parted -a opt --script /dev/nvme4n1 mklabel gpt mkpart primary 0% 60%

sudo parted -a opt --script /dev/nvme5n1 mklabel gpt mkpart primary 0% 60%


#Create 6 file systems with each 4kB blocks and stripe size 256 kByte

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme0n1p1

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme1n1p1

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme2n1p1

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme3n1p1

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme4n1p1

sudo mkfs.xfs -b size=4096 -d sunit=512,swidth=512 /dev/nvme5n1p1


#Create the 6 directories for the tablespace files

sudo mkdir -p /ndb_data1

sudo mkdir -p /ndb_data2

sudo mkdir -p /ndb_data3

sudo mkdir -p /ndb_data4

sudo mkdir -p /ndb_data5

sudo mkdir -p /ndb_data6


#Mount those 6 file systems

sudo mount /dev/nvme0n1p1 /ndb_data1 -o noatime,nodiratime

sudo mount /dev/nvme1n1p1 /ndb_data2 -o noatime,nodiratime

sudo mount /dev/nvme2n1p1 /ndb_data3 -o noatime,nodiratime

sudo mount /dev/nvme3n1p1 /ndb_data4 -o noatime,nodiratime

sudo mount /dev/nvme4n1p1 /ndb_data5 -o noatime,nodiratime

sudo mount /dev/nvme5n1p1 /ndb_data6 -o noatime,nodiratime


#Move ownership of the file systems to the Oracle Cloud user

sudo chown opc /ndb_data1

sudo chgrp opc /ndb_data1

sudo chown opc /ndb_data2

sudo chgrp opc /ndb_data2

sudo chown opc /ndb_data3

sudo chgrp opc /ndb_data3

sudo chown opc /ndb_data4

sudo chgrp opc /ndb_data4

sudo chown opc /ndb_data5

sudo chgrp opc /ndb_data5

sudo chown opc /ndb_data6


When NDB has been started the following commands are used to create the

tablespace on these file system. These commands will take some time since

NDB will initialise the file system to ensure that the disk space is

really allocated to avoid that we run out of disk space.


#First create the UNDO log file group

#We set the size to 512G, we have plenty of disk space for logs, so no need

#to use a very small file. We allocate 4GByte of memory for the UNDO

#log buffer, the machine have 768 GByte of memory, and the memory

#is thus abundant and there is no need to save on UNDO buffer memory.

CREATE LOGFILE GROUP lg1

ADD UNDOFILE 'undofile.dat'

INITIAL_SIZE 512G

UNDO_BUFFER_SIZE 4G

ENGINE NDB;


#Next create the tablespace and the first data file

#We set the size to more than 3 TByte of usable space

CREATE TABLESPACE ts1

ADD DATAFILE '/ndb_data1/datafile.dat'

USE LOGFILE GROUP lg1

INITIAL_SIZE 3200G

ENGINE NDB;


#Now add the remaining 5 data files

#Each of the same size

ALTER TABLESPACE ts1

ADD DATAFILE '/ndb_data2/datafile.dat'

INITIAL_SIZE 3200G

ENGINE NDB;


ALTER TABLESPACE ts1

ADD DATAFILE '/ndb_data3/datafile.dat'

INITIAL_SIZE 3200G

ENGINE NDB;


ALTER TABLESPACE ts1

ADD DATAFILE '/ndb_data4/datafile.dat'

INITIAL_SIZE 3200G

ENGINE NDB;


ALTER TABLESPACE ts1

ADD DATAFILE '/ndb_data5/datafile.dat'

INITIAL_SIZE 3200G

ENGINE NDB;


ALTER TABLESPACE ts1

ADD DATAFILE '/ndb_data6/datafile.dat'

INITIAL_SIZE 3200G

ENGINE NDB;


In the next blog post we will show how to write the configuration file for this

NDB Cluster.


This blog post showed the setup of a highly optimised setup for a key-value store

that can store 20 TB in a 2-node setup that can handle more than 1 GByte insert

rate and also around 1 GByte update speed.


From the graphs in the blog post one can see that performance is a function of the

performance of the NVMe drives and the available network bandwidth.

The CPU usage is never more than 20% of the available CPU power.