Friday, July 29, 2022

The world's first LATS benchmark

 LATS stands for low Latency, high Availability, high Throughput and scalable Storage. When testing an OLTP DBMS it is important to look at all those aspects. This means that the benchmark should test how the DBMS works in scenarios where data fits in memory, where data doesn't fit in memory. In addition tests should run measuring both throughput and latency. Finally it isn't enough to run the benchmarks while the DBMS operates in normal operation. There should also be tests that verify the performance when node fails and when nodes rejoin the cluster.

We have executed a battery of tests using RonDB, a key value store with SQL capabilities that makes it a complete LATS benchmark. We used the Yahoo! Cloud Serving Benchmark (YCSB) for this. These benchmark were executed using Amazon EC2 servers with 16 VCPUs, 122 GBytes, 2x1900 GB NVMe drives and with up to 10 Gb Ethernet. These virtual machines were used both for in-memory tests and tests of on-disk performance.

Link to full benchmark presentation. The full benchmark presentation contains lots of graphs and a lot more detail about the benchmarks. Here I will present a short summary of the results we saw.

YCSB contains 6 different workloads and there were tests of all 6 workloads in different aspects. In most workloads the average latency is around 600 microseconds and 99 percentile is usually around 1 millisecond and almost always below 2 milliseconds.

The availability tests starts by shutting down one of the RonDB data nodes. The ongoing transactions that are affected by this node failure will see a latency of up to a few seconds since node failure handling requires the transaction state to be rebuilt to decide if the transaction should be committed or aborted. New transactions can start as soon as the cluster has been reconfigured to remove the failed node. After discovering the node failures this reconfiguration only takes about one millisecond. The time to discover depends on how the failure occurs, if the failure is a software failure in the data node it will be discovered by the OS and the discovery is immediate since the connection is broken. If there is a HW failure this could lead to the heartbeat mechanism discovering the failure. The time to discover failures using heartbeats depends on the configured heartbeat interval.

After the node failure has been handled the throughput decreases around 10% and latency goes up by about 30%. The main reason for this is that we now have less data nodes to serve the reads. The impact will be higher with 2 replicas compared to using 3 replicas. When the recovery reaches the synchronisation phase where the starting node is synchronising its data with the live data nodes sees a minor decrease of throughput which actually leads to a shorter latency. Finally when the process is completed and the starting node can serve reads again the throughput and latency returns to normal levels.

Thus it is clear from those numbers that one should design the RonDB clusters with a small amount of extra capacity to handle node recovery, but it is not very high, a bit more if using 2 replicas compared to when using 3 replicas.

Performance when data doesn't fit in memory decreases significantly. The reason for this is that it is limited by how many IOPS the NVMe drives can sustain. We have done similar experiments a few years ago and saw that RonDB performance can scale to even 8 NVMe drives and handle read and write workloads of more than a GByte per second using YCSB. The HW development of NVMe drives is even faster than for CPUs, so this bottleneck is likely to diminish as HW development proceeds.

The latency for reads is higher, but the update latency is substantially higher for on-disk storage. The update latency at high throughput reaches up to 10 milliseconds. We expect latency and throughput to improve substantially using the new generation of VMs using substantially improved NVMe drives. It will be even more interesting to see how this performance improves when moving to PCI Express 5.0 NVMe drives.

No comments: