Friday, July 29, 2022

The world's first LATS benchmark

 LATS stands for low Latency, high Availability, high Throughput and scalable Storage. When testing an OLTP DBMS it is important to look at all those aspects. This means that the benchmark should test how the DBMS works in scenarios where data fits in memory, where data doesn't fit in memory. In addition tests should run measuring both throughput and latency. Finally it isn't enough to run the benchmarks while the DBMS operates in normal operation. There should also be tests that verify the performance when node fails and when nodes rejoin the cluster.

We have executed a battery of tests using RonDB, a key value store with SQL capabilities that makes it a complete LATS benchmark. We used the Yahoo! Cloud Serving Benchmark (YCSB) for this. These benchmark were executed using Amazon EC2 servers with 16 VCPUs, 122 GBytes, 2x1900 GB NVMe drives and with up to 10 Gb Ethernet. These virtual machines were used both for in-memory tests and tests of on-disk performance.

Link to full benchmark presentation. The full benchmark presentation contains lots of graphs and a lot more detail about the benchmarks. Here I will present a short summary of the results we saw.

YCSB contains 6 different workloads and there were tests of all 6 workloads in different aspects. In most workloads the average latency is around 600 microseconds and 99 percentile is usually around 1 millisecond and almost always below 2 milliseconds.

The availability tests starts by shutting down one of the RonDB data nodes. The ongoing transactions that are affected by this node failure will see a latency of up to a few seconds since node failure handling requires the transaction state to be rebuilt to decide if the transaction should be committed or aborted. New transactions can start as soon as the cluster has been reconfigured to remove the failed node. After discovering the node failures this reconfiguration only takes about one millisecond. The time to discover depends on how the failure occurs, if the failure is a software failure in the data node it will be discovered by the OS and the discovery is immediate since the connection is broken. If there is a HW failure this could lead to the heartbeat mechanism discovering the failure. The time to discover failures using heartbeats depends on the configured heartbeat interval.

After the node failure has been handled the throughput decreases around 10% and latency goes up by about 30%. The main reason for this is that we now have less data nodes to serve the reads. The impact will be higher with 2 replicas compared to using 3 replicas. When the recovery reaches the synchronisation phase where the starting node is synchronising its data with the live data nodes sees a minor decrease of throughput which actually leads to a shorter latency. Finally when the process is completed and the starting node can serve reads again the throughput and latency returns to normal levels.

Thus it is clear from those numbers that one should design the RonDB clusters with a small amount of extra capacity to handle node recovery, but it is not very high, a bit more if using 2 replicas compared to when using 3 replicas.

Performance when data doesn't fit in memory decreases significantly. The reason for this is that it is limited by how many IOPS the NVMe drives can sustain. We have done similar experiments a few years ago and saw that RonDB performance can scale to even 8 NVMe drives and handle read and write workloads of more than a GByte per second using YCSB. The HW development of NVMe drives is even faster than for CPUs, so this bottleneck is likely to diminish as HW development proceeds.

The latency for reads is higher, but the update latency is substantially higher for on-disk storage. The update latency at high throughput reaches up to 10 milliseconds. We expect latency and throughput to improve substantially using the new generation of VMs using substantially improved NVMe drives. It will be even more interesting to see how this performance improves when moving to PCI Express 5.0 NVMe drives.

Thursday, July 28, 2022

New stable release of RonDB, RonDB 21.04.8

Today we released a new version of RonDB 21.04, the stable release series of RonDB. RonDB 21.04.8 fixes a few critical bugs and two new features. See the docs for more details of this released version.

Make it possible to use IPv4 sockets between ndbmtd and API nodes

In MySQL NDB Cluster all sockets have been converted to use IPv6 format even when IPv4 sockets are used. This led to MySQL NDB Cluster no longer being able to interact with device drivers that only works using IPv4 sockets. This is the case for Dolphin SuperSockets.

Dolphin SuperSockets makes it possible to use extreme low latency HW in connecting the nodes in a cluster to improve latency significantly. This feature makes it possible for RonDB 21.04.8 to make use of interconnect cards from Dolphin using the Dolphin SuperSockets. RonDB has been tested and benchmarked using Dolphin SuperSockets. We will soon release a benchmark report of this.

Two new ndbinfo tables to check memory usage

RonDB is now used by app.hopsworks.ai, a Serverless Feature Store. This means that thousands of users can share RonDB. To ensure this multi-tenant usage of RonDB is working we have introduced two new ndbinfo tables that makes it possible to track exactly how much memory a specific user is using. A user in Hopsworks is mapped to a project and a project uses its own database in RonDB. Thus those two new tables makes it possible to implement quotas both on user level and on Feature Group level.

Two new ndbinfo tables are created, ndb$table_map and ndb$table_memory_usage. The ndb$table_memory_usage lists four properties for all table replicas, in_memory_bytes (the number of bytes used by a table fragment replica in DataMemory), free_in_memory_bytes (the number of bytes free of the previous, these bytes are always in the variable sized part), disk_memory_bytes (the number of bytes in the disk columns, essentially the number of extents allocated to the table fragment replica times the size of the extents in the tablespace), free_disk_memory_bytes (number of bytes free in the disk memory for disk columns).

Since each table fragment replica provides one row we will use a GROUP BY on table id and fragment id and the MAX of those columns to ensure we only have one row per table fragment.

We want to provide the memory usage in-memory and in disk memory per table or per database. However a table in RonDB is spread out in several tables. There are four places a table can use memory. First the table itself uses memory for rows and for a hash index, when disk columns are used this table also makes use of disk memory. Second there are ordered indexes that use memory for the index information. Thirdly there are unique indexes that use memory for rows in the unique index (a unique index is simply a table with unique key as primary key and primary key as columns) and the hash index for the unique index table. This table is not necessarily colocated with the table itself. Finally there is also BLOB tables that can contain hash index, row storage and even disk memory usage.

The user isn't particularly interested in this level of detail, so we want to display information about memory usage for tables and databases that the user sees. Thus we have to gather data for this, the tool to gather the data is the new ndbinfo table ndb$table_map, this table lists the table name and database name provided the table id, the table id can be the table id of a table, an ordered index, a unique index or a BLOB table, but will always present the name of the actual table defined by the user, not the name of the index table or BLOB table.

Using those two tables we create two ndbinfo views, the table_memory_usage listing the database name and table name and the above 4 properties for each table in the cluster. The second view, database\_memory\_usage lists the database name and the 4 properties summed over all table fragments in all tables created by RonDB for the user based on the BLOBs and indexes.

To make things a bit more efficient we keep track of all ordered indexes attached to a table internally in RonDB. Thus ndb$table_memory_usage will list memory usage of tables plus the ordered indexes on the table, there will be no rows presenting memory usage of an ordered index.

These two tables makes it easy for users to see how much memory they are using in a certain table or database. This is useful in managing a RonDB cluster.