Friday, July 29, 2022

The world's first LATS benchmark

 LATS stands for low Latency, high Availability, high Throughput and scalable Storage. When testing an OLTP DBMS it is important to look at all those aspects. This means that the benchmark should test how the DBMS works in scenarios where data fits in memory, where data doesn't fit in memory. In addition tests should run measuring both throughput and latency. Finally it isn't enough to run the benchmarks while the DBMS operates in normal operation. There should also be tests that verify the performance when node fails and when nodes rejoin the cluster.

We have executed a battery of tests using RonDB, a key value store with SQL capabilities that makes it a complete LATS benchmark. We used the Yahoo! Cloud Serving Benchmark (YCSB) for this. These benchmark were executed using Amazon EC2 servers with 16 VCPUs, 122 GBytes, 2x1900 GB NVMe drives and with up to 10 Gb Ethernet. These virtual machines were used both for in-memory tests and tests of on-disk performance.

Link to full benchmark presentation. The full benchmark presentation contains lots of graphs and a lot more detail about the benchmarks. Here I will present a short summary of the results we saw.

YCSB contains 6 different workloads and there were tests of all 6 workloads in different aspects. In most workloads the average latency is around 600 microseconds and 99 percentile is usually around 1 millisecond and almost always below 2 milliseconds.

The availability tests starts by shutting down one of the RonDB data nodes. The ongoing transactions that are affected by this node failure will see a latency of up to a few seconds since node failure handling requires the transaction state to be rebuilt to decide if the transaction should be committed or aborted. New transactions can start as soon as the cluster has been reconfigured to remove the failed node. After discovering the node failures this reconfiguration only takes about one millisecond. The time to discover depends on how the failure occurs, if the failure is a software failure in the data node it will be discovered by the OS and the discovery is immediate since the connection is broken. If there is a HW failure this could lead to the heartbeat mechanism discovering the failure. The time to discover failures using heartbeats depends on the configured heartbeat interval.

After the node failure has been handled the throughput decreases around 10% and latency goes up by about 30%. The main reason for this is that we now have less data nodes to serve the reads. The impact will be higher with 2 replicas compared to using 3 replicas. When the recovery reaches the synchronisation phase where the starting node is synchronising its data with the live data nodes sees a minor decrease of throughput which actually leads to a shorter latency. Finally when the process is completed and the starting node can serve reads again the throughput and latency returns to normal levels.

Thus it is clear from those numbers that one should design the RonDB clusters with a small amount of extra capacity to handle node recovery, but it is not very high, a bit more if using 2 replicas compared to when using 3 replicas.

Performance when data doesn't fit in memory decreases significantly. The reason for this is that it is limited by how many IOPS the NVMe drives can sustain. We have done similar experiments a few years ago and saw that RonDB performance can scale to even 8 NVMe drives and handle read and write workloads of more than a GByte per second using YCSB. The HW development of NVMe drives is even faster than for CPUs, so this bottleneck is likely to diminish as HW development proceeds.

The latency for reads is higher, but the update latency is substantially higher for on-disk storage. The update latency at high throughput reaches up to 10 milliseconds. We expect latency and throughput to improve substantially using the new generation of VMs using substantially improved NVMe drives. It will be even more interesting to see how this performance improves when moving to PCI Express 5.0 NVMe drives.

Thursday, July 28, 2022

New stable release of RonDB, RonDB 21.04.8

Today we released a new version of RonDB 21.04, the stable release series of RonDB. RonDB 21.04.8 fixes a few critical bugs and two new features. See the docs for more details of this released version.

Make it possible to use IPv4 sockets between ndbmtd and API nodes

In MySQL NDB Cluster all sockets have been converted to use IPv6 format even when IPv4 sockets are used. This led to MySQL NDB Cluster no longer being able to interact with device drivers that only works using IPv4 sockets. This is the case for Dolphin SuperSockets.

Dolphin SuperSockets makes it possible to use extreme low latency HW in connecting the nodes in a cluster to improve latency significantly. This feature makes it possible for RonDB 21.04.8 to make use of interconnect cards from Dolphin using the Dolphin SuperSockets. RonDB has been tested and benchmarked using Dolphin SuperSockets. We will soon release a benchmark report of this.

Two new ndbinfo tables to check memory usage

RonDB is now used by app.hopsworks.ai, a Serverless Feature Store. This means that thousands of users can share RonDB. To ensure this multi-tenant usage of RonDB is working we have introduced two new ndbinfo tables that makes it possible to track exactly how much memory a specific user is using. A user in Hopsworks is mapped to a project and a project uses its own database in RonDB. Thus those two new tables makes it possible to implement quotas both on user level and on Feature Group level.

Two new ndbinfo tables are created, ndb$table_map and ndb$table_memory_usage. The ndb$table_memory_usage lists four properties for all table replicas, in_memory_bytes (the number of bytes used by a table fragment replica in DataMemory), free_in_memory_bytes (the number of bytes free of the previous, these bytes are always in the variable sized part), disk_memory_bytes (the number of bytes in the disk columns, essentially the number of extents allocated to the table fragment replica times the size of the extents in the tablespace), free_disk_memory_bytes (number of bytes free in the disk memory for disk columns).

Since each table fragment replica provides one row we will use a GROUP BY on table id and fragment id and the MAX of those columns to ensure we only have one row per table fragment.

We want to provide the memory usage in-memory and in disk memory per table or per database. However a table in RonDB is spread out in several tables. There are four places a table can use memory. First the table itself uses memory for rows and for a hash index, when disk columns are used this table also makes use of disk memory. Second there are ordered indexes that use memory for the index information. Thirdly there are unique indexes that use memory for rows in the unique index (a unique index is simply a table with unique key as primary key and primary key as columns) and the hash index for the unique index table. This table is not necessarily colocated with the table itself. Finally there is also BLOB tables that can contain hash index, row storage and even disk memory usage.

The user isn't particularly interested in this level of detail, so we want to display information about memory usage for tables and databases that the user sees. Thus we have to gather data for this, the tool to gather the data is the new ndbinfo table ndb$table_map, this table lists the table name and database name provided the table id, the table id can be the table id of a table, an ordered index, a unique index or a BLOB table, but will always present the name of the actual table defined by the user, not the name of the index table or BLOB table.

Using those two tables we create two ndbinfo views, the table_memory_usage listing the database name and table name and the above 4 properties for each table in the cluster. The second view, database\_memory\_usage lists the database name and the 4 properties summed over all table fragments in all tables created by RonDB for the user based on the BLOBs and indexes.

To make things a bit more efficient we keep track of all ordered indexes attached to a table internally in RonDB. Thus ndb$table_memory_usage will list memory usage of tables plus the ordered indexes on the table, there will be no rows presenting memory usage of an ordered index.

These two tables makes it easy for users to see how much memory they are using in a certain table or database. This is useful in managing a RonDB cluster.

Saturday, April 23, 2022

Variable sized disk rows in RonDB

 RonDB was a pure in-memory database engine in its origin. The main reason for this was to support low latency applications in the telecom business. However already in 2005 we presented a design at VLDB in Trondheim for the introduction of columns stored on disk. These columns cannot be indexed, but is very suitable for columns with large sizes.

RonDB is currently targeting Feature Store applications. These applications often access data through a set of primary key lookups where each row can have hundreds of columns with varying size.

In RonDB 21.04 the support for disk columns uses a fixed size disk row. This works very well to support handling small files in HopsFS. HopsFS is a distributed file system that can handle petabytes of storage in an efficient manner. On top of it Hopsworks build the offline Feature Store applications.

The small files are stored in a set of fixed size rows in RonDB with suitable sizes. YCSB benchmarks have shown that RonDB can handle writes of up to several GBytes per second. Thus the disk implementation of RonDB is very efficient.

Applications using the online Feature Store will however store much of its data in variable sized columns. These work perfectly well in the in-memory columns. They work also in the disk columns in RonDB 21.04. However to make storage more efficient we are designing a new version of RonDB where the row parts on disk are stored on variable sized disk pages.

These pages use the same data structure as the in-memory variable sized pages. So the new format only affects handling free space, handling of recovery. This design has now reached a state where it is passing our functional test suites. We will still add more tests, perform system tests and search for even more problems before we release for production usage.

One interesting challenge that can happen with a variable sized rows is that one might have to use more space in a data page. If this space isn't available we have to find a new page where space is available. It becomes an interesting challenge when taking into account that we can abort operations on a row while still committing other operations on the same row. The conclusion here is that one can never release any allocated resources until you fully commit or fully abort the transaction.

This type of challenge is one reason why it is so interesting to work with the internals of a distributed database engine. After 30 years of education, development and support, there are still new interesting challenges to handle.

Another challenge we faced was that we need to page in multiple data pages to handle an operation on the row. This means that we have to ensure that while paging in one data page, that other pages that we already paged in won't be paged out before we have completed our work on the row. This work also prepares the stage for handling rows that span over multiple disk pages. RonDB already supports rows that span multiple in-memory pages and one disk page.

If you want to learn more about RonDB requirements, LATS properties, use cases and internal algorithms, join us on Monday CMU Vaccin database presentation. Managed RonDB is supported on AWS, Azure and GCP and on-prem.

If you like to join the effort to develop RonDB and a managed RonDB version we have open positions at Hopsworks AB. Contact me at LinkedIn if you are interested.

Monday, January 31, 2022

RonDB receives ARM64 support and large transaction support

 RonDB is the base platform for all applications in Hopsworks. Hopsworks is a machine learning platform featuring a Feature Store that can be used in online applications as well as offline applications.

This means that RonDB development is driven towards ensuring that operating RonDB in this environment is the best possible.

RonDB is designed for millions of small transactions reading and writing data. However occasionally applications perform rather large transactions. Previous versions of RonDB had some weaknesses in this area. The new versions of RonDB now supports also large transactions although the focus is still on many smaller transactions.

Designing this new support of large transactions required a fairly large development effort. To do this in a stable release is a challenge, therefore it was decided to combine this effort with a heavy testing period focused on fixing bugs.

This effort has been focused on achieving three objectives. First to stabilise the new RonDB 21.04 releases which is the stable release of RonDB. Second, to stabilise the next RonDB release at the same level as RonDB 21.04. Third, we also wanted the same level of support for ARM64 machines.

We are now proud to release RonDB 21.04.3, a new stable release of RonDB that supports much larger transactions. Since the release of RonDB 21.04.1 in July 2021 we have fixed more than 50 bugs in RonDB and we are very satisfied with the stability also on ARM64 machines.

The original plan was to release the next version of RonDB in October 2021, however we didn't want to release a new version with any less stability than the RonDB 21.04 release. Thus instead we release this new version of RonDB now, RonDB 22.01.0.

ARM64 support covers both RonDB 21.04.3 and RonDB 22.01.0. RonDB is now also supported on both Linux and Mac OS X and on Windows it is supported using WSL 2 (Linux on Windows) on Windows 11. We have extensively tested RonDB on the following platforms:

  1. Mac OS X 11.6 x86_64
  2. Mac OS X 12.2 ARM64
  3. Windows WSL 2 Ubuntu x86_64
  4. Ubuntu 21.04 x86_64
  5. Oracle Linux 8 Cloud Developer version ARM64

It is used in production on AWS and Azure and has been extensively tested also on GCP and Oracle Cloud.

As part of the new RonDB release we have also updated the documentation of RonDB at docs.rondb.com. Among other things it contains a new section on Contributing to RonDB that shows how you can build, test and develop extensions to RonDB. In the documentation you will also find an extensive list of the improvements made in the two new RonDB releases.

ARM64 support is still in beta phase, our plan is to make it available for production use in Q2 2022. There are no known bugs, but we want to give it a bit more time before we assign it to production workloads. This includes adding more test machines and also performing benchmarks on ARM64 VMs.

Our experience with ARM64 machines so far says that it is fairly stable, but it isn't yet at the same level as x86, it is possible to find bugs in the compilers, the support around it is however maturing very quickly and not surprising the support on Mac OS X is here leading the way since Mac OS X has fully committed its future on ARM. We have also great help of participating in the OCI ARM Accelerator program providing access to ARM VMs in the Oracle Cloud making it possible to test on Oracle Linux using ARM with both small and large VMs.

RonDB 22.01.0 comes with a set of new features:

  1. Now possible to scale reads using locks onto more threads
  2. Improved placement of primary replicas to enable
  3. All major memory areas now managed by global memory manager
  4. Even more flexibility in thread configurations
  5. Removing a scalability hog in index statistics handling
  6. Merged with MySQL Cluster 8.0.28

You can either download RonDB tarballs from https://github.com/logicalclocks/rondb or from https://repo.hops.works/master, for exact links to the various versions of the binary tarballs see Release Notes on each version.