Tuesday, January 10, 2023

The flagship feature in the new LTS version RonDB 22.10.0

 In RonDB 22.10.0 we added a new major feature to RonDB. This feature means that variable sized disk columns in RonDB are stored in variable sized rows instead of using fixed size rows.

The history of disk data in RonDB starts already in 2004 when the NDB team at Ericsson had been acquired by MySQL AB. NDB Cluster was originally designed as an in-memory DBMS. The reason for this was based on that a disk based DBMS couldn't handle the latency requirements in telecom applications.

Thus NDB was developed using a distributed architecture using Network Durability (meaning that a transaction is made durable by writing the transaction into memory in several computers in a network). Long-term durability of data is achieved by a background process ensuring that data is written to disk.

When the NDB team joined MySQL we looked for many other application categories as well and thus increasing the database sizes NDB could handle was seen as important. Thus we started on developing support for disk-based columns. The design decisions of this design was accepted as a paper at VLDB in Trondheim in 2005.

The use of this feature didn't really take off in any significant manner for a few years since the latency of hard drives and also the performance of hard drives made it too different from the performance of in-memory data.

That problem has been solved by technology development of SSDs and with the introduction of NVMe drives and newer versions of PCI Express 3,4 and now 5. As an anecdote I installed a set of NVMe drives on my workstation capable of handling millions of IOPS and able to deliver 66 GBytes per second of bandwidth to these NVMe drives. However while installing I discovered that I had only 1 memory card which meant that I had 3x more bandwidth to my NVMe drives compared to my memory bandwidth. So in order to make use of those NVMe drives I had to install a number of memory cards to get the required memory bandwidth to handle those NVMe drives.

So with the introduction of NVMe drives the feature became useful, actually one of the main users of this feature is HopsFS, a distributed file system in the Hopsworks platform which uses RonDB for metadata management. HopsFS can use disk columns in RonDB for storing small files.

Performance of disk columns is really good. This blog presents a benchmark with YCSB using disk-based columns in NDB Cluster. We get a bandwidth of more than 1 GByte per second of application data read and written.

The latency on NVMe drives is 100x lower than on hard drives. This means that previously latency on hard drives was a lot more than 100x higher than in-memory latency for database operations. With modern NVMe drives the difference on latency between in-memory columns and disk columns is down to a factor of 2. We analysed performance and latency using the YCSB benchmark and compared it to in-memory columns in this blog.

One problem with the original implementation is that the disk columns was always stored in fixed size rows. In HopsFS we found ways to handle this by using multiple tables for different row sizes.

In a traditional application and in the Feature Store it is very common to store data in variable sized columns. To ensure that the data fits the maximum size of the column can be 10x higher than the average size of the column. Thus we can easily waste 90% of the disk space. This means that to use disk columns in Feature Store applications we have to enable support of variable sized rows on disk.

Thus with the release of the new LTS RonDB version 22.10.0 the disk columns is now as useful as the in-memory columns. They have excellent performance, the latency is very good, even better than the in-memory latency of some competitors and the storage efficiency is now high as well.

This means that with RonDB 22.10.0 we can handle nodes with TBytes of in-memory and many tens of TBytes of disk columns. Thus RonDB can scale all the way up to handling database sizes up to the petabyte level with latency of read and write operations in less than a millisecond.

Summary of RonDB 21.04.9 changes

 RonDB 21.04 main use case is being the base of the data management platform in Hopsworks. As such every now and then some new requirements on RonDB emerges. But obviously the most important feature of development of RonDB 21.04 is on stability.

Hopsworks provides a free Serverless use case to try out the Hopsworks platform. Check it out on https://app.hopsworks.ai. Each user gets their own database in RonDB and can create a number of tables. Then one can load data from various sources using the OnlineFS (a feature using Kafka and ClusterJ to load data from external sources into Feature Groups, a Feature Group is a table in RonDB).

Previously ClusterJ was limited to using only one database per cluster connection which led to a lot of unnecessary connect and disconnect of connections to the RonDB cluster. In RonDB 21.04.9 it is now possible for one cluster connection to use any number of databases.

In addition we did a few changes to RonDB to make it easier to manage RonDB in our managed platform.

In preparation for releasing Hopsworks 3.1 which includes RonDB 21.04.9 we extended the tests for the Hopsworks platform, among other things for HopsFS, a distributed file system that uses RonDB to store metadata and small files. We fixed all issues found in these extended tests and any other problems found in the last couple of months.

Monday, January 09, 2023

RonDB News

 The RonDB team has been busy in development in 2022. Now is the time to start releasing things. There are 5 things that we are planning to release in Q1 2023.

RonDB 21.04.9: A new version of RonDB with a few new features required by the Hopsworks 3.1 release and a number of bug fixes. This is released today and will be described in a separate blog.

RonDB 22.10.0: This is a new Long-Term Support version (LTS) that will be maintained until 2025 at least. It is also released today. It has a number of new features on top of RonDB 21.04 of which the most important one is supporting variable sized disk columns which makes it much more interesting to use RonDB with large data sets. More on this feature in a separate blog post.

In addition RonDB 22.10.0 is updated to be based on MySQL 8.0.31, RonDB 21.04 is based on MySQL 8.0.23. I will post a separate blog more about the content of RonDB 22.10.0.

The release content is shown in detail in the release notes and new features chapters in the RonDB docs.

We are going to release very soon a completely revamped version of RonDB Docker using Docker Compose. This is intended to support developing applications on top of RonDB in your local development environment. This is used by RonDB developers to develop new features in RonDB, but is also very useful to develop any type of applications on top of RonDB using any of the numerous APIs by which you can connect to RonDB.

We are also close to finishing up the first version of our completely new RonDB REST API that will have the possibility to issue REST API requests towards RonDB as well as the same queries using gRPC calls. In the first version it will support primary key lookups and batched key lookups. Batched key lookups are very useful in some Feature Store applications where it is necessary to read hundreds of rows in RonDB for ranking query results. Our plan is to further develop this REST API service such that it can also be used efficiently in multi- tenant setups enabling the use of RonDB in Serverless applications.

Finally we have completed the development phase and test phase of RonDB Reconfiguration in Hopsworks cloud using AWS. Hopsworks cloud is implemented using Amplify in AWS. So the Hopsworks cloud service is handled by Amplify even if the actual Hopsworks cluster is running in GCP or Azure. RonDB Reconfiguration means that you can start creating a Hopsworks cluster with 2 Data node VMs with 8 VCPUs and 64 GB of memory with 2 MySQL Server VMs using 16 VCPUs. When you see that this cluster is required to grow you can simply tell the Hopsworks UI that you want e.g. 3 Data node VMs with 16 VCPUs and 128 GB of memory each and 3 MySQL Server VMs with 32 VCPUs each. The Hopsworks cloud service will then reconfigure the cluster as an online operation. No downtime will happen during the reconfiguration. There might be some queries that gets temporary errors, but those can simply be retried.

The Hopsworks cloud applications uses virtual service names through Consul, this means that the services using the MySQL service will automatically use the new MySQL Servers as they come online and will use the MySQL servers in a round-robin fashion.

It is possible to scale data node VM sizes upwards, we currently don't support scaling sizes downwards. It is possible to scale up and down the number of replicas between 1 and 3. The number of MySQL Servers can be increased by one and decreased and the size of the MySQL Server VMs can go both upwards and downwards. At the moment we don't allow adding more Node Groups of data nodes as an online operation. This requires an offline change.

This reconfiguration feature is going to be integrated into Hopsworks cloud in the near future.

Friday, July 29, 2022

The world's first LATS benchmark

 LATS stands for low Latency, high Availability, high Throughput and scalable Storage. When testing an OLTP DBMS it is important to look at all those aspects. This means that the benchmark should test how the DBMS works in scenarios where data fits in memory, where data doesn't fit in memory. In addition tests should run measuring both throughput and latency. Finally it isn't enough to run the benchmarks while the DBMS operates in normal operation. There should also be tests that verify the performance when node fails and when nodes rejoin the cluster.

We have executed a battery of tests using RonDB, a key value store with SQL capabilities that makes it a complete LATS benchmark. We used the Yahoo! Cloud Serving Benchmark (YCSB) for this. These benchmark were executed using Amazon EC2 servers with 16 VCPUs, 122 GBytes, 2x1900 GB NVMe drives and with up to 10 Gb Ethernet. These virtual machines were used both for in-memory tests and tests of on-disk performance.

Link to full benchmark presentation. The full benchmark presentation contains lots of graphs and a lot more detail about the benchmarks. Here I will present a short summary of the results we saw.

YCSB contains 6 different workloads and there were tests of all 6 workloads in different aspects. In most workloads the average latency is around 600 microseconds and 99 percentile is usually around 1 millisecond and almost always below 2 milliseconds.

The availability tests starts by shutting down one of the RonDB data nodes. The ongoing transactions that are affected by this node failure will see a latency of up to a few seconds since node failure handling requires the transaction state to be rebuilt to decide if the transaction should be committed or aborted. New transactions can start as soon as the cluster has been reconfigured to remove the failed node. After discovering the node failures this reconfiguration only takes about one millisecond. The time to discover depends on how the failure occurs, if the failure is a software failure in the data node it will be discovered by the OS and the discovery is immediate since the connection is broken. If there is a HW failure this could lead to the heartbeat mechanism discovering the failure. The time to discover failures using heartbeats depends on the configured heartbeat interval.

After the node failure has been handled the throughput decreases around 10% and latency goes up by about 30%. The main reason for this is that we now have less data nodes to serve the reads. The impact will be higher with 2 replicas compared to using 3 replicas. When the recovery reaches the synchronisation phase where the starting node is synchronising its data with the live data nodes sees a minor decrease of throughput which actually leads to a shorter latency. Finally when the process is completed and the starting node can serve reads again the throughput and latency returns to normal levels.

Thus it is clear from those numbers that one should design the RonDB clusters with a small amount of extra capacity to handle node recovery, but it is not very high, a bit more if using 2 replicas compared to when using 3 replicas.

Performance when data doesn't fit in memory decreases significantly. The reason for this is that it is limited by how many IOPS the NVMe drives can sustain. We have done similar experiments a few years ago and saw that RonDB performance can scale to even 8 NVMe drives and handle read and write workloads of more than a GByte per second using YCSB. The HW development of NVMe drives is even faster than for CPUs, so this bottleneck is likely to diminish as HW development proceeds.

The latency for reads is higher, but the update latency is substantially higher for on-disk storage. The update latency at high throughput reaches up to 10 milliseconds. We expect latency and throughput to improve substantially using the new generation of VMs using substantially improved NVMe drives. It will be even more interesting to see how this performance improves when moving to PCI Express 5.0 NVMe drives.