Thursday, September 25, 2025

Datagraph releases an extension of RonDB, a Common Lisp NDB API

Datagraph develops a Graph database called Dydra that can handle SPARQL, GraphQL and Linked Data Platform (LDP). Dydra stores a revisioned history, this means that you have access to the full history of your data. This could be a development of some document, a piece of software, a piece of HW like a SoC (System-on-a-Chip) or a building or something else. Essentially any data.

This blog describes this development by the team at Datagraph.

Traditionally Dydra has used a memory-mapped key-value store for this. Now Hopsworks and Dydra have worked together for a while to provide features in RonDB that makes it possible to run Dydra on a highly available platform which is distributed that makes it possible to parallelise many of the searches.

RonDB is distributed key-value store with SQL capabilities. Traditionally distributed key-value stores offer the possibility to read and write the data in highly efficient manners using key lookups. RonDB offers this capability as well with extremely good performance (RonDB showed how to achieve 100M key lookups per second using a REST API ). However, the SQL capabilities mean that RonDB can also push down filters and projections to the RonDB data nodes. This means that searches can be parallelised.

Thus, RonDB will also be able to handle complex joins efficiently in many cases. Some of the queries in TPC-H (a standard analytical database benchmark) can be executed 50x faster in RonDB compared to using MySQL/InnoDB.

Now working with Dydra on their searches we realised that they store data structures in columns using the data type VARBINARY. SQL doesn't really have any way to define searches on complex data structures inside a VARBINARY.

When using RonDB there are many ways to access it. Many people find the use of MySQL APIs to be the preferrable method. These are APIs that are well known and there is plentiful of literature on how to use hem. However, RonDB is a key-value store as well, this means that a lower-level interface is much more efficient.

The base of all interactions with RonDB is through the C++ NDB API. On top of this API there is a Java API called ClusterJ, there is a NodeJS API called Database Jones. As mentioned there are the MySQL APIs as RonDB is an NDB storage engine for MySQL. With RonDB 24.10 we introduced also a C++ REST API server that can be used to retrieve batches of key lookups at very low latency. There is even an experimental Redis interface for RonDB that we call Rondis, it is integrated in the RonDB REST API server (RDRS).

In 2022, Datagraph released one more option: to use Common Lisp bindings for the C++ NDB API. With the release of RonDB 24.10, they just released a much improved version of the cl-ndbapi for RonDB 24.10.

As discussed above a Dydra query often entails a scan operation where one has to analyse the content of the VARBINARY column. In a first step all this functionality was performed by shipping the VARBINARY to the Common Lisp environment. This gave pretty decent performance, but we realised we could do better.

RonDB has had a simple interpreter for things such as filters, auto increment, and the like. However, to make complex analysis of VARBINARY columns we needed to extend the RonDB interpreter.

MySQL has a similar feature where one can integrate a C program into MySQL called user-defined functions (UDF). However this has two implications if we were to use a similar thing for RonDB, first it is a security issue, this program could easily crash the RonDB data nodes and this is in conflict with the high availability features of RonDB. The second issue is that RonDB is a distributed architecture, so the program would be required on every RonDB data node, thus complicating the installation process of RonDB.

Instead we opted for the approach of extending the RonDB interpreter. The RonDB interpreter has 8 registers; these registers store 64-bit signed integers. An interpreted execution always has access to a single row, it cannot acccess any other rows or data outside the interpreter. Interpreted execution has several steps, one can first ready columns, next execute the interpreted program, next one can write some columns and finally one can again read columns. In this manner one can combine normal simple reads with an interpreted program. In MySQL the interpreted program is used to execute WHERE clauses to filter away those rows not interesting for the query. The program can also have a section of input parameters making it possible to reuse an interpreted program with different input. It is also possible to return calculated results using output parameters.

To handle the new requirements the RonDB interpreter was extended with a memory area of a bit more than 64 kB.

To ensure that one can handle a generic program RonDB added a long list of new operations like Shift Left/Right, multiplication, divison, modulo and so forth. In addition instructions to read columns into the memory area and even read only parts of a column if desired. Similarly instructions to write columns.

Dydra used these new approaches and saw a handsome improvement to the results delivered by RonDB.

Now analysing the use case for Dydra we found that they used some variants of binary search on parts of the VARBINARY columns. Thus RonDB also implemented a set of higher level instruction such as binary search, search intervals, memory copy and text-to-number conversion and vice versa.

Using those new instructions Dydra saw a bit more improvements. Those new instructions also ensure that the interpreted programs are quicker to develop. As requirements for other algorithms arise it is fairly easy to add new instructions to the RonDB interpreter and should be possible for other community developers.

The most innovative part of the new Common Lisp NDB API is the handling of the interpreted instructions. It contains a language-to-language compiler, so you can write the interpreted program as a Lisp program using normal IF, WHEN and COND (IF, ELSE constructs in Lisp). You can even decide to run the program in the client using Lisp (mainly for testing and debugging) or push it down to RonDB for execution in the RonDB data nodes (for speed).

One benchmark that Dydra used to evaluate RonDB performance compared MySQL/InnoDB using an UDF with using RonDB using pushdown of the evaluation. The data set consisted of 4.5M rows where essentially all rows were scanned and for each row one executed a program that checked if the row was visible in the revision asked for. About 2% of the rows were returned.

In MySQL/InnoDB the query took 8.89 seconds to execute, in RonDB the query took 0.51 seconds to execute. Thus a nice speedup of around 17 times. Most of the speedup is dependent on the amount of parallelism used in RonDB. The MySQL execution is single-threaded. The cost of scanning one row in MySQL/InnoDB and in RonDB is very similar, RonDB is a bit faster, but there is not a major difference in speed.

Wednesday, August 27, 2025

How to design a DBMS for Telco requirements

 My colleague Zhao Song presented a walkthrough of the evolution of the DBMSs and how it relates to Google Spanner, Aurora, PolarDB and MySQL NDB Cluster.

I had some interesting discussions with him on the topic and it makes sense to return to the 1990s when I designed NDB Cluster and the impact on the recovery algorithms from the requirements for a Telco DBMS.

A Telco DBMS is a DBMS that operates in a Telco environment, this DBMS is involved in each interaction with the Telco system through smartphones such as call setup, location updates, SMS, Mobile Data. If the DBMS is down it means no service available for smartphones. Obviously there is no time of day or night when it is ok to be down. Thus even a few seconds of downtime is important to avoid.

Thus in the design of NDB Cluster I had to take into account the following events:

  • DBMS Software Upgrade
  • Application Software Upgrade
  • SW Failure in DBMS Node
  • SW Failure in Application Service
  • HW Failure in DBMS Node
  • HW Failure in Application Service
  • SW Failure in DBMS Cluster
  • Region Failure
It was clear that the design had to be a distributed DBMS, in Telcos it was not uncommon to build HW Redundant solutions with a single node but redundant HW. But this solution will obviously have difficulties with SW failures. Also it requires very specialised HW which costs hundreds of million of dollars to develop. Today this solution is very rarely used.

One of the first design decisions would be to choose between a disk-based DBMS and an in-memory DBMS. This was settled by the fact that latency requirement was to handle transactions involving tens of rows within around 10 milliseconds, thus with the hard drives of those days not really possible. Today with the introduction of SSDs and NVMe drives there is still a latency impact of at least 3x in using disk drives compared to using an in-memory DBMS.

If we play with the thought of using a Shared Disk DBMS using modern HW we still have a problem. The Shared Disk solution requires a storage solution which is a Shared Nothing solution. In addition Shared Disk DBMS commit by writing the REDO log to the Shared Disk. This means at recovery we need to replay part of the REDO log to allow the node to take over after a failed node. Thus since the latest state of some disk pages is only available in the Shared Disk, we cannot serve any transactions of these pages until we replayed the REDO log. This used to be a period of around 30 seconds, it is shorter now, but it is still not good enough for the Telco requirements.

Thus we have settled for a Shared Nothing DBMS solution using in-memory tables. The next problem is how to handle replication in a Shared Nothing. The replication sends REDO logs or something similar to this towards the backup replicas. Now one has a choice, either one applies the REDO logs immediately or one only writes them to the REDO log and applies them later.

Again applying them later means that we will suffer downtime if the backup replica is forced to take over as primary replica. Thus we have to apply the REDO logs immediately as part of the transaction execution. This means we are able to takeover within milliseconds after a node failure.

Failures could happen in two ways, most SW failures will be discovered by the other nodes in the cluster immediately. In this case node failures are discovered very quickly. However in particular HW failures can lead to silent failures, here one is required to use some sort of I-am-alive protocol (heartbeat in NDB). The discovery time here is a product of the real-time properties of the operating system and of the DBMS.

Now transaction execution can be done using a replication protocol such as PAXOS where a global order of transactions is maintained or through a non-blocking 2PC protocol. Both are required to handle failures of the coordinator through a leader-selection algorithm and handling the ongoing transactions that are affected by this.

The benefits of the non-blocking 2PC is that it can handle millions of concurrent transactions since the coordinator role can be handled by any node in the cluster. There is no central point limiting the transaction throughput. To be a non-blocking 2PC it is required to handle failed coordinators by finishing ongoing transactions using a take-over protocol. To handle cluster recovery an epoch transaction is created that regularly creates consistent recovery points. This epoch transaction can also be used to replicate to other regions even supporting Active-Active replication using various Conflict Detection Algorithms.

So the conclusion of how to design a DBMS for Telco requirements is:
  • Use an in-memory DBMS
  • Use a Shared Nothing DBMS
  • Apply the changes on both primary replica and backup replica as part of transaction
  • Use non-blocking 2PC for higher concurrency of write transactions
  • Implement Heartbeat protocol to discover silent failures in both APIs and DBMS nodes
  • Implement Take-over protocols for each distributed protocol, especially Leader-Selection
  • Implement Software Upgrade mechanisms in both APIs and DBMS nodes
  • Implement Failure Handling of APIs and DBMS nodes
  • Support Online Schema Changes
  • Support Regional Replication
The above implementation makes it possible to run a DBMS with Class 6 availability (less than 30 seconds of downtime per year). This means that all SW, HW and regional failures, including the catastrophic ones are accounted for within this 30 seconds per year.

MySQL NDB Cluster has been used at this level for more than 20 years and continues to serve billions of people with a highly available service.

At Hopsworks MySQL NDB Cluster was selected as the platform to build a highly available real-time AI platform. To make MySQL NDB Cluster accessible for anyone to use we forked it and call it RonDB. RonDB has made many improvements of ease-of-use, scalable reads, creating a managed service that makes it possible to easily install and manage RonDB. We have also added a set of new interfaces, a REST server to handle batches of lookups for generic database lookups and for feature store lookups, RonSQL to handle optimised aggregation queries that are very common in AI applications and finally an experimental Redis interface called Rondis.

Check out rondb.com for more information, you can try it out and if you want to read the latest scalable benchmark go directly here. If you want to have a walkthrough of the challenges in running a highly scalable benchmark you can find it here.

Happy reading!

Thursday, August 21, 2025

How to reach 100M Key lookups using REST server with Python clients

 A few months ago I decided to run a benchmark to showcase how RonDB 24.10 can handle 100M Key lookups per second using our REST API server from Python client. This exercise is meant to show both how RonDB can scale to handle throughput requirements as well as latency requirements for Personalised Recommendation systems that are commonly used by companies such as Spotify, E-commerce sites and so forth.

The exercise started at 2M Key lookups per second. Running a large benchmark like this means that you hit all sorts of bottlenecks. Some of the bottlenecks are due to configuration issues, some are due to load balancers, some due to quota constraints and networking within the cloud vendor, some are due to bugs and yet some required some new features in RonDB. It also includes a comparison of VM types using Intel, AMD and ARM CPUs. It also included managing multiple Availability Zones.

I thought reporting on this exercise could be an interesting learning also for others, so the whole process can be found in this blog.

At rondb.com you can find other blogs about RonDB 24.10 and you can even try out RonDB in a Test Cluster. You can start a small benchmark and check 12 dashboards of monitoring information about RonDB while it is running.