Wednesday, August 27, 2025

How to design a DBMS for Telco requirements

 My colleague Zhao Song presented a walkthrough of the evolution of the DBMSs and how it relates to Google Spanner, Aurora, PolarDB and MySQL NDB Cluster.

I had some interesting discussions with him on the topic and it makes sense to return to the 1990s when I designed NDB Cluster and the impact on the recovery algorithms from the requirements for a Telco DBMS.

A Telco DBMS is a DBMS that operates in a Telco environment, this DBMS is involved in each interaction with the Telco system through smartphones such as call setup, location updates, SMS, Mobile Data. If the DBMS is down it means no service available for smartphones. Obviously there is no time of day or night when it is ok to be down. Thus even a few seconds of downtime is important to avoid.

Thus in the design of NDB Cluster I had to take into account the following events:

  • DBMS Software Upgrade
  • Application Software Upgrade
  • SW Failure in DBMS Node
  • SW Failure in Application Service
  • HW Failure in DBMS Node
  • HW Failure in Application Service
  • SW Failure in DBMS Cluster
  • Region Failure
It was clear that the design had to be a distributed DBMS, in Telcos it was not uncommon to build HW Redundant solutions with a single node but redundant HW. But this solution will obviously have difficulties with SW failures. Also it requires very specialised HW which costs hundreds of million of dollars to develop. Today this solution is very rarely used.

One of the first design decisions would be to choose between a disk-based DBMS and an in-memory DBMS. This was settled by the fact that latency requirement was to handle transactions involving tens of rows within around 10 milliseconds, thus with the hard drives of those days not really possible. Today with the introduction of SSDs and NVMe drives there is still a latency impact of at least 3x in using disk drives compared to using an in-memory DBMS.

If we play with the thought of using a Shared Disk DBMS using modern HW we still have a problem. The Shared Disk solution requires a storage solution which is a Shared Nothing solution. In addition Shared Disk DBMS commit by writing the REDO log to the Shared Disk. This means at recovery we need to replay part of the REDO log to allow the node to take over after a failed node. Thus since the latest state of some disk pages is only available in the Shared Disk, we cannot serve any transactions of these pages until we replayed the REDO log. This used to be a period of around 30 seconds, it is shorter now, but it is still not good enough for the Telco requirements.

Thus we have settled for a Shared Nothing DBMS solution using in-memory tables. The next problem is how to handle replication in a Shared Nothing. The replication sends REDO logs or something similar to this towards the backup replicas. Now one has a choice, either one applies the REDO logs immediately or one only writes them to the REDO log and applies them later.

Again applying them later means that we will suffer downtime if the backup replica is forced to take over as primary replica. Thus we have to apply the REDO logs immediately as part of the transaction execution. This means we are able to takeover within milliseconds after a node failure.

Failures could happen in two ways, most SW failures will be discovered by the other nodes in the cluster immediately. In this case node failures are discovered very quickly. However in particular HW failures can lead to silent failures, here one is required to use some sort of I-am-alive protocol (heartbeat in NDB). The discovery time here is a product of the real-time properties of the operating system and of the DBMS.

Now transaction execution can be done using a replication protocol such as PAXOS where a global order of transactions is maintained or through a non-blocking 2PC protocol. Both are required to handle failures of the coordinator through a leader-selection algorithm and handling the ongoing transactions that are affected by this.

The benefits of the non-blocking 2PC is that it can handle millions of concurrent transactions since the coordinator role can be handled by any node in the cluster. There is no central point limiting the transaction throughput. To be a non-blocking 2PC it is required to handle failed coordinators by finishing ongoing transactions using a take-over protocol. To handle cluster recovery an epoch transaction is created that regularly creates consistent recovery points. This epoch transaction can also be used to replicate to other regions even supporting Active-Active replication using various Conflict Detection Algorithms.

So the conclusion of how to design a DBMS for Telco requirements is:
  • Use an in-memory DBMS
  • Use a Shared Nothing DBMS
  • Apply the changes on both primary replica and backup replica as part of transaction
  • Use non-blocking 2PC for higher concurrency of write transactions
  • Implement Heartbeat protocol to discover silent failures in both APIs and DBMS nodes
  • Implement Take-over protocols for each distributed protocol, especially Leader-Selection
  • Implement Software Upgrade mechanisms in both APIs and DBMS nodes
  • Implement Failure Handling of APIs and DBMS nodes
  • Support Online Schema Changes
  • Support Regional Replication
The above implementation makes it possible to run a DBMS with Class 6 availability (less than 30 seconds of downtime per year). This means that all SW, HW and regional failures, including the catastrophic ones are accounted for within this 30 seconds per year.

MySQL NDB Cluster has been used at this level for more than 20 years and continues to serve billions of people with a highly available service.

At Hopsworks MySQL NDB Cluster was selected as the platform to build a highly available real-time AI platform. To make MySQL NDB Cluster accessible for anyone to use we forked it and call it RonDB. RonDB has made many improvements of ease-of-use, scalable reads, creating a managed service that makes it possible to easily install and manage RonDB. We have also added a set of new interfaces, a REST server to handle batches of lookups for generic database lookups and for feature store lookups, RonSQL to handle optimised aggregation queries that are very common in AI applications and finally an experimental Redis interface called Rondis.

Check out rondb.com for more information, you can try it out and if you want to read the latest scalable benchmark go directly here. If you want to have a walkthrough of the challenges in running a highly scalable benchmark you can find it here.

Happy reading!

Thursday, August 21, 2025

How to reach 100M Key lookups using REST server with Python clients

 A few months ago I decided to run a benchmark to showcase how RonDB 24.10 can handle 100M Key lookups per second using our REST API server from Python client. This exercise is meant to show both how RonDB can scale to handle throughput requirements as well as latency requirements for Personalised Recommendation systems that are commonly used by companies such as Spotify, E-commerce sites and so forth.

The exercise started at 2M Key lookups per second. Running a large benchmark like this means that you hit all sorts of bottlenecks. Some of the bottlenecks are due to configuration issues, some are due to load balancers, some due to quota constraints and networking within the cloud vendor, some are due to bugs and yet some required some new features in RonDB. It also includes a comparison of VM types using Intel, AMD and ARM CPUs. It also included managing multiple Availability Zones.

I thought reporting on this exercise could be an interesting learning also for others, so the whole process can be found in this blog.

At rondb.com you can find other blogs about RonDB 24.10 and you can even try out RonDB in a Test Cluster. You can start a small benchmark and check 12 dashboards of monitoring information about RonDB while it is running.

Thursday, December 05, 2024

Release of RonDB 22.10.7

Today we released a new release of the stable series of RonDB. This version RonDB 22.10.7 is mostly a bug fix release, but also contains a few new features that were required for the Kubernetes integration of RonDB.

The major development in RonDB is currently around RonDB 24.10 which is aimed for a first release in 1-2 months.

RonDB 22.10.7 contains the following new features:

RONDB-789: Find out memory availability in a container

RonDB uses Automatic Memory Configuration as default. In this setting RonDB will discover the amount of memory available and allocate most of the available memory to the RonDB data nodes. In Linux using VMs or bare metal servers this information is found in /proc/meminfo. However running in a container the information is instead stored in /sys/fs/cgroup/memory.max. The setting of the amount of memory to be available is set in the RonDB Helm charts and can thus now be automatically detected without extra configuration variables. Setting TotalMemoryConfig will still override the discovered memory size.

RONDB-785: Set LocationDomainId dynamically

With RonDB Kubernetes support it is very easy to setup the cluster in such a way that nodes in the RonDB cluster are spread in several Availability Zones (Availability Domains in Oracle Cloud). In order to avoid sending network messages over Availability Zone boundaries more than necessary we try to locate the transaction coordinator in our domain and read data from our domain if possible.

To avoid complex Kubernetes setups this required the ability to set the domain in the RonDB data node container using a RonDB management client command. In RonDB we use a Location Domain Id to figure out which Availability Zone we are in. How this is set is up to the management software (Kubernetes and containers in our case).

This features makes it possible to access RonDB through a network load balancer that chooses a MySQL Server (or RDRS server) in the same domain, this MySQL Server will contact a RonDB data node in the same domain and finally the RonDB data node will ensure that it reads data from the same domain. Thus we can completely avoid any network messages that passes over domain boundaries for key lookups that reads.

RONDB-784: Performance improvement for Complex Features in Go REST API Server

  • Bump GO version to 1.22.9.
  • Use hamba avro as a replacement for linkedin avro library to deserialize complex features
  • Avoid json.Unmarshal when parsing complex feature field
  • Use Sonic library to serialize JSON before sending to the client.

This feature cuts latency of complex feature processing to half of what it used to be. This significantly improves latency of Feature Store REST API lookups.

RONDB-776: Changed hopsworks.schema to use TEXT data type

Impacts the GO REST API Server and its Feature Store REST API.

 

Wednesday, October 16, 2024

Coroutines in RonDB,

 A while ago C++ standard added a new feature to C++ 20 called coroutines. I thought it was an interesting thing to try out for RonDB and used some time this summer to read more about it. My findings was that C++ coroutines can only be used for special tasks that require no stacks. The problem is that a coroutine cannot save the stack.

My hope was to find that I could have a single thread that could work using fibers where the fiber belongs to one thread and the thread could switch between different fibers. The main goal of fibers would be to improve throughput by doing work instead of blocking the CPU on cache misses. However fibers can also be a tool to allow a scalable server where the process runs in a single thread when the load is low, and scale up to hundreds of threads (if the process has access to that many CPUs) when required. This will provide better power efficiency, better latency at low loads. Also if we ever get VMs that can dynamically scale up and down the number of CPUs we can use fibers to scale up and down the number of threads in this case.

My read up on C++ 20 coroutines was that it could not deliver on this.

However my read up found an intriguingly simple and elegant solution to the problem. See this blog for a description and here is the GitHub tree with the code. So a small header file of around 300 lines solves the problem elegantly for x86_64 both on Macs and on Linux and similarly for ARM64. Thus all the platforms RonDB supports. The header file can also be used on Windows (Windows supports fibers).

I developed a small test program to see the code in action:

#include <iostream>

#include "tiny_fiber.h"


/**

 * A very simple test program checking how fibers and threads interact.

 * The program will printout the following:

hello from fibermain

hello from main

hello from fibermain 2

hello from main 2

 */


tiny_fiber::FiberHandle thread_fiber;

tiny_fiber::FiberHandle fiber;


void fibermain(void* arg) {

  tiny_fiber::FiberHandle fiber =

  *reinterpret_cast<tiny_fiber::FiberHandle*>(arg);

  std::cout<<"hello from fibermain"<<std::endl;

  tiny_fiber::SwitchFiber(fiber, thread_fiber);

  std::cout<<"hello from fibermain 2"<<std::endl;

  tiny_fiber::SwitchFiber(fiber, thread_fiber);

}


int main(int argc, char** argv) {

  const int stack_size = 1024 * 16;

  thread_fiber = tiny_fiber::CreateFiberFromThread();

  fiber = tiny_fiber::CreateFiber(stack_size, fibermain, &fiber);

  tiny_fiber::SwitchFiber(thread_fiber, fiber);

  std::cout<<"hello from main"<<std::endl;

  tiny_fiber::SwitchFiber(thread_fiber, fiber);

  std::cout<<"hello from main 2"<<std::endl;

  return 0;

}

Equipped with this I have the tools I need to develop an experiment and see how fibers works with RonDB. Good news is that I need no learn any complex C++ syntax to do this. It is all low level system programming. I have learnt through long experience that it is not a certain success if you have a theory. A computer is sufficiently complex to not understand the impact of changes that one does. So I am excited to see how this particular new idea works out.

The concept of fibers fits very nicely into the RonDB runtime scheduler and the division of work between threads. It even provides the ability for a thread to be turned into a fiber and moved to another OS thread and it can be returned to its original thread again as well.