Mikael Ronstrom: Early design choices for RonDB and InnoDB

I have had many interesting discussions with Zhao Song about RonDB and its internals. Since both Zhao and myself also worked on MySQL/InnoDB features as well, it becomes natural that we sometimes compare the features of RonDB with the features in InnoDB.

In this blog I will discuss what is the basis for the very different solutions that we have in RonDB compared to what we find in InnoDB. Since this blog is mostly about the early history of RonDB and InnoDB I will use the NDB name which is still the name of the product Oracle develops and that RonDB is a fork of.

The story starts more than 30 years ago in the early 1990s. I was just starting my Ph.D studies of databases. I was working at Ericsson, the world's leading telecom provider. In the late 1980s the telecom industry started using databases in the telecom applications. The first application was the network databases for enterprise companies called SCPs (Service Control Point). These network databases was used when calling a company number to intelligently control who would pick up the phone. These numbers are still in use today, 020 numbers in Sweden and 800 numbers in the US.

Traditionally DBMSs had not been used in the telecom industry due to the real-time requirements that the application had on the DBMS. This started to change in the early 1990s when many managers at Ericsson started to understand how important it was to be on top of this new technology in the telecom networks. Actually Ericsson started 4 projects within 10 years to develop new DBMS engines. The first was DBS, an internal SQL database used in AXE switches, the second was DBN, developed as part of a large research project called AXE-N. The third DBMS developed was NDB Cluster and finally the fourth one was Mnesia. All these DBMSs are still in active use, DBS still in AXE systems, DBN renamed to TelORB and used in a number legacy applications within Ericsson, NDB Cluster is nowadays known as MySQL NDB Cluster and RonDB is fork of this DBMS and finally Mnesia is an open source DBMS for Erlang applications.

At the same time the database industry had introduced the relational database and Oracle had won this category in the early 1990s. However Oracle wasn't anything that could be used in the telecom applications in the 1990s.

My personal start in the database market was to develop a course in DBS (Database Subsystem) a new subsystem in the telecom switch AXE that used a compiler to compile SQL queries to assembly (always primary key lookup queries). Actually I still think it might be the most efficient SQL implementation in the world. A SQL Select query could be translated to as little as 2 assembler instructions!

However my research started more in earnest with a EU research project on 3G mobile networks that I participated in 1992-1995. I worked in the network simulation task where we analysed the requirements on the various network nodes in the telecom network. I focused on the requirements of the network databases in the 3G mobile network. The network database was used to track where the mobile was through location updates, the services of the mobile was also stored in the network database.

So from these requirements through network simulations, it was clear that the network database had to handle multiple queries within a time span of about 10 millisecond. The telecom network would be completely dependent on that these network databases would be available at all times. Downtime meant no phone calls for the mobiles, obviously unacceptable.

There was lots of speculation about the killer application for mobiles, today we know this is the smartphone with all its apps. At this time we speculated on multimedia email, on-demand news and I also looked into genealogy applications.

This meant that I had a pretty good idea about the performance requirements, the latency requirements, the availability requirements and the storage requirements of any future telecom DBMS.

In the database industry at the same time there had been lots of research on how to build recovery algorithms. The most important report was written by C. Mohan at IBM that presented the ARIES algorithm.

Both InnoDB and NDB (yes, it is short for Network DataBase) were invented and developed in the 1990s (Ericsson was really keen on three-letter abbreviations). I don't know all the details about the reasons behind InnoDB, but it developed a fairly straightforward architecture based on the knowledge in the 1990s. Thus it has used many ideas from ARIES and from the development of scalable disk-based B+trees. Essentially providing an open source solution that enterprise databases at the time was built like.

NDB on the other hand tried to solve a new problem. How to write an always available DBMS with very low latency and making extremely efficient use of CPUs to ensure that the performance requirements could be met. Much of the early prototypes of NDB used special hardware (Dolphin SCI) for networking that made it possible already in the 1990s to have response times within less than a millisecond.

The requirements and the research led to the following decisions.

The DBMS must primarily be an in-memory DBMS to meet both performance and latency requirements.
The Always Available means that it must have failover times that are instant for Software failures and at most a few seconds at Hardware failures.
The complexity of the Availability requirements meant that a traditional 2-phase locking was used since it simplified the reasoning in recovery algorithms.
The DBMS must avoid context switches as much as possible. This led to NDB becoming the first database that used an internal asynchronous programming style. This led to both superior performance and latency. NDB never lost a benchmark competition, not even when the rules were set by a competing department! The basis of this asynchronous technology came from the AXE system that used blocks (modules) and signals (messages) as main concept in the programming style.
The most important query in a telecom DBMS is the key lookup for read and write.
The DBMS must have a hash index built for efficient use CPU caches as the main index.
SQL was not a suitable programming API for the network database, it required too much overhead for key lookup queries.

If you read those requirements, you probably understand why NDB Cluster became the first Key-value data store, even 10-15 years before the concept was even introduced. Our sales guy went into a market where database was equivalent to use of SQL, so the marketing was mainly done through showing off significant performance benefits and this continues unto this day, RonDB is still the most performant key-value DBMS in the world.

In ARIES the idea was to use the REDO log to roll forward and then using the UNDO log to roll back to a state where there are only completed transactions left. In addition Compensation Log records was used. The writing of the log had to abide by the WAL (Write Ahead Log) algorithm.

Both REDO and UNDO was page-oriented logs, that could sometimes use logical writes within each page.

This meant extra overhead on logging for updating transactions. It is normal that 100 transactions updating 1 row in a DBMS is much more costly than running 1 transaction with 100 rows updated. Not so in NDB, in NDB the cost is more or less the same for the two, can even be more efficient to split into small transactions in a large cluster.

NDB stored transaction state in memory and thus no dirty state was written to the database memory. Thus only a REDO log was required and this was logical, thus only storing changed columns, no need to store all the columns unless if they were all updated.

NDB used a two-phase commit protocol, we used a combination of linear 2-phase commit protocol (between replicas of a row) and a normal 2-phase commit protocol between operations. To avoid blocking states the transaction state can be rebuilt at node failures to ensure that transaction can be finished quickly even in crash situations.

Thus we are ready to make some comparisons between InnoDB and NDB.

1. InnoDB use a traditional ARIES algorithm with REDO and UNDO log.

NDB use a logical REDO log.

2. InnoDB was designed as a single node DBMS that uses replication algorithms on top to handle availability.

NDB is a distributed DBMS that is designed for high availability environments and HA is built into the product, thus replication is part of the architecture.

3. InnoDB is a traditional disk-based DBMS

NDB is an in-memory DBMS

4. InnoDB failover times depends on the time it takes to roll forward the log after a crash.

NDB failover time is instant.

5. InnoDB use a traditional B+tree, thus best used for small number of larger queries

NDB tables always have a distributed hash table as main index for efficient key lookups NDB have shown already 11 years ago the ability to handle 200M queries per second.

6. InnoDB use a traditional OS model with lots of threads interacting through mutexes, condition variables and atomic variables. (The author spent a few years making this architecture scale to 64 CPUs).

NDB used as a base a single-threaded without context switches to execute a query. It has now scaled this up to a set of single-CPU threads that interact and can scale to hundreds of CPUs.

7. InnoDB was designed with most of the focus on access to disk pages being as efficient as possible.

NDB was designed with data structures that focused on minimising the number of CPU cache misses. NDB can still execute with 2-4x more instructions per CPU cycle compared to traditonal DBMSs. In the early days the difference was even bigger. All disk writes are sequential in nature and thus very efficient.

Both NDB and InnoDB have developed a lot since its early days in the 1990s and both Zhao and myself have participated in both InnoDB and NDB developments. The main reason for their difference is that they focus on solving quite different problems.

Zhao has spent the last year implementing pushdown aggregates in RonDB meaning that aggregate queries can be evaluated right at the time when we access the data and can also be parallelised. This means that those queries can be 10-20x faster in RonDB than previously in NDB and about 5-10x faster than in MySQL/InnoDB.

RonDB is now focused on the requirements of AI applications. This still means a focus on key loookups, but also specialised aggregate queries and a lot of data changes that flows in and out of the RonDB database. The high availability is still a very important requirement. Nowadays RonDB can also stores large parts of the rows on disk using a traditional disk page cache.

In conclusion InnoDB is a very capable database backend for traditional database applications. It has been able to handle competition from many competing products, both within MySQL and outside.

RonDB based on NDB is also a very capable database backend for real-time applications with high-availability surpassing that accomplished with the Oracle DBMS and a perfect fit for the new era of AI applications. It has been used to develop an LDAP server on top of it, an SQL database, a distributed file system and many HA applications. Thus RonDB is an important tool in your toolbox as developer of the most demanding applications.

But InnoDB and RonDB serve very different customer segments, thus their differences simply comes from serving different customer requirements and this has led to quite different technology choices.

Mikael Ronstrom

Friday, October 04, 2024

Early design choices for RonDB and InnoDB

No comments:

About Me

Links