Tuesday, September 28, 2021

Memory Management in RonDB

 Most of the memory allocated in RonDB is handled by the global memory manager. Exceptions are architecture objects and some fixed size data structures. In this presentation we will focus on the parts handled by the global memory manager.

In the global memory manager we have 13 different memory regions as shown in the figure below:



- DataMemory

- DiskPageBufferCache

- RedoBuffer

- UndoBuffer

- JobBuffer

- SendBuffers

- BackupSchemaMemory

- TransactionMemory

- ReplicationMemory

- SchemaMemory

- SchemaTransactionMemory

- QueryMemory

- DiskOperationRecords

One could divide those regions into a set of qualities. We have a set of regions that are fixed in size, another set of regions are critical and cannot handle failure to allocate memory, a set of regions have no natural upper limit and are unlimited in size, there is also a set of regions that are flexible in size that can work together to achieve the best use of memory. We can also divide regions based on whether the memory is short term or long term. Each region can belong to multiple categories.

To handle these qualities of the regions we have priorities on each memory region, this priority can be affected by the amount of memory that the resource has allocated.

Fixed regions have a fixed size, this is used for database objects, the Redo log Buffer, the Undo log buffer, the DataMemory and the DiskPageBufferCache (the page cache for disk pages). There is code to ensure that we queue up when those resources are no longer available. DataMemory is a bit special and we will describe it separately below.

Critical regions are regions where a request to allocate memory would cause a crash. This relates to the job buffer which is used for internal messages inside a node, it also relates to send buffers which are used for messages to other nodes. DataMemory is a critical region during recovery, if we fail to allocate memory for database objects during recovery we would not be able to recover the database. Thus DataMemory is a critical region in the startup phase, but not during normal operation. DiskOperationRecords are also a critical resource since otherwise we cannot maintain the disk data columns. Finally we also treat BackupSchemaMemory as critical since not being able to perform a backup would make it very hard to manage RonDB.

Unlimited regions have no natural upper limit, thus as long as memory is available at the right priority level, the memory region can continue to grow. The regions in this category is BackupSchemaMemory, QueryMemory and SchemaTransactionMemory. QueryMemory is memory used to handle complex SQL queries such as large join queries. SchemaTransactionMemory can grow indefinitely, but the meta data operations try avoid growing too big.

Flexible regions are regions that can grow indefinitely but that have to set limits on its own growth to ensure that other flexible regions are also allowed to grow. Thus one flexible resource isn't allowed to grab all the shared memory resources. There are limits to how much memory a resource can grab before its priority is significantly lowered.

Flexible regions are TransactionMemory, ReplicationMemory, SchemaMemory, QueryMemory, SchemaTransactionMemory, SendBuffers, BackupSchemaMemory, DiskOperationRecords, 

Finally we have short term versus long term memory regions. A short term memory region allocation is of smaller signifance compared to a long term memory region. In particular this relates to SchemaMemory. SchemaMemory contains metadata about tables, indexes, columns, triggers, foreign keys and so forth. This memory once allocated will stay for a very long time. Thus if we allow it to grow too much into the shared memory we will not have space to handle large transactions that require TransactionMemory.

Each region has a reserved space, a maximum space and a priority. In some cases a region can also have a limit where its priority is lowered.

4% of the shared global memory is only accessible to the highest priority regions plus half of the reserved space for job buffers and communication buffers.

10% of the shared global memory is only available to high prio requesters. The remainder of the shared global memory is accessible to all memory regions that are allowed to allocate from the shared global memory.

The actual limits might change over time as we learn more about how to adapt the memory allocations.

Most regions have access also to a shared global memory. It will first use its reserved memory and if there is shared global memory available it can allocate from this as well.

The most important ones are DataMemory and DiskPageBufferMemory. Any row stored in memory and all indexes in RonDB are stored in the DataMemory. The DiskPageBufferMemorycontains the page cache for data stored on disk. To ensure that we can always handlerecovery, DataMemory is fixed in size and since recovery can sometimes grow the data size a bit. We don't allow the DataMemory to be filled beyond 95% in normal operation. In recovery it can use the full DataMemory size. Those extra 5% memory resources are also reserved for critical operations such as growing the cluster with more nodes and reorganising the data inside RonDB. The DiskPageBufferCache is fixed in size, operations towards the disk is queued by using DiskOperationRecords.

Critical regions which have higher priority to get memory compared to the rest of the regions. These are job buffers used for sending messages between modules inside a data node, send buffers used for sending messages between nodes in the cluster, the meta data required for handling backup operations and finally operation records to access disk data.

These regions will be able to allocate memory even when all other regions will fail to allocate memory. Failure to access memory for those regions would lead to failure of the data node or failure to backup the data which are not events that are acceptable in a DBMS.

We have 2 more regions that are fixed in size, the Redo log buffer and the Undo log buffer (the Undo log is only used for operations on disk pages). Those allocate memory at startup and use that memory, there is some functionality to handle overload on those buffers by queueing operations when those buffers are full.

The remaining 4 regions we will go through in detail.

The first one is TransactionMemory. This memory region is used for all sorts of operations such as transaction records, scan records, key operation records and many more records used to handle the queries issued towards RonDB.

The TransactionMemory region have a reserved space, but it can grow up to 50% of the shared global memory beyond that. It can even grow beyond that, but in this case it only has access to the lowest priority region of the shared global memory. Failure to allocate memory in this region leads to aborted transactions.

The second region in this category is SchemaMemory. This region contains a lot of meta data objects representing tables, fragments, fragment replicas, columns, and triggers. These are long-term objects that will be there long-term. Thus we want this region to be flexible in size, but we don't want it grow such that it diminishes the possibility to execute queries towards region. Thus we calculate a reserved part and allow this part to grow into at most 20% of the shared memory region in addition to its reserved region. This region cannot access the higher priority memory regions of the shared global memory.

Failure to allocate SchemaMemory causes meta data operations to be aborted.

Next region in this category is ReplicationMemory. These are memory structures used to represent replication towards other clusters supporting Global Replication. It can also be used to replicate changes from RonDB to other systems such as ElasticSearch. The memory in this region is of temporary nature with memory buffers used to store the changes that are being replicated. The meta data of the replication is stored in the SchemaMemory region.

This region has a reserved space, but it can also grow to use up to 30% of the shared global memory. After that it will only have access to the lower priority regions of the shared global memory.

Failure to allocate memory in this region lead to failed replication. Thus replication have to be set up again. This is a fairly critical error, but it is something that can be handled.

The final region in this category is QueryMemory. This memory has no reserved space, it can use the shared global lower priority regions. This memory is used to handle complex SQL queries. Failure to allocate memory in this region will lead to complex queries being aborted.

This blog presents the memory management architecture in RonDB that is currently in a branch called schema_mem_21102, this branch is intended for RonDB 21.10.2, but could also be postponed to RonDB 22.04. The main difference in RonDB 21.04 is that the SchemaMemory and ReplicationMemory are fixed in size and cannot use the shared global memory. The BackupSchemaMemory is also introduced in this branch. It was currently part of the TransactionMemory.

In the next blog on this topic I will discuss how one configures the automatic memory in RonDB.

Friday, September 24, 2021

Automatic Memory Management in RonDB

RonDB has now grown up to the same level of memory management as you find in expensive commercial DBMSs like Oracle, IBM DB2 and Microsoft SQL Server.

Today I made the last development steps in this large project. This project started with a prototype effort by Jonas Oreland already in 2013 after being discussed for a long time before that. After he left for Google the project was taken over by Mauritz Sundell that implemented the first steps for operational records in the transaction manager.

Last year I added the rest of the operational records in NDB. Today I completed the programming of the final step in RonDB. This last step meant moving around 30 more internal data structures towards using the global memory manager. These memory structures are used to represent meta data about tables, fragments, fragment replicas, triggers and global replication objects.

One interesting part that is contained in this work is a malloc-like implementation that interacts with all record-level data structures that is already in RonDB to handle linked list, hash tables and so forth for internal data structures.

So after more than 5 years it feels like a major step forward in the development of RonDB.

What does this mean for a user of RonDB? It means that the user won't have to bother much with memory management configuration. If RonDB is started in a cloud VM, it will simply use all memory in the VM and ensure that the memory is handled as a global resource that can be used by all parts of RonDB. This feature is exactly existing already in RonDB 21.04. What this new step means is that the memory management is even more flexible, there is no need to allocate more memory than needed for meta data objects (and vice versa if more memory is needed, it is likely to be accessible).

Thus memory can be used for other purposes as well. Thus the end result is that more memory is made available in all parts of RonDB, both to store data in it and to perform more parallel transactions and more query handling.

Another important step is that this step opens up for many new developments to handle larger objects in various parts of RonDB.

In later blogs we will describe how the memory management in RonDB works. This new development will either appear in RonDB 21.10 or in RonDB 22.04.