This blog gives some background to the decisions made when designing the
storage engine NDB Cluster used in MySQL Cluster around how to support
real-time requirements (or as I sometime refer to it, predictable response
time requirements).
databases two things were important. The first requirement is that we need to
be able to respond to queries within a few milliseconds (today even down to
tens of microseconds). The second requirement is that we need to do this while
at the same time supporting a mix of simple traffic queries combined with a
number of more complex queries running at the same time.
The first requirement was the main requirement that led to NDB Cluster using a
main memory storage model with durability on disk using a REDO log and
various checkpoints. Today we also support storing non-indexed columns on
disk in combination with columns stored in main memory.
in an extremely large environment with many CPUs can be done by allowing the
traffic queries and management queries to run on different CPUs. This model will
however not work at all in a confined environment with only 1-2 CPUs and it will
even be hard to put to work in a large environment since the usage of the
management queries will come and go quickly.
compared to our requirement of responding within parts of a millisecond.
So this model won't work very well either.
Yet another possibility would be to use a real-time operating system, but this would
marginalise the product too much.
Most DBMS today use the OS to handle the requirements on reponse times. So as an
example if one uses MySQL/InnoDB and send various queries to the MySQL Server,
some traffic queries and some management queries, MySQL will use different threads
for each query. MySQL will deliver good throughput even in the context of very
varying workloads since the OS will use time-sharing to fairly split the CPU usage
amongst the various threads. However it will not be able to handle response time
requirements of parts of a millisecond with a mixed load of simple and complex
queries.
within Ericsson. In Ericsson a real-time telecom switch had been developed in the
70s, the AXE. The AXE is still in popular use today and new versions of it are still
developed. AXE had a solution to this problem which was built around a message
passing machine.
I spent a good deal of the 90s developing a virtual machine for AXE called AXE VM
that later turned into a real product called APZ VM (APZ is the name of the CPU
subsystem in the AXE). This virtual machine was able to execute on any machine.
The AXE VM used a model where execution was handled as execution of signals. A
signal is simply a message, this message contains an address label, it contains
a signal number and it contains data of various sizes. A signal is executed inside
a block, a block is a module that is self-contained, it owns all its data and
the only manner to get to the data in the block is through sending a signal to the
block.
So effectively the AXE VM implemented a real-time operating system inside a normal
operating system such as Windows, Linux, Solaris or Mac OS X.
The AXE VM also had a lot of handling of the language used in AXE called PLEX. This
is no longer present in NDB. But NDB still is implemented using signals and blocks.
The blocks are implemented in C++ and in AXE VM it was possible to have such
blocks, they were called simulated blocks. In NDB all blocks are nowadays simulated
blocks.
in a highly loaded system. First of all it is important to state that NDB does
handle this. We have very demanding customers both in the telecom, networking and
in financial sectors and lately also in the storage world that expects to run
complex transactions involving tens of different key lookups and scan queries and
that expects these transactions to complete within a few milliseconds even at
90-95% load in the system.
As an example in the financial sector missing the deadline might mean that you miss
the opportunity to buy or sell some stock equity in real-time trading. In the telecom
sector your telephone call setup and other telco services depends on immediate
response to complex transactions.
At the same time these systems also need to ensure that they can analyse the data
in real-time, these queries obviously have less demanding response time
requirements, but they are not allowed to impact the response time of the traffic queries.
The virtual machine model implements this by using a design technique where each
signal is only allowed to execute for a few microseconds. A typical key lookup
query in modern CPUs takes less than two microseconds to execute. Scanning a table
is divided up into scanning a few rows at a time where each such scan takes less
than ten microseconds. All other maintenance work to handle restarts, node failures,
aborts, creating new tables and so forth is similarly implemented with the same
requirements on signal execution.
So what this means is that a typical traffic transaction is normally handled by one
key lookup or a short scan query and then the response is sent back to the API node.
A transaction consists of a number of such interactions normally on the order of
tens of such queries. This means that each interaction needs to complete within
100-200 microseconds in order to handle response times of a few millseconds
for the transaction.
NDB can handle this response time requirement even when 20-30 messages are
queued up before the message given that each message will only take on the order
of 1-2 microseconds to execute. So most of the time is still spent in the transporter
layer sending the message and receiving the message.
A complex query will execute in this model by being split into many small signal
executions. Each time a signal is completed it will put itself back into the queue
of signals and wait for its next turn.
So traffic queries will always have the ability to meet strict requirements on
response time. Another nice thing with this model is that it will adapt to
varying workloads within a few microseconds. So if there is currently no traffic
queries to execute, then the complex query will get the CPU to itself since the
next signal will execute immediately after being put on the queue.
things happens. First we execute more and more signals every time we have
received a set of signals. This means that the overhead to collect each
signal decreases. Second executing larger and larger sets of signals means
that we send larger and larger packets. This means that the cost per packet
decreases. Thus actually NDB data nodes executes more and more efficiently
as load increases. This is a very important characteristic that avoids many
overload problems.
possible to use different Query Server for traffic queries to the ones used
for complex queries. So in the MySQL Cluster model this means that you can
use a set of MySQL Servers in the cluster to handle short real-time queries.
You can use a different set of MySQL Servers to handle complex queries.
Thus MySQL Cluster can handle real-time requirements in a proper configuration
of the cluster even when operating using SQL queries.
signals, the most common ones are TCKEYREQ that implements all types of
key lookups using the primary key and SCAN_TABREQ that implements all types
of scan queries (also including join queries that have been pushed down to
data nodes). There is a protocol to carry these signals that currently uses
TCP/IP sockets but have historically also been carried by SCI, Infiniband
and shared memory transporters.
So the separation of Data Server and Query Server functionality might mean
that MySQL Cluster have slightly longer minimum response time compared to
a local storage engine in MySQL, but MySQL Cluster will continue to deliver
low and predictable response times even using varying workloads and even
when executing at very high loads.
One experiment that was done when developing pushdown join functionality
showed that the performance of those pushed down joins was the same
when executing in an otherwise idle cluster as when executing in a cluster
that performed 50.000 update queries per second.
NDB has been designed such that with some work of configuring it properly
it can be extremely reliable in delivering predictable response times. At the
same time we're working hard to make it easier and easier to configure also
when you don't want to control every bell and whistle. One step in this direction
is the introduction of the ability to read also from backup replicas and the
adaptive control of which threads that help out in sending.
storage engine NDB Cluster used in MySQL Cluster around how to support
real-time requirements (or as I sometime refer to it, predictable response
time requirements).
Requirement analysis
When analysing the requirements for NDB Cluster based on its usage in telecomdatabases two things were important. The first requirement is that we need to
be able to respond to queries within a few milliseconds (today even down to
tens of microseconds). The second requirement is that we need to do this while
at the same time supporting a mix of simple traffic queries combined with a
number of more complex queries running at the same time.
The first requirement was the main requirement that led to NDB Cluster using a
main memory storage model with durability on disk using a REDO log and
various checkpoints. Today we also support storing non-indexed columns on
disk in combination with columns stored in main memory.
Potential solutions
The second requirement was a bit harder to handle. To solve the second requirementin an extremely large environment with many CPUs can be done by allowing the
traffic queries and management queries to run on different CPUs. This model will
however not work at all in a confined environment with only 1-2 CPUs and it will
even be hard to put to work in a large environment since the usage of the
management queries will come and go quickly.
The next potential solution is to simply leave the problem to the OS. Modern OSs
of today use a time-sharing model. However each time quanta is fairly longcompared to our requirement of responding within parts of a millisecond.
So this model won't work very well either.
Yet another possibility would be to use a real-time operating system, but this would
marginalise the product too much.
Most DBMS today use the OS to handle the requirements on reponse times. So as an
example if one uses MySQL/InnoDB and send various queries to the MySQL Server,
some traffic queries and some management queries, MySQL will use different threads
for each query. MySQL will deliver good throughput even in the context of very
varying workloads since the OS will use time-sharing to fairly split the CPU usage
amongst the various threads. However it will not be able to handle response time
requirements of parts of a millisecond with a mixed load of simple and complex
queries.
AXE VM
So when designing NDB Cluster we wanted to avoid this problem. NDB was designedwithin Ericsson. In Ericsson a real-time telecom switch had been developed in the
70s, the AXE. The AXE is still in popular use today and new versions of it are still
developed. AXE had a solution to this problem which was built around a message
passing machine.
I spent a good deal of the 90s developing a virtual machine for AXE called AXE VM
that later turned into a real product called APZ VM (APZ is the name of the CPU
subsystem in the AXE). This virtual machine was able to execute on any machine.
The AXE VM used a model where execution was handled as execution of signals. A
signal is simply a message, this message contains an address label, it contains
a signal number and it contains data of various sizes. A signal is executed inside
a block, a block is a module that is self-contained, it owns all its data and
the only manner to get to the data in the block is through sending a signal to the
block.
So effectively the AXE VM implemented a real-time operating system inside a normal
operating system such as Windows, Linux, Solaris or Mac OS X.
The AXE VM also had a lot of handling of the language used in AXE called PLEX. This
is no longer present in NDB. But NDB still is implemented using signals and blocks.
The blocks are implemented in C++ and in AXE VM it was possible to have such
blocks, they were called simulated blocks. In NDB all blocks are nowadays simulated
blocks.
How does NDB solve the real-time problem
So how does this model enable response times of down to parts of a millisecond evenin a highly loaded system. First of all it is important to state that NDB does
handle this. We have very demanding customers both in the telecom, networking and
in financial sectors and lately also in the storage world that expects to run
complex transactions involving tens of different key lookups and scan queries and
that expects these transactions to complete within a few milliseconds even at
90-95% load in the system.
As an example in the financial sector missing the deadline might mean that you miss
the opportunity to buy or sell some stock equity in real-time trading. In the telecom
sector your telephone call setup and other telco services depends on immediate
response to complex transactions.
At the same time these systems also need to ensure that they can analyse the data
in real-time, these queries obviously have less demanding response time
requirements, but they are not allowed to impact the response time of the traffic queries.
The virtual machine model implements this by using a design technique where each
signal is only allowed to execute for a few microseconds. A typical key lookup
query in modern CPUs takes less than two microseconds to execute. Scanning a table
is divided up into scanning a few rows at a time where each such scan takes less
than ten microseconds. All other maintenance work to handle restarts, node failures,
aborts, creating new tables and so forth is similarly implemented with the same
requirements on signal execution.
So what this means is that a typical traffic transaction is normally handled by one
key lookup or a short scan query and then the response is sent back to the API node.
A transaction consists of a number of such interactions normally on the order of
tens of such queries. This means that each interaction needs to complete within
100-200 microseconds in order to handle response times of a few millseconds
for the transaction.
NDB can handle this response time requirement even when 20-30 messages are
queued up before the message given that each message will only take on the order
of 1-2 microseconds to execute. So most of the time is still spent in the transporter
layer sending the message and receiving the message.
A complex query will execute in this model by being split into many small signal
executions. Each time a signal is completed it will put itself back into the queue
of signals and wait for its next turn.
So traffic queries will always have the ability to meet strict requirements on
response time. Another nice thing with this model is that it will adapt to
varying workloads within a few microseconds. So if there is currently no traffic
queries to execute, then the complex query will get the CPU to itself since the
next signal will execute immediately after being put on the queue.
Handling memory allocation
One more important factor in ensuring that NDB can always operate in an optimal
manner and deliver the expected throughput is that we control memory. All the
memory is allocated at startup, this means that we cannot get into a situation where
we oversubscribe the main memory of the machine. NDB even have a number of
config parameters to ensure that the memory used by NDB data nodes is never
paged out.
Locking of CPUs
One more manner of ensuring that NDB always operates in an optimal manner is to
control the placement of threads onto different CPUs.
Behaviour of NDB at high load
There is one more very important aspect of this model. As load increases twothings happens. First we execute more and more signals every time we have
received a set of signals. This means that the overhead to collect each
signal decreases. Second executing larger and larger sets of signals means
that we send larger and larger packets. This means that the cost per packet
decreases. Thus actually NDB data nodes executes more and more efficiently
as load increases. This is a very important characteristic that avoids many
overload problems.
Building a mixed environment for traffic and management queries
Finally the separation of Data Server and Query Server functionality makes itpossible to use different Query Server for traffic queries to the ones used
for complex queries. So in the MySQL Cluster model this means that you can
use a set of MySQL Servers in the cluster to handle short real-time queries.
You can use a different set of MySQL Servers to handle complex queries.
Thus MySQL Cluster can handle real-time requirements in a proper configuration
of the cluster even when operating using SQL queries.
Conclusion
The interface to the Data Server is as you can now see implemented on top ofsignals, the most common ones are TCKEYREQ that implements all types of
key lookups using the primary key and SCAN_TABREQ that implements all types
of scan queries (also including join queries that have been pushed down to
data nodes). There is a protocol to carry these signals that currently uses
TCP/IP sockets but have historically also been carried by SCI, Infiniband
and shared memory transporters.
So the separation of Data Server and Query Server functionality might mean
that MySQL Cluster have slightly longer minimum response time compared to
a local storage engine in MySQL, but MySQL Cluster will continue to deliver
low and predictable response times even using varying workloads and even
when executing at very high loads.
One experiment that was done when developing pushdown join functionality
showed that the performance of those pushed down joins was the same
when executing in an otherwise idle cluster as when executing in a cluster
that performed 50.000 update queries per second.
NDB has been designed such that with some work of configuring it properly
it can be extremely reliable in delivering predictable response times. At the
same time we're working hard to make it easier and easier to configure also
when you don't want to control every bell and whistle. One step in this direction
is the introduction of the ability to read also from backup replicas and the
adaptive control of which threads that help out in sending.