Monday, March 31, 2014

MySQL Cluster 7.4.0 Lab Release Improvements

The year of 2014 seems to be the year of the In-Memory Databases. We've seen a lot of commotion around the Hekaton product from Microsoft and other new products. At the same time everyone talks about WebScale and how to build distributed databases that can handle millions of queries per second. In addition we've seen the development quite efficient communication mechanisms the last few years. Databases that make use of In-Memory, WebScale and efficient communication mechanisms are the talk of today in the database world.

The background of NDB

The design of the NDB storage engine started out with exactly those base requirements already more than 20 years ago where the aim was to build the next generation telecom database. The base requirements was a shared-nothing database (for superior scalability and meeting telecom requirements on fail-over times). Already in those days one could buy a machine equipped with 1 GB of memory. Given that the telecom databases was used for extremely quick lookups of small amount of data it was natural to consider an in-memory database design. My research into databases at the time showed that they spent a considerable amount of time in the operating system to handle communication and hard disks. So by moving data to in-memory and by using communication mechanisms that avoided operating system we were able to deliver extremely efficient database operations already in the late 90s.

NDB Today

Today NDB is the storage engine of MySQL Cluster and has been in production usage for more than 10 years. Most everyone on the globe is touched by its operation in some telecom system, in some computer game or in some other type of web property application. We already delivered benchmarks with billions of reads per minute and the scalability of MySQL Cluster is so high that we simply don't have big enough computers or computer sites to show off its limits. A while ago we had access to a computer lab with hundreds of computers that was connected using Infiniband with a total bandwidth between the machines of 1 Tbit/sec. This means we can transport 128 GBytes per second between the machines. However MySQL Cluster could theoretically produce enough parallel read operations to swamp even such a network. So we're getting to the point that it becomes more and more uninteresting to show the scalability limits of MySQL Cluster. So this means that we also want to focus on efficiency and not only on scalability.

The Many-Core CPU Challenge

I gave a presentation to a set of students at Uppsala University about MySQL and MySQL Cluster. In this presentation I showed how the development of multi-core CPUs presented a tremendous challenge to software designers. In a short time of only 8 years Intel and other HW developers gave us the challenge to scale our software 60x. At MySQL we were up to the challenge, we've increased scalability of MySQL using InnoDB 20x in a time span of 5 years. At the same we've actually increased the scalability of the NDB storage engine more than 30x and this means that MySQL Cluster where we use MySQL together with the NDB storage engine has actually scaled 60x in total. This means that my test machine with 8 sockets and 96 CPU threads is now the limiting factor in my benchmarks.

The Many-Core CPU Solution

How have we achieved this? With the MySQL Server it has been a long set of handling various bottlenecks such as splitting the InnoDB buffer pool, handling the LOCK_open mutex and many more changes that collectively have made it possible to scale much beyond of what our software in 2008 could achieve. This improvement of scalability continues, so stay tuned for more, there are blogs to read now what is currently going on MySQL 5.7 development.

With the NDB storage engine the solution has been quite different. We started out building a distributed database already from the beginning consisting of a set of nodes that replicate data synchronously using transactions. In order to avoid usage of the operating system we built the architecture based on a set of independent modules that interact with messages. This was built on the architecture of AXE, a telecom switch operating system of unique efficiency. The first version had each node implemented in a signal thread. With the development based on independent modules meant that we inherited a simple task of dividing the thread into a number of functional modules. Currently we have separated into the local database part, the transaction part, the network send part, the network receive part and an asynchronous event part and finally the main part containing features for meta data handling. Given that we developed a shared nothing architecture it was simple to continue the partitioning to gain even more independent LDM parts by having each LDM thread handle different parts of the data. The transaction part is simple to use a simple round robin scheme and the network parts can easily be divided per socket we handle. In the future we could perform even more divisions of some functions.


NDB Layer by Layer approach


So what does this mean in effect. It means that we actually built a distributed database inside each node inside our distributed database. Given that we can also replicate using MySQL replication we can actually go even further and have multiple clusters connected together. For those that want even more to think of how NDB could be used to build systems with millions of CPUs can google on the word iClaustron. iClaustron is a hobby project I've played with since 2006 and I presented the aims of the project in a tech talk at Google which is available on YouTube.

The world is organised down into microcosmos and continues growing into macrocosmos. So why would software be different, we need to build systems of any size by using layers of layers of distribution.

So building MySQL Cluster is an interesting project in building layer by layer of distribution into the system.


The big challenge ahead of us


So what could be the next challenge that the hardware engineers will deliver to us software engineers. Personally I am preparing already now for it. This challenge that I hope they will bring to us is persistent memory. This means that we will have to build databases where all of a sudden we can make persistent writes at similar speed as we are currently writing to main memory. This will be an interesting challenge and personally I think that main memory databases have a unique advantage in this challenge since they already work at memory speed. So I feel a bit like a horse in the gates before a race, kicking and just eagerly waiting to get off on to the track to see how fast we can run home the next big challenge. But we have to wait until the hardware engineers first solves the issue with which technology will be the winner in this category and that can be commercialised.


So after these small philosophical thoughts let's get into what we're doing in first Lab Release of MySQL Cluster version 7.4 to get further on the path to these goals.

The improvements in MySQL Cluster 7.4.0 Labs release

As mentioned we are working on improving efficiency of MySQL Cluster, we have specifically worked on the scans in the NDB storage engine which have been heavily optimised. In benchmarks using a lot of scans such as Sysbench we have managed to scale up performance per data node by 46% comparing 7.4.0 to 7.3.5. Compared to 7.2.16 the difference is even bigger than 100% but going from 7.2 to 7.3 it was mainly inefficiences in the MySQL Server that was fixed.





Another important thing we've done in 7.4.0 is add a lot of documentation about both our restarts and our scans in the form of extended comments in the code. We've also gone through the log message to the operator while restarting and made them much more accessible and extensive.

MySQL Cluster 7.4.0 improvements for virtual machine environments

With 7.4 we're working hard on making MySQL Cluster more stable even when the underlying system isn't as stable as we would expect. MySQL Cluster is designed for high availability environments and now we're working on making sure that the system can continue to operate even when systems are overcommitted, when we're working in virtual machine environment where we cannot be certain of the exact resources we have available it is hard to operate a high availability environment but we still want to work as reliable as possible.

MySQL Cluster 7.4.0 Stability improvements

We have also been working on improving the availability of the system also in high availability environments by improving the restart times. There are many areas where we can work on this, we can remove subtle delays that adds up to longer restart times, we can use more parallelism in certain phases of the restarts. We have also made our local checkpoints more parallelised which means that we have a more balanced load on the various LDM threads in our system. This actually has the nice side effect that we get a more balanced load amongst the LDM threads which pays off in 5-10% improved performance for any application. Naturally it also means that we can run the local checkpoint faster since we don't risk imbalances in the CPU load by running local checkpoints faster.

Another unique feature of MySQL Cluster is supporting Active-Active environments using MySQL replication. We've been working to extend the support of this feature even further.


Benchmark environment description

We executed a set of benchmarks using Sysbench 0.4.12.6 in our dbt2-0.37.50.6 environment. We used a big machine with 8 sockets of Intel Xeon CPUs running at 2GHz. Each socket has 6 cores and 12 CPU threads. In most cases we run with hyperthreading enabled. But we have found that running LDM threads without hyperthreading is a good idea. This decreases the amount of partitions to manage and the number of threads to manage which have a positive effect on performance. We used 8 LDM threads and in this case the NDB data node used 2 sockets, the benchmark program and the MySQL Server had access to 5 sockets. The MySQL Server used about 40 CPU threads out of the 60 it had access to so in this configuration we had spare resources to use. But in the next step where we went to 12 LDM threads we could not use the full potential of the SW. In this case the data node needed 3 sockets, the benchmark program used 1 socket and thus the MySQL Server only had access to 4 sockets and this meant that it could increase performance by 25% and not the 50% made possible by going to 12 LDM threads (actually we squeezed a bit and made 52 CPU threads available to the MySQL Server and thus got about a 30% improvement over 8 LDM threads). Using 7.3 the data nodes are less efficient so here we could scale the LDM threads all the way to the 50% improvement (actually we even got to 52.7% improvement, so perfect scaling of performance as more LDM threads are added).

So with 12 LDM threads we need a 54 core-machine to make full use of the possibility of the data node. With 16 LDM threads we need even more, we need 4 sockets for the data node, we now need 2 sockets for the benchmark program, we need 6 sockets to run the MySQL Server and thus a total of 12 sockets or 72 cores. This is probably as far as MySQL 5.6 can help us scale before the MySQL Server can no longer scale. But this is an important area of focus for MySQL 5.7 that have already had a set of improvements implemented in the 5.7.4 DMR released now.

Final words

So with this kind of scalability we are now in a position to deliver more performance from one single node than we previously could deliver from an entire cluster. Imagine what performance we can then get when connecting many nodes together and still we're working on making each MySQL Cluster thread more efficient in its execution.

Monday, November 25, 2013

How to make an efficient Scalable Key Lookup engine of MySQL Cluster

MySQL Cluster have all the ingridients to be designed as a very scalable and extremely efficient key lookup engine for the Cloud. As we have shown in earlier entries of my blog we've been able to scale MySQL Cluster 7.2 to handle 72 million key lookups per second or 4.3 billion key lookups per minute. This benchmark was actually limited by a limitation of the NDB API nodes to handle more than about 300k lookups per second and so with a maximum of 255 nodes we got to around 72 million per second in total. However in MySQL Cluster 7.3 we have removed this limitation, in addition we have also enabled scaling to even bigger data nodes, so it should be possible now to reach even higher numbers.

The aim of this blog is however not to give any new benchmark results, rather it is providing details about how the benchmark program works and how this benchmark program architecture can be used to design an efficient scalable key lookup data store.

To obtain best possible performance we want to ensure that the data node can operate as efficiently as possible. This is done by ensuring that a connection to the data node sends many key lookups bundled together. Operating on individual key lookups is possible of course, but as usual it is more efficient to operate on bigger entities than one key lookup at a time. To provide this we use a concept we call Executor Thread. The Executor Thread will only execute key lookups aimed for a certain data node. So this means that the number of Executor Threads will be a multiple of the number of data nodes (there could be more than one thread per data node if necessary). The Executor Thread will receive key lookups from an internal queue handled by the application program (in our case the flexAsynch benchmark program). The key lookups are prepared by the Definer Threads. The Definer Thread will receive a key lookup aimed for any data node, it will take this key lookup and calculate the receiving data node for this key lookup (there is API calls in the NDB API to handle this). Based on this calculation the Definer Thread will put the key lookup in the queue of the proper Executor Thread.

The architecture before the Definer Thread is dependent on the application. In the figure provided here we have shown one possible architecture where we have one receive thread that receives a flow of messages from somewhere, to process those messages we need to interpret the packets and process them, this could entail one or more key lookups. In the figure we have assumed there is one key lookup per message and that the Executor Thread can format the packet back to the sender based on the information in the internal key lookup order.



So the important part of the architecture is the Executor Thread that handles messages to one data node based on an internal data structure that defines one key lookup and defines how to process the response (this thread should do as little work as possible to ensure it can focus on communication with the data node). There should also be a Definer Thread that prepares the key lookup request and puts the request in the queue of the proper Executor Thread. The Definer Thread could also do other things and there could be few or many Definer Threads in the architecture.

So how does flexAsynch work, in this case we don't have any input traffic, we generate the key lookups in the Definer Threads. The Definer Thread has a very simple operation. It starts by preparing a set of key lookups to any data node. For each of those key lookups it puts the request in the queue of the proper Executor Thread. After placing all requests in a queue, it starts waiting for all operations to complete. After all requests have received their response from the Executor Threads we simply continue with the next batch.

The operation of the Executor Thread is also very simple, it gets the current set of key lookups waiting in queue, it prepares those for execution of the NDB API. It sends off all the operations to the data node. When the data node have executed all of the operations it reports the result back to the Definer Threads and updates some benchmark statistics, then it continues with the next batch of key lookups.

So the operation of an efficient key lookup data store is not difficult at all. To make it scale one can then add up to 48 data nodes per cluster (each is capable of handling more than 5 million key lookups per second of around 100 byte in size). Each cluster can handle a total of 255 nodes in total. Obviously it is also straightforward to operate more than one cluster to scale even further.

The benchmark code exists in storage/ndb/test/ndbapi/flexAsynch.cpp, the interesting code exists in the NEW module here (it also contains a lot of legacy code for old variants of the flexAsynch benchmark).

Tuesday, November 19, 2013

MySQL Cluster run-time environment: Part 3: Configuration recommendations

Binding threads to CPUs in the MySQL Cluster data nodes can have great benefits. So what about hyperthreading, should we use all CPU threads, or only 1 CPU thread per CPU core? The recommendation differs actually. For the most part it is beneficial to use hyperthreading. In most thread types it gives about 40% higher performance with 2 CPUs using hyperthreading compared to 1 CPU not using hyperthreading. There are a few cases where it might be beneficial to not use hyperthreading though.

The first example is for LDM threads. Using hyperthreading means we will increase the number of partitions of a table by a factor of two. In many cases this isn't beneficial, particularly when the number of LDM threads is high. I tried with using 24 LDM threads with hyperthreading on 12 CPU cores and compared it to 12 LDM threads on 12 CPU cores. This case didn't benefit from using hyperthreading of LDM threads. However if the number of LDM threads is low it would probably still pay off, so going from 2 to 4 LDM threads is still beneficial, probably also going from 4 to 8. But going from 8 to 16 is less likely to be beneficial.

I have tested send and recv threads with and without hyperthreading, my benchmarks have always improved by using hyperthreading. It has always been better to use hyperthreading. The same conclusion I've seen with tc threads. Obviously if the main thread or the rep thread for some reason becomes the major bottleneck, then it makes sense to remove use of hyperthreading here.

Avoiding use of hyperthreading can be simply done by not configuring any threads to use the second CPU on each of the CPU cores we want to avoid hyperthreading on. As an example we will configure a machine with one data node, the machine have 4 sockets with 24 cores and 48 CPU threads. The CPUs 0-5 represent the CPU threads on socket 0 and core 0-5 and thread 0, CPUs 24-29 represents socket 0 and core 0-5 and thread 1. So if we want to configure with 12 LDM threads not using hyperthreading here we could use the config:

ThreadConfig="ldm={count=12,cpubind=0-11},tc={count=4,cpuset=12-17,36-41},send={count=4,cpuset=12-17,36-41},recv={count=4,cpuset=12-17,36-41},io={cpuset=18-19,42-43},main={count=1,cpuset=18-19,42-43},rep={count=1,cpuset=18-19,42-43},wd={cpuset=18-19,42-43}"

In this configuration the LDM threads will use CPUs 0-11 which covers all cores on socket 0 and 1. No other thread is configured to use any CPU thread in those CPU sockets. The OS might still decide to use some of them, but we have left a number of empty CPUs that the OS hopefully discovers as idle and uses those to schedule OS activities. We can actually do even better than this, there is a boot config variable in Linux whereby one can specify which CPUs the OS is allowed to use. Similarly there is a similar config variable for irqbalance to ensure interrupts are not scheduled on any CPU used by the MySQL Cluster data node. The tc, send and recv thread are scheduled on any the CPU threads on socket 2 and the main, rep, wd, and io threads are using 2 cores on socket 3. The OS and other processes aren't blocked from using other CPUs, but will most likely be scheduled on any of those free CPUs with no activity on them.

The first question that one starts in specifying the MySQL Cluster run-time environment would be to decide on how many ldm threads one should use. This is dependent on how many CPUs that are accessible to the data nodes. So assuming we have access to 24 CPU threads with hyperthreading. In this case it would be natural to start by using the number of ldm threads set to 6 and not use hyperthreading. A natural initial start is to use half of the available CPU cores for ldm threads. Next one assigns about a quarter of the ldm CPU resources to tc threads, in this case we land at 3 tc threads. Next one assigns a similar number of CPUs to send and recv threads. Then one assigns the rest of the CPUs to main, rep, io and wd threads. This should give a fair amount of resources available also to the OS.

After this initial assignment it is a good idea to run some benchmark which is close to the workload of your application to see whether the config works well. For this test run one should use cpubind for all thread types to ensure that we know how much CPU resources each thread type consumes (can be easily derived looking at top with per-CPU load mode, using cpubind). First check whether ldm threads is a bottleneck, if it is then check if it is possible to increase to the next level. In this example this would mean going to 8 ldm threads then using 2/3 of the CPU resources. If this isn't possible then just make sure that the rest of the thread types have a fair amount of CPU resources to avoid any unneeded bottlenecks.

If the bottleneck isn't the ldm thread, then assign more resources to this thread type and in most cases by removing resources from non-ldm threads. There could be cases where less than half of the CPU resources are needed by the ldm threads, but I would deem those as very unusual. Given that the actual database processing is done in the ldm threads, it would be an exceptional case if other threads consume more than half of the resources.

Always remember to update the NoOfFragmentLogParts variable if changing the number of ldm threads.

After a few trials we have most likely found a decent configuration. After finding this configuration we can also consider where to use cpuset and if any threads should use the realtime or spintime variables.

So next question is when to use cpubind, cpuset, realtime and spintime.

Both cpubind and cpuset is about locking threads to an individual CPU or a set of CPUs. We could actually consider even using no cpubind and cpuset as a form of CPU locking. We are locking the threads to the set of available CPUs in the OS. Given that we might be running on a virtual OS this might actually already be a subset of the existing subset of the existing CPUs. To make any configuration of CPUs using cpubind/cpuset one has to have knowledge of how cpu ids maps to CPU sockets and CPU cores.

So the default configuration not using any cpubind/cpuset is to allow the OS to schedule the thread onto any of the available CPUs. The OS scheduler is optimised towards an interactive environment where processes need to react to human interaction. It does also a good job of server environments where a fairly high number of threads compete for a small number of CPUs. It does also do a decent job of handling server environments where there are only a handful of threads competing for a number of CPUs.

The type of threads that the normal OS schedulers have most problems to handle are long-running threads that consume a lot of CPU resources. In particular threads that run more or less constantly. What happens is that the OS scheduler eventually downgrades their priority such that other processes are given a chance to execute, since the process still wants to execute the OS now searches for a free CPU to use. This is often successful, the problem is however that this means that we migrate the thread onto a new CPU. This happens many hundreds of times per second for those busy threads. The effect is that the thread comes to a new CPU where it has no data or instructions cached in the CPU caches. So this means that a migrated thread will spend quite a lot of time to warm up CPU caches before it can run as efficiently as it could before the migration. In addition this requires more bandwidth on the memory bus which in some cases can become a bottleneck.

In the above case it is better to stay with the same CPU even if a new job is scheduled, this is the case if we can be sure that the new process isn't one more long-running thread. So this means that in order to optimise the MySQL Cluster run-time environment we need to be in control over all usages of the CPU on the machine we're using. In many cases we colocate data nodes and MySQL Server processes. We can control placement of MySQL Server processes using numactl or taskset that can be applied when starting the process or using pid when the process is already started. This ensures that the MySQL Server process is never scheduled outside the set of CPUs we gave it access to through the taskset/numactl process. This is how I control the environment when running any MySQL Cluster benchmark. Similarly I also control the benchmark processes (sysbench/dbt2 client processes/flexAsynch...).

In this manner I am certain that no application process is using the CPUs provided for execution of the data node threads. So the only threads that will execute on the CPUs are either OS kernel threads or interrupt handling. Even this can be controlled by using a special boot option isolcpus and list the cpu numbers that the OS is allowed to use. When setting this variable only the listed cpus are using a normal scheduler handling with the possibility to migrate CPUs. Usage of the rest of the CPUs can only be invoked if the application uses a locking call such as controlled by cpuset/cpubind or taskset/numactl. The final thing to control is the execution of interrupts. This is normally handled by the irqbalance process and this can be configured to avoid a bitmap specified by the irqbalance configuration variable IRQBALANCE_BANNED_CPUS. So it is possible to create a completely compartmentalized load with interrupts on certain CPUs, OS and other applications on another set of CPUs, a set of CPUs for data node, a set of CPUs for MySQL Server and a set of CPUs for any other consuming application process. Providing this is a combination of boot options, irqbalance configuration, MySQL Cluster configuration and finally using taskset/numactl on certain processes.

So when to use cpubind and when to cpuset. cpubind should mainly be used on threads that can consume up to 100% of the CPUs. This would mainly be the ldm threads normally, but can definitely be the case also for other threads dependent on the workload and the configuration. So one way of configuring MySQL Cluster environments is to use cpubind for ldm threads and put the tc, send and recv threads in a common cpuset. The nice thing with this configuration is that we can run with more threads than really necessary, so e.g. we can run with 4 tc threads, 4 send threads and 4 recv threads and put these into a cpuset consisting of e.g. 8 CPUs. In this manner the  OS can easily handle if there is a certain thread types that requires more resources for a while.

So configuring a MySQL Cluster run-time environment is about making use of static scheduling for the most critical resources and to use the flexibility of the OS scheduling to handle the less critical resources. A good design philosophy for all environments like this is to design the run-time environment with a well-known bottleneck. For MySQL Cluster data nodes we recommend to always make the ldm threads the bottleneck. The nice thing about this is that it makes it easier to understand how to handle overload and reason around it. As an example if the ldm thread are overloaded and tc threads have available resources we can ensure that the tc threads handle sending error messages about overload without even contacting the ldm threads. This can be achieved by decreasing available resources in ldm threads through configurations at least for scan operations.

Hopefully these recommendations will help you find the optimal configuration and still a safe configuration. Default configurations will work for most installations, but to get the last 10-100% performance out of the system one might need to dive a bit deeper into the configuration of the MySQL Cluster run-time environment.

Next configuration item to consider is the realtime setting. This can now be set on each thread type. Traffic queries normally execute arriving in the recv thread, sent to the tc or the ldm threads and then the reply is sent through the send thread. main thread is mainly involved in the meta-data operation which rarely are time-critical. rep threads can have a high load, but they are not part of any critical paths except for asynchronous replication to other clusters. The io thread is only involved in time-critical operations if disk data is used in MySQL Cluster. The wd threads are different, obviously it is important that the watchdog thread gets an opportunity to execute every now and then. So if other threads are using realtime it's a good idea to use realtime also on the wd thread type. recv, send and tc threads are usually doing small jobs and thus realtime scheduling might be beneficial for those threads to cut the response time. For ldm threads that execute close to 100% of the time it's debatable whether realtime is such a good idea. The OS cannot handle realtime threads executing on realtime priority for a long time, so we have implemented protection for this in the data nodes. This ensures that we decrease the priority to normal user priority even for realtime threads if they execute for too long. It's rarely any benefits of throughput to use the realtime configuration for threads. It's mainly intended to enable less variation on response time.

Another use case for realtime is when there is a mix of data node threads and other application threads where the application threads have lower priority. In this case we ensure that the data node threads gets prioritised access to CPUs before the other application threads gets access.

The final configuration item to consider is the spintime. This means that the data node thread will execute for a bit longer before entering sleep mode. So if the recv thread or any other thread sends a new message to the thread in this spintime we decrease the wake up time. The only case where this can increase throughput is if the spinning thread is the bottleneck of the data node and the spinning doesn't steal resources from any other critical thread. The main usage of spintime would be as a tool to improve response time in an environment with plentiful of CPU resources. It is important to consider that the use of spintime will increase use of CPU resources for those threads it is set on. It only applies to the thread types ldm, tc, main, rep, send and recv threads.  One should in most cases avoid mixing spintime and realtime settings on the same thread.

MySQL Cluster run-time environment: Part 2: How to configure

The first selection in selecting run-time environment is done by selecting the binary. The original NDB run-time environment was a single-threaded environment where we executed everything except IO threads in a single thread. This environment actually still exists and is selected by running with the binary ndbd. The new multithreaded environment is selected by running with the binary ndbmtd. The ndbmtd can clearly scale far beyond ndbd, but ndbd can still have advantages in environments which has a low load and needs optimisation on latency. Since ndbd does receive on socket, execution of signals and send of signals in the same thread it can have shorter latency at the cost of scalability. Most of the rest of this description discusses ndbmtd configuration. The ThreadConfig variable doesn't apply to running with ndbd since ndbd uses hard-coded thread configuation consisting of 1 main thread.

When configuring the MySQL Cluster run-time environment the most important variable is ThreadConfig. One can alternatively also use MaxNoOfExecutionThreads and LockExecuteThreadToCPU as we will show below. There is also LockMaintThreadsToCPU that can be used to bind IO threads to a specific CPU, it is now recommended to do this using ThreadConfig for the locking of io threads instead since it has more options on how to do this. RealtimeScheduler and SchedulerSpinTimer can be used to set real-time scheduling and spin time on threads. If ThreadConfig is also used then these variables are only used as default settings that can be overridden by the ThreadConfig setting. SchedulerExecutionTimer is not applicable when running with ndbmtd. It is only applicable when running with ndbd.

LockPagesInMemory is part of the MySQL Cluster run-time environment and can have a heavy impact on response times. Setting this hasn't changed and its setting is independent on the rest of the settings discussed in this blog.

NoOfFragmentLogParts is an important variable that should be set equal to number of ldm threads if number of ldm threads is larger than 4, it cannot be set to anything smaller than 4. One can set NoOfFragmentLogParts larger than number of ldm threads, there is no advantage of this that I can think of. It can be set to the following values: 4, 6, 8, 12, 16, 24, 32.

The types of threads we have are ldm, tc, main, rep, send, recv, io and wd threads. These were discussed in part 1 of this blog. main and rep have a fixed amount of threads equal to one. The wd threads are always three with specific roles, however from a configuration point of view the number of wd threads is one. ldm, tc, send and recv threads have a configurable amount of threads. There can be 1,2,4,6,8,12,16,24 or 32 ldm threads. There can be anywhere between 1 to 32 tc threads and anywhere between 1 and 16 send threads and likewise between 1 and 16 recv threads. The number of ldm, tc, send, recv, main, rep and wd threads are fixed once the configuration is given, when the ndbmtd process is started these threads are started and not stopped until the process is stopped. The io threads are different, the io threads are handled by a dynamic pool of threads.

The io thread pool is controlled by two variables, the InitialNoOfOpenFiles which gives the initial amount of io threads (one thread handles one open file). The second variable is MaxNoOfOpenFiles that specifies the maximum of io threads that will be created. DiskIoThreadPool can be used to set the number of parallel accesses to disk data files, it doesn't affect the number of io threads. It only affects that more than one thread at a time can access a disk data file.

The preferred manner of configuring most of this environment is by using the ThreadConfig variable. This variable accepts a string and the string is parsed. I will start by giving an example.

ThreadConfig="ldm={count=4,cpubind=0-3,spintime=50},tc={count=1,cpubind=4,realtime=1},main={count=1},rep={count=1},send={count=1,cpubind=5},recv={count=1,cpubind=6},io={cpubind=7},wd={cpubind=7}"

In this example we see that for each thread type we can specify the following information, count, this is the number of threads of this type, count must be 1 for main and rep and also for io and wd. For io and wd threads the count is actually ignored since IO threads are handled with a dynamic pool and wd threads are 3 specific threads (WatchDog, SocketServer and SocketClient thread).

cpubind can be given for all thread types, cpubind is a comma separated list of CPUs where one can also use ‘-‘ to indicate an array of CPUs. So e.g. 0-3 is equivalent to 0,1,2-3. Threads are assigned to CPUs in order, so for the ldm's in the above config the first ldm thread will be bound to CPU 0, the second ldm thread to CPU 1, the next to CPU 2 and the last ldm thread to CPU 3. CPU 3 here is the CPU ordering imposed by the OS. On Linux one can get details on this by running 'cat /proc/cpuinfo', this will provide exact information about the CPU Socket the CPU belongs to, the CPU core the CPU belongs and which of the hyperthreads it runs, it will also list a bunch of other things. One must have at least one CPU per thread assigned here, otherwise an error will occur, having more CPUs specified is ok, but only the first ones provided will actually be used. cpubind always means that the thread can only execute on the specified CPU.

Setting realtime=1 means that the thread will be using the real-time scheduler of the OS. To protect the OS from starvation we will yield in this case if we execute for too long with real-time priority and allow other threads to get some work done as well. As can be seen from the example we can use real-time setting on a subset of the threads. The global configuration variable RealtimeScheduler is still used and sets the default value for a thread, so if this is set to 1, then we need to set realtime=0 on a thread to ensure we don't get real-time scheduling on it. Use of the real-time scheduler is mainly intended to lower the variance of the latency. Using it isn't likely to improve performance except possibly in a highly loaded system where other processes are not requiring real-time scheduling. Given that the OS often expects real-time threads to execute for short time with high requirements on short response time, it is not recommended to mix this configuration option with spintime on the same thread.

Setting spintime=50 means that we will spend at least 50 microseconds of executing signals and waiting for more signals before setting the thread to sleep again. This parameter is mainly intended as well to improve our latency, by keeping the threads ready to go, we can react faster to a query. We cannot set the spintime higher than 500 microseconds. It will consume considerably more CPU resources setting spintime and provides a tiny bit of extra throughput.

In the latest version of MySQL Cluster 7.x we also introduced the ability to bind threads a set of CPUs. Each CPU can only belong to one CPU set, on Solaris a cpu set is exclusive also to usage of the data node. In Linux it only sets the whereabouts of the MySQL Cluster data node threads. Other threads can still be scheduled on the same CPUs.

As an example we could use the following config:
ThreadConfig="ldm={count=4,cpuset=0-3},tc={count=1,cpuset=4-7},main={count=1,cpuset=4-7},rep={count=1,cpuset=4-7},send={count=1,cpuset=4-7},recv={count=1,cpuset=4-7},io={cpuset=4-7},wd={cpuset=4-7}"

In this example we have the same amount of threads as in the previous example, but we divided the threads into two CPU sets. The first CPU set covers CPU 0-3 and is only used by the 4 LDM threads. The other CPU set covers CPU 4-7 and takes care of the remaining threads in the data node. In this manner we can arbitrarily configure scheduling domains. cpuset have the advantage that the OS gets some liberty to do dynamic scheduling, the default config with no CPU sets or CPU bindings means that we have one CPU set consisting of all CPUs. By using cpuset and cpubind we can mix usage of OS scheduler and our own fixed scheduler. Usually we want the threads with most load to have their own CPUs, for other threads that have more variable load it makes sense to use cpusets such that the OS can dynamically schedule the threads dependent on the current load of the system.

The thread types that have variable number of threads is currently ldm, tc, send and recv threads. In 7.3.3 we have extended the support of threads such that we can have 32 ldm threads, 32 TC threads, 16 send threads and 16 recv threads. TC, send and recv can have an arbitrary number and it would work also to change the ThreadConfig variable, at least after an initial node restart. However the number of LDM threads has more restrictions, first of all it cannot be changed other than after starting up a completely new cluster. Also we only allow for the following number of LDM threads, 1,2,4,6,8,12,16,24 and 32. A very important part of the config which is easy to forget is that if we increase the number of LDM threads beyond 4, then we also need to set NoOfFragmentLogParts to at least the same as the number of LDM threads. Normally one would simply set this variable to the same value as the number of LDM threads.

A final note on the usage of ThreadConfig is that one can even divide one thread type into several groups, so one could write ldm={cpubind=0},ldm={cpubind=1},ldm={cpubind=2},ldm={cpubind=3) as equivalent to ldm={count=4,cpubind=0-3}.

If one wants to avoid discovering the best run-time configuration one can also still use the MaxNoOfExecutionThreads variable together with LockExecuteThreadToCPU to get a simpler, but not as flexible to configure the run-time environment. The number of ldm, tc, send and recv threads is in this case dependent on MaxNoOfExecutionThreads through a lookup in a table which is found in the code (there is a program to generate this table).

Here is the table (it is found in mt_thr_config.cpp, both the table and the program generating it). Each entry in the table have five entries, the first one is the value of the MaxNoOfExecutionThreads, the next four ones are number of ldm threads, number of tc threads, number of send threads and finally number of recv threads.

  static const struct entry
  {
    Uint32 M;
    Uint32 lqh;
    Uint32 tc;
    Uint32 send;
    Uint32 recv;
  } table[] = {
    { 9, 4, 2, 0, 1 },
    { 10, 4, 2, 1, 1 },
    { 11, 4, 3, 1, 1 },
    { 12, 6, 2, 1, 1 },
    { 13, 6, 3, 1, 1 },
    { 14, 6, 3, 1, 2 },
    { 15, 6, 3, 2, 2 },
    { 16, 8, 3, 1, 2 },
    { 17, 8, 4, 1, 2 },
    { 18, 8, 4, 2, 2 },
    { 19, 8, 5, 2, 2 },
    { 20, 8, 5, 2, 3 },
    { 21, 8, 5, 3, 3 },
    { 22, 8, 6, 3, 3 },
    { 23, 8, 7, 3, 3 },
    { 24, 12, 5, 2, 3 },
    { 25, 12, 6, 2, 3 },
    { 26, 12, 6, 3, 3 },
    { 27, 12, 7, 3, 3 },
    { 28, 12, 7, 3, 4 },
    { 29, 12, 8, 3, 4 },
    { 30, 12, 8, 4, 4 },
    { 31, 12, 9, 4, 4 },
    { 32, 16, 8, 3, 3 },
    { 33, 16, 8, 3, 4 },
    { 34, 16, 8, 4, 4 },
    { 35, 16, 9, 4, 4 },
    { 36, 16, 10, 4, 4 },
    { 37, 16, 10, 4, 5 },
    { 38, 16, 11, 4, 5 },
    { 39, 16, 11, 5, 5 },
    { 40, 16, 12, 5, 5 },
    { 41, 16, 12, 5, 6 },
    { 42, 16, 13, 5, 6 },
    { 43, 16, 13, 6, 6 },
    { 44, 16, 14, 6, 6 },
    { 45, 16, 14, 6, 7 },
    { 46, 16, 15, 6, 7 },
    { 47, 16, 15, 7, 7 },
    { 48, 24, 12, 5, 5 },
    { 49, 24, 12, 5, 6 },
    { 50, 24, 13, 5, 6 },
    { 51, 24, 13, 6, 6 },
    { 52, 24, 14, 6, 6 },
    { 53, 24, 14, 6, 7 },
    { 54, 24, 15, 6, 7 },
    { 55, 24, 15, 7, 7 },
    { 56, 24, 16, 7, 7 },
    { 57, 24, 16, 7, 8 },
    { 58, 24, 17, 7, 8 },
    { 59, 24, 17, 8, 8 },
    { 60, 24, 18, 8, 8 },
    { 61, 24, 18, 8, 9 },
    { 62, 24, 19, 8, 9 },
    { 63, 24, 19, 9, 9 },
    { 64, 32, 16, 7, 7 },
    { 65, 32, 16, 7, 8 },
    { 66, 32, 17, 7, 8 },
    { 67, 32, 17, 8, 8 },
    { 68, 32, 18, 8, 8 },
    { 69, 32, 18, 8, 9 },
    { 70, 32, 19, 8, 9 },
    { 71, 32, 20, 8, 9 },
    { 72, 32, 20, 8, 10}