Tuesday, December 21, 2010

MySQL Server and NUMA architectures

When you run MySQL on a large NUMA box it is possible to control the memory placement and the use of CPUs through the use of numactl. Most modern servers are NUMA boxes nowadays.

numactl works with a concept called NUMA nodes. One NUMA node contains CPUs and memory where all the CPUs can access the memory in this NUMA node at an equal delay. However to access memory in a different NUMA node will typically be slower and it can often be 50% or even 100% slower to access memory in a different NUMA node compared to the local NUMA node. One NUMA node is typically one chip with a memory bus shared by all CPU cores in the chip. There can be multiple chips in one socket.

With numactl the default option is to allocate memory from the NUMA node the CPU currently running on is connected to. There is also an option to interleave memory allocation on the different parts of the machine by using the interleave option.

Memory allocation actually happens in two steps. The first step is the one that makes a call to malloc. This invokes a library linked with your application, this could be e.g. the libc library or a library containing tcmalloc or jemalloc or some other malloc implementation. The malloc implementation is very important for performance of the MySQL Server, but in most cases the malloc library doesn't control the placement of the allocated memory.

The allocation of physical memory happens when the memory area is touched, either the first time or after the memory have been swapped out and a page fault happens. This is the time that we assign memory to the actual NUMA node it's going to be allocated on. To control how the Linux OS decides on this memory allocation one can use the numactl program.

numactl provides options to decide on whether to use interleaved memory location or local memory. The problem with local memory can be easily seen if we consider that the first thing that happens in the MySQL Server is a recovery of the InnoDB and this recovery is single-threaded so will thus make a large piece of the memory in the buffer pool to be attached to the NUMA node where the recovery took place. Using interleaved allocation means that we get a better spread of the memory allocation.

We can also use the interleave option to specify which NUMA nodes the memory should be chosen from. Thus the interleave option acts both as a way of binding the MySQL Server to NUMA nodes as well as interleaving memory allocation on all the NUMA nodes the server is bound to.

numactl finally also provides an ability to bind the MySQL Server to specific CPUs in the computer. This can be either by locking to NUMA nodes, or by locking to individual CPU cores.

So e.g. on a machine with 8 NUMA nodes one might start the MySQL Server like this:
numactl --interleave=2-7 --cpunodebind=2-7 mysqld ....
This will allow the benchmark program to use NUMA node 0 and 1 without interfering with the MySQL Server program. If we want to use the normal local memory allocation it should more or less be sufficient to remove the interleave option since we have bound the MySQL Server to NUMA node 2-7 there should be very slim risk that the memory is allocated from elsewhere. We could however also use
--memnodebind=2-7 to ensure that the memory allocation happens in the desired NUMA nodes.

So how effective is numactl compared to e.g. using taskset. From a benchmark performance point of view there is not much difference unless you get memory very unbalanced through a long recovery at the start of the MySQL Server. Given that taskset allows the server to be bound to certain CPU cores, it also means effectively that the memory is bound to the NUMA nodes of the CPUs the MySQL Server was bound to by taskset.

However binding to a subset of the NUMA nodes or CPUs in the computer is definitely a good idea. On a large NUMA box one can gain at least 10% performance by locking to a subset of the machine compared to allowing the MySQL Server to freely use the entire machine.

Binding the MySQL Server also improves the stability of the performance. Also binding to certain CPUs can be an instrument in ensuring that different appplications running on the same computer don't interfere with each other. Naturally this can also be done by using virtual machines.