There are two major achievements in the development of MySQL Cluster 7.3 that has led to significant improvements of scalability. The first is the move to MySQL 5.6, this means that we've removed the infamous LOCK_open bottleneck that limited scalability of MySQL Cluster 7.2 which is based on MySQL 5.5 and also all other earlier versions. The other is the improvements of the scalability in the NDB API. For sysbench we are able to process more than three times as many transactions per NDB API connection. In 7.3 it is also possible to do receive processing in separate threads that can be locked to specific CPUs which also enables higher throughput.
When testing MySQL Cluster 7.3 vs MySQL Cluster 7.2 using sysbench one can overcome the limitation of scalability in the NDB API by simply using many more NDB API connections in MySQL Cluster 7.2. However the limitation imposed by LOCK_open cannot be overcome, for sysbench this means that we can scale to usage of about 40 CPU threads and beyond that there is no additional gain of having more CPUs. When running with so high load it's actually a challenge to handle this load in the data nodes as well. What we discover is that the main bottleneck lies in the local data management threads (LDM threads). It turns out that for this particular type of threads it does actually not pay off to use hyperthreading. So best results are achieved by using 16 LDM threads that are not using hyperthreading. The problem is that when we add hyperthreading we also increase the number of LDM threads to achieve the same performance, with sysbench this means more partitions and also more work to process. So this is a rare case for MySQL where it doesn't pay off to use hyperthreading. So using this setup for MySQL Cluster we reach 7096 TPS for Sysbench RW and 9371 TPS for Sysbench RO using MySQL Cluster 7.2.
So when we take this configuration to MySQL Cluster 7.3 we can easily increase the performance since the LOCK_open bottleneck is now removed. Actually with the machine that I have access to (a standard x86/Linux box with 8 sockets and a total of 96 CPU threads on 48 cores) I can no longer get the MySQL Server to become a bottleneck for this type of standard sysbench tests. For point selects I can still reach this bottleneck since MDL locking still has some limitation in MySQL 5.6. As can be seen from the MySQL 5.7.2 DMR this is also a problem which is going away and then there will be even less problems to scale further.
So with MySQL Cluster 7.3 we are actually limited by the hardware of a 96 CPU thread box and not by any software limitation. With this box we are able to reach 8932 TPS on Sysbench RW and 11327 TPS on Sysbench RO. At this load we're using 8 CPU threads for the benchmark, 20 cores for the data nodes and the remaining 48 CPU threads for the MySQL Server. So in order to increase throughput on this machine we simply have to make better use of the HW resources at hand. With MySQL Cluster 7.3 we've made it possible to scale data nodes all the way to 32 LDM threads, so if we had access to a sufficiently big box we would be able to scale MySQL Cluster 7.3 performance even further. It's likely that we are able to scale it to about 70-80 CPU threads for the MySQL Server and this would require about 30 cores for the data nodes.
So what we see is that we're quickly reaching a point where MySQL Cluster scalability can go beyond what most customers need and it is thus very important to also start focusing on the area of CPU efficiency. In order to get an understanding of where load is spent we make good use of the perf tool in Linux which can pinpoint problems in the code with branch prediction, tight loops, cache misses and give us an idea of how to use software prefetching to improve code efficiency. Given that we program in C++, the compiler can often be assisted by introducing local variables, but one has to take care such that those local variables are not spilled to the stack in which case they only do harm to the efficiency of the program. Our initial experiments of increasing efficiency in the MySQL Cluster data nodes using this approach have been very successful.
NOTE: All the above benchmarks were done on one machine using a single MySQL Server and a single data node. Obviously MySQL Cluster is also capable of scaling in the number of nodes in addition to the scaling on one machine. In this benchmark we have however focused on how far we can scale MySQL Cluster on a single machine.
When testing MySQL Cluster 7.3 vs MySQL Cluster 7.2 using sysbench one can overcome the limitation of scalability in the NDB API by simply using many more NDB API connections in MySQL Cluster 7.2. However the limitation imposed by LOCK_open cannot be overcome, for sysbench this means that we can scale to usage of about 40 CPU threads and beyond that there is no additional gain of having more CPUs. When running with so high load it's actually a challenge to handle this load in the data nodes as well. What we discover is that the main bottleneck lies in the local data management threads (LDM threads). It turns out that for this particular type of threads it does actually not pay off to use hyperthreading. So best results are achieved by using 16 LDM threads that are not using hyperthreading. The problem is that when we add hyperthreading we also increase the number of LDM threads to achieve the same performance, with sysbench this means more partitions and also more work to process. So this is a rare case for MySQL where it doesn't pay off to use hyperthreading. So using this setup for MySQL Cluster we reach 7096 TPS for Sysbench RW and 9371 TPS for Sysbench RO using MySQL Cluster 7.2.
So when we take this configuration to MySQL Cluster 7.3 we can easily increase the performance since the LOCK_open bottleneck is now removed. Actually with the machine that I have access to (a standard x86/Linux box with 8 sockets and a total of 96 CPU threads on 48 cores) I can no longer get the MySQL Server to become a bottleneck for this type of standard sysbench tests. For point selects I can still reach this bottleneck since MDL locking still has some limitation in MySQL 5.6. As can be seen from the MySQL 5.7.2 DMR this is also a problem which is going away and then there will be even less problems to scale further.
So with MySQL Cluster 7.3 we are actually limited by the hardware of a 96 CPU thread box and not by any software limitation. With this box we are able to reach 8932 TPS on Sysbench RW and 11327 TPS on Sysbench RO. At this load we're using 8 CPU threads for the benchmark, 20 cores for the data nodes and the remaining 48 CPU threads for the MySQL Server. So in order to increase throughput on this machine we simply have to make better use of the HW resources at hand. With MySQL Cluster 7.3 we've made it possible to scale data nodes all the way to 32 LDM threads, so if we had access to a sufficiently big box we would be able to scale MySQL Cluster 7.3 performance even further. It's likely that we are able to scale it to about 70-80 CPU threads for the MySQL Server and this would require about 30 cores for the data nodes.
So what we see is that we're quickly reaching a point where MySQL Cluster scalability can go beyond what most customers need and it is thus very important to also start focusing on the area of CPU efficiency. In order to get an understanding of where load is spent we make good use of the perf tool in Linux which can pinpoint problems in the code with branch prediction, tight loops, cache misses and give us an idea of how to use software prefetching to improve code efficiency. Given that we program in C++, the compiler can often be assisted by introducing local variables, but one has to take care such that those local variables are not spilled to the stack in which case they only do harm to the efficiency of the program. Our initial experiments of increasing efficiency in the MySQL Cluster data nodes using this approach have been very successful.
NOTE: All the above benchmarks were done on one machine using a single MySQL Server and a single data node. Obviously MySQL Cluster is also capable of scaling in the number of nodes in addition to the scaling on one machine. In this benchmark we have however focused on how far we can scale MySQL Cluster on a single machine.
Hi, Mikael!
ReplyDeleteCould you share your cluster settings and measure description?
Thanks, Michael
I use the latest version of my dbt2-0.37.50 scripts (I will publish this version soon), below I have pasted the configuration file this uses, hopefully this brings some light. Based on this config file I generate the NDB config file, the MySQL server startup command line and the other info needed to start the various processes including the benchmark program sysbench itself. The scripts mentioned above are published in an earlier version and is available on http://dev.mysql.com/downloads/benchmarks.html.
ReplyDeleteBENCHMARK_TO_RUN="sysbench"
TARBALL_DIR="/export/home/mronstrom/bench/tarballs"
USE_BINARY_MYSQL_TARBALL="no"
MYSQL_VERSION="mysql-cluster-gpl-7.3.3"
SYSBENCH_VERSION="sysbench-0.4.12.6"
DBT2_VERSION="dbt2-0.37.50.5"
SERVER_HOST="127.0.0.1"
SERVER_PORT="3316"
DATA_DIR_BASE="/export/home/mronstrom/ndb"
WINDOWS_REMOTE="no"
BUILD_REMOTE="no"
REMOTE_BIN_INSTALL_DIR="/export/home/mronstrom/bench/mysql"
REMOTE_SRC_INSTALL_DIR="/export/home/mronstrom/bench/src_build"
ENGINE="ndb"
MYSQL_BASE="5.6"
TEST_DESCRIPTION="mysql-cluster-gpl-7.3.3"
#Compiler parameters
STATIC_LINKING_FLAG="no"
CORE_FILE_USED="yes"
DEBUG_FLAG="yes"
#Use high optimisation level and optimize for CPU used
USE_FAST_MYSQL="yes"
PERFSCHEMA_FLAG="no"
#Mandatory parameters if NDB used
NDB_MGMD_NODES="127.0.0.1"
NDBD_NODES="127.0.0.1"
#Optional parameters for NDB config
NDB_DISKLESS="yes"
#Set NDB_MULTI_CONNECTION to use multiple connects to data nodes from MySQL Server
#or APIs, set to number of connects per MySQL Server/API.
NDB_MULTI_CONNECTION="6"
NDB_FORCE_SEND="yes"
NDB_REALTIME_SCHEDULER="no"
NDB_SCHEDULER_SPIN_TIMER="0"
ENGINE_CONDITION_PUSHDOWN_OPTION="no"
NDB_RECV_THREAD_ACTIVATION_THRESHOLD="1"
NDB_RECV_THREAD_CPU_MASK="43-45,91-93"
NDB_REPLICAS="1"
NDB_DATA_MEMORY="5G"
NDB_INDEX_MEMORY="2G"
DISK_CHECKPOINT_SPEED="15M"
NDB_NO_OF_FRAGMENT_LOG_PARTS="16"
NDB_SEND_BUFFER_MEMORY="12M"
NDB_MAX_NO_OF_CONCURRENT_OPERATIONS="2M"
NDB_NO_OF_FRAGMENT_LOG_FILES="100"
NDB_THREAD_CONFIG="io={cpubind=90},ldm={count=16,cpubind=24-39},tc={count=2,cpubind=40,88},send={count=3,cpubind=41,89,90},recv={count=1,cpubind=42},main={count=1,cpubind=90},rep={count=1,cpubind=90}"
#Optional iClaustron parameters
USE_MALLOC_LIB="yes"
MALLOC_LIB="/usr/local/lib/libjemalloc.so.1"
TABLE_CACHE_SIZE="4000"
TABLE_CACHE_INSTANCES="16"
#Optional Sysbench parameters (only used when running sysbench)
RUN_RW="yes"
RUN_RO="yes"
SB_NUM_PARTITIONS="16"
SB_NUM_TABLES="8"
SYSBENCH_ROWS="500000"
THREAD_COUNTS_TO_RUN="64;128;256;312;448;512;640"
MAX_TIME="60"
NUM_TEST_RUNS="1"
TASKSET="numactl"
SERVER_CPUS="4-23,52-71,46-47,94-95"
SERVER_BIND="0-3,7"
SERVER_MEM_POLICY="interleaved"
BENCHMARK_CPUS="0-3,48-51"
BENCHMARK_BIND="0"
BENCHMARK_MEM_POLICY="local"
#BETWEEN_RUNS="25"
AFTER_INITIAL_RUN="10"
AFTER_SERVER_START="30"
It's a condensed version of the config file where I removed most comments that showed default values and comments that explained a lot. It wasn't possible to publish the entire file in a comment.
ReplyDelete