Tuesday, March 25, 2008

Speeding up MySQL by 36% on the T2000

This post will focus on the performance tuning work that we've been working on since December 2007 on the Sun T2000 server. We got a nice speedup of 36% with fairly small efforts and we've got good hope we can improve performance a great deal more. This effort is part of a new effort at MySQL to improve performance both on Solaris and Linux platforms and to some extent Windows as well. This report focuses on T2000 using Solaris.

T1000 and T2000 are the first CoolThreads servers from Sun with the UltraSPARC T1 processors. The T1 is very energy efficient, which is extremely important to modern datacenters. On the other hand, leveraging the massive amount of thread-level parallelism (32 concurrent threads) provided by the CoolThreads servers is the key to getting good performance. As the CoolThreads servers are used by many Sun customers to run web facing workloads, making sure that MySQL runs well on this platform is important to Sun and MySQL customers, and also to the success of the CoolThreads servers and MySQL.

Note: This work was started long before it was known that MySQL was to be acquired by Sun Microsystems. The actual work done for this tuning was done by Rayson in the performance team at MySQL.

The workload that we used was sysbench, which is a very simple benchmark. In particular, we only ran read-only OLTP sysbench to perform this tuning work. The reason behind this is that if MySQL does not scale well with a simple read-only OLTP workload, then it would not scale well with more complex workloads, yet using a more complex workload would need more time to setup and run.

This is a list of things that we tried.

1) Hardware setup and software versions used
The compiler version:
> cc -V
cc: Sun C 5.9 SunOS_sparc Build47_dlight 2007/05/22
usage: cc [ options] files. Use 'cc -flags' for details

Solaris version:
> cat /etc/release
Solaris 10 11/06 s10s_u3wos_10 SPARC
Copyright 2006 Sun Microsystems, Inc. All Rights Reserved.
Use is subject to license terms.
Assembled 14 November 2006

For each run, 5 results were collected, and we discarded the best and the worst results, and then averaged the remaining 3, and sysbench was invoked as follow:
> ./sysbench --test=oltp --num-threads=32 --max-time=60 --max-requests=0 --oltp-read-only=on run

Using default configuration of MySQL 5.0.45 and read-only OLTP sysbench 0.4.8 on a Sun T2000 running at 1GHz, the throughput measured was 1209 transactions per second.

2) Compiling with -fast
Since the workload is CPU intensive with very few I/O operations, we knew that compiler optimizations would be very beneficial to performance. As Sun used the -fast flag for compiling other CPU intensive benchmarks (e.g. SPEC CPU), using -fast was the first thing we tried; this was done by setting CFLAGS and CXXFLAGS to -fast before we ran the configure script.

The throughput measured was 1241 transactions per second, or an improvement of 2.6%.

3) Fixing headers for Sun Studio
As using a higher optimization level gave us a small but nice improvement, we then looked for other opportunities from compiler optimizations. The first thing we noticed was that there were compiler directives that were not recognized by Sun Studio. And inlining was disabled as well.

As the Sun Studio compiler supports inlining, we enabled it in InnoDB by modifying the header file: univ.i

The throughput went up 3.1% to 1279 transactions per second.

We also enabled prefetching by using "sparc_prefetch_read_many()" and "sparc_prefetch_write_many()". In fact there was a small performance degradation, the throughput decreased by -0.47% to 1273 transactions per second. Since we do enable prefetching on Linux when gcc is used as the build compiler, we believe that the Niagara has enough MLP (Memory Level Parallelism), which does not need a lot of help from prefetching. However, we will see if this could benefit other SPARC servers (UltraSPARC IV+ and SPARC64 come in mind), or x64 servers running Solaris (when Sun Studio is used as the build compiler).

4) Locks in MySQL
We then use plockstat to locate contented mutex locks in the system. Surprising, memory management in libc was accounted for a lot of the lock contentions. Since the default malloc/free is not
optimized for threaded applications, we switched to mtmalloc. mtmalloc could be used without recompiling or relinking. We simply set the LD_PRELOAD environment variable in the shell that was used to start the MySQL server to interpose malloc/free calls.

> setenv LD_PRELOAD /usr/lib/libmtmalloc.so

The gain was 8.1% to 1376 transactions per second.

5) Caching Memory Inside MySQL
After we switched to mtmalloc, we still found that there were memory allocation and free patterns that were not efficient. We modified the code so that memory is cached inside MySQL instead of repeatedly allocated and freed. The idea is that we could trade memory usage for performance, but since most memory implementations cache memory when freed by the application instead of returning back to the operating system, with MySQL caching the memory would not only speed up the code, but also would not have impact on memory usage.

Using DTrace, we found that there were over 20 places where malloc and free were called repeatedly. We picked one of the hot spots and modified the code.

The change above gave us 1.5% to 1396 transactions per second.

6) Using Largepages
Using largepages on the UltraSPARC T1 platform can be beneficial to performance, as the TLBs in the T1 processor are shared by the 32 hardware threads.

We use the environment variable MPSSHEAP to tell the operating system that we wanted to use largepages for the memory heap:

> setenv LD_PRELOAD mpss.so.1
> setenv MPSSHEAP 4M

This change gave us a gain of 4.2% in throughput to 1455 transactions per second.

7) Removing strdup() calls
Later on, we also found that there was an unnecessary strdup/free pattern in the code in mf_cache.c. Since the character string was not modified in the code, we removed the strdup call and simply passed the pointer to the string instead.

This change gave us a gain of 0.34% to 1460 transactions per second.

8) Feedback Profile and Link Time Optimizations
We then compiled the MySQL server with feedback profile compiler optimization and link time optimization. We also trained MySQL in a training run, and then we recompile so that the compiler
could use the information (execution behavior) collected during the training run. The compiler flags used: -xipo -xprofile, -xlinkopt -fast

The combination of the compiler flags gave us a gain of 10.5% to 1614 transactions per second.

9) Configuration File Tuning
While tuning values in the configuration file is the most common way to get higher performance for MySQL, we did not spend a lot of time on it, however. The reason is that we were more interested in finding the bottlenecks in the code. Nevertheless, we did use a few flags:

> cat my.cnf

And the final throughput was 1649 transactions per second.

10) Things That Did Not Work as Expected
We also tried to use atomic instructions and ISM (Intimate Shared Memory), but both of them did not give us performance improvements.

Conclusion (for now)
This was the initial work done to optimize MySQL on the Sun CoolThreads platform, and we got 36% better throughput than the default installation. As MySQL is now part of Sun, I expect that working with Sun engineers would allow MySQL to get even better performance and throughput.

Currently, caching memory inside MySQL looks promising. We got 1.5% improvement by only modifying one place inside MySQL. Since there are quite a few places that we could apply this optimization, there is still room for further performance improvement!

Finally, I should mention that some of the optimizations above also improved MySQL on x64 Linux Solaris. I will update everyone here in the near future. :-)

1 comment:

Unknown said...

Mikael, I noticed some improvements just with compiler differences on both platforms (see my post here on comparing speeds for 32-bit/64-bit, although I didn't use my T1000 for the tests).

Also, if you haven't seen it already, Krish's comments on x86/64 compiler options, it's worth a read.