Wednesday, March 26, 2008

Visited Hadoop Conference

NOTE: Any comments in this blog entry is based on my personal thoughts after visiting the Hadoop conference and doesn't represent any current plans within MySQL.

I visited the Hadoop conference today which was a very interesting event. The room was filled to its limit, people were even standing up in lack of chairs. Probably around 300 people or so.

It was interesting to see the wide scope of web-scale problems that could be attacked using Hadoop. The major disruptive feature in Hadoop is the MapReduce solution to solving parallel data analysis problems.

One piece that I started thinking of was how one could introduce the MapReduce into SQL. One presentation of HIVE showed an interesting approach of how to solve this problem. I thought a bit on how one could integrate a MapReduce solution in MySQL and there are certainly a lot of problems to solve but I got a few interesting ideas.

The concept of being able to query both business data stored in a database and web-based logs and other type of massive amounts of data is certainly an interesting problem to consider.

In principle what one can add by introducing MapReduce into MySQL is the ability to handle streaming queries (queries that use dataflows as input table(s) and dataflows as output table).

However the actual implementation of Hadoop and HBase still were very much in their infancies so availability and reliability were far away from always on and also performance wasn't yet a focus.

Tuesday, March 25, 2008

MySQL Architecture Workshop

We had a workshop in Stockholm in early March to discuss what can be done to innovate MySQL in a number of areas. Most of the work here will not be useful code in yet a year or two and a lot longer before it'll be used in Enterprise Ready binaries. Obviously there is no guarantee that this early work will reach production binaries. This work is part of an aim at advancing the MySQL Architecture in the next few years.

One interesting topic we discussed was Pushdown of Query Fragments to Storage Engine.

A Query Fragment is a piece of an SQL query, for example in a 3-way join any join of 2 tables in this query is a Query Fragment, also the full query is a Query Fragment. As part of this interface the storage engine can decide to perform its own optimising using a new interface or it could rely on the MySQL Server to handle this optimisation. If the Storage Engine decides it can handle the Query Fragment and the optimiser decides to use this Query Fragment then the execution of this Query Fragment will be executed using the traditional Storage Engine API as if the Query Fragment was a normal table.

There are many engines that could make use of this new interface. Another interesting use of this interface is to implement parallel query support for the MySQL Server using this new interface. We hope to build a prototype of this sometime this year.

Please provide comments on this development on this blog, the development is in such an early phase that input is very welcome.

Visiting Internal Sun Technology Conference

My first real chance to meet up with my new colleagues at Sun was an internal technology conference at Sun. It was interesting to listen to what's cooking within Sun.

We got a presentation of Data Centers and their impact on the environment and it immediately triggered me to start thinking of how we can interact with power save functions from the MySQL Code. It was also interesting to see slides on how computer architecture is developing, this can be put into thinking about how the MySQL architecture should progress over the next few years.

Visiting Family History Technology Workshop at BYU

On the 13th of March I attended an interesting workshop on techhnology for Genealogy. My interest in this is based on interest in genealogy itself (my family tree contains currently about 3000 persons from various parts of Sweden down to some farmers in northern Sweden born around 1400) and my interest in technology and in particular how MySQL and MySQL Cluster can be used for genealogy applications. Being an LDS myself also adds to my interest in the subject.

The LDS church has developed a Web API FamilySearchAPI where genealogists through their genealogy software can work on a common database where they can add, edit information about our ancestors. The system handling this system currently contains 2.2 PB of data and is going to grow significantly as images and more genealogy information is added.

There were quite a few interesting discussions on how to link information between the source information (scanned images of historical documents), transscribed information from sources and derived family trees. The most complex problem in this application is the fuzziness of the base data and that different genealogists can have many different opinion about how to interpret the fuzzy base data. Thus in order to solve the problem one has to handle quality of genealogists somehow in the model.

From a database point of view this application requires a huge system with large clusters of information, it contains one part which is the base data (the scanned images) and this is typically stored in a large clustered file system containing many petabytes of data. Then the derived data is smaller but given that all versions need to be stored will still be a really huge data set and this is a fairly traditional relational database with large amounts of relations between data.

So what I take home from the workshop is ideas on what MySQL and MySQL Cluster should support in 3-5 years from now to be able to work in applications like this one.

Speeding up MySQL by 36% on the T2000

This post will focus on the performance tuning work that we've been working on since December 2007 on the Sun T2000 server. We got a nice speedup of 36% with fairly small efforts and we've got good hope we can improve performance a great deal more. This effort is part of a new effort at MySQL to improve performance both on Solaris and Linux platforms and to some extent Windows as well. This report focuses on T2000 using Solaris.

T1000 and T2000 are the first CoolThreads servers from Sun with the UltraSPARC T1 processors. The T1 is very energy efficient, which is extremely important to modern datacenters. On the other hand, leveraging the massive amount of thread-level parallelism (32 concurrent threads) provided by the CoolThreads servers is the key to getting good performance. As the CoolThreads servers are used by many Sun customers to run web facing workloads, making sure that MySQL runs well on this platform is important to Sun and MySQL customers, and also to the success of the CoolThreads servers and MySQL.

Note: This work was started long before it was known that MySQL was to be acquired by Sun Microsystems. The actual work done for this tuning was done by Rayson in the performance team at MySQL.

The workload that we used was sysbench, which is a very simple benchmark. In particular, we only ran read-only OLTP sysbench to perform this tuning work. The reason behind this is that if MySQL does not scale well with a simple read-only OLTP workload, then it would not scale well with more complex workloads, yet using a more complex workload would need more time to setup and run.

This is a list of things that we tried.

1) Hardware setup and software versions used
============================================
The compiler version:
> cc -V
cc: Sun C 5.9 SunOS_sparc Build47_dlight 2007/05/22
usage: cc [ options] files. Use 'cc -flags' for details

Solaris version:
> cat /etc/release
Solaris 10 11/06 s10s_u3wos_10 SPARC
Copyright 2006 Sun Microsystems, Inc. All Rights Reserved.
Use is subject to license terms.
Assembled 14 November 2006

For each run, 5 results were collected, and we discarded the best and the worst results, and then averaged the remaining 3, and sysbench was invoked as follow:
> ./sysbench --test=oltp --num-threads=32 --max-time=60 --max-requests=0 --oltp-read-only=on run

Using default configuration of MySQL 5.0.45 and read-only OLTP sysbench 0.4.8 on a Sun T2000 running at 1GHz, the throughput measured was 1209 transactions per second.

2) Compiling with -fast
=======================
Since the workload is CPU intensive with very few I/O operations, we knew that compiler optimizations would be very beneficial to performance. As Sun used the -fast flag for compiling other CPU intensive benchmarks (e.g. SPEC CPU), using -fast was the first thing we tried; this was done by setting CFLAGS and CXXFLAGS to -fast before we ran the configure script.

The throughput measured was 1241 transactions per second, or an improvement of 2.6%.

3) Fixing headers for Sun Studio
================================
As using a higher optimization level gave us a small but nice improvement, we then looked for other opportunities from compiler optimizations. The first thing we noticed was that there were compiler directives that were not recognized by Sun Studio. And inlining was disabled as well.

As the Sun Studio compiler supports inlining, we enabled it in InnoDB by modifying the header file: univ.i

The throughput went up 3.1% to 1279 transactions per second.

We also enabled prefetching by using "sparc_prefetch_read_many()" and "sparc_prefetch_write_many()". In fact there was a small performance degradation, the throughput decreased by -0.47% to 1273 transactions per second. Since we do enable prefetching on Linux when gcc is used as the build compiler, we believe that the Niagara has enough MLP (Memory Level Parallelism), which does not need a lot of help from prefetching. However, we will see if this could benefit other SPARC servers (UltraSPARC IV+ and SPARC64 come in mind), or x64 servers running Solaris (when Sun Studio is used as the build compiler).

4) Locks in MySQL
=================
We then use plockstat to locate contented mutex locks in the system. Surprising, memory management in libc was accounted for a lot of the lock contentions. Since the default malloc/free is not
optimized for threaded applications, we switched to mtmalloc. mtmalloc could be used without recompiling or relinking. We simply set the LD_PRELOAD environment variable in the shell that was used to start the MySQL server to interpose malloc/free calls.

> setenv LD_PRELOAD /usr/lib/libmtmalloc.so

The gain was 8.1% to 1376 transactions per second.

5) Caching Memory Inside MySQL
==============================
After we switched to mtmalloc, we still found that there were memory allocation and free patterns that were not efficient. We modified the code so that memory is cached inside MySQL instead of repeatedly allocated and freed. The idea is that we could trade memory usage for performance, but since most memory implementations cache memory when freed by the application instead of returning back to the operating system, with MySQL caching the memory would not only speed up the code, but also would not have impact on memory usage.

Using DTrace, we found that there were over 20 places where malloc and free were called repeatedly. We picked one of the hot spots and modified the code.

The change above gave us 1.5% to 1396 transactions per second.

6) Using Largepages
===================
Using largepages on the UltraSPARC T1 platform can be beneficial to performance, as the TLBs in the T1 processor are shared by the 32 hardware threads.

We use the environment variable MPSSHEAP to tell the operating system that we wanted to use largepages for the memory heap:

> setenv LD_PRELOAD mpss.so.1
> setenv MPSSHEAP 4M

This change gave us a gain of 4.2% in throughput to 1455 transactions per second.

7) Removing strdup() calls
==========================
Later on, we also found that there was an unnecessary strdup/free pattern in the code in mf_cache.c. Since the character string was not modified in the code, we removed the strdup call and simply passed the pointer to the string instead.

This change gave us a gain of 0.34% to 1460 transactions per second.

8) Feedback Profile and Link Time Optimizations
===============================================
We then compiled the MySQL server with feedback profile compiler optimization and link time optimization. We also trained MySQL in a training run, and then we recompile so that the compiler
could use the information (execution behavior) collected during the training run. The compiler flags used: -xipo -xprofile, -xlinkopt -fast

The combination of the compiler flags gave us a gain of 10.5% to 1614 transactions per second.

9) Configuration File Tuning
============================
While tuning values in the configuration file is the most common way to get higher performance for MySQL, we did not spend a lot of time on it, however. The reason is that we were more interested in finding the bottlenecks in the code. Nevertheless, we did use a few flags:

> cat my.cnf
[server]
query_cache_size=0
innodb_thread_concurrency=0
innodb_buffer_pool_size=100M
innodb_additional_mem_pool_size=20M

And the final throughput was 1649 transactions per second.

10) Things That Did Not Work as Expected
========================================
We also tried to use atomic instructions and ISM (Intimate Shared Memory), but both of them did not give us performance improvements.


Conclusion (for now)
====================
This was the initial work done to optimize MySQL on the Sun CoolThreads platform, and we got 36% better throughput than the default installation. As MySQL is now part of Sun, I expect that working with Sun engineers would allow MySQL to get even better performance and throughput.

Currently, caching memory inside MySQL looks promising. We got 1.5% improvement by only modifying one place inside MySQL. Since there are quite a few places that we could apply this optimization, there is still room for further performance improvement!

Finally, I should mention that some of the optimizations above also improved MySQL on x64 Linux Solaris. I will update everyone here in the near future. :-)

Performance Guide for MySQL Cluster@MySQL Users Conference

A new MySQL Users Conference is coming up again. MySQL was acquired recently by Sun Microsystems and thus innovation within will happen at an even faster rate than previously. The Users Conference will contain a lot of interesting presentations on how to develop your MySQL Applications. So come to Santa Clara 15-17 April to take part of the development and discuss with many MySQLers how MySQL will be developed in the next few years. I've prepared a set of blogs that I will publish over the next few days to give you an idea of what's cooking within MySQL and I hope some of these blogs can persuade you to come there and give your opinion on where the future development should be heading.

Personally I'll add my contribution to the talks at the MySQL Users Conference what to think about when building a high performance application based on MySQL Cluster. MySQL Cluster technology has matured over the last few years and is being used in more and more application categories. I even visited a conference on Family History Technology at BYU where I bumped into Matt Garner from FindMyPast (), he told me about how they had used MySQL Cluster for their Data Mining application and sustained a continous flow of 75.000 queries per second.

In my talk I'm planning to cover how partitioning your application data can improve performance, how the use of cluster interconnects can improve response time by as much as a factor of 8, when to use the native NDB API's and when to use SQL, and how to use some new features recently developed.

The MySQL Cluster development has been very focused on developing a feature set for the Telecom space for a few years, the last year development has started focusing more on general features to ensure we get improved performance also on complex SQL queries. Also development of improved features for usage of computers with high number of cores and execution threads (e.g. Niagara processor from Sun) and a number of other performance improvements are developed.

The talk will be very much focused on how you as an application developer can make use of the enormous performance capabilities a MySQL Cluster provides you with. I also hope to be able to present some impressing benchmark numbers using a large cluster Intel has made available to our use.