Wednesday, May 15, 2013

Year of Anniversaries

Many people write blogs and emails about various anniversaries. I had an anniversary when I started writing this blog on the 2 may where I celebrated 23 years since I my start date according to my employment contract. 2 may 1990 was the first day I worked at Ericsson where I started working on databases and where NDB Cluster was born that later grew into MySQL Cluster.

Actually this is a year of anniversaries for me. I started by becoming half a century old a few months ago, then I celebrated a quarter of a century as member in the LDS church, my wife and I just celebrated our 25th wedding day and in september I will celebrate 10 years of working with MySQL together with the other people that joined MySQL from Ericsson 10 years ago.

So I thought this was an appropriate timing to write down a little history of my career and particulary focused on how it relates to MySQL and MySQL Cluster. The purpose of this isn't to publish any new technical facts, the purpose is more to provide some input to those interested in the history of software development. I am personally very interested in history and always enjoy telling a good story, particularly when it is a true story. I think however that the most important reason to write this blog is to help young software engineers to understand both the challenges they will face and also that the path to success sometimes or even most of the time is a road that contains many road blocks and even complete changes.

I will start the story at the university where I studied mathematical statistics, I learned to enjoy hard work and challenges when I somewhat by chance found myself studying at a university while still attending last year of gymnasium (similar to last year of high school) and also working half-time. This challenge taught me to work hard and that overcoming challenges was quite fun (a lot more fun than easy work) when you come out on top.

I remember when I started working after the university that I was looking at two options, either work with data communication or work with databases. I selected to work with data communication! This meant that I learned a lot about low level stuff and working on X.25 protocols, SNA protocols and often programming in obscure assembler code. This has been a very important competence all through my career since it means that I have always had a good understanding of the performance aspects of the programs I have analysed and written.

In 1989 I had some personal inspiration that I would study for a Ph.D in mathematics and databases. I didn't have any possibility to start this immediately. However when I started working at Ericsson the 2nd of May 1990 in their education department I could quickly extend my competence through learning Object oriented programming, C++, telecommunication, advanced system testing and many other things. In march 1991 the opportunity to start learning about databases appeared as they were to develop a new course on an internal Ericsson database. Later the same year I saw an internal job offering at the Ericsson systems department working with system aspects of this internal Ericsson database. So in 1991 my career in databases went off to a start. Late 1991 I heard of a large research project about UMTS (what is today called 3G) where Ericsson participated. I went to my manager and asked him if I could spend a third of my time over the next few years on this project. He accepted and a few months later he also accepted that I went to half my time on this research project. My manager Martin Hänström and my department manager Göran Ahlforn and their vision of future telecom databases was crucial to the development of MySQL Cluster although they never at any time were involved in its actual development.

This project was called RACE MONET and involved hundreds of researcher in Europe and I learned a lot about network databases and the requirements that they would see the next few decades. This meant that I could start my Ph.D. studies in mathematics and databases. I spent a few years reading database theory, understanding more about future telecommunication systems. I developed a few high-level models for how to build a distributed database, but it took a few years to come up with the model now used in MySQL Cluster data nodes. In those days I actually focused so much on reading and thinking that I didn't even see a need to have a computer on my desk.

I had some interesting collaboration with a norwegian team of database researchers that meant that we were very close to joining forces in 1994 on a project on developing a distributed database. We did eventually join forces 14 years later when MySQL was acquired by Sun and the ClustRa team (ClustRa had been acquired by Sun a few years earlier) joined the MySQL team as the Sun database team. Today this team is an important part of all parts of the MySQL development but through history we have had many occasions of both collaboration and competing in the marketplace.

After studying a lot for 3-4 years I had a very interesting period in 1995 and 1996 where all my studying went into the idea phase. Through understanding of both requirements on a telecom database and database theory I had a period of constant innovation, a new idea popped up more than once per week and I was constantly thinking of new interesting ideas, a lot of new ideas popped up while I was out running.

With all these ideas I had the theoretical foundation of NDB Cluster that later turned into MySQL Cluster. Not a single code line was yet written, but most high-level algorithms was written down in my Ph.D thesis that I eventually finalised in 1998.

Now the time had come to actually start developing NDB Cluster. I remember driving by a new compound where we were supposed to move to in a few months later and I told a friend that in this new building my 5th kid, NDB, will be born. In the beginning I had no resources, but at Ericsson it was easy to hire summer students and master thesis students. So in a few years I worked with 30-40 summer students and 10 master thesis students. Most of these students focused on special areas and prototyped some idea, this period was extremely important since many of the early ideas wasn't so bright and the student works proved that we were on the wrong path. I remember especially a prototype of the communication protocol to the database, the prototype used BER encoding and only the protocol code was a few thousand lines of code. It was clearly not the right path and we went for a much more fixed protocol which actually evolved in six different versions in a short time to end where it is today. The NDB protocol is entirely based on 32-bit unsigned integers and most data are in fixed positions with a few exceptions where bits are used to shave off a few words from the message.

The first steps in the NDB development was completely intertwined with the other half of my work at Ericsson. This work entailed building a virtual machine for the APZ processor that was an Ericsson-developed CPU that powered the AXE switches which was a very important part of the Ericsson offering. In 1998 we developed a prototype of this virtual machine which was faster than any other APZ processor available at the time. This project was eventually brought into real products, personally I haven't done anything in this area the last 10 years. This intertwining of projects was important to get funding for the first phases of the NDB development and in the second half of 1997 we started up a real development project for NDB Cluster. The aim of the project was to build a prototype together with an operator to demonstrate low-latency communication towards an external database from AXE thus providing easy maintenance of various telecom services. The intertwining happened since NDB was executed in this virtual machine. There are still remnants of this virtual machine in the NDB code in the form of signals between blocks.

We ran this project at full speed for one year with up to 30 people involved. This project brought NDB from version 0.1 to version 0.3 and much of the code for version 0.4 was also written. The first NDB API was developed here based on some ideas from the telecom operator and much of the transaction protocols and data structures. The project was brought to a successful conclusion. There was however no continuation of the project since Ericsson decided to focus their resources on another internal database project. So most of 1999 the project continued as a skunk-work project. For about half a year I was the single person involved in the project and spent a lot of time in fixing various bugs, mostly in the hash index part and replaced the hash function with the MD5 crypto function.

The NDB API is based on a very simple record access model that makes it possible to access individual rows, scan tables and scan using an ordered index. Most relational databases have a similar functional split in their software. Since my research into requirements on telecom databases showed that almost all queries are very simple queries that are part of the traffic protocols it seemed like a very good idea to make this interface visible to the application developers. Also research into other application areas such as genealogy, email servers and charging applications showed similar needs.

Today MySQL Cluster still benefits from this record level API that makes it easy to integrate MySQL Cluster also for big data applications. It has also made it easy to make other types of interfaces to the data nodes such as LDAP interface and Java interfaces and so forth.

Eventually I wrote an email to the Ericsson CEO, the email was focused on the use of NDB for a web cache server that we prototyped together with the Ericsson IT department. The email was sent back to my organisation which quickly ignored it, but the CEO also sent the email on to a new Ericsson organisation, Ericsson Business Development. I met with one of the key people in this organisation, Gunnar Tyrsing, whom I got to know when we both participated in a Ph.D course on data communication. He had done a fast career in developing the Japanese 2G mobile system. He was therefore a manager that understood both technology and management and he quickly understood the possibilities of NDB. He forwarded the assignment to work with me to Ton Keppel who became the manager of NDB development the next 3 years.

Together with Ton Keppel I built up a new NDB organisation and we hired a lot of people 2000-2002, many of those are still working at MySQL, some of them in MySQL Cluster development but also a few in management positions. One of the first things we did was to convert the NDB code into a C++ program.

Those days were interesting, but very hectic, I even had my own personal secretary for a period when I had meetings all the time, developer meetings, management meetings, meetings with our owner Ericsson Business Innovation, meetings with potential customers and many, many meetings with potential investors. I participated in a course in 2000 on Innovation in a large organisation where I learned a lot about marketing and we visited a number of VCs in the US. While visiting one of these VCs in april 2001 I remember him stating that 2 weeks previously the market for business to business web sites had imploded. This was the first sign of the IT crash I saw. Half a year later the problems came to us. In 2000 we had access to an almost unlimited budget and at the end of 2001 we started working with a more normal and constrained budget. At this time the number of ventures was decreased from 30 to 5 in half a year. So every week a new venture was closed so we all knew that our venture could be closed at any time. Somehow we managed to survive and we were one of the 5 remaining ventures that was left after downsizing completed.

We refocused our efforts towards the telecom market after the IT crash, our first target market was the stock exchange market which wasn't a bright market after the stock market going for the worse. This meant that we finalised all the work on node recovery and system recovery that was started in 1998-1999. We found a first telecom customer in Bredbandsbolaget and they developed a number of their key network applications using MySQL Cluster which later many other customers have replicated.

At this time in history all telecom vendors were bleeding and downsizing. Ericsson downsized from 110.000 employees to around 47.000 employees in a two-year period. So eventually the turn came also to Ericsson Business Innovation. All remaining ventures including us had to start searching for new owners. Actually the quality of the ventures was at this point very high since all of them managed to find new owners in just a few summer months in 2003.

So the 2 september 2003 the acquisition of NDB into MySQL was completed and the NDB team was now part of the MySQL team. Our team came into MySQL at a very early point and increased the number of employees in MySQL significantly at the time.

Our team continued working with existing customers and we also started on integrating the NDB Cluster storage engine into MySQL. We decided to call the combined MySQL and NDB product MySQL Cluster and this name is what we still use.

After less than a year of working at MySQL the VP of Engineering Maurizio Gianola approached me and he had decided that I should work on new assignments. This happened exactly at the time when the commercial success of MySQL Cluster started to happen. Obviously after working on a project for more than 10 years it was hard to all of a sudden to start working on a new project. Interestingly I think it was good for me as a person to do this shift. What happened was that NDB was ready for success and needed a lot of detailed work on bugs, support, sales and new features for the customers. I have always been an innovator and therefore moving into a new project meant that I could innovate something new.

The new project I got involved in was MySQL partitioning. At first I had to learn the MySQL Server inside out and in particular I had to learn the MySQL parser which was far away from my expertise areas. I had great help of Anthony Curtis in this time and I spent about one year developing the code and one year fixing the bugs before it was released as part of MySQL 5.1. I had seriously good help in the QA part of this project where Matthias Leich helped me writing me many new test programs that exercised the new partitioning code. Also Peter Gulutzan had a special genius of finding out how to write test cases that found bugs. When I saw his test cases I quickly concluded that he had a lot of imagination of how to use SQL.

Working with MySQL code improved my skills as a programmer. In the Ericsson environment I learned to work on an architecturial level. At Ericsson I developed mind models that I finally "dumped" into source code. At MySQL I learned more about the actual coding skills.

At this point in time I decided to make something revolutionary in my life, I resigned from MySQL and started working as an independent consultant. I always want to have a feeling that things are moving upwards and at this point in my life I didn't get this feeling. So I decided to be a bit revolutionary and take some risks. I had a large family with 5 kids at the time and financially it was a bit of an adventure. It did however work out very well. I still continued to dabble with MySQL development as a consultant part-time, but I also worked with other companies. I worked a lot with a norwegian company, Dolphin, that I helped develop marketing material for their network cards that could be used in combination with MySQL Cluster. I also worked as a MySQL consultant in the Stockholm area, I even developed a completely brand new 3-day MySQL course in 12 days, I produced 300 new slides in a short time. It was a fun time and the company went well and I booked around 45 consultancy hours per week so financially I came out quite ok.

As part of this company I also started developing my own hobby project, iClaustron, I am still developing it. It's mainly an educational project for me, but eventually I might release the code, after 7 years I produced around 60.000 lines of code and it starts to actually do something useful :).

Meanwhile MySQL had found a new VP of Engineering in Jeffrey Pugh and he was looking for a Senior MySQL Architect. He gave me an offering I could not refuse, it both meant that I returned to having a feeling of going upwards in my career and also financially ok. Initially I worked mostly on getting a lot of rock stars to communicate. We had a number of strong outspoken architects in the organisation, many of which were very well-known in the MySQL world, we also had a number of competent, but less outspoken architects. So my work was to try to get these people to get to an agreement on techical development which wasn't always so easy. We did however manage to shield off the developers from the technical debates and ensure that instead of debating each new feature in a large group, we allowed the developers to work with one or two architects in developing the new features. This has paid off well now in that our developers are much more independent in the current development organisation.

At the end of 2008 it became clear that MySQL was in great need of a scalability project, at this time we were part of Sun after being acquired early 2008 and we had a skilled Sun team assisting us in this project. Given my background in performance-oriented design I dedicated much of my time over the next two years to this project. We delivered a breakthrough scalability project about once per year at the MySQL conference. The first one in 2009 we also got word of the Oracle acquisition. Personally I met a person coming out of the elevator at 7am coming to breakfast who asked how I felt about being acquired by Oracle. We have continued on this trend of delivering new scalability enhancements all the way until now when MySQL 5.6 was released with substantial performance enhancements and we've come a really long way on this path since we started this development almost 5 years ago.

In the meantime I also worked with Kelly Long to develop the MySQL Thread Pool design. There was a previous design in the MySQL 6.0 project, this design didn't scale and we opted for a completely new design based on splitting connections into thread groups. Kelly Long left MySQL to do a new startup in the networked storage business and this week I read how he sold his new venture for 119M$, so working with MySQL means working with many talented people. The thread pool design was recently updated for MySQL 5.6 and its design rocks also at a very much higher throughput.

The last few years have seen a lot of tremendous performance improvements in various areas. Much of the scalability improvements now happen as part of the normal engineering process and I mainly focus on helping out in small directed projects. So I helped out with the split of the LOCK_open mutex and the increased scalability of the binlog group commit writes and resolving the bottleneck in InnoDB we called G5 which essentially was a 1-liner problem.

Other areas I have focused my attention on has been back to MySQL Cluster, in MySQL Cluster 7.2 we increased scalability of MySQL Cluster by almost 10x through adding possibilities for multiple send threads, multiple receive threads, multiple TC threads and scaling to even more local database threads (LDM threads). In MySQL Cluster 7.3 I just completed a much improved scalability of the NDB API.

For me the last few years have been interesting, I have had some tough health issues, but at the same time the efforts we have done have paid off in ways never seen before. We've increased performance of the MySQL Server by 3.3x, we've increased performance of MySQL Cluster by almost 10x, we've demonstrated 1 billion updates per minute in MySQL Cluster and we're ready to scale MySQL Cluster 7.3 to hundreds of millions reads per second.

I tend to call the things I've done the last few years surgery. I go into working code, find the hot-spots and find ways to remove these hot-spots. The patches to do this is usually very small but one needs to be very careful in the changes since they touch parts of the code that are extremely central. As with real surgery the impact of these surgical patches can be astounding.

So where does the future take us, well there are still many interesting things to further develop. Every time we put some effort into parallelising some part of the MySQL code we can expect at least a 10x speedup. There is possibilities to speed up local code in most areas, there is possibilities to use the dramatic surge of CPU power to use compression even for main memory data. There is lots of tools that we can build around MySQL that provides more scalability, more high availability. We also know that the HW industry is innovating and they are likely to put things on our table that enables new things we never dreamed of doing before. One thing I particularly look forward to is when the HW industry can replace hard drives by something similar to DRAM's that have a persistent memory. When this happens it can change a lot of things we currently take for granted.

I found in my career that as an innovator it's important to be ready to let go of your development and put it into other people's competent hands. This means that one can move on to new areas. If one continues to work in the same organisation one can then always return to the old work areas. So I still do a lot of work in MySQL Cluster, I still review work done on MySQL partitioning, I still participate in working on the MySQL thread pool, I still help out on various MySQL scalability projects and there is even some new projects yet to be released where I continue in the review role after helping out in the early phases.

I did however leave one project solely to myself and this is the iClaustron project. It's nice to be able to work on something completely without regard to any other person's view and do exactly as I please. It's refreshing to do this every now and then and for me it has served as a very important tool to keep me up-to-date on build tools, code organisation, modularisation of code and many other aspects of software engineering.

Tuesday, May 14, 2013

MySQL Cluster 7.3 Improvements - Connection Thread Scalability

As many have noted we have released another milestone release of MySQL Cluster 7.3. One of the main features of 7.3 is obviously foreign keys. In this post I am going to describe one more feature added to MySQL Cluster in the second milestone release which is called Connection Thread Scalability.

http://dev.mysql.com/tech-resources/articles/cluster-7.3-dmr2.html

Almost all software designed for multithreaded use cases in the 1990s have some sort of big kernel mutex, as a matter of a fact this is also true for some hyped new software written in this millenium and even in this decade. Linux had its big kernel mutex, InnoDB had its kernel mutex, MySQL server had its LOCK_open mutex. All these mutexes are characterized by the fact that these mutexes protects many things that often have no connection with each other. Most of these mutexes have been fixed by now, the Linux big kernel mutex is almost gone by now, the LOCK_open is more or less removed as a bottleneck, the InnoDB kernel mutex has been split into ten different mutexes.

In MySQL Cluster we have had two types of kernel mutexes. In the data nodes the "kernel mutex" was actually a single-thread execution model. This model was very efficient but limited scalability. In MySQL Cluster 7.0 we extended the single threaded data nodes to be able to run up to 8 threads in parallel instead. In MySQL Cluster 7.2 we extended this to enable support of up to 32 or more threads.

The "kernel mutex" in the NDB API we call the transporter mutex. This mutex meant that all communication from a certain process, using a specific API node, used one protected region that protected communication with all other data nodes. This mutex could in some cases be held for substantial times.

This has meant that there has been limitation on how much throughput can be processed using one API node. It has been possible to process much throughput
anyways using multiple API nodes per process (this is the configuration parameter ndb-cluster-connection-pool).

What we have done in MySQL Cluster 7.3 is that we have fixed this bottleneck. We have split the transporter mutex and replaced it with mutexes that protects sending to a specific data node, mutexes that protects receiving from a certain data node, mutexes that protect memory buffers and mutexes that protect execution on behalf of a certain NDB API connection.

This means a significant improvement of throughput per API node. If we run a benchmark with just one data node using the flexAsynch benchmark that handles around 300-400k transactions per second per API node, this improvement increases throughput by around 50%. A Sysbench benchmark for one data node is improved by a factor of 3.3x. Finally a DBT2 benchmark with one data node is improved by a factor of 7.5x.

The bottleneck for an API node is that only one thread can process incoming messages for one connection between an API node and a data node. For flexAsynch there is a lot of processing of messages per node connection, it's much smaller in Sysbench and even smaller in DBT2 and thus these benchmarks see a much higher improvement due to this new feature.

If we run with multiple data nodes the improvement increases even more since the connections to different data nodes from one API node are now more or less independent of each other.

The feature will improve performance of applications without any changes of the application, the changes are entirely done inside the NDB API and thus improve performance both of NDB API applications as well as MySQL applications.

It is still possible to use multiple API nodes for a certain client process. But the need to do so is much smaller and in many cases even removed.

Monday, May 13, 2013

MySQL Thread Pool in 5.6

MySQL Enterprise Edition included the thread pool in its MySQL 5.5 version. We have now updated the thread pool also for the MySQL 5.6 Enterprise Edition.

You can try it for free at trials

The MySQL thread pool is developed as a plugin and all the interfaces needed by the thread pool are part of the MySQL community server enabling anyone to develop their own version of the MySQL thread pool. As part of the development of the thread pool we did a lot of work to make it possible to deliver stand-alone plugins using the interfaces provided by the MySQL server. Most of these interfaces were available already in MySQL 5.1, but we extended them and made them more production ready as part of MySQL 5.5 development. So a plugin can easily define their own configuration variables and their own information schema tables and also easily use performance schema extensions. A plugin can also have its own set of tests.

Thus one can build a new plugin by adding a new directory in the plugin directory. What is needed is then the source code of the plugin, the tests of the plugin and finally a CMakeLists.txt-file to describe any special things needed by the build process. The public services published by the MySQL server is found in the include/mysql directory. There is currently six services published by the MySQL Server in 5.6. The my_snprintf, thd_alloc, thd_wait, thread_scheduler, my_plugin_log and the mysql_string services. One usually needs also to integrate a bit more of the MySQL server as the work on modularizing the MySQL server is a work in progress. To help other developers understand what private interfaces are used by a plugin some plugins also provide a plugin_name_private.h-file. There is such a file for the thread pool and for InnoDB currently.

Buying and reading the book MySQL 5.1 Plugin Development will be helpful if you want to develop a new MySQL server plugin. The MySQL documentation also have a chapter on extending MySQL at http://dev.mysql.com/doc/refman/5.6/en/extending-mysql.html

The thread pool have now been updated for the MySQL 5.6 version. Obviously with the much higher concurrency of the MySQL Server in 5.6 it's important that the thread pool doesn't add any new concurrency problem when scaling up to 60 CPU threads. The good news is that the thread pool works even better in MySQL 5.6 than in MySQL 5.5. MySQL 5.6 has fixed even more issues when it comes to execution of many concurrent queries and this means that the thread pool provides even more stable throughput almost independent of the number of queries sent to it in parallel.

Our benchmark numbers using Sysbench OLTP RW and Sysbench OLTP RO shows that the thread pool continues to deliver at least 97% of the top performance even when going to 8192 concurrent queries sent to it. The thread pool has a slight overhead of 2-3% at low concurrency but the benefit it provides at high concurrency is breathtaking. At 8192 concurrent queries the thread pool delivers 13x more throughput compared to standard MySQL 5.6.



So let's step back and see in which cases the thread pool is useful. Most production servers are used at low loads most of the time. However many web servers have high load cases where a massive load is happening every now and then, this could be due to more users accessing the site, but it could also be due to changes of the web site that means that many web caches need to be refreshed. In those cases the MySQL Servers can all of a sudden see a burst of queries which means that the MySQL Server has to execute thousands of queries simultaneously for a short time.

Executing thousands of queries simultaneously is not a good use of the computer resources. The problem is that there is a high chance that we run out of memory in the machine and have to start using virtual memory not resident in memory. This will have a severe impact on the performance. Also switching between thousands of threads means a high cost on CPU cache usage since each time a thread is swapped in it has to build up the CPU caches again and this means more work for the CPUs. There is also other scalability issues within the MySQL server code that makes contention on certain hot-spots higher when many queries are executed concurrently.

So how does the thread pool go about resolving this problem to ensure that the MySQL Server always operate at optimum throughput? The first step is that when many concurrent queries arrive that these queries are to some extent serialised, so rather than executing all queries at once, we put the queries in a queue and execute them one by one. Obviously there needs to be some special handling of long-running queries to ensure that these queries don't block short queries for too long time. Also some special handling is needed of queries that block for some reason such as on a row lock, a disk IO or other wait event.

The thread pool divides connections into thread groups, the number of thread groups is configurable, in really big machines it can payoff to have as many as up to 60 thread groups using MySQL 5.6. Each thread group normally executes either 0 or 1 query. Only in the case of blocked queries and long-running queries does the thread group execute more than 1 query for a short time.

The above description serves to avoid problems with swapping when using too much memory and also fixes the issue with too much context switches that taxes the CPUs.

There is however still one problem that can cause issues and this is the fact that transactions hold resources also when no query is running at the moment. To solve this problem the thread pool puts a higher priority on connections that have already started a transaction. Started transactions can hold locks on rows that block other transactions from proceeding, they are part of sensitive hot-spot protected regions that can slow down the execution of all other transactions. So giving started transactions a higher priority makes a lot of sense. Indeed this feature means that we can sustain 97% of the optimum throughput even in the context of 8192 concurrent queries running, without this feature we would not be able to handle more than around 70% of the optimum throughput at such high concurrency.

One way of describing the thread pool is as an insurance. If you run without the thread pool your system will run just fine for most of the time. But in the event of spike loads where the system is taking a severe hit the thread pool serves as a protection measure to ensure that the system continues to deliver optimum throughput also in the context of massive amounts of incoming queries.

The above description fits the use cases for InnoDB. It is also possible to use the thread pool in MySQL Cluster 7.2 and onwards. For MySQL Cluster the thread pool has a slightly different use case. In this case the thread pool ensures that the data nodes of MySQL Cluster are protected from overload. So setting the number of thread groups to a number using MySQL Cluster means that this MySQL Server can only process this number of queries concurrently and thus ensuring that the data nodes are protected from overload cases and also ensuring that we always operate in optimal manner. Naturally with MySQL Cluster one can have many MySQL servers within one cluster and thus the thread pool can provide a more stable throughput in also highly scalable use cases of MySQL Cluster.