Monday, October 08, 2012

An architect's view of the MySQL Development model

When I stepped into my current job as Senior MySQL Architect in the summer of 2007 one of the first things on the agenda for discussion was the development model. We had released 5.0 almost 2 years ago and we were still more than a year from releasing MySQL 5.1. The model we used for developing MySQL 5.0 and MySQL 5.1 didn't work as well we wanted. So we had many discussions on how to improve this model.

A quick step 5 years forward to today, we are now using the Milestone release model and are just about to release the second GA version that have been developed using this model.


Differences in old and new development model

So what is the main difference between the new development model and the old development model?

I think the most important attribute of our current model is that the development tree always have to be in a releasable state every 2-4 months and this releaseable state have to be possible to bring to a GA version in around 6 months of stabilisation period. Our definition of releasable state is that the quality is at Release Candidate level. The old model effectively only required a releasable state when the version was reaching Release Candidate state since also in the beta stage new features were added, so a releasable state was only reached every 2 years instead of every 2-4 months.




Impact on new code in MySQL

So what does this mean for how we need to develop new code in MySQL. It means that new code must be added in fairly small chunks, even more it means that the new code cannot be too intermingled with old code since that would introduce too many new bugs. So if we need to add a new feature in an area where the complexity of the MySQL Server is high, then we first need to reengineer this area before we can proceed pushing the actual new feature. So in a sense the highest priority is always given to maintaining a stable development tree. In the past to get a new feature into the MySQL Server could sometimes have higher priority than the stability of the development tree.

So interestingly, by prioritizing tree stability before new features, we are able to develop more new features! But we are not necessarily able to develop specific new features as quick as we want sometimes. If a new feature requires tampering with complex MySQL Server code, then it might require a number of preparatory projects before they can actually be turned into a real feature. But anything that can be done outside of the complex parts of the MySQL Server can be added quite quickly.


Development of major features in new model

How does one handle addition of new major features in this model. The main consideration is that all new code must be fairly well structured. So when adding a new feature we can't simply start coding along, we first need to consider if the current code structure can be used to develop this new feature. As an example one of the last things added in MySQL 5.6 was a split of LOCK_open. If we would have tried to add this feature in earlier versions of the MySQL Server we would have had a new feature that touches many functional parts of the MySQL Server and it would have been introduced a significant set of new bugs to fix before any new GA version could have been released. Instead we went about things differently. First a number of reengineering projects isolated the main feature of LOCK_open to only protect metadata operations on tables and caching of metadata in connections. When these steps were completed about halfways through the MySQL 5.6 development, then it was very straightforward to make the actual split of LOCK_open in a fairly small patch which was a safe change.

Inspiration to new development model

Our selection of this new development model obviously was inspired by lessons from other successful open source projects such as the Linux kernel.


Change from old to new model

The change from the old model to the new model were probably the most difficult part. At the time of the change we had a new development tree which had some fairly significant new features implemented in it. But the tree was very far away from getting to a releasable state, so the tough decision was taken to restart development with a new tree based on the MySQL 5.1 development tree. This tree was the start of development using the new milestone model. The features from the old development tree was step by step moved over to the new development tree. This process actually took a few years since there are very high quality requirements on new features making it into the new development tree. But now we've moved all features from this old development tree that we will move, some of the features have been dropped or simply completely reimplemented.

Impact on future MySQL development

So what does this mean for current and future MySQL development? In MySQL 5.6 we added 200 new features, so obviously the new development model leads to a high pace of new interesting changes that will benefit our users and customers. We expect that the development model will continue to benefit us and that we will continue delivering new interesting features at a high pace to the MySQL community. As an architect I am proud to see how the MySQL development team continously improves the MySQL architecture and thus making it possible to add new features at an even higher pace.

Thursday, October 04, 2012

My personal list of new features in MySQL 5.6

This blog post simply lists and describes a few of the most interesting new features in MySQL 5.6. The list is simply my preferences based on my development interest and what I think is very interesting feaures in general.

In MySQL we develop new features using worklogs, the development is divided into milestones and the MySQL 5.6 development was divided into 6 different milestones. In total there was around 200 worklogs developed in MySQL 5.6 which is a quite impressive set of new features. Naturally there is also a ton of bug fixes in MySQL 5.6 as well but I won't go into those here.

In this blog I will only mention 30-40 of those worklogs, but even this is a fairly large number of interesting features.

I will discuss features in areas of interest.

Partitioning

Let's start with partitioning. We developed partitioning originally as a 5.1 feature. Now development is mostly handled by Mattias Jonsson and we meet regularly to discuss new bugs and features in partitioning. MySQL 5.6 saw a couple of significant new partitioning features. With MySQL 5.6 it is now possible to import and export partitions to and from tables as a very quick operation (except when you import and want to check rows for correctness as part of the import). This means you can insert your data into a single table and then decide to move the table into a partitioned table after a while. This feature can be used for a variety of things.

We have also worked hard on improving performance with many partitions. We have fixed a number of performance issues in the algorithms, we have improved the locking part such that we only lock those partitions that are actually used in the query. This improvement required some reengineering effort in the MySQL Server which is beneficial to many other parts other than partitioning. Another thing we added is that it is now possible to specify in queries which partitions one want to use, thus partitions can in a sense be treated as tables in queries if need be.

Optimizer

The next area is the optimizer, many new things have happened here and I am not the main expert in this area, but even from my view the improvements in subquery handling, index condition pushdown, persistence of optimizer stats (both in server and in InnoDB), new algorithms for multi-range reads focused on disk sweeps and an improved sort function is very interesting and many of them gives extreme performance improvement for certain queries. A feature I also think is very important is the optimizer trace, this feature came up a while ago in a reengineering project and it's nice to see it completed now. It gives anyone the ability to understand exactly how the optimizer makes its decisions and thus also making it possible for both users and developers to understand how complex queries can be improved.

The new ability to use EXPLAIN also on UPDATE/DELETEs is obviously a very important addition to users of MySQL.

Replication

The next area is replication, it contains a large amount of functional features that will make it easier to use. There is a number of improvements that I consider very interesting, the first is the improved scalability of binlog writing which makes the scalability balanced such that we're scalable both for replicated and non-replicated scenarios. The introduction of GTIDs is also a very important feature that will make life much easier for developers of various sharding and HA solutions based on MySQL replication. Finally also of course parallel slave applier is another important addition to the scalability of MySQL replication. Many new MySQL 5.6 features makes it easier to use MySQL as a component in larger systems.

MySQL Server internals

In the MySQL Server I also of course take an interest in the split of LOCK_open and the work on metadata locks. These changes are very significant for MySQL Server scalability. There are also many new features in the performance schema area making it possible to monitor MySQL in numerous new ways not possible in the past. A major effort was also made in the area of on-line alter table changes. Now there is a API in the MySQL Server to make any change on-line if the storage engine supports it and InnoDB specifically supports on-line add index.

InnoDB

Given my background in developing MySQL Cluster, and my general interest in everything around performance, it's not surprising that most everything that happens in the InnoDB storage engine is also things I consider interesting.

InnoDB has developed extensively in MySQL 5.6 both in terms of scalability and functionality.

In the area of scalability InnoDB has seen improvements of:
- RW-locks for buffer pool hash
- Split of kernel mutex
- Multi-threaded purge
- Reduced contention on log mutex
- Increased max size of REDO log files
- Improvements to InnoDB thread concurrency feature
- Read-only transaction optimisations

Furthermore InnoDB also have new major features such as fulltext search, preloading of buffer pool, separate tablespaces for UNDO log, flexible page sizes, memcache support for InnoDB and transportable tablespaces.

Finally InnoDB has improved stability of performance through improved flushing and it has a new low-level monitoring (InnoDB metrics).

As a final thing we have also updated default configurations to make it even easier to manage MySQL Servers.

So the MySQL 5.6 is a release that will be very useful for both new and old MySQL users.

Wednesday, October 03, 2012

Scalability improvements in MySQL 5.6

At MySQL Connect last weekend the MySQL 5.6 was released as a Release Candidate. It contains a large set of improvements over MySQL 5.5. In this blog I will focus on what has changed in the area of scalability.

There are four main areas of improvement of scalability in the MySQL 5.6:
1) Splitting InnoDB Kernel mutex
2) The G5 discovery
3) Splitting LOCK_open mutex
4) Binlog group commit

In InnoDB as in so many other products there was an infamous kernel mutex. This mutex was a sort of catch all mutex used to protect the system from concurrent use of various data structures. For an implementer it was easy enough to use the kernel mutex as a way to achieve safe concurrency. The problem of course is that it hurts our scalability. The kernel mutex was a major bottleneck in many benchmarks, and even more important it also made it very difficult to resolve other scalability issues.

Already early in the MySQL 5.6 development an effort in the InnoDB development team led by Sunny Bains, took the kernel mutex and split it into around 10 new mutexes. So effectively the kernel mutex was previously used for 10 different things. Now through the split, each new mutex protects only one logical entity. This naturally had a great impact on scalability in the MySQL 5.6 release, but also will it make future developments easier. Now since each new mutex is only protecting one logical entity it is much easier to see how to improve handling of each of those 10 new mutexes when they become scalability issues. The new mutex that has the most contention now is the lock mutex, this mutex protects the metadata of tables mainly.

The G5 discovery refers to a name we gave a patch internally. We made several experiments after splitting the kernel mutex to see which of the remaining mutexes would make most sense to fix next. So some experimental patches were developed that entirely removed a number of mutexes (obviously not a fully working MySQL version, but sufficiently working for a simple benchmark). The puzzle we met was that whatever we removed we had no success in improving performance. We looked at this problem using various tools without finding the root cause of it. Finally we used oprofile and the new Linux perf tool to see which code lines that we spent time in. We eventually discovered the issue. The problem was that every time a row was read in InnoDB we updated a statistics counter. In the case of Sysbench we have a number of scans where each scan hits 100 rows, so for each Sysbench transaction we read around 410 rows. Modern 4-socket servers are fast, but if one tries to update the same variable from all CPUs at once, then the cache line which this variable resides in, will bounce back and forth between the CPUs in the system. The systems we used for testing this could handle between 2-4 million updates per second of the same variable when updates were applied from CPUs residing in different sockets. So effectively this statistics counter kept the number of rows read in a MySQL Server down to 2-4 million reads per second or converted to Sysbench transactions we could not deliver more than 5-10.000 TPS. This variable have no major effect in MySQL 5.5 since it has other scalability issues that hides this issue. But in MySQL 5.6 we solved so many scalability issues that this variable became a major bottleneck. A simple solution was obviously to simply remove this line, but obviously we developed a more proper solution that we also used in many other similar areas where this could potentially become an issue. Finding and resolving this issue was a team effort made possible by the InnoDB team, the MySQL Server team and the MySQL performance experts. So the development resources available in the Oracle MySQL organisation makes it possible to continue scaling MySQL towards new heights.

Another sort of kernel mutex is the LOCK_open mutex in MySQL. This mutex used to be a mutex to protect the metadata of a table within the MySQL Server. It was however used for many other things as well. This meant that splitting this mutex required a number of reengineering projects before we were ready to perform the actual split. Already early in the MySQL 5.6 development the ground work was finalised to perform the actual split through the work of the MySQL runtime team. The idea to the actual split came in a development meeting in Trondheim where myself and Dmitry Lenev quickly came up with an idea to protect TABLE objects separately through an array of mutexes and keep the LOCK_open only for creating new TABLE objects and other activities around table metadata changes.

Finally this idea also became a new addition to the MySQL 5.6 release. Standing on the shoulders of the G5 discovery and the split of the kernel mutex, the split of the LOCK_open mutex made the jump in performance very significant. We were able to improve performance 70% and more only based on the LOCK_open split in a number of benchmarks.

What we have discovered with MySQL 5.6 and its benchmarking is that the standard Sysbench has a limiting factor. The problem is that Sysbench only uses 1 table, this means that any protection of this single table will be part of the scalability issues in running Sysbench. Since we assume that most, if not all, applications don't direct all queries towards one table, we also started benchmarking MySQL 5.6 using 2 tables and more in Sysbench. In this case we avoid bottlenecks related to use of only a single table.

In MySQL 5.6 we also decided to spend a serious effort in improving replication scalability. We were happy with the improvements of scalability in MySQL 5.5, but saw a need to also move more attention to the replication area. We set out with a modest goal of scaling replication to 16 CPUs. We actually managed to scale replication on the server side as far as the MySQL Server itself to 48 CPUs. We managed this even with sync_binlog set to 1! So in MySQL 5.6 the binlog writing flies. Also the slave side saw major improvements in scalability in MySQL 5.6.

The problem related to writing of the binlog was that we only had one mutex protecting those writes and all its phases. This meant among other things that this mutex also protected file writes to the binlog and even sync to the disks of the file writes.

In order to solve this issue Mats Kindahl and I experimented with a number of approaches to discover what worked best. We eventually came up with a sort of queueing system. So writes to the binlog were divided into phases. The first phase happens as part of transaction execution (as it does in MySQL 5.5) where binlog writes are gathered in a data structure operated by the connection. Thus there is no concurrency issue at all in this phase. The next step is to write this data into the file buffers operated by the operating system through write system calls. The next step is the sync of the writes to the disk. The last step is the commit that happens in the storage engines. Each of those steps is handled similarly but with minor differences.

Each step has two phases, a queue phase and an execution phase. The queue and the execution phase is protected by separate locks. Whenever someone arrives to a step he grabs the queue mutex and places himself in the queue. If he is the first to arrive in the queue then he also tries to grab the execution mutex after releasing the queue mutex. This means that while waiting for the execution mutex, more connections can place their data in the queue, they will then wait for the execution owner to signal them when the work is done (this could happen after more than one step is executed). When the execution mutex is acquired, the execution mutex owner will also grab the queue mutex, grab the queue and set the queue to empty and release the queue mutex. After this he will execute all tasks in the queue.

The final commit step can be parallelised if desired, thus multiple connections can commit simultaneously. The only limitation to this is when a concurrent hot backup is happening at the same time.

What is interesting with this architecture is that it is flexible to where the bottleneck is. For some workloads the bottleneck is in the commit phase, in this case it makes sense to parallelise this part which is possible. For others it is the sync phase which is the bottleneck and for yet other workloads the writing to the buffer is the bottleneck. Wherever the bottleneck resides this architecture makes the best of the situation and scales MySQL 5.6 replication to new heights.

So with this impressive set of new improvements in MySQL 5.6, what is left to do? As usual when one develops a new feature or performance improvement one also finds out about a ton of other things one wants to improve. So the list of ideas is not empty and there is some pretty interesting scalability improvements in line also for MySQL 5.7. We have access to machines currently with 96 CPU threads and bigger machines are likely to appear, we also cooperate with other parts of Oracle where even larger systems are possible. So our aim continues to keep MySQL scalable on commodity servers. Naturally we have noted that many of our customers are sharding their data set to handle even higher workloads. A long career on distributed systems has learnt me that it is extremely important to do proper partitioning of data sets to achieve scalability on network level. But it is still extremely important to make each node in the distributed system as large as possible. This decreases maintenance activities, it minimises issues in sharding by minimising the number of shards. We are pursuing this effort both in the MySQL Cluster area and in the MySQL area where InnoDB is used as a storage engine.