Thursday, October 20, 2016

Read any replica in MySQL Cluster 7.5

MySQL Cluster uses a distributed architecture where we have one primary
replica and one or more backup replicas of all data in the cluster.
In previous versions we have not supported reading from the backup
replicas. The reason for this comes from the transaction algorithm in
the NDB storage engine. When the transaction have committed we have
unlocked the rows in the primary replica, but the locks in the backup
replica is unlocked in the complete phase.

The impact of this is that we could potentially perform an update
of a row and then immediately read it and not see our own write. To
avoid this problem we have avoided to read the backup replica.

If it is important to read the backup replica then we need to ensure
that transactions are not reported back to the user until the complete
phase.

When we decided to support read from backup replicas we considered how
to handle this. One manner would be to simply have a flag in the
updating transactions that signals that we want the response after the
complete phase instead of after the commit phase.

We decided that this approach would be possible, but we opted for a
safer approach where it is a table property that defines if tables
can be used for reading the backup replica or not.

Given that there is a latency cost to wait until the complete phase
is done this feature is not enabled for all tables. In 7.5 we decided
to have the default that read from backup replicas isn't enabled.

There are a number of ways to enable read from backup replicas.

Let's start with the case when you haven't created the table yet.
In this case the first manner to handle it is to create a table
with the read backup flag set. This is done by using a special
flag set in the COMMENT section of the CREATE TABLE statement.
To set the flag we use the following comment:
COMMENT="... NDB_TABLE=READ_BACKUP=1 ... "
We run a parser on the COMMENT section where we look for the
NDB_TABLE keyword and after that we look for an equal sign
followed by a new keyword READ_BACKUP which can be set to 1 or
0. It is also possible set several table properties in one comment
and in this case we add a comma and the next property. So e.g.
COMMENT="NDB_TABLE=READ_BACKUP=1,PARTITION_BALANCE=FOR_RA_BY_NODE"

The above feature is a good approach when you want read from backup
replicas in some tables, but not from all tables. A more likely
approach is to set the new configuration variable ndb-read-backup
in the MySQL Server that ensures that all tables gets created with
the read backup flag set.

So basically it is necessary to decide if you want to focus on
read performance or low latency of updating transactions. Given
that NDB is very often used for high write loads we have kept the
default to not set the read backup flag. But many applications
are read-focused, so in this case it makes sense to set the
config variable ndb_read_backup to 1.

There is one more case where there are already existing tables
that have been upgraded from an earlier MySQL Cluster version.
In this case it is possible to change those tables to set the
read backup flag online.

The syntax to use this for a table called t1 would be:
ALTER TABLE t1 algorithm=inplace, COMMENT="NDB_TABLE=READ_BACKUP=1";
This will run an online change that can happen concurrent with
reading and writing transactions.

Now when equipped with read backup flag set on your tables it is
possible to localize your application. Assume that you have the
following MySQL Cluster setup.



Now we want to ensure that the MySQL Server 1 always uses the NDB
data node 1 and vice versa for reading. If the MySQL Server 1 and
the NDB data node 1 are using the same hostname or ip address we
will automatically detect that they are on the same host. So in
this case we will always prefer to send reads to the same host.
If they are on the same host but uses different ip address then
we can tell the MySQL Server 1 that its closest data node
neighbour is the node id of NDB data node 1. This is done by
setting this node id in the MySQL configuration variable
ndb-data-node-neighbour to this value.

So with this settings we are able to more or less make a read
optimised architecture using the MySQL Cluster.

Many people that uses MySQL/InnoDB that tries out MySQL Cluster
will most likely expect the behaviour provided with the read
backup flag set. So users of MySQL Cluster that comes from the
MySQL world should most likely set the ndb-read-backup flag
to get the expected behaviour.

HopsFS based on MySQL Cluster 7.5 delivers a scalable HDFS

The swedish research institute, SICS, have worked hard for a few years on
developing a scalable and a highly available Hadoop implementation using
MySQL Cluster to store the metadata. In particular they have focused on the
Hadoop file system (HDFS) and the YARN. Using features of MySQL
Cluster 7.5 they were able to achieve linear scaling in number of name
nodes as well as in number of NDB data nodes to the number of nodes
available for the experiment (72 machines). Read the press release from
SICS here

The existing metadata layer of HDFS is based on a single Java server
that acts as name node in HDFS. There are implementations to ensure
that this metadata layer have HA by using a backup name node and to
use ZooKeeper for heartbeats and a number of Journalling nodes to
ensure that logs of changes to metadata are safely changed.

With MySQL Cluster 7.5 all these nodes are replaced by a MySQL Cluster
installation with 2 data nodes (or more NDB  data nodes if needed to scale
higher) to achieve the same availability. This solution scales by using many
HDFS name nodes. Each 2 NDB data nodes scales to supporting around
10 name nodes. SICS made an experiment where they managed to
scale HDFS to using 12 NDB data nodes and 60 name nodes where they
achieved 1.2 millions file operations per second. The workload is based on
real-world data from a company delivering a cloud-based service
based on Hadoop. Most file operations are a combination of a number of
key lookups and a number of scan operations. We have not found any
limiting factor for scaling even more with even more machines
available.

This application uses ClusterJ, ClusterJ is a Java API that access
the MySQL Cluster data nodes directly using a very simple API.

The application uses a lot of scans to get the data, so the
application takes advantage of the improved scalability of scans
as present in 7.5. Given that it is a highly distributed application
a lot of the CPU time is spent in communicating, so the new adaptive
algorithms for sending is ensuring that performance is scaling nicely.

SICS have also developed a framework for installing MySQL Cluster in
the cloud (Amazon, Google, OpenStack) or on bare metal. I tried this
out and got the entire HopsFS installed on my laptop by doing a few
clicks on a web page on my desktop and pointing out the address of my
laptop. This uses a combination of Karamel, a number of Chef cookbooks
for MySQL Cluster, and a number of cookbooks for installing HopsFS.
Karamel uses JClouds to start up VMs in a number of different clouds.

MySQL Cluster 7.5 is GA, best cluster release ever

I have been fairly quiet on my blog for some time. We've been very busy
developing new features for MySQL Cluster 7.5 and ensuring that the
quality is improved even further.

We're now very pleased to release a new version of MySQL Cluster.

MySQL Cluster 7.5 contains a number of new things that makes MySQL
Cluster even better.
1) You can declare a table as a READ_BACKUP table. This means that
the updating transactions will receive the commit acknowledge
a little bit later to ensure that we can always use any of the
replicas for reading. We will use the nearest replica for
committed reads, for locking reads we will still use the primary
replica to avoid deadlocks.

For applications that are mostly read-focused one can make it easier
to set this variable by setting the ndb-read-backup config variable
to 1 in the MySQL Server configuration. This means that all tables
created from this server will create tables with the READ_BACKUP
flag set to true.

2) You can declare a table as a FULLY_REPLICATED table. This means
that this table will get as many replica as there are nodes in the
cluster.

Fully replicated tables (global tables) provides a very important way
to scale your applications using MySQL Cluster. Many applications have
small tables that is mainly read, these tables can now fully replicated
and thus be read from any data node in the cluster.

Another potential use of fully replicated tables is to use MySQL Cluster
for read scalability. You can start out with a 2-node cluster where both
nodes are able to read the data. If you then need to scale to 4 nodes
you add 2 more nodes to the cluster, you create one more node group
for the 2 new nodes. Then you can reorganise the tables such that they
are fully replicated on all 4 nodes. You can continue in this manner
all the way until you have 48 nodes. All through this process any
MySQL Server can be used to both read and write the data in the cluster.

If the application is an SQL application it is a good idea to place
the MySQL Server is placed on the same computer as a data node.
So in this manner MySQL Cluster have a very easy manner of scaling up
your application for reads.

A third variant to use fully replicated tables is to place hot-spot rows
that are often read and seldom updated into fully replicated tables.

It is possible to set a config variable ndb-fully-replicated to 1 to
ensure that all tables are created as fully replicated. This should
obviously only be used for read scaling applications.

3) We have made it easier to vary the number of partitions in a table.
Normally we will have 2 partitions per LDM thread per node group. So
this means in a 4 node cluster with 2 node groups that have 4 LDMs per
node, there would be 16 partitions in a normal table.

Now it is possible to decide to have less partitions that are still
linked to the number of node groups in the cluster. So one can define
e.g. to have 1 partition per node group. This means that the command
ALTER TABLE REORGANIZE will add more partitions if node groups have
been added since the table last was altered. It will keep the number
to in this case 1 partition per node group. We support seven different
variants here.

The idea with this is to be able to decrease the number of partitions
for smaller tables while still using the automated partitioning
features of MySQL Cluster.

4) In earlier versions of MySQL Cluster we introduced the possibility to
use special send threads to do all the sending. In 7.5 the send threads
have been redesigned such that they now share the burden of sending with
all the other block threads in the NDB data nodes. The block threads will
assist when they are not very loaded. So the send thread assistance is
adaptive based on the CPU usage in the threads.

In order to develop this feature the NDB data nodes now have a very good
view on how much CPU usage each thread is using. This information is used
for internal algorithms, but it is also available as new NDBINFO tables.

5) Earlier versions of MySQL Cluster had a scaling issue with executing
extremely many scans in parallel. In MySQL Cluster 7.5 we have effectively
removed this limitation. Now it is possible to handle many millions of
scans per second for each table for each data node. So the main limitation
is the CPU available now for processing in general.

6) We have also continued working on improving our testing to ensure that
we always deliver better quality with new releases.

7) We have also worked on improving the scalability of the NDB API to
ensure that we deliver high performance even in situations with hundreds
or even thousands of concurrent threads working against the cluster.

8) Last but not least MySQL Cluster 7.5 is based on MySQL 5.7. So this means
that we get a lot of new optimizer features from the MySQL Server, we get
improved scalability of the MySQL Server parts and a lot more.

My personal favorite among those features is definitely the fully replicated
tables and read backup tables. This opens up a new category of application
optimisations in a scalable system.