Mikael Ronstrom: parallel MySQL

Showing posts with label parallel MySQL. Show all posts

Monday, May 27, 2024

875X improvement from RonDB 21.04.17 to 22.10.4

At Hopsworks we are working on ensuring that the online feature store will be able to perform complex join operations in real-time. This means that queries that could use data from multiple tables can be easily integrated into machine learning applications.

Today most feature stores use key-value stores like Redis and DynamoDB. These systems have no capability to issue complex join queries, if this is required the feature store will have to write complex code to handle this and this is likely to involve multiple roundtrips and thus cause unwanted latency.

Hopsworks feature store uses RonDB as its online feature store. RonDB can handle any SQL operations that MySQL can handle. Actually RonDB has even support for parallelising the join queries and pushing the filtering and joining down to the RonDB data nodes where data resides.

This means that users of the Hopsworks feature store can integrate more features from multiple feature groups in online inferencing requests. This means that things credit fraud detection can be made much more intelligent by taking more features into account in the inferencing requests.

This means that performance of real-time join queries becomes more important in RonDB. To evaluate how RonDB develops in this are I ran a set of tests using TPC-H queries from DBT3 against RonDB 21.04.17 and RonDB 22.10.4 (not released yet). I also ran tests against MySQL 8.0.35 (RonDB 22.10.4 is based on MySQL 8.0.35 with loads of added RonDB features).

The results were interesting, the improvement in Q20 was the highest I have seen in my career. The performance improved from 70 seconds to 80 milliseconds, thus an 875x speedup or 87500% improvement. Q2 had a 360x improvement. So RonDB 22.10.4 is much better equipped for more complex queries compared to RonDB 21.04. MySQL 8.0.35 had similar performance to RonDB 22.10.4 with an average of around 20% slower, this is mostly due to performance improvements in RonDB, not algorithmic changes.

When using complex queries the query optimiser tries to find an optimal plan, sometimes however better plans are available and one can add hints in the SQL query to ensure a better plan is used.

The RonDB team isn't satisfied with this however, we have realised that evaluating aggregation is also very important when the online feature store stores a time window of certain features. This means that RonDB can compute aggregate dynamically and thus provide more accurate predictions.

Early tests of some simple single table queries showed an improvement of 4-5x and we expect we will be able to get to 10-20x improvements in quite a few queries of this sort.

Thursday, April 23, 2009

Join Executor for MySQL Cluster

Jonas in the Cluster team reported on his work on executing
joins in the NDB kernel for MySQL Cluster here.

This is a very interesting work we have in progress at MySQL.
We are working on an extension of the Storage Engine API
where the MySQL Server will present an Abstract Query Tree
to the Storage Engine. The Storage Engine can then decide to
execute the query on his own or decide that the MySQL Server
should execute it in the classic manner. In the first prototype
the optimisation will be done as usual and only after the
optimisation phase will we present the join to the storage
engine. However the specification also covers work on
integrating this with the optimiser and also enabling the
possibility for the storage engine to execute parts of the
query and not the entire one. The specification of this
work can be found here.

Jonas is working on the backend part for this interface in
MySQL Cluster.

What is interesting with pushing joins to the NDB kernel is that
it becomes very easy to parallelize the join execution. So what
will happen when this feature is ready is that MySQL Cluster
will shine on join performance and enable very good
performance on all sorts of application using SQL.

The reason that MySQL Cluster can so easily parallelize the query
execution of the join is due to the software architecture of the
NDB kernel. The NDB kernel is entirely developed as a message
passing architecture. So to start a thread of execution in the
NDB kernel one simply sends two messages when executing one
message and to stop a thread one simply doesn't send any messages
when executing a message. The problem then is more on that one
should not parallelize too much to run out of resources in the
system.

So with this development MySQL Cluster will also be shining at
Data Mining in an OLTP database. MySQL Cluster is designed for
systems where you need massive amounts of read and write
bandwidth (the cost of writing your data is close to the cost
of reading the data). So with the new features it will be
possible to do Data Mining on data updated in Real-time. Most
Data Mining is performed on a specialised Data Warehousing
solution. But to achieve this you need to transfer the data to
the Data Warehouse. With MySQL Cluster it will be possible to
both use the database for OLTP applications with heavy updates
always occuring while still querying the data with parallel
queries in parallel. MySQL Cluster is very efficient at
executing individual queries in the NDB kernel and can also
scale to very many machines and CPU cores.

Tuesday, November 25, 2008

Impressive numbers of Next Gen MySQL Cluster

I had a very interesting conversation on the phone with Jonas
Oreland today (he also blogged about it on his blog at
http://jonasoreland.blogspot.com).

There is a lot of interesting features coming up in MySQL Cluster
version 6.4. Online Add Node is one of those, which can be done
without any downtime and even with almost no additional memory
needed other than the memory in the new machines added into the
cluster. This is a feature I started thinking almost 10 years ago
so it's nice to see the fourth version of the solution actually be
implemented and it's a really neat solution to the problem,
definitely fitting the word innovative.

The next interesting feature is to use a more efficient protocol
for handling large operations towards the data nodes. This makes it
use less bits on the wire, but even more it saves a number of copy
stages internally in the NDB data nodes. So this has a dramatic
effect on performance of reads and writes of large records. It
doubles the throughput for large records.

In addition the new 6.4 version also adds multithreading to the
data nodes. Previously the data nodes was a very efficient single
thread which handled all the code blocks and also the send and
receive handling. In the new 6.4 version the data nodes are split
into at least 4 threads for database handling, one thread for send
and receive and the usual assistance threads for file writes and
so forth. This means that a data node will fit nicely into a 8-core
server since also 1-2 cpu's are required for interrupt handling and
other operating system activity.

Jonas benchmarked using a benchmark from our popular flex-series
of benchmark. It started with that I developed flexBench more
than 10 years ago, it's been followed by flexAsynch, flexTT and a
lot more variants of the same type. It can vary the number of
threads, the size of the records, the number of operations per
batch per thread and a number of other things. flexAsynch is
really good at generating extremely high loads to the database
without doing anything useful itself :)

So what Jonas demonstrated today was a flexAsynch run where he
managed to do more than 1 million reads per second using only
one data node. MySQL Cluster is a clustered system so you can
guess what happens when we have 16, 32 or 48 of those nodes
tied together. It will do many tens of millions of reads per
second. An interesting take on this is an article in
Datateknik 3.0 (a magazine no longer around) where I was
discussing how we had reached or was about to reach 1 million
reads per second. I think
this was sometime 2001 or 2002. I was asked where we
were going next and I said that 100 million reads per
second was the next goal. We're actually in range of
achieving this now since I also have a patch lying
around which can increase the number of data nodes
in the cluster to 128 data nodes whereby with good
scalability a 100 million reads per second per
cluster is achievable.

When Jonas called he had achieved 950k reads and then I told
him to try out using the Dolphin DX cards which were also
available on the machines. Then we managed to increase the
performance to inch over 1 million upto 1.070.000.
Quite nice. Maybe even more impressive that it also was
possible to do more 600.000 write operations per second
(these are all transactional).

This run of flexAsynch was focused on seeing how many operations
per second one could get through. I then decided I was
interested in seeing also how much bandwidth we could handle
in the system. So we changed the record size from 8 bytes to
2000 Bytes. When trying it out with Gigabit Ethernet we reached
60,000 reads and 55.000 inserts/updates per second. A quick
calculation shows that we're doing almost 120 MBytes of reads
and 110 MBytes of writes to the data node. This is obviously
where the limit of Gigabit Ethernet goes so an easy catch of
the bottleneck.

Then we tried the same thing using the Dolphin DX cards. We got
250.000 reads per second and more than 200.000 writes per
second. This corresponds to almost 500 MBytes per second of
reads from the database and more than 400 MBytes of writes to
the data nodes.

I had to check whether this was actually the limit of the set-up
I had for the Dolphin cards (they can be set-up to use either
x4 or x8 on the PCI Express). Interestingly enough after working
in various ways with Dolphin cards for 15 years it's the first
time I really cared about the bandwidth it could chunk through.
The performance of MySQL Cluster have never been close to
saturating the Dolphin links in the past.

However today we managed to saturate the links. The maximum
bandwidth achievable by a microbenchmark with a single process
was 510 MBytes per second and we achieved almost 95% of this
number. Very impressive indeed I think. What's even more
interesting is that the Dolphin card used the x4 configuration
so it can actually do 2x the bandwidth in the x8 setting and
the CPU's were fairly lightly loaded on the system so it's
likely that we could come very close to saturating the load
even using a x8 configuration of the Dolphin cards. So that's
a milestone to me, that MySQL Cluster have managed to
saturate even the bandwidth of a cluster interconnect with
very decent bandwidth.

This actually imposes an interesting database recovery
solution problem into the MySQL Cluster architecture. How
does one handle 1 GBytes of writes to each data node in
the system when used with persistent tables which has
to be checkpointed and logged to disk. This requires
bandwidth to the disk subsystem in multiple GBytes per
second. It's only reasonable to even consider doing this
with the upcoming new high-performance SSD drives. I
heard an old colleague nowadays working for a disk
company mention that he had demonstrated 6 GBytes
per second to local disks, so this actually is a
very nice fit. Turns out that this problem can also be
solved.

Actually SSD drives is also a very nice fit with also
the disk data part of MySQL Cluster. Here it makes all
the sense in the world to use SSD drives as the place
to put the tablespaces for the disk part of MySQL
Cluster. This way also the disk data becomes part of
the real-time system and you can fairly easy build a
terabyte database with an exceedingly high
performance. Maybe this is to some extent a reply
Mark Callaghans request for a data warehouse based
on MySQL Cluster link. Not that we really focused so
much on it, but the parallelism and performance
available in a large MySQL Cluster based on 6.4 will
be breathtaking even to me with 15 years of thinking
into this behind me. A final word on this is that
we are actually also working on a parallel query
capability towards MySQL Cluster. This is going to
based on some new advanced additions to the storage
engine interface we're currently working on
(Pushdown Query Fragment for those that joined the
storage engine summit at Google in April this year).

A nice thing with being part of Sun is that they're
building the HW which is required to build these
very large systems and are very interested in doing
showcases for them. So all the technology to do
what has been discussed above is available within
Sun.

Sorry for writing a very long blog. I know it's
better to write short and to the point blogs,
however I found so many interesting tilts on the
subject.

Thursday, October 30, 2008

CACHE INDEX per partition for MyISAM

The newest development in the partitioning code
is WL#4571.

This new feature makes it possible to tie a
partition using MyISAM to a specific cache index.
The syntax for how to do is available in the
above worklog entry.

We found this feature to be useful for enabling
higher performance of parallel ALTER TABLE
(WL#2550). When adding
a primary key to a MyISAM table the key cache in
MyISAM limited scalability of Parallel ALTER TABLE
severely, so adding several key caches, essentially
one per partition we can ensure that the ALTER TABLE
can be fully parallelised (all other ALTER TABLE
on MyISAM already scales perfectly).

We also have some ideas on how to solve the base
problem in how to make the key cache more scalable
by dividing the mutex on the key cache into one
mutex per a range of key cache pages.

Mikael Ronstrom