Mikael Ronstrom: HopsFS

Showing posts with label HopsFS. Show all posts

Saturday, April 23, 2022

Variable sized disk rows in RonDB

RonDB was a pure in-memory database engine in its origin. The main reason for this was to support low latency applications in the telecom business. However already in 2005 we presented a design at VLDB in Trondheim for the introduction of columns stored on disk. These columns cannot be indexed, but is very suitable for columns with large sizes.

RonDB is currently targeting Feature Store applications. These applications often access data through a set of primary key lookups where each row can have hundreds of columns with varying size.

In RonDB 21.04 the support for disk columns uses a fixed size disk row. This works very well to support handling small files in HopsFS. HopsFS is a distributed file system that can handle petabytes of storage in an efficient manner. On top of it Hopsworks build the offline Feature Store applications.

The small files are stored in a set of fixed size rows in RonDB with suitable sizes. YCSB benchmarks have shown that RonDB can handle writes of up to several GBytes per second. Thus the disk implementation of RonDB is very efficient.

Applications using the online Feature Store will however store much of its data in variable sized columns. These work perfectly well in the in-memory columns. They work also in the disk columns in RonDB 21.04. However to make storage more efficient we are designing a new version of RonDB where the row parts on disk are stored on variable sized disk pages.

These pages use the same data structure as the in-memory variable sized pages. So the new format only affects handling free space, handling of recovery. This design has now reached a state where it is passing our functional test suites. We will still add more tests, perform system tests and search for even more problems before we release for production usage.

One interesting challenge that can happen with a variable sized rows is that one might have to use more space in a data page. If this space isn't available we have to find a new page where space is available. It becomes an interesting challenge when taking into account that we can abort operations on a row while still committing other operations on the same row. The conclusion here is that one can never release any allocated resources until you fully commit or fully abort the transaction.

This type of challenge is one reason why it is so interesting to work with the internals of a distributed database engine. After 30 years of education, development and support, there are still new interesting challenges to handle.

Another challenge we faced was that we need to page in multiple data pages to handle an operation on the row. This means that we have to ensure that while paging in one data page, that other pages that we already paged in won't be paged out before we have completed our work on the row. This work also prepares the stage for handling rows that span over multiple disk pages. RonDB already supports rows that span multiple in-memory pages and one disk page.

If you want to learn more about RonDB requirements, LATS properties, use cases and internal algorithms, join us on Monday CMU Vaccin database presentation. Managed RonDB is supported on AWS, Azure and GCP and on-prem.

If you like to join the effort to develop RonDB and a managed RonDB version we have open positions at Hopsworks AB. Contact me at LinkedIn if you are interested.

Thursday, February 13, 2020

NDB Cluster, the World's Fastest Key-Value Store

Using numbers produced already with MySQL Cluster 7.6.10 we have
shown that NDB Cluster is the world's fastest Key-Value store using
the Yahoo Cloud Serving Benchmark (YCSB) Workload A.

Presentation at Slideshare.net.

We reached 1.4M operations using 2 Data Nodes and 2.8M operations
using a 4 Data Node setup. All this using a standard JDBC driver.
Obviously using a specialised ClusterJ client will improve performance
further. These benchmarks was executed by Bernd Ocklin.

The benchmark was executed in the Oracle Cloud. Each Data Node used
a Bare Metal Server using DenseIO which have 52 CPU cores with
8 NVMe drives.

The MySQL Servers and Benchmark clients was executed on Bare Metal
servers with 2 MySQL Server per server (1 MySQL Server per CPU socket).
These Bare Metal servers contained 36 CPU cores each.

All servers used Oracle Linux 7.

YCSB Workload A means that 50% of the operations are reads that read the
full rows (1 kByte in size) and 50% perform updates of one of the fields
(100 bytes in size).

Oracle Cloud contains 3 different levels of domains. The first level is that
servers are placed in different Failure Domains within the same Availability
Domains. This means essentially that servers are not relying on the same
switches and power supplies. But they can still be in the same building.

The second level is Availability Domains that are in the same region, but each
Availability Domain is failing independently of the other Availability Domains
in the same region.

The third level is regions that are separated by long distances as well.

Most applications of NDB Cluster relies on a model that would use 2 or
more NDB Clusters in different regions, but each cluster contained inside
an Availability Domain. Next global replication between the NDB Clusters
is used for fail-over when one region or availability domain fails.

With Oracle Cloud one can also setup a cluster to have Data Nodes in
different Availability Domain. This increases the availability of the
cluster at the expense of higher latency for write operations. NDB Cluster
have configuration options to ensure that one always performs local
reads on either the same server or at least in the same Availability/Failure
Domain.

The Oracle Cloud have the most competitive real-time characteristics of
the enterprise clouds. Our experience is that the Oracle Cloud provides
2-4x better latency compared to other cloud vendors. Thus the Oracle
Cloud is perfectly suitable for NDB Cluster.

The DenseIO Bare Metal Servers or DenseIO VMs are suitable for
use for NDB Data Nodes or a NDB Data Nodes colocated with
MySQL Server. These servers have excellent CPU combined with
25G Ethernet links and extremely high performing NVMe drives.

This benchmark reported here stores the table as In-Memory tables.
We will later report on some benchmarks where we use a slightly
modified YCSB benchmark to show numbers when we instead use
Disk-based tables with much heavier update loads.

The Oracle Cloud contains a number of variants of Bare Metal servers
and VMs that are suitable for MySQL Servers and applications.

In NDB Cluster the MySQL Servers are actually stateless since all
the state is in the NDB Data Node. The only exception to this rule
is the MySQL Servers used for replication to another cluster that
requires disk storage for the MySQL binlog.

So usually a standard server can be setup without any special extra
disks for MySQL Servers and clients.

In the presentation we show the following important results.

The latency of DBMS operations is independent of the data size. We
get the same latency when Data set have 300M rows as when there are
600M rows.

We show that scaling to 8 Data Nodes with 4 Data Nodes in each Availability
Domains scales from 4 Data Nodes in the same Availability Domain. But
the extra latency increases the latency and this also some loss in throughput.
Still we reach 3.7M operations per second for this 8-node setup.

We show that an important decision for the cluster setup is the number of
LDM threads. These are the threads doing the actual database work. We get
best scalability when going for the maximum number of LDM threads which
is 32. Using 32 LDM threads can increase latency at low number of clients,
but when the clients increase the 32 LDM setup will scale much longer than
the 16 LDM thread setup.

In MySQL Cluster 8.0.20 we have made more efforts to improve scaling to
many LDM threads. So we expect the performance of large installations to
scale even further in 8.0.20.

The benchmark report above gives very detailed numbers of latency in various
situations. As can seen there we can handle 1.3M operations per second with
latency of reads below 1 ms and updates having latency below 2 ms!

Finally the benchmark report also shows the impact of various NUMA settings
on performance and latency. It is shown that Interlaced NUMA settings have
a slight performance disadvantage, but since it means that we get access to the
full DRAM and the full machine, it is definitely a good idea to use this setting.
In NDB Cluster this is the default setting.

The YCSB benchmark shows NDB Cluster in its home turf with enormous
throughput of key operations, both read and write, with predictable and low
latency.

Couple with the high availability features that have been proven in the field
with more than 15 years of continous operations with better than Class 6
availability we feel confident to claim that NDB Cluster is the World's
Fastest and Most Available Key-Value Store!

The YCSB benchmark is a standard benchmark, so any competing solution
is free to challenge our claim. We used a standard YCSB client of version
0.15.0 using a standard MySQL JDBC driver.

NDB Cluster supports full SQL through the MySQL Server, it can push joins
down to the NDB Data Nodes for parallel query filtering and joining.
NDB Cluster supports sharding transparently and complex SQL queries
executes cross-shard joins which most competing Key-Value stores don't
support.

One interesting example using NDB Cluster as a Key-Value Store is HopsFS
that implements a hierarchical file system based on Hadoop HDFS. It has been
shown to scale to 1.6M file operations per second and small files can be stored
in the NDB Cluster for low latency access to small files.

Thursday, October 20, 2016

HopsFS based on MySQL Cluster 7.5 delivers a scalable HDFS

The swedish research institute, SICS, have worked hard for a few years on
developing a scalable and a highly available Hadoop implementation using
MySQL Cluster to store the metadata. In particular they have focused on the
Hadoop file system (HDFS) and the YARN. Using features of MySQL
Cluster 7.5 they were able to achieve linear scaling in number of name
nodes as well as in number of NDB data nodes to the number of nodes
available for the experiment (72 machines). Read the press release from
SICS here

The existing metadata layer of HDFS is based on a single Java server
that acts as name node in HDFS. There are implementations to ensure
that this metadata layer have HA by using a backup name node and to
use ZooKeeper for heartbeats and a number of Journalling nodes to
ensure that logs of changes to metadata are safely changed.

With MySQL Cluster 7.5 all these nodes are replaced by a MySQL Cluster
installation with 2 data nodes (or more NDB data nodes if needed to scale
higher) to achieve the same availability. This solution scales by using many
HDFS name nodes. Each 2 NDB data nodes scales to supporting around
10 name nodes. SICS made an experiment where they managed to
scale HDFS to using 12 NDB data nodes and 60 name nodes where they
achieved 1.2 millions file operations per second. The workload is based on
real-world data from a company delivering a cloud-based service
based on Hadoop. Most file operations are a combination of a number of
key lookups and a number of scan operations. We have not found any
limiting factor for scaling even more with even more machines
available.

This application uses ClusterJ, ClusterJ is a Java API that access
the MySQL Cluster data nodes directly using a very simple API.

The application uses a lot of scans to get the data, so the
application takes advantage of the improved scalability of scans
as present in 7.5. Given that it is a highly distributed application
a lot of the CPU time is spent in communicating, so the new adaptive
algorithms for sending is ensuring that performance is scaling nicely.

SICS have also developed a framework for installing MySQL Cluster in
the cloud (Amazon, Google, OpenStack) or on bare metal. I tried this
out and got the entire HopsFS installed on my laptop by doing a few
clicks on a web page on my desktop and pointing out the address of my
laptop. This uses a combination of Karamel, a number of Chef cookbooks
for MySQL Cluster, and a number of cookbooks for installing HopsFS.
Karamel uses JClouds to start up VMs in a number of different clouds.

Mikael Ronstrom