Datagraph develops a Graph database called Dydra that can handle SPARQL, GraphQL and Linked Data Platform (LDP). Dydra stores a revisioned history, this means that you have access to the full history of your data. This could be a development of some document, a piece of software, a piece of HW like a SoC (System-on-a-Chip) or a building or something else. Essentially any data.
This blog describes this development by the team at Datagraph.
Traditionally Dydra has used a memory-mapped key-value store for this. Now Hopsworks and Dydra have worked together for a while to provide features in RonDB that makes it possible to run Dydra on a highly available platform which is distributed that makes it possible to parallelise many of the searches.
RonDB is distribued key-value store with SQL capabilities. Traditionally distributed key-value stores offer the possibility to read and write the data in highly efficient manners using key lookups. RonDB offers this capability as well with extremely good performance (RonDB showed how to achieve 100M key lookups per second using a REST API ). However, the SQL capabilities mean that RonDB can also push down filters and projections to the RonDB data nodes. This means that searches can be parallelised.
Thus, RonDB will also be able to handle complex joins efficiently in many cases. Some of the queries in TPC-H (a standard analytical database benchmark) can be executed 50x faster in RonDB compared to using MySQL/InnoDB.
Now working with Dydra on their searches we realised that they store data structures in columns using the data type VARBINARY. SQL doesn't really have any way to define searches on complex data structures inside a VARBINARY.
When using RonDB there are many ways to access it. Many people find the use of MySQL APIs to be the preferrable method. These are APIs that are well known and there is plentiful of literature on how to use hem. However, RonDB is a key-value store as well, this means that a lower-level interface is much more efficient.
The base of all interactions with RonDB is through the C++ NDB API. On top of this API there is a Java API called ClusterJ, there is a NodeJS API called Database Jones. As mentioned there are the MySQL APIs as RonDB is an NDB storage engine for MySQL. With RonDB 24.10 we introduced also a C++ REST API server that can be used to retrieve batches of key lookups at very low latency. There is even an experimental Redis interface for RonDB that we call Rondis, it is integrated in the RonDB REST API server (RDRS).
In 2022, Datagraph released one more option: to use Common Lisp bindings for the C++ NDB API. With the release of RonDB 24.10, they just released a much improved version of the cl-ndbapi for RonDB 24.10.
As discussed above a Dydra query often entails a scan operation where one has to analyse the content of the VARBINARY column. In a first step all this functionality was performed by shipping the VARBINARY to the Common Lisp environment. This gave pretty decent performance, but we realised we could do better.
RonDB has had a simple interpreter for things such as filters, auto increment, and the like. However, to make complex analysis of VARBINARY columns we needed to extend the RonDB interpreter.
MySQL has a similar feature where one can integrate a C program into MySQL called user-defined functions (UDF). However this has two implications if we were to use a similar thing for RonDB, first it is a security issue, this program could easily crash the RonDB data nodes and this is in conflict with the high availability features of RonDB. The second issue is that RonDB is a distributed architecture, so the program would be required on every RonDB data node, thus complicating the installation process of RonDB.
Instead we opted for the approach of extending the RonDB interpreter. The RonDB interpreter has 8 register; these registers store 64-bit signed integers. An interpreted execution always has access to a single row, it cannot acccess any other rows or data outside the interpreter. Interpreted execution has several steps, one can first ready columns, next execute the interpreted program, next one can write some columns and finally one can again read columns. In this manner one can combine normal simple reads with an interpreted program. In MySQL the interpreted program is used to execute WHERE clauses to filter away those rows not interesting for the query. The program can also have a section of input parameters making it possible to reuse an interpreted program with different input. It is also possible to return calculated results using output parameters.
To handle the new requirements the RonDB interpreter was extended with a memory area of a bit more than 64 kB.
To ensure that one can handle a generic program RonDB added a long list of new operations like Shift Left/Right, multiplication, divison, modulo and so forth. In addition instructions to read columns into the memory area and even read only parts of a column if desired. Similarly instructions to write columns.
Dydra used these new approaches and saw a handsome improvement to the results delivered by RonDB.
Now analysing the use case for Dydra we found that they used some variants of binary search on parts of the VARBINARY columns. Thus RonDB also implemented a set of higher level instruction such as binary search, search intervals, memory copy and text-to-number conversion and vice versa.
Using those new instructions Dydra saw a bit more improvements. Those new instructions also ensures that the interpreted programs are quicker to develop. As requirements for other algorithms arise it is fairly easy to add new instructions to the RonDB interpreter and should be possible for other community developers.
The most innovative part of the new Common Lisp NDB API is the handling of the interpreted instructions. It contains a language-to-language compiler, so you can write the interpreted program as a Lisp program using normal IF, WHEN and COND (IF, ELSE constructs in Lisp). You can even decide to run the program in the client using Lisp (mainly for testing and debugging) or push it down to RonDB for execution in the RonDB data nodes (for speed).
One benchmark that Dydra used to evaluate RonDB performance compared MySQL/InnoDB using an UDF with using RonDB using pushdown of the evaluation. The data set consisted of 4.5M rows where essentially all rows were scanned and for each row one executed a program that checked if the row was visible in the revision asked for. About 2% of the rows were returned.
In MySQL/InnoDB the query took 8.89 seconds to execute, in RonDB the query took 0.51 seconds to execute. Thus a nice speedup of around 17 times. Most of the speedup is dependent on the amount of parallelism used in RonDB. The MySQL execution is single-threaded. The cost of scanning one row in MySQL/InnoDB and in RonDB is very similar, RonDB is a bit faster, but there is not a major difference in speed.
No comments:
Post a Comment