Friday, December 06, 2019

NDB Parallel Query, part 1

I will describe how NDB handles complex SQL queries in a number of
blogs. NDB has the ability to parallelise parts of join processing.
Ensuring that your queries makes best possible use of these
parallelisation features enables appplications to boost their
performance significantly. It will also be a good base to explain
any improvements we add to the query processing in NDB Cluster.

NDB was designed from the beginning for extremely efficient key lookups
and for extreme availability (less than 30 seconds of downtime per year
including time for software change, meta data changes and crashes).

Originally the model was single-threaded and optimised for 1-2 CPUs.
The execution model uses an architecture where messages are sent
between modules. This made it very straightforward to extend the
architecture to support multi-threaded execution when CPUs with
many cores became prevalent. The first multi-threaded version of NDB
was version 7.0. This supported up to 7 threads working parallel
plus a large number of threads handling interaction with the file
system.

With the introduction of 7.0 the scans of a table, either using an
range scan on an index or scanning the entire table was automatially
parallelised. So NDB have supported a limited form of parallel query
already since the release of 7.0 (around 2011 I think).

Now let's use an example query, Q6 from DBT3 that mimics TPC-H.

SELECT
    SUM(l_extendedprice * l_discount) AS revenue
FROM
    lineitem
WHERE
    l_shipdate >= '1994-01-01'
    AND l_shipdate < DATE_ADD( '1994-01-01' , INTERVAL '1' year)
    AND l_discount BETWEEN 0.06 - 0.01 AND 0.06 + 0.01
    AND l_quantity < 24;

The execution of this will use a range scan on the index on l_shipdate.
This range is a perfectly normal range scan in NDB. Since range scans
are parallelised, this query will execute using 1 CPU for each partition
of the table. Assuming that we set up a cluster with default setup
and with 8 LDM threads the table will be partitioned into 16 partitions.
Each of those partitions will have a different CPU for the primary
partition. This means that the range scans will execute on 16 CPUs in
parallel.

LDM (Local Data Manager) is the name of the threads in the NDB data
nodes that manages the actual data in NDB. It contains a distributed
hash index for the primary keys and unique keys, an ordered index
implemented as a T-tree, a query handler that controls execution of
lookups and scans and checkpointing and also handles the REDO log.
Finally the LDM thread contains the row storage that has 4 parts.
Fixed size parts of the row in memory, variable sized parts of the
row in memory, dynamic parts of the row (absence of a column here
means that it is NULL, so this provides the ability to ADD a column
as an online operation) and finally a fixed size part that is stored
on disk using a page cache. The row storage also contains an
interpreter that can evaluate conditions, perform simple operations
like add to support efficient auto increment.

Now the first implementation of the NDB storage engine was implemented
such that all condition evaluation was done in the MySQL Server. This
meant that although we could scan the table in parallel, we still had
a single thread to evaluate the conditions. This meant that to handle
this query efficiently a condition pushdown is required. Condition
pushdown was added to the MySQL storage engine API a fairly long time
ago as part of the NDB development and can also benefit any other
storage engine that can handle condition evaluation.

So the above contains 3 parts that can be parallelised individually.
Scanning the data, evaluating the condition and finally performing
the sum on the rows that match the condition.

NDB currently parallelises the scan part and the condition evaluation
part. The sum is handled by the MySQL Server. In this case this the
filtering factor is high, so this means that the sum part is not a
bottleneck in this query. The bottleneck in this query is scanning
the data and evaluating the condition.

In the terminology of relational algebra this means that NDB supports
a parallelised SELECT operator for some filters. NDB also supports a
parallel PROJECT operator. NDB doesn't yet support a parallel
AGGREGATE function.

The bottleneck in this query is how fast one can scan the data and
evaluate the condition. In version 7.6 we made a substantial
optimisation of this part where we managed to improve a simple
query by 240% through low-level optimisations of the code.
With this optimisation NDB can handle more than 2 million rows
per second per CPU with a very simple condition to evaluate. This
query greatly benefits from this greater efficiency. Executing this
query with scale factor 10 (60M rows in the lineitem table) takes
about 1.5 seconds with the configuration above where 16 CPUs
concurrently perform the scan and condition evaluation.

A single-threaded storage engine is around 20x slower. With more
CPUs available in the LDM threads the parallelisation will be even
higher.

Obviously there are other DBMSs that are focused on analytical
queries that can handle this query even faster, NDB is focused
on online applications with high write scalability and the highest
availability. But we are working to also make query execution of
complex SQL much faster that online applications can analyze
data in real-time.

Query Execution

In the figure below we describe the execution flow for this query. As usual
the query starts with parsing (unless it is a prepared statement) and after
that the query is optimised.

This query is executed as a single range scan against the lineitem table. Scans
are controlled by a TC thread that ensures that all the fragments of the table are
scanned. It is possible to control the parallelism of the query through the
NDB API. In most of the cases the parallelism will be full parallelism. Each thread
has a real-time scheduler and the scan in the LDM threads will be split up into
multiple executions that will be interleaved with execution by other queries
executing in parallel.

This means that in an idle system this query will be able to execute at full speed.
However even if there is lots of other queries going on in parallel the query will
execute almost as fast as long as the CPUs are not overloaded.

In the figure below we also show that control of the scan goes through the TC
thread, but the result row is sent directly from the LDM thread to the NDB API.

In the MySQL Server the NDB storage engine gets the row from the NDB API
and returns it to the MySQL Server for the sum function and result processing.

Query Analysis

The query reads the lineitem table that has about 6M rows in scale
factor 1. It reads them using an index on l_shipdate. The range
consists of 909.455 rows to analyse and of those 114.160 rows are
produced to calculate results of the sum.  In the above configuration
it takes about 0.15 seconds for NDB to execute the query. There are
some limitations to get full use of all CPUs involved even in this
query that is related to batch handling. I will describe this in a
later blog.

Scalability impact

This query is only positively impacted by any type of scaling. The
more fragments the lineitem table is partitioned into, the more
parallelism the query will use. So the only limitation to scaling
is when the sum part starts to become the bottleneck.

Next part

In the next part we will discuss how NDB can parallelise a very
simple 2-way join from the DBT3 benchmark. This is Q12 from
TPC-H that looks like this.

SELECT
        l_shipmode,
        SUM(CASE
                WHEN o_orderpriority = '1-URGENT'
                        OR o_orderpriority = '2-HIGH'
                        THEN 1
                ELSE 0
        END) AS high_line_count,
        SUM(CASE
                WHEN o_orderpriority <> '1-URGENT'
                        AND o_orderpriority <> '2-HIGH'
                        THEN 1
                ELSE 0
        END) AS low_line_count
FROM
        orders,
        lineitem
WHERE
        o_orderkey = l_orderkey
        AND l_shipmode IN ('MAIL', 'SHIP')
        AND l_commitdate < l_receiptdate
        AND l_shipdate < l_commitdate
        AND l_receiptdate >= '1994-01-01'
        AND l_receiptdate < DATE_ADD( '1994-01-01', INTERVAL '1' year)
GROUP BY
        l_shipmode
ORDER BY
        l_shipmode;

This query introduces 3 additional relational algebra operators,
a JOIN operator, a GROUP BY operator and a SORT operator.

No comments: