Tuesday, October 27, 2020

Using CTEs in TPC-H in NDB Cluster

 In the previous post I showed some improvements based on using

Windows functions in rewritten TPC-H queries. In this blog post

I will discuss the improvements made in Q15 using a CTE (Common

Table Expression). This is again based on inspiration from new and

old blogs written by Öystein Grövlen (should be norwegian ö's).


Here is Q15 in TPC-H:

create view revenue (supplier_no, total_revenue) as

        select

                l_suppkey,

                sum(l_extendedprice * (1 - l_discount))

        from

                lineitem

        where

                l_shipdate >= '1996-01-01'

                and l_shipdate < date_add('1996-01-01', interval '90' day)

        group by

                l_suppkey;


select

        s_suppkey,

        s_name,

        s_address,

        s_phone,

        total_revenue

from

        supplier,

        revenue

where

        s_suppkey = supplier_no

        and total_revenue = (

                select

                        max(total_revenue)

                from

                        revenue

        )

order by

        s_suppkey;


drop view revenue;


It creates a view and uses this view in 2 places in the query.

The same query using a CTE:

WITH revenue0(supplier_no , total_revenue) AS (

    SELECT l_suppkey, SUM(l_extendedprice * (1 - l_discount))

    FROM lineitem

    WHERE l_shipdate >= '1996-01-01'

      AND l_shipdate < DATE_ADD('1996-01-01', INTERVAL '90' DAY)

    GROUP BY l_suppkey )

SELECT s_suppkey, s_name, s_address, s_phone, total_revenue

FROM supplier, revenue0

WHERE s_suppkey = supplier_no

  AND total_revenue = (SELECT MAX(total_revenue) FROM revenue0)

ORDER BY s_suppkey;

Here is the output from EXPLAIN for the original query:

+----+-------------+------------+-------------------------+--------+--------------------------------+------------+---------+--------------------------+--------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

| id | select_type | table      | partitions              | type   | possible_keys                  | key        | key_len | ref                      | rows   | filtered | Extra                                                                                                                                                                                        |

+----+-------------+------------+-------------------------+--------+--------------------------------+------------+---------+--------------------------+--------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

|  1 | PRIMARY     | <derived3> | NULL                    | ALL    | NULL                           | NULL       | NULL    | NULL                     | 223448 |    10.00 | Using where; Using temporary; Using filesort                                                                                                                                                 |

|  1 | PRIMARY     | supplier   | p0,p1,p2,p3,p4,p5,p6,p7 | eq_ref | PRIMARY                        | PRIMARY    | 4       | dbt3.revenue.supplier_no |      1 |   100.00 | NULL                                                                                                                                                                                         |

|  3 | DERIVED     | lineitem   | p0,p1,p2,p3,p4,p5,p6,p7 | range  | l_shipDATE,l_partkey,l_suppkey | l_shipDATE | 4       | NULL                     | 223448 |   100.00 | Using pushed condition ((`dbt3`.`lineitem`.`l_shipDATE` >= DATE'1996-01-01') and (`dbt3`.`lineitem`.`l_shipDATE` < <cache>(('1996-01-01' + interval '90' day)))); Using MRR; Using temporary |

|  2 | SUBQUERY    | <derived4> | NULL                    | ALL    | NULL                           | NULL       | NULL    | NULL                     | 223448 |   100.00 | NULL                                                                                                                                                                                         |

|  4 | DERIVED     | lineitem   | p0,p1,p2,p3,p4,p5,p6,p7 | range  | l_shipDATE,l_partkey,l_suppkey | l_shipDATE | 4       | NULL                     | 223448 |   100.00 | Using pushed condition ((`dbt3`.`lineitem`.`l_shipDATE` >= DATE'1996-01-01') and (`dbt3`.`lineitem`.`l_shipDATE` < <cache>(('1996-01-01' + interval '90' day)))); Using MRR; Using temporary |

+----+-------------+------------+-------------------------+--------+--------------------------------+------------+---------+--------------------------+--------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

5 rows in set, 1 warning (0.00 sec)


What is clear in the original query is that we execute the view in both places where

it is used.

Here is the EXPLAIN from the CTE query:

+----+-------------+------------+-------------------------+--------+--------------------------------+------------+---------+----------------------+--------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

| id | select_type | table      | partitions              | type   | possible_keys                  | key        | key_len | ref                  | rows   | filtered | Extra                                                                                                                                                                                        |

+----+-------------+------------+-------------------------+--------+--------------------------------+------------+---------+----------------------+--------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

|  1 | PRIMARY     | <derived2> | NULL                    | ALL    | NULL                           | NULL       | NULL    | NULL                 | 223448 |    10.00 | Using where; Using temporary; Using filesort                                                                                                                                                 |

|  1 | PRIMARY     | supplier   | p0,p1,p2,p3,p4,p5,p6,p7 | eq_ref | PRIMARY                        | PRIMARY    | 4       | revenue0.supplier_no |      1 |   100.00 | NULL                                                                                                                                                                                         |

|  3 | SUBQUERY    | <derived2> | NULL                    | ALL    | NULL                           | NULL       | NULL    | NULL                 | 223448 |   100.00 | NULL                                                                                                                                                                                         |

|  2 | DERIVED     | lineitem   | p0,p1,p2,p3,p4,p5,p6,p7 | range  | l_shipDATE,l_partkey,l_suppkey | l_shipDATE | 4       | NULL                 | 223448 |   100.00 | Using pushed condition ((`dbt3`.`lineitem`.`l_shipDATE` >= DATE'1996-01-01') and (`dbt3`.`lineitem`.`l_shipDATE` < <cache>(('1996-01-01' + interval '90' day)))); Using MRR; Using temporary |

+----+-------------+------------+-------------------------+--------+--------------------------------+------------+---------+----------------------+--------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+


Here we only execute the scan on the lineitem table once and reuse it in both places

where it is used.

Thus what we see here is that CTEs as well as windows functions makes it easier

for MySQL to discover repeated patterns.

Since we only need to scan once instead of twice the query execution times is cut in

half.

No comments: