In the previous post I showed some improvements based on using
Windows functions in rewritten TPC-H queries. In this blog post
I will discuss the improvements made in Q15 using a CTE (Common
Table Expression). This is again based on inspiration from new and
old blogs written by Öystein Grövlen (should be norwegian ö's).
Here is Q15 in TPC-H:
create view revenue (supplier_no, total_revenue) as
select
l_suppkey,
sum(l_extendedprice * (1 - l_discount))
from
lineitem
where
l_shipdate >= '1996-01-01'
and l_shipdate < date_add('1996-01-01', interval '90' day)
group by
l_suppkey;
select
s_suppkey,
s_name,
s_address,
s_phone,
total_revenue
from
supplier,
revenue
where
s_suppkey = supplier_no
and total_revenue = (
select
max(total_revenue)
from
revenue
)
order by
s_suppkey;
drop view revenue;
It creates a view and uses this view in 2 places in the query.
The same query using a CTE:
WITH revenue0(supplier_no , total_revenue) AS (
SELECT l_suppkey, SUM(l_extendedprice * (1 - l_discount))
FROM lineitem
WHERE l_shipdate >= '1996-01-01'
AND l_shipdate < DATE_ADD('1996-01-01', INTERVAL '90' DAY)
GROUP BY l_suppkey )
SELECT s_suppkey, s_name, s_address, s_phone, total_revenue
FROM supplier, revenue0
WHERE s_suppkey = supplier_no
AND total_revenue = (SELECT MAX(total_revenue) FROM revenue0)
ORDER BY s_suppkey;
Here is the output from EXPLAIN for the original query:
+----+-------------+------------+-------------------------+--------+--------------------------------+------------+---------+--------------------------+--------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+-------------------------+--------+--------------------------------+------------+---------+--------------------------+--------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 1 | PRIMARY | <derived3> | NULL | ALL | NULL | NULL | NULL | NULL | 223448 | 10.00 | Using where; Using temporary; Using filesort |
| 1 | PRIMARY | supplier | p0,p1,p2,p3,p4,p5,p6,p7 | eq_ref | PRIMARY | PRIMARY | 4 | dbt3.revenue.supplier_no | 1 | 100.00 | NULL |
| 3 | DERIVED | lineitem | p0,p1,p2,p3,p4,p5,p6,p7 | range | l_shipDATE,l_partkey,l_suppkey | l_shipDATE | 4 | NULL | 223448 | 100.00 | Using pushed condition ((`dbt3`.`lineitem`.`l_shipDATE` >= DATE'1996-01-01') and (`dbt3`.`lineitem`.`l_shipDATE` < <cache>(('1996-01-01' + interval '90' day)))); Using MRR; Using temporary |
| 2 | SUBQUERY | <derived4> | NULL | ALL | NULL | NULL | NULL | NULL | 223448 | 100.00 | NULL |
| 4 | DERIVED | lineitem | p0,p1,p2,p3,p4,p5,p6,p7 | range | l_shipDATE,l_partkey,l_suppkey | l_shipDATE | 4 | NULL | 223448 | 100.00 | Using pushed condition ((`dbt3`.`lineitem`.`l_shipDATE` >= DATE'1996-01-01') and (`dbt3`.`lineitem`.`l_shipDATE` < <cache>(('1996-01-01' + interval '90' day)))); Using MRR; Using temporary |
+----+-------------+------------+-------------------------+--------+--------------------------------+------------+---------+--------------------------+--------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
5 rows in set, 1 warning (0.00 sec)
What is clear in the original query is that we execute the view in both places where
it is used.
Here is the EXPLAIN from the CTE query:
+----+-------------+------------+-------------------------+--------+--------------------------------+------------+---------+----------------------+--------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+-------------------------+--------+--------------------------------+------------+---------+----------------------+--------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 223448 | 10.00 | Using where; Using temporary; Using filesort |
| 1 | PRIMARY | supplier | p0,p1,p2,p3,p4,p5,p6,p7 | eq_ref | PRIMARY | PRIMARY | 4 | revenue0.supplier_no | 1 | 100.00 | NULL |
| 3 | SUBQUERY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 223448 | 100.00 | NULL |
| 2 | DERIVED | lineitem | p0,p1,p2,p3,p4,p5,p6,p7 | range | l_shipDATE,l_partkey,l_suppkey | l_shipDATE | 4 | NULL | 223448 | 100.00 | Using pushed condition ((`dbt3`.`lineitem`.`l_shipDATE` >= DATE'1996-01-01') and (`dbt3`.`lineitem`.`l_shipDATE` < <cache>(('1996-01-01' + interval '90' day)))); Using MRR; Using temporary |
+----+-------------+------------+-------------------------+--------+--------------------------------+------------+---------+----------------------+--------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Here we only execute the scan on the lineitem table once and reuse it in both places
where it is used.
Thus what we see here is that CTEs as well as windows functions makes it easier
for MySQL to discover repeated patterns.
Since we only need to scan once instead of twice the query execution times is cut in
half.