Now the first alpha release of partitioning has been released.
This includes more or less all the syntax for partitioning to be
available. There is still some optimisation patches that are in the
works for coming 5.1 alpha releases.
There is still a small window of opportunity to get smaller additions
in so please use the MySQL partitioning forum to feedback any ideas
you have on improvements.
Partitioning is designed to work with all storage engines of MySQL. It is
however not possible to mix storage engines in a table. It is also possible
to use partitioning for MySQL Cluster. This gives the DBA the possibility to
control data placement in the cluster and thus create heterogenous clusters.
Most of the documentation for partitioning is already written and can be found
in the MySQL 5.1 Reference Manual due to some hard work by Jon Stephens.
Suggestions for improvements is welcome here as well.
My name is Mikael Ronstrom and I work for Hopsworks AB as Head of Data. I also assist companies working with NDB Cluster as self-employed consultant. I am a member of The Church of Jesus Christ of Latter Day Saints. I can be contacted at mikael dot ronstrom at gmail dot com for NDB consultancy services. The statements and opinions expressed on this blog are my own and do not necessarily represent those of Hopsworks AB.
Monday, December 05, 2005
Thursday, October 20, 2005
Calculating parameters for local checkpoints in MySQL Cluster
There are a couple of parameters in the configuration of local checkpoints
that have an intimate relationship between each other. All the parameters
are documented individually but the following example can serve to
improve the understanding of how they can be used together.
The example uses an application that does
50.000 selects per hour
15.000 updates per hour
15.000 deletes per hour
and 15.000 inserts per hour
(LCP = Local Checkpoint)
What I usually do is that I decide to fix the time an LCP is to take. Usually I use
5 minutes as a "thumb-rule". This means that a system restart will take
maximum 2-3 minutes to execute the REDO log.
Then I take the maximum size of the data in a node. For simplicity one could
use DataMemory here.
Then I take DataMemory / 300 as the amount of data to checkpoint per
second. Say I had 2 GB DataMemory => 6.6MB/sec => 660 kB/100 msec =>
83 * 8 kB/100 msec.
The parameter to set for this is NoOfDiskPagesToDiskAfterRestartTUP.
If the data to checkpoint to disk is close to the disk bandwidth available I
would ease down on the 5 minutes above and go for a longer checkpoint
time. Remember that the disk is also needed to handle REDO log writes and
UNDO log writes.
Then a similar calculation for IndexMemory.
The parameter to set for this is NoOfDiskPagesToDiskAfterRestartACC.
After performing these calculations I have come to a position where I know
the LCP time. The next step is to go about calculating the number of
fragment log files.
The parameter to set for this is NoOfFragmentLogFiles.
The fragment log files (REDO log files) are needed to keep at least log records
during 3 LCP's. Since there is a number of uncertainties in for example whether
the disk will actually go at full configured speed always and so forth I usually
go for a conservative figure and estimating a size for 6 LCP's.
If I choose 300 secs (5 minutes) as the time of a LCP, this means that I need to
support writing log records at full speed for 6 * 300 secs = 1800 secs.
The size of a REDO log record is:
72 bytes + 4 bytes * number of fields updated + max size of all fields updated
There is one such record for each record updated in a transaction in each node
where the data resides.
Using your case above we get:
50.000 selects/hour => 0 log records since SELECT's are not REDO logged
15.000 deletes/hour => ~ 5 deletes per second = 5 * 72 = 360 bytes/second
15.000 updates/hour =>
~ 5 updates per second = 5 * 72 + 5 * 4 * 5 + 5 * 32 = 620 bytes/second
(Assumed here 5 fields of 32 bytes updated)
15.000 inserts/hour =>
~ 5 inserts per second = 5 * 72 + 5 * 4 * 40 + 5 * 32 * 40 = 7560 bytes/second
Assuming a table with 40 fields of 32 bytes each and no NULL's in insert.
Thus a total of 7560 + 620 + 360 = 8540 bytes per second
=> 8540 * 1800 = 15 372 000
Thus 16 MByte of REDO log file would suffice in this case. The minimum
setting on number of fragment log files is 3 (where each file is 64 MByte)
and thus the default setting of 8 is quite ok here and no need to change.
(With an application that does 15.000 updates/deletes/inserts per second
the figures obviously change quite a lot).
The UNDO log file I usually don't worry about at all, only in the sense that
there needs to be sufficient disk space for it. In this case a few Mbytes
should suffice, but given sizes of disks today I can't see why one would
allocate less than a GByte for it.
that have an intimate relationship between each other. All the parameters
are documented individually but the following example can serve to
improve the understanding of how they can be used together.
The example uses an application that does
50.000 selects per hour
15.000 updates per hour
15.000 deletes per hour
and 15.000 inserts per hour
(LCP = Local Checkpoint)
What I usually do is that I decide to fix the time an LCP is to take. Usually I use
5 minutes as a "thumb-rule". This means that a system restart will take
maximum 2-3 minutes to execute the REDO log.
Then I take the maximum size of the data in a node. For simplicity one could
use DataMemory here.
Then I take DataMemory / 300 as the amount of data to checkpoint per
second. Say I had 2 GB DataMemory => 6.6MB/sec => 660 kB/100 msec =>
83 * 8 kB/100 msec.
The parameter to set for this is NoOfDiskPagesToDiskAfterRestartTUP.
If the data to checkpoint to disk is close to the disk bandwidth available I
would ease down on the 5 minutes above and go for a longer checkpoint
time. Remember that the disk is also needed to handle REDO log writes and
UNDO log writes.
Then a similar calculation for IndexMemory.
The parameter to set for this is NoOfDiskPagesToDiskAfterRestartACC.
After performing these calculations I have come to a position where I know
the LCP time. The next step is to go about calculating the number of
fragment log files.
The parameter to set for this is NoOfFragmentLogFiles.
The fragment log files (REDO log files) are needed to keep at least log records
during 3 LCP's. Since there is a number of uncertainties in for example whether
the disk will actually go at full configured speed always and so forth I usually
go for a conservative figure and estimating a size for 6 LCP's.
If I choose 300 secs (5 minutes) as the time of a LCP, this means that I need to
support writing log records at full speed for 6 * 300 secs = 1800 secs.
The size of a REDO log record is:
72 bytes + 4 bytes * number of fields updated + max size of all fields updated
There is one such record for each record updated in a transaction in each node
where the data resides.
Using your case above we get:
50.000 selects/hour => 0 log records since SELECT's are not REDO logged
15.000 deletes/hour => ~ 5 deletes per second = 5 * 72 = 360 bytes/second
15.000 updates/hour =>
~ 5 updates per second = 5 * 72 + 5 * 4 * 5 + 5 * 32 = 620 bytes/second
(Assumed here 5 fields of 32 bytes updated)
15.000 inserts/hour =>
~ 5 inserts per second = 5 * 72 + 5 * 4 * 40 + 5 * 32 * 40 = 7560 bytes/second
Assuming a table with 40 fields of 32 bytes each and no NULL's in insert.
Thus a total of 7560 + 620 + 360 = 8540 bytes per second
=> 8540 * 1800 = 15 372 000
Thus 16 MByte of REDO log file would suffice in this case. The minimum
setting on number of fragment log files is 3 (where each file is 64 MByte)
and thus the default setting of 8 is quite ok here and no need to change.
(With an application that does 15.000 updates/deletes/inserts per second
the figures obviously change quite a lot).
The UNDO log file I usually don't worry about at all, only in the sense that
there needs to be sufficient disk space for it. In this case a few Mbytes
should suffice, but given sizes of disks today I can't see why one would
allocate less than a GByte for it.
Wednesday, October 19, 2005
Partition Defaults
It's been an interesting exercise to define the defaults when creating and
altering partitioned tables.
If one creates a table like this
CREATE TABLE t1 (a int)
PARTITION BY KEY(a)
1) How many partitions
2) What will be the names of the partitions
3) Which engine is used
In this case the answer will be
1) 1
2) p0
3) MyISAM, most likely, depends on which is the default engine in the server
(If NDB is default then the answer will differ)
The reasoning here is that 1 is a minimalistic approach and can still be desired
since it might be desirable to be able to grow the table using ADD PARTITION
and so forth.
For RANGE/LIST partitioned tables it is not possible to avoid defining the
partitions. The reason is that the VALUES LESS THAN/VALUES IN must be
specified, otherwise we don't know the ranges or the list values (these are
too hard to guess at for a default).
However subpartitions can again be default
CREATE TABLE t1 (a date)
PARTITION BY RANGE (year(a))
SUBPARTITION BY KEY(a)
(PARTITION p0 VALUES LESS THAN (10));
In this case we will again set default number of subpartitions to 1, which in this
case isn't very useful since we cannot change the subpartitioning with any
partition management commands in this version.
CREATE TABLE t1 (a date)
PARTITION BY RANGE (year(a))
SUBPARTITION BY KEY(a)
SUBPARTITIONS 2
(PARTITION p0 VALUES LESS THAN (10));
This is more useful and the name of the subpartitions will be sp0 and sp1.
Now to even more tricky parts. What if a table created with all defaults has
partitions added to it.
ALTER TABLE t1 ADD PARTITION 1;
Assume that the table already had all defaults this would mean that we had 1
partition and add another. So from now on we don't have default number of
partitions anymore so the table is the same as if created by
CREATE TABLE t1 (a int)
PARTITION BY KEY(a)
PARTITIONS 2;
Thus partitions and their names are still default but not the name.
If we do instead
ALTER TABLE t1 (a int) ADD PARTITION (PARTITION x0);
In this case we have specified one new partition and we have even specified its
name and thus we don't use defaults anymore. Thus the table is now the same
as if created with:
CREATE TABLE t1 (a int)
PARTITION BY KEY(a)
(PARTITION p0, PARTITION x0);
For partitioned tables using NDB as the storage engine (MySQL Cluster) the
default number of partitions will be equal to the number of nodes in the
cluster, except when MAX_ROWS has been set very high when it can be set
higher, to a multiple of the number of nodes in the cluster.
altering partitioned tables.
If one creates a table like this
CREATE TABLE t1 (a int)
PARTITION BY KEY(a)
1) How many partitions
2) What will be the names of the partitions
3) Which engine is used
In this case the answer will be
1) 1
2) p0
3) MyISAM, most likely, depends on which is the default engine in the server
(If NDB is default then the answer will differ)
The reasoning here is that 1 is a minimalistic approach and can still be desired
since it might be desirable to be able to grow the table using ADD PARTITION
and so forth.
For RANGE/LIST partitioned tables it is not possible to avoid defining the
partitions. The reason is that the VALUES LESS THAN/VALUES IN must be
specified, otherwise we don't know the ranges or the list values (these are
too hard to guess at for a default).
However subpartitions can again be default
CREATE TABLE t1 (a date)
PARTITION BY RANGE (year(a))
SUBPARTITION BY KEY(a)
(PARTITION p0 VALUES LESS THAN (10));
In this case we will again set default number of subpartitions to 1, which in this
case isn't very useful since we cannot change the subpartitioning with any
partition management commands in this version.
CREATE TABLE t1 (a date)
PARTITION BY RANGE (year(a))
SUBPARTITION BY KEY(a)
SUBPARTITIONS 2
(PARTITION p0 VALUES LESS THAN (10));
This is more useful and the name of the subpartitions will be sp0 and sp1.
Now to even more tricky parts. What if a table created with all defaults has
partitions added to it.
ALTER TABLE t1 ADD PARTITION 1;
Assume that the table already had all defaults this would mean that we had 1
partition and add another. So from now on we don't have default number of
partitions anymore so the table is the same as if created by
CREATE TABLE t1 (a int)
PARTITION BY KEY(a)
PARTITIONS 2;
Thus partitions and their names are still default but not the name.
If we do instead
ALTER TABLE t1 (a int) ADD PARTITION (PARTITION x0);
In this case we have specified one new partition and we have even specified its
name and thus we don't use defaults anymore. Thus the table is now the same
as if created with:
CREATE TABLE t1 (a int)
PARTITION BY KEY(a)
(PARTITION p0, PARTITION x0);
For partitioned tables using NDB as the storage engine (MySQL Cluster) the
default number of partitions will be equal to the number of nodes in the
cluster, except when MAX_ROWS has been set very high when it can be set
higher, to a multiple of the number of nodes in the cluster.
Tuesday, October 18, 2005
Key parameters for MySQL Cluster
Two parameters that you want to keep track of for MySQL Cluster is:
ndb-use-exact-count
If this is set to 1 you get very fast SELECT COUNT(*) but most other
queries become much slower. So if your normal query scenario is
primary key lookups then set this parameter to 0. It can have a
radical effect on performance. This parameter exists in 4.1 and
onwards.
engine_condition_pushdown
This parameter is introduced in 5.0 and if set enables condition pushdown.
As an example if one does a query like:
SELECT * FROM t1 WHERE name LIKE %adam;
Then the LIKE expression can be pushed down to the data node and be evaluated
in parallel for much higher execution speed.
So both those parameters can give radical performance effect in certain query types.
ndb-use-exact-count
If this is set to 1 you get very fast SELECT COUNT(*) but most other
queries become much slower. So if your normal query scenario is
primary key lookups then set this parameter to 0. It can have a
radical effect on performance. This parameter exists in 4.1 and
onwards.
engine_condition_pushdown
This parameter is introduced in 5.0 and if set enables condition pushdown.
As an example if one does a query like:
SELECT * FROM t1 WHERE name LIKE %adam;
Then the LIKE expression can be pushed down to the data node and be evaluated
in parallel for much higher execution speed.
So both those parameters can give radical performance effect in certain query types.
Monday, October 17, 2005
Partitioning on non-PK field
An interesting comment from Timo mentioned that he wanted to have
partitioning made on a non-PK field. The problem with this is that it
requires a global index on PK to be able to support the primary key
constraint on the table. Global indexes on partitioned tables will not
be supported in version 5.1.
What could be done here is to change the PK into a composite index
with the original PK fields + the partitioning field. Thus we get both
partitioning based on the field required and still have the PK constraint
possible to maintain.
If this is done then voila, the query will only use the server where the data
resides and won't need to access all parts.
Also dynamic repartitioning and possibility to add nodes online is a very cool
feature that would be nice to have. It is certainly on the roadmap but as
usual it is dependent on priorities in which version it actually arrives in.
partitioning made on a non-PK field. The problem with this is that it
requires a global index on PK to be able to support the primary key
constraint on the table. Global indexes on partitioned tables will not
be supported in version 5.1.
What could be done here is to change the PK into a composite index
with the original PK fields + the partitioning field. Thus we get both
partitioning based on the field required and still have the PK constraint
possible to maintain.
If this is done then voila, the query will only use the server where the data
resides and won't need to access all parts.
Also dynamic repartitioning and possibility to add nodes online is a very cool
feature that would be nice to have. It is certainly on the roadmap but as
usual it is dependent on priorities in which version it actually arrives in.
Friday, October 14, 2005
Default Partitioning of MySQL Cluster 5.1
In MySQL Cluster version 4.1 and 5.0 tables are automatically partitioned
using a hash function on the primary key as partition function. If there is
no primary key there is a hidden key introduced and we use hash function
on that hidden key as the partition function.
The number of partitions is also selected automatically and also the placement
of the partitions. The number of partitions is the same as the number of nodes
except when MAX_ROWS has been set really high in which case it can be
n*number of nodes where n <= 8.
Assuming that we have 4 nodes in the cluster with 2 replicas. In this case we
have two node groups with 2 nodes in each.
The first partition will be placed in node group 0 with the first node in the node
group as primary, the second partition will be placed in node group 1 with the
first node in the node group as primary. The third partition will be placed in
node group 0 again and this with the second node in the node group as primary
and finally the fourth partition is placed in node group 1 again with the second
node in the node group.
This automatic partitioning is designed to spread data and load evenly on the
cluster nodes and a homogenous cluster is assumed.
In MySQL 5.1 we are introducing manual partitioning. One of the things one
can do with partitioned tables is that one can add, drop, reorganise partitions.
To be able to do this also on tables defined by ENGINE=NDB without
partitioning we decided that in 5.1 all clustered tables are partitioned and that
their partition can change.
Thus when you define a table as ENGINE=NDB it will automatically be a
partitioned table defined as:
PARTITION BY KEY()
KEY() here refers to either primary key partitioning or hidden key partitioning
as defined above (for partitioning on other engines that don't have partitioning
it is defined on partitioning on the primary key fields).
The partitioning above then applies on the number of partitions and their
placement.
Later on it will be possible to execute (the discussion here applies to
partitioning in general):
ALTER TABLE t1 COALESCE PARTITION 1;
This will merge the last partition into the remaining partitions. Assume the
automatic partitioning at table creation created 4 partitions. After this
command the number of partitions is 3. This is no longer the default number
of partitions, thus the partition definition is changed to:
PARTITION BY KEY()
PARTITIONS 3
Later on we execute:
ALTER TABLE t1 REORGANIZE PARTITION (p0) INTO
(PARTITION x0 NODEGROUP 1);
The name p0 is the default name of the first partition (followed by p1, p2, ...).
In this we have decided to specifically move the first partition to the second
node group. Thus there is no longer any default partitioning and after this
change the partition is completely manual and specified as:
PARTITION BY KEY()
PARTITIONS 3
(PARTITION x0 NODEGROUP 1,
PARTITION p1 NODEGROUP 1,
PARTITION p2 NODEGROUP 0)
So one can either have completely automatic partitioning of the table, one can
have a defined number of partitions but all the partitions have default setting
and finally one can have a completely manual partitioning.
using a hash function on the primary key as partition function. If there is
no primary key there is a hidden key introduced and we use hash function
on that hidden key as the partition function.
The number of partitions is also selected automatically and also the placement
of the partitions. The number of partitions is the same as the number of nodes
except when MAX_ROWS has been set really high in which case it can be
n*number of nodes where n <= 8.
Assuming that we have 4 nodes in the cluster with 2 replicas. In this case we
have two node groups with 2 nodes in each.
The first partition will be placed in node group 0 with the first node in the node
group as primary, the second partition will be placed in node group 1 with the
first node in the node group as primary. The third partition will be placed in
node group 0 again and this with the second node in the node group as primary
and finally the fourth partition is placed in node group 1 again with the second
node in the node group.
This automatic partitioning is designed to spread data and load evenly on the
cluster nodes and a homogenous cluster is assumed.
In MySQL 5.1 we are introducing manual partitioning. One of the things one
can do with partitioned tables is that one can add, drop, reorganise partitions.
To be able to do this also on tables defined by ENGINE=NDB without
partitioning we decided that in 5.1 all clustered tables are partitioned and that
their partition can change.
Thus when you define a table as ENGINE=NDB it will automatically be a
partitioned table defined as:
PARTITION BY KEY()
KEY() here refers to either primary key partitioning or hidden key partitioning
as defined above (for partitioning on other engines that don't have partitioning
it is defined on partitioning on the primary key fields).
The partitioning above then applies on the number of partitions and their
placement.
Later on it will be possible to execute (the discussion here applies to
partitioning in general):
ALTER TABLE t1 COALESCE PARTITION 1;
This will merge the last partition into the remaining partitions. Assume the
automatic partitioning at table creation created 4 partitions. After this
command the number of partitions is 3. This is no longer the default number
of partitions, thus the partition definition is changed to:
PARTITION BY KEY()
PARTITIONS 3
Later on we execute:
ALTER TABLE t1 REORGANIZE PARTITION (p0) INTO
(PARTITION x0 NODEGROUP 1);
The name p0 is the default name of the first partition (followed by p1, p2, ...).
In this we have decided to specifically move the first partition to the second
node group. Thus there is no longer any default partitioning and after this
change the partition is completely manual and specified as:
PARTITION BY KEY()
PARTITIONS 3
(PARTITION x0 NODEGROUP 1,
PARTITION p1 NODEGROUP 1,
PARTITION p2 NODEGROUP 0)
So one can either have completely automatic partitioning of the table, one can
have a defined number of partitions but all the partitions have default setting
and finally one can have a completely manual partitioning.
Thursday, October 13, 2005
Posted proposals for MySQL Users Conference 2006
So I made my proposals for the Users Conference actually in good time, long before the deadline of 7 november.
Have some great ideas about how I want to present MySQL Partitioning.
Want to explain the concepts of partitioning:
Partition Basics:
RANGE/LIST/HASH partitioning
and what one can do with it.
Partition Management:
How you can manage your partitions
ALTER TABLE t1 ADD/DROP/REORGANIZE/REBUILD/COALESCE/OPTIMIZE/ANALYZE/CHECK/REPAIR PARTITION ...
I'm particularly fond of the things you can do with REORGANIZE.
Partition pruning:
What the MySQL Server will do to avoid using partitions not a part of your query
Partition hints:
When you want to hint the MySQL Server on what partitions to use and not rely on the optimizer.
After going through those things give some examples of how it could be applied in reality.
I am particularly excited about what you could do with partitioning also when disk data for cluster comes around.
It'll be possible to have more heterogenous clusters where one could have machines with memory based tables and
other machines full of disks to handle the disk related tables.
Have some great ideas about how I want to present MySQL Partitioning.
Want to explain the concepts of partitioning:
Partition Basics:
RANGE/LIST/HASH partitioning
and what one can do with it.
Partition Management:
How you can manage your partitions
ALTER TABLE t1 ADD/DROP/REORGANIZE/REBUILD/COALESCE/OPTIMIZE/ANALYZE/CHECK/REPAIR PARTITION ...
I'm particularly fond of the things you can do with REORGANIZE.
Partition pruning:
What the MySQL Server will do to avoid using partitions not a part of your query
Partition hints:
When you want to hint the MySQL Server on what partitions to use and not rely on the optimizer.
After going through those things give some examples of how it could be applied in reality.
I am particularly excited about what you could do with partitioning also when disk data for cluster comes around.
It'll be possible to have more heterogenous clusters where one could have machines with memory based tables and
other machines full of disks to handle the disk related tables.
Friday, August 05, 2005
Backwards programming
I noticed an interesting manner I am programming right now. I am programming and testing all the error cases before
even attempting to perform one successful variant. This is a little backwards from what I usually have done and also
backwards from what I've seen other programmers do. It's kind of nice however since when the successful tests work then
the work is completed rather than having all the boring test cases still to work on.
Also when working in adapting code others written (or yourself where some time has passed) this works since you check
that your assumptions were correct and that the code fails where you expect it to fail and not before that and not passing
the expected failures. Thus one is always in full control.
even attempting to perform one successful variant. This is a little backwards from what I usually have done and also
backwards from what I've seen other programmers do. It's kind of nice however since when the successful tests work then
the work is completed rather than having all the boring test cases still to work on.
Also when working in adapting code others written (or yourself where some time has passed) this works since you check
that your assumptions were correct and that the code fails where you expect it to fail and not before that and not passing
the expected failures. Thus one is always in full control.
Thursday, August 04, 2005
Great idea on Partition Management
I sat down thinking about what ways to manage your partitions in a partitioned table.
To be able to add and drop partitions in various ways for the various types of partitions
are more or less obvious. But there are a great number of other changes desirable like
merging partitions, splitting partitions, changing ranges of partitions and so forth. So
either one could define a whole battery of special commands for every special situation
(and naturally write code for each and every one of them as well).
So I found a solution where one defines the set of partitions to reorganise and then defines
the partitions to reorganise them into. This single method can be specialised to merge
partitions, split partitions, change range conditions, change list of values for partitions and
so forth.
To be able to add and drop partitions in various ways for the various types of partitions
are more or less obvious. But there are a great number of other changes desirable like
merging partitions, splitting partitions, changing ranges of partitions and so forth. So
either one could define a whole battery of special commands for every special situation
(and naturally write code for each and every one of them as well).
So I found a solution where one defines the set of partitions to reorganise and then defines
the partitions to reorganise them into. This single method can be specialised to merge
partitions, split partitions, change range conditions, change list of values for partitions and
so forth.
Thursday, July 14, 2005
Blog Test
This is my first blog and is created while working in +30C. It is
a hot summer here in Sweden. I am working hard on an interesting
lock bug while trying to putting together my work on partitioning for
MySQL 5.1.
a hot summer here in Sweden. I am working hard on an interesting
lock bug while trying to putting together my work on partitioning for
MySQL 5.1.
Subscribe to:
Posts (Atom)