Mikael Ronstrom: The design of the SCAN algorithm within one LDM instance

As part of optimising scan execution by about 50% in MySQL Cluster 7.4
we made an effort to also write down a description of the scan protocol
locally within the LDM threads. This is documented in the source code of
MySQL Cluster 7.4 and here is an excerpt from the source code.

DBLQH controls the execution of scans on tables on behalf of DBTC and
DBSPJ. Here follows a signal overview of how a scan is performed within
one LDM instance. For a description of the global scan protocol
see DbtcMain.cpp as a comment before execSCAN_TABREQ.

DBLQH only controls execution of a scan towards one partition of a
table. DBTC/DBSPJ is responsible for execution of scans toward the
entire table and ensuring that the API sees a consistent view of the
table.

There are currently four types of scans implemented in one LDM
instance:

Full table scan using hash index. This is implemented in DBACC.
Full table scan using row by row. This is implemented in DBTUP.
Full table scan using row by row in disk order. This is implemented in
DBTUP.
Index scan using one or several ranges. This is implemented in DBTUX.

DBLQH controls execution of one partition scan, Dependent on the scan
type, DBACC/DBTUP/DBTUX is responsible to get the row references to
the tuple scanned. DBTUP is responsible for reading of those rows and
finally DBACC is responsible for any locking of rows required as part
of the scan.

Each scan is controlled by an interpreted program created by the API
and transported down to DBTUP. This program is sent as part of the
SCAN_FRAGREQ signal and passed to DBTUP in the STORED_PROCREQ signal.
This program is applied on each row reference passed to DBTUP by
execution of the execTUPKEYREQ signal.

In index ranges one or more ranges is sent in the keyinfo part of the
SCAN_FRAGREQ. This range information is sent to DBTUX one range at a
time. Actually with multiple ranges, DBLQH will treat each range as a
separate scan towards the other blocks, so a scan will be started and
closed towards DBACC/DBTUP/DBTUX for each range involved.

As an optimisation all signals locally in one LDM instance have been
converted to direct signals.
The following signals are used as part of the scan of one partition.
ACC_SCANREQ:
This signal initialises an operation record in DBACC/DBTUP/DBTUX for
scan of one range or a full partition. Always sent as a direct signal
and returned immediately through signal object on return.

STORED_PROCREQ:
This signal stores the interpreted program used to read each tuple
as part of the scan. The same signal is also used to deallocate the
the interpreted program when the entire scan of all ranges have been
completed. Always sent as a direct signal and returned immediately
through signal object on return.

ACC_LOCKREQ:
Certain scans require a lock on the row before the row is read, this
signal acquires such a lock. Always sent as a direct signal. Return
signal not always sent immediately.

ACCKEYCONF:
Signal returned when the lock have been acquired, the signal is
normally sent directly when the row is not locked, but for a locked
row the signal can be sent even a second or more later. When sent the
signal is sent as a direct signal.

ACCKEYREF:
Signal returned when acquiring lock failed, e.g. due to record deleted
while waiting for it.

ACC_ABORTCONF:
Signal returned after aborting a scan using an asynchronous message to
ensure that all asynchronous messages are delivered since setting the
scan state as aborted.

NEXT_SCANREQ:
This signal is used with different meaning:
ZSCAN_NEXT:
Get the next row reference to read, returned in NEXT_SCANCONF signal.
ZSCAN_NEXT_COMMIT:
Get the next row reference to read AND unlock the specified row.
Returned in NEXT_SCANCONF signal.
ZSCAN_COMMIT:
Unlock the specified row. Return signal is simply returned when
returning from call to execNEXT_SCANREQ.
ZSCAN_CLOSE:
Close the scan in DBACC/DBTUP/DBTUX.

When sent as ZSCAN_COMMIT and ZSCAN_CLOSE it is always sent as a direct
signal. Otherwise it is sent as direct or asynchronous signal dependent
on the value of the scan_direct_count variable in the DBLQH scan
record. The scan_direct_count variable ensures that we keep the number
of direct signals sent bounded.

NEXT_SCANCONF:
Return signal to NEXT_SCANREQ containing row reference to read or
indication of close completed. Always sent as a direct signal.

TUPKEYREQ:
This signal does the actual read of the row and sends the read row data
directly to the API using the TRANSID_AI signal. This signal is always
sent as a direct signal.

ACC_CHECK_SCAN:
Continue scanning from specified place. Used by DBACC/DBTUP/DBTUX as an
internal signal as part of the scan. This signal can be sent both as
an asynchronous signal and as a direct signal.

SCAN_FRAGCONF:
Return signal sent to DBTC/DBSPJ after completing a part of the scan,
the signal carries a set of references to rows sent to the API. After
sending this signal DBLQH will stop and wait for a SCAN_NEXTREQ to
signal asking DBLQH to continue the scan of the partition. The number
of rows scanned before sending SCAN_FRAGCONF is dependent on both
configuration parameters and information in the SCAN_FRAGREQ signal.

This signal is also sent when the scan is fully completed.
This signal is normally a distributed signal, so it is always sent as
an asynchronous signal.

SCAN_NEXTREQ:
Request to continue scanning from DBTC/DBSPJ as requested to them from
API.
This signal is normally a distributed signal, so it is always sent as
an asynchronous signal.

Below follows an example signal diagram of a scan of one partition.

DBLQH ACC TUP ACC/TUP/TUX API DBTC
|ACC_SCANREQ
|----------------------------------------- >|
|< -----------------------------------------|
| STORED_PROCREQ
|------------------------- >|
|< -------------------------|
| NEXT_SCANREQ (ZSCAN_NEXT)
|----------------------------------------- >|
| prepareTUPKEYREQ
| |< --------------|
| |-------------- >|
| NEXT_SCANCONF
|< -----------------------------------------|
| TUPKEYREQ
|------------------------- >| TRANSID_AI
| |--------------------------------------- >|
|< -------------------------|
| NEXT_SCANREQ (ZSCAN_NEXT_COMMIT)
|----------------------------------------- >|
| prepareTUPKEYREQ
| |< --------------|
| |-------------- >|
| NEXT_SCANCONF
|< -----------------------------------------|
| TUPKEYREQ
|------------------------- >| TRANSID_AI
| |--------------------------------------- >|
|< -------------------------|
Repeat above for as many rows as required before returning to the
API. The above TRANSID_AI isn't necessary, the interpreted program
could perform selection and decide to not send a specific row since
it doesn't match the condition checked by the interpreted program.
|
| SCAN_FRAGCONF
|---------------------------------------------------------------- >|
.... Some time for API and DBTC to process things.
| SCAN_NEXTREQ
|< ----------------------------------------------------------------|
| NEXT_SCANREQ (ZSCAN_NEXT_COMMIT)
|----------------------------------------- >|
| prepareTUPKEYREQ
| |< --------------|
| |-------------- >|
| NEXT_SCANCONF
|< -----------------------------------------|
| TUPKEYREQ
|------------------------- >| TRANSID_AI
| |-------------------------- >|
|< -------------------------|
Repeat above again until time for next SCAN_FRAGCONF to be sent.
When scan from NEXT_SCANCONF indicates there are no more tuples to
fetch one starts to close the scan.

|
| NEXT_SCANREQ (ZSCAN_NEXT_COMMIT)
|----------------------------------------- >|
| NEXT_SCANCONF(no more tuples)
|< -----------------------------------------|
| NEXT_SCANREQ (ZSCAN_CLOSE)
|----------------------------------------- >|
| NEXT_SCANCONF
|< -----------------------------------------|
| STORED_PROCREQ (delete interpreted program)
|------------------------- >|
|< -------------------------|
| SCAN_FRAGCONF (close flag set)
|---------------------------------------------------------------- >|
Now the scan is completed.

Now a number of variations on the above signal diagrams:
Scan with locking:
In this we use the flag ZSCAN_NEXT all the time and never
ZSCAN_NEXT_COMMIT, we handle things a bit differently instead when
receiving SCAN_NEXTREQ where we perform a signal diagram like this:

| NEXT_SCANREQ (ZSCAN_COMMIT)
|----------------------------------------- >|
|< -----------------------------------------|
This is repeated for each row sent to the API in the previous
SCAN_FRAGCONF signal.

If the application wants the row locked for longer time he have had
the chance to perform a key lookup operation that took over the lock
such that even when we unlock the scan lock, the transaction still
retains a lock on the row.

After each row scanned we check if we've reached a scan heartbeat
timeout. In case we have we send a SCAN_HBREP signal to DBTC/DBSPJ
to inform about that we're still actively scanning even though no
result rows have been sent. Remember here that a scan in DBLQH can
potentially scan billions of rows while only returning very few to
the API. Thus we can scan for an extended time without returning to
the API. This is handled by the method check_send_scan_hb_rep.

Already from returning from ACC_SCANREQ we can discover that the
partition (== fragment) is empty and go immediately to the close down code.
For index scans we will send TUX_BOUND_INFO after ACC_SCANREQ and
before sending STORED_PROCREQ to DBTUX. This will provide one range
to DBTUX for scanning, if multiple ranges are to be scanned we
startup a new scan as if it was a new SCAN_FRAGREQ received, but we
don't need to send STORED_PROCREQ since the same interpreted program
will be used. We will however send ACC_SCANREQ and TUX_BOUND_INFO
also for this new range.

There are various reasons for temporarily stopping a scan, this could
lack of operation records, holding too many row locks, one could also
end up in this situation after waiting for a row lock.

To restart the scan again after any type of temporary stop one sends
the signal ACC_CHECK_SCAN either as direct or as an asynchronous signal
to DBACC/DBTUP/DBTUX. This signal is sent from many different places in
DBLQH, DBACC, DBTUP and DBTUX. It is always sent as part of NEXT_SCANREQ
processing.

When executing ACC_CHECK_SCAN one can flag to DBACC/DBTUP/DBTUX
that one should check for a 10 ms delay with the flag ZCHECK_LCP_STOP. In
previous versions this was also related to local checkpoints, this is no longer
the case. Now it's only related to situations where it is required to
perform an extra wait such that resources becomes available again.

DBTUP and DBTUX sends the signal CHECK_LCP_STOP to DBLQH in a number of
situations, among other things when a locked key has been encountered.
When the ACCKEYCONF signal then is received indicating that one acquired
the lock, DBLQH will still wait for CHECK_LCP_STOP from DBQLH to return
after 10ms delay. This is on the TODO-list to fix to ensure that we can
proceed with these locked rows immediately after delivery. As it is now
we can get up to 10ms delay each time we encounter a locked row.

Mikael Ronstrom

Friday, March 06, 2015

The design of the SCAN algorithm within one LDM instance

No comments:

About Me

Links