Friday, March 06, 2015

The design of the SCAN algorithm within one LDM instance

As part of optimising scan execution by about 50% in MySQL Cluster 7.4
we made an effort to also write down a description of the scan protocol
locally within the LDM threads. This is documented in the source code of
MySQL Cluster 7.4 and here is an excerpt from the source code.

DBLQH controls the execution of scans on tables on behalf of DBTC and
DBSPJ. Here follows a signal overview of how a scan is performed within
one LDM instance. For a description of the global scan protocol
see DbtcMain.cpp as a comment before execSCAN_TABREQ.

DBLQH only controls execution of a scan towards one partition of a
table. DBTC/DBSPJ is responsible for execution of scans toward the
entire table and ensuring that the API sees a consistent view of the
table.

There are currently four types of scans implemented in one LDM
instance:

Full table scan using hash index. This is implemented in DBACC.
Full table scan using row by row. This is implemented in DBTUP.
Full table scan using row by row in disk order. This is implemented in
DBTUP.
Index scan using one or several ranges. This is implemented in DBTUX.

DBLQH controls execution of one partition scan, Dependent on the scan
type, DBACC/DBTUP/DBTUX is responsible to get the row references to
the tuple scanned. DBTUP is responsible for reading of those rows and
finally DBACC is responsible for any locking of rows required as part
of the scan.

Each scan is controlled by an interpreted program created by the API
and transported down to DBTUP. This program is sent as part of the
SCAN_FRAGREQ signal and passed to DBTUP in the STORED_PROCREQ signal.
This program is applied on each row reference passed to DBTUP by
execution of the execTUPKEYREQ signal.

In index ranges one or more ranges is sent in the keyinfo part of the
SCAN_FRAGREQ. This range information is sent to DBTUX one range at a
time. Actually with multiple ranges, DBLQH will treat each range as a
separate scan towards the other blocks, so a scan will be started and
closed towards DBACC/DBTUP/DBTUX for each range involved.

As an optimisation all signals locally in one LDM instance have been
converted to direct signals.
The following signals are used as part of the scan of one partition.
ACC_SCANREQ:
  This signal initialises an operation record in DBACC/DBTUP/DBTUX for
  scan of one range or a full partition. Always sent as a direct signal
  and returned immediately through signal object on return.

STORED_PROCREQ:
  This signal stores the interpreted program used to read each tuple
  as part of the scan. The same signal is also used to deallocate the
  the interpreted program when the entire scan of all ranges have been
  completed. Always sent as a direct signal and returned immediately
  through signal object on return.

ACC_LOCKREQ:
  Certain scans require a lock on the row before the row is read, this
  signal acquires such a lock. Always sent as a direct signal. Return
  signal not always sent immediately.

ACCKEYCONF:
  Signal returned when the lock have been acquired, the signal is
  normally sent directly when the row is not locked, but for a locked
  row the signal can be sent even a second or more later. When sent the
  signal is sent as a direct signal.

ACCKEYREF:
  Signal returned when acquiring lock failed, e.g. due to record deleted
  while waiting for it.

ACC_ABORTCONF:
  Signal returned after aborting a scan using an asynchronous message to
  ensure that all asynchronous messages are delivered since setting the
  scan state as aborted.

NEXT_SCANREQ:
  This signal is used with different meaning:
  ZSCAN_NEXT:
    Get the next row reference to read, returned in NEXT_SCANCONF signal.
  ZSCAN_NEXT_COMMIT:
    Get the next row reference to read AND unlock the specified row.
    Returned in NEXT_SCANCONF signal.
  ZSCAN_COMMIT:
    Unlock the specified row. Return signal is simply returned when
    returning from call to execNEXT_SCANREQ.
  ZSCAN_CLOSE:
    Close the scan in DBACC/DBTUP/DBTUX.

  When sent as ZSCAN_COMMIT and ZSCAN_CLOSE it is always sent as a direct
  signal. Otherwise it is sent as direct or asynchronous signal dependent
  on the value of the scan_direct_count variable in the DBLQH scan
  record. The scan_direct_count variable ensures that we keep the number
  of direct signals sent bounded.

NEXT_SCANCONF:
  Return signal to NEXT_SCANREQ containing row reference to read or
  indication of close completed. Always sent as a direct signal.

TUPKEYREQ:
  This signal does the actual read of the row and sends the read row data
  directly to the API using the TRANSID_AI signal. This signal is always
  sent as a direct signal.

ACC_CHECK_SCAN:
  Continue scanning from specified place. Used by DBACC/DBTUP/DBTUX as an
  internal signal as part of the scan. This signal can be sent both as
  an asynchronous signal and as a direct signal.

SCAN_FRAGCONF:
  Return signal sent to DBTC/DBSPJ after completing a part of the scan,
  the signal carries a set of references to rows sent to the API. After
  sending this signal DBLQH will stop and wait for a SCAN_NEXTREQ to
  signal asking DBLQH to continue the scan of the partition. The number
  of rows scanned before sending SCAN_FRAGCONF is dependent on both
  configuration parameters and information in the SCAN_FRAGREQ signal.

  This signal is also sent when the scan is fully completed.
  This signal is normally a distributed signal, so it is always sent as
  an asynchronous signal.

SCAN_NEXTREQ:
  Request to continue scanning from DBTC/DBSPJ as requested to them from
  API.
  This signal is normally a distributed signal, so it is always sent as
  an asynchronous signal.

 Below follows an example signal diagram of a scan of one partition.

 DBLQH          ACC      TUP    ACC/TUP/TUX    API      DBTC
   |ACC_SCANREQ
   |----------------------------------------- >|
   |< -----------------------------------------|
   | STORED_PROCREQ
   |------------------------- >|
   |< -------------------------|
   | NEXT_SCANREQ (ZSCAN_NEXT)
   |-----------------------------------------  >|
   |                          prepareTUPKEYREQ
   |                          |< --------------|
   |                          |-------------- >|
   | NEXT_SCANCONF
   |< -----------------------------------------|
   | TUPKEYREQ
   |------------------------- >|  TRANSID_AI
   |                          |--------------------------------------- >|
   |< -------------------------|
   | NEXT_SCANREQ (ZSCAN_NEXT_COMMIT)
   |----------------------------------------- >|
   |                          prepareTUPKEYREQ
   |                          |< --------------|
   |                          |-------------- >|
   | NEXT_SCANCONF
   |< -----------------------------------------|
   | TUPKEYREQ
   |------------------------- >|  TRANSID_AI
   |                          |--------------------------------------- >|
   |< -------------------------|
   Repeat above for as many rows as required before returning to the
   API. The above TRANSID_AI isn't necessary, the interpreted program
   could perform selection and decide to not send a specific row since
   it doesn't match the condition checked by the interpreted program.
   |
   | SCAN_FRAGCONF
   |---------------------------------------------------------------- >|
   .... Some time for API and DBTC to process things.
   | SCAN_NEXTREQ
   |< ----------------------------------------------------------------|
   | NEXT_SCANREQ (ZSCAN_NEXT_COMMIT)
   |----------------------------------------- >|
   |                          prepareTUPKEYREQ
   |                          |< --------------|
   |                          |-------------- >|
   | NEXT_SCANCONF
   |< -----------------------------------------|
   | TUPKEYREQ
   |------------------------- >|  TRANSID_AI
   |                          |-------------------------- >|
   |< -------------------------|
   Repeat above again until time for next SCAN_FRAGCONF to be sent.
   When scan from NEXT_SCANCONF indicates there are no more tuples to
   fetch one starts to close the scan.

   |
   | NEXT_SCANREQ (ZSCAN_NEXT_COMMIT)
   |----------------------------------------- >|
   | NEXT_SCANCONF(no more tuples)
   |< -----------------------------------------|
   | NEXT_SCANREQ (ZSCAN_CLOSE)
   |----------------------------------------- >|
   | NEXT_SCANCONF
   |< -----------------------------------------|
   | STORED_PROCREQ (delete interpreted program)
   |------------------------- >|
   |< -------------------------|
   | SCAN_FRAGCONF (close flag set)
   |---------------------------------------------------------------- >|
   Now the scan is completed.

   Now a number of variations on the above signal diagrams:
   Scan with locking:
   In this we use the flag ZSCAN_NEXT all the time and never
   ZSCAN_NEXT_COMMIT, we handle things a bit differently instead when
   receiving SCAN_NEXTREQ where we perform a signal diagram like this:

   | NEXT_SCANREQ (ZSCAN_COMMIT)
   |----------------------------------------- >|
   |< -----------------------------------------|
   This is repeated for each row sent to the API in the previous
   SCAN_FRAGCONF signal.

   If the application wants the row locked for longer time he have had
   the chance to perform a key lookup operation that took over the lock
   such that even when we unlock the scan lock, the transaction still
   retains a lock on the row.

   After each row scanned we check if we've reached a scan heartbeat
   timeout. In case we have we send a SCAN_HBREP signal to DBTC/DBSPJ
   to inform about that we're still actively scanning even though no
   result rows have been sent. Remember here that a scan in DBLQH can
   potentially scan billions of rows while only returning very few to
   the API. Thus we can scan for an extended time without returning to
   the API. This is handled by the method check_send_scan_hb_rep.

   Already from returning from ACC_SCANREQ we can discover that the
   partition (== fragment) is empty and go immediately to the close down code.
   For index scans we will send TUX_BOUND_INFO after ACC_SCANREQ and
   before sending STORED_PROCREQ to DBTUX. This will provide one range
   to DBTUX for scanning, if multiple ranges are to be scanned we
   startup a new scan as if it was a new SCAN_FRAGREQ received, but we
   don't need to send STORED_PROCREQ since the same interpreted program
   will be used. We will however send ACC_SCANREQ and TUX_BOUND_INFO
   also for this new range.

  There are various reasons for temporarily stopping a scan, this could
  lack of operation records, holding too many row locks, one could also
  end up in this situation after waiting for a row lock.

  To restart the scan again after any type of temporary stop one sends
  the signal ACC_CHECK_SCAN either as direct or as an asynchronous signal
  to DBACC/DBTUP/DBTUX. This signal is sent from many different places in
  DBLQH, DBACC, DBTUP and DBTUX. It is always sent as part of NEXT_SCANREQ
  processing.

  When executing ACC_CHECK_SCAN one can flag to DBACC/DBTUP/DBTUX
  that one  should check for a 10 ms delay with the flag ZCHECK_LCP_STOP. In
  previous versions this was also related to local checkpoints, this is no longer
  the case. Now it's only related to situations where it is required to
  perform an extra wait such that resources becomes available again.

  DBTUP and DBTUX sends the signal CHECK_LCP_STOP to DBLQH in a number of
  situations, among other things when a locked key has been encountered.
  When the ACCKEYCONF signal then is received indicating that one acquired
  the lock, DBLQH will still wait for CHECK_LCP_STOP from DBQLH to return
  after 10ms delay. This is on the TODO-list to fix to ensure that we can
  proceed with these locked rows immediately after delivery. As it is now
  we can get up to 10ms delay each time we encounter a locked row.

No comments: