| The cluster MD is a shared-device RAID for a cluster. |
| |
| |
| 1. On-disk format |
| |
| Separate write-intent-bitmap are used for each cluster node. |
| The bitmaps record all writes that may have been started on that node, |
| and may not yet have finished. The on-disk layout is: |
| |
| 0 4k 8k 12k |
| ------------------------------------------------------------------- |
| | idle | md super | bm super [0] + bits | |
| | bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] | |
| | bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits | |
| | bm bits [3, contd] | | | |
| |
| During "normal" functioning we assume the filesystem ensures that only one |
| node writes to any given block at a time, so a write |
| request will |
| - set the appropriate bit (if not already set) |
| - commit the write to all mirrors |
| - schedule the bit to be cleared after a timeout. |
| |
| Reads are just handled normally. It is up to the filesystem to |
| ensure one node doesn't read from a location where another node (or the same |
| node) is writing. |
| |
| |
| 2. DLM Locks for management |
| |
| There are two locks for managing the device: |
| |
| 2.1 Bitmap lock resource (bm_lockres) |
| |
| The bm_lockres protects individual node bitmaps. They are named in the |
| form bitmap001 for node 1, bitmap002 for node and so on. When a node |
| joins the cluster, it acquires the lock in PW mode and it stays so |
| during the lifetime the node is part of the cluster. The lock resource |
| number is based on the slot number returned by the DLM subsystem. Since |
| DLM starts node count from one and bitmap slots start from zero, one is |
| subtracted from the DLM slot number to arrive at the bitmap slot number. |
| |
| 3. Communication |
| |
| Each node has to communicate with other nodes when starting or ending |
| resync, and metadata superblock updates. |
| |
| 3.1 Message Types |
| |
| There are 3 types, of messages which are passed |
| |
| 3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been |
| updated, and the node must re-read the md superblock. This is performed |
| synchronously. |
| |
| 3.1.2 RESYNC: informs other nodes that a resync is initiated or ended |
| so that each node may suspend or resume the region. |
| |
| 3.2 Communication mechanism |
| |
| The DLM LVB is used to communicate within nodes of the cluster. There |
| are three resources used for the purpose: |
| |
| 3.2.1 Token: The resource which protects the entire communication |
| system. The node having the token resource is allowed to |
| communicate. |
| |
| 3.2.2 Message: The lock resource which carries the data to |
| communicate. |
| |
| 3.2.3 Ack: The resource, acquiring which means the message has been |
| acknowledged by all nodes in the cluster. The BAST of the resource |
| is used to inform the receive node that a node wants to communicate. |
| |
| The algorithm is: |
| |
| 1. receive status |
| |
| sender receiver receiver |
| ACK:CR ACK:CR ACK:CR |
| |
| 2. sender get EX of TOKEN |
| sender get EX of MESSAGE |
| sender receiver receiver |
| TOKEN:EX ACK:CR ACK:CR |
| MESSAGE:EX |
| ACK:CR |
| |
| Sender checks that it still needs to send a message. Messages received |
| or other events that happened while waiting for the TOKEN may have made |
| this message inappropriate or redundant. |
| |
| 3. sender write LVB. |
| sender down-convert MESSAGE from EX to CR |
| sender try to get EX of ACK |
| [ wait until all receiver has *processed* the MESSAGE ] |
| |
| [ triggered by bast of ACK ] |
| receiver get CR of MESSAGE |
| receiver read LVB |
| receiver processes the message |
| [ wait finish ] |
| receiver release ACK |
| |
| sender receiver receiver |
| TOKEN:EX MESSAGE:CR MESSAGE:CR |
| MESSAGE:CR |
| ACK:EX |
| |
| 4. triggered by grant of EX on ACK (indicating all receivers have processed |
| message) |
| sender down-convert ACK from EX to CR |
| sender release MESSAGE |
| sender release TOKEN |
| receiver upconvert to EX of MESSAGE |
| receiver get CR of ACK |
| receiver release MESSAGE |
| |
| sender receiver receiver |
| ACK:CR ACK:CR ACK:CR |
| |
| |
| 4. Handling Failures |
| |
| 4.1 Node Failure |
| When a node fails, the DLM informs the cluster with the slot. The node |
| starts a cluster recovery thread. The cluster recovery thread: |
| - acquires the bitmap<number> lock of the failed node |
| - opens the bitmap |
| - reads the bitmap of the failed node |
| - copies the set bitmap to local node |
| - cleans the bitmap of the failed node |
| - releases bitmap<number> lock of the failed node |
| - initiates resync of the bitmap on the current node |
| |
| The resync process, is the regular md resync. However, in a clustered |
| environment when a resync is performed, it needs to tell other nodes |
| of the areas which are suspended. Before a resync starts, the node |
| send out RESYNC_START with the (lo,hi) range of the area which needs |
| to be suspended. Each node maintains a suspend_list, which contains |
| the list of ranges which are currently suspended. On receiving |
| RESYNC_START, the node adds the range to the suspend_list. Similarly, |
| when the node performing resync finishes, it send RESYNC_FINISHED |
| to other nodes and other nodes remove the corresponding entry from |
| the suspend_list. |
| |
| A helper function, should_suspend() can be used to check if a particular |
| I/O range should be suspended or not. |
| |
| 4.2 Device Failure |
| Device failures are handled and communicated with the metadata update |
| routine. |
| |
| 5. Adding a new Device |
| For adding a new device, it is necessary that all nodes "see" the new device |
| to be added. For this, the following algorithm is used: |
| |
| 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues |
| ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD) |
| 2. Node 1 sends NEWDISK with uuid and slot number |
| 3. Other nodes issue kobject_uevent_env with uuid and slot number |
| (Steps 4,5 could be a udev rule) |
| 4. In userspace, the node searches for the disk, perhaps |
| using blkid -t SUB_UUID="" |
| 5. Other nodes issue either of the following depending on whether the disk |
| was found: |
| ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and |
| disc.number set to slot number) |
| ioctl(CLUSTERED_DISK_NACK) |
| 6. Other nodes drop lock on no-new-devs (CR) if device is found |
| 7. Node 1 attempts EX lock on no-new-devs |
| 8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk |
| as SpareLocal |
| 9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED |
| 10. Other nodes get the information whether a disk is added or not |
| by the following METADATA_UPDATED. |