Jonathan Brassow | c0a2fa1 | 2011-08-02 12:32:06 +0100 | [diff] [blame] | 1 | dm-raid |
Jonathan Brassow | be83651 | 2013-04-24 11:42:43 +1000 | [diff] [blame] | 2 | ======= |
NeilBrown | 9d09e66 | 2011-01-13 20:00:02 +0000 | [diff] [blame] | 3 | |
Jonathan Brassow | c0a2fa1 | 2011-08-02 12:32:06 +0100 | [diff] [blame] | 4 | The device-mapper RAID (dm-raid) target provides a bridge from DM to MD. |
| 5 | It allows the MD RAID drivers to be accessed using a device-mapper |
| 6 | interface. |
NeilBrown | 9d09e66 | 2011-01-13 20:00:02 +0000 | [diff] [blame] | 7 | |
Jonathan Brassow | be83651 | 2013-04-24 11:42:43 +1000 | [diff] [blame] | 8 | |
| 9 | Mapping Table Interface |
| 10 | ----------------------- |
Jonathan Brassow | c0a2fa1 | 2011-08-02 12:32:06 +0100 | [diff] [blame] | 11 | The target is named "raid" and it accepts the following parameters: |
NeilBrown | 9d09e66 | 2011-01-13 20:00:02 +0000 | [diff] [blame] | 12 | |
Jonathan Brassow | c0a2fa1 | 2011-08-02 12:32:06 +0100 | [diff] [blame] | 13 | <raid_type> <#raid_params> <raid_params> \ |
| 14 | <#raid_devs> <metadata_dev0> <dev0> [.. <metadata_devN> <devN>] |
NeilBrown | 9d09e66 | 2011-01-13 20:00:02 +0000 | [diff] [blame] | 15 | |
Jonathan Brassow | c0a2fa1 | 2011-08-02 12:32:06 +0100 | [diff] [blame] | 16 | <raid_type>: |
Jonathan Brassow | b12d437 | 2011-08-02 12:32:07 +0100 | [diff] [blame] | 17 | raid1 RAID1 mirroring |
Jonathan Brassow | c0a2fa1 | 2011-08-02 12:32:06 +0100 | [diff] [blame] | 18 | raid4 RAID4 dedicated parity disk |
| 19 | raid5_la RAID5 left asymmetric |
| 20 | - rotating parity 0 with data continuation |
| 21 | raid5_ra RAID5 right asymmetric |
| 22 | - rotating parity N with data continuation |
| 23 | raid5_ls RAID5 left symmetric |
| 24 | - rotating parity 0 with data restart |
| 25 | raid5_rs RAID5 right symmetric |
| 26 | - rotating parity N with data restart |
| 27 | raid6_zr RAID6 zero restart |
| 28 | - rotating parity zero (left-to-right) with data restart |
| 29 | raid6_nr RAID6 N restart |
| 30 | - rotating parity N (right-to-left) with data restart |
| 31 | raid6_nc RAID6 N continue |
| 32 | - rotating parity N (right-to-left) with data continuation |
Jonathan Brassow | 63f33b8d | 2012-07-31 21:44:26 -0500 | [diff] [blame] | 33 | raid10 Various RAID10 inspired algorithms chosen by additional params |
| 34 | - RAID10: Striped Mirrors (aka 'Striping on top of mirrors') |
| 35 | - RAID1E: Integrated Adjacent Stripe Mirroring |
Jonathan Brassow | fe5d2f4 | 2013-02-21 13:28:10 +1100 | [diff] [blame] | 36 | - RAID1E: Integrated Offset Stripe Mirroring |
Jonathan Brassow | 63f33b8d | 2012-07-31 21:44:26 -0500 | [diff] [blame] | 37 | - and other similar RAID10 variants |
NeilBrown | 9d09e66 | 2011-01-13 20:00:02 +0000 | [diff] [blame] | 38 | |
Masanari Iida | 40e4712 | 2012-03-04 23:16:11 +0900 | [diff] [blame] | 39 | Reference: Chapter 4 of |
Jonathan Brassow | c0a2fa1 | 2011-08-02 12:32:06 +0100 | [diff] [blame] | 40 | http://www.snia.org/sites/default/files/SNIA_DDF_Technical_Position_v2.0.pdf |
NeilBrown | 9d09e66 | 2011-01-13 20:00:02 +0000 | [diff] [blame] | 41 | |
Jonathan Brassow | c0a2fa1 | 2011-08-02 12:32:06 +0100 | [diff] [blame] | 42 | <#raid_params>: The number of parameters that follow. |
NeilBrown | 9d09e66 | 2011-01-13 20:00:02 +0000 | [diff] [blame] | 43 | |
Jonathan Brassow | c0a2fa1 | 2011-08-02 12:32:06 +0100 | [diff] [blame] | 44 | <raid_params> consists of |
| 45 | Mandatory parameters: |
| 46 | <chunk_size>: Chunk size in sectors. This parameter is often known as |
| 47 | "stripe size". It is the only mandatory parameter and |
| 48 | is placed first. |
| 49 | |
| 50 | followed by optional parameters (in any order): |
| 51 | [sync|nosync] Force or prevent RAID initialization. |
| 52 | |
Jonathan Brassow | be83651 | 2013-04-24 11:42:43 +1000 | [diff] [blame] | 53 | [rebuild <idx>] Rebuild drive number 'idx' (first drive is 0). |
Jonathan Brassow | c0a2fa1 | 2011-08-02 12:32:06 +0100 | [diff] [blame] | 54 | |
| 55 | [daemon_sleep <ms>] |
| 56 | Interval between runs of the bitmap daemon that |
| 57 | clear bits. A longer interval means less bitmap I/O but |
| 58 | resyncing after a failure is likely to take longer. |
| 59 | |
| 60 | [min_recovery_rate <kB/sec/disk>] Throttle RAID initialization |
| 61 | [max_recovery_rate <kB/sec/disk>] Throttle RAID initialization |
Jonathan Brassow | be83651 | 2013-04-24 11:42:43 +1000 | [diff] [blame] | 62 | [write_mostly <idx>] Mark drive index 'idx' write-mostly. |
| 63 | [max_write_behind <sectors>] See '--write-behind=' (man mdadm) |
| 64 | [stripe_cache <sectors>] Stripe cache size (RAID 4/5/6 only) |
Jonathan Brassow | c108456 | 2011-08-02 12:32:07 +0100 | [diff] [blame] | 65 | [region_size <sectors>] |
| 66 | The region_size multiplied by the number of regions is the |
| 67 | logical size of the array. The bitmap records the device |
| 68 | synchronisation state for each region. |
Jonathan Brassow | c0a2fa1 | 2011-08-02 12:32:06 +0100 | [diff] [blame] | 69 | |
Jonathan Brassow | 63f33b8d | 2012-07-31 21:44:26 -0500 | [diff] [blame] | 70 | [raid10_copies <# copies>] |
Jonathan Brassow | fe5d2f4 | 2013-02-21 13:28:10 +1100 | [diff] [blame] | 71 | [raid10_format <near|far|offset>] |
Jonathan Brassow | 63f33b8d | 2012-07-31 21:44:26 -0500 | [diff] [blame] | 72 | These two options are used to alter the default layout of |
| 73 | a RAID10 configuration. The number of copies is can be |
Jonathan Brassow | fe5d2f4 | 2013-02-21 13:28:10 +1100 | [diff] [blame] | 74 | specified, but the default is 2. There are also three |
| 75 | variations to how the copies are laid down - the default |
| 76 | is "near". Near copies are what most people think of with |
| 77 | respect to mirroring. If these options are left unspecified, |
| 78 | or 'raid10_copies 2' and/or 'raid10_format near' are given, |
| 79 | then the layouts for 2, 3 and 4 devices are: |
Jonathan Brassow | 63f33b8d | 2012-07-31 21:44:26 -0500 | [diff] [blame] | 80 | 2 drives 3 drives 4 drives |
| 81 | -------- ---------- -------------- |
| 82 | A1 A1 A1 A1 A2 A1 A1 A2 A2 |
| 83 | A2 A2 A2 A3 A3 A3 A3 A4 A4 |
| 84 | A3 A3 A4 A4 A5 A5 A5 A6 A6 |
| 85 | A4 A4 A5 A6 A6 A7 A7 A8 A8 |
| 86 | .. .. .. .. .. .. .. .. .. |
| 87 | The 2-device layout is equivalent 2-way RAID1. The 4-device |
| 88 | layout is what a traditional RAID10 would look like. The |
| 89 | 3-device layout is what might be called a 'RAID1E - Integrated |
| 90 | Adjacent Stripe Mirroring'. |
| 91 | |
Jonathan Brassow | fe5d2f4 | 2013-02-21 13:28:10 +1100 | [diff] [blame] | 92 | If 'raid10_copies 2' and 'raid10_format far', then the layouts |
| 93 | for 2, 3 and 4 devices are: |
| 94 | 2 drives 3 drives 4 drives |
| 95 | -------- -------------- -------------------- |
| 96 | A1 A2 A1 A2 A3 A1 A2 A3 A4 |
| 97 | A3 A4 A4 A5 A6 A5 A6 A7 A8 |
| 98 | A5 A6 A7 A8 A9 A9 A10 A11 A12 |
| 99 | .. .. .. .. .. .. .. .. .. |
| 100 | A2 A1 A3 A1 A2 A2 A1 A4 A3 |
| 101 | A4 A3 A6 A4 A5 A6 A5 A8 A7 |
| 102 | A6 A5 A9 A7 A8 A10 A9 A12 A11 |
| 103 | .. .. .. .. .. .. .. .. .. |
| 104 | |
| 105 | If 'raid10_copies 2' and 'raid10_format offset', then the |
| 106 | layouts for 2, 3 and 4 devices are: |
| 107 | 2 drives 3 drives 4 drives |
| 108 | -------- ------------ ----------------- |
| 109 | A1 A2 A1 A2 A3 A1 A2 A3 A4 |
| 110 | A2 A1 A3 A1 A2 A2 A1 A4 A3 |
| 111 | A3 A4 A4 A5 A6 A5 A6 A7 A8 |
| 112 | A4 A3 A6 A4 A5 A6 A5 A8 A7 |
| 113 | A5 A6 A7 A8 A9 A9 A10 A11 A12 |
| 114 | A6 A5 A9 A7 A8 A10 A9 A12 A11 |
| 115 | .. .. .. .. .. .. .. .. .. |
| 116 | Here we see layouts closely akin to 'RAID1E - Integrated |
| 117 | Offset Stripe Mirroring'. |
| 118 | |
Jonathan Brassow | c0a2fa1 | 2011-08-02 12:32:06 +0100 | [diff] [blame] | 119 | <#raid_devs>: The number of devices composing the array. |
| 120 | Each device consists of two entries. The first is the device |
| 121 | containing the metadata (if any); the second is the one containing the |
Jonathan Brassow | b12d437 | 2011-08-02 12:32:07 +0100 | [diff] [blame] | 122 | data. |
Jonathan Brassow | c0a2fa1 | 2011-08-02 12:32:06 +0100 | [diff] [blame] | 123 | |
| 124 | If a drive has failed or is missing at creation time, a '-' can be |
| 125 | given for both the metadata and data drives for a given position. |
| 126 | |
| 127 | |
Jonathan Brassow | be83651 | 2013-04-24 11:42:43 +1000 | [diff] [blame] | 128 | Example Tables |
Jonathan Brassow | c0a2fa1 | 2011-08-02 12:32:06 +0100 | [diff] [blame] | 129 | -------------- |
Jonathan Brassow | b12d437 | 2011-08-02 12:32:07 +0100 | [diff] [blame] | 130 | # RAID4 - 4 data drives, 1 parity (no metadata devices) |
NeilBrown | 9d09e66 | 2011-01-13 20:00:02 +0000 | [diff] [blame] | 131 | # No metadata devices specified to hold superblock/bitmap info |
| 132 | # Chunk size of 1MiB |
| 133 | # (Lines separated for easy reading) |
Jonathan Brassow | c0a2fa1 | 2011-08-02 12:32:06 +0100 | [diff] [blame] | 134 | |
NeilBrown | 9d09e66 | 2011-01-13 20:00:02 +0000 | [diff] [blame] | 135 | 0 1960893648 raid \ |
| 136 | raid4 1 2048 \ |
| 137 | 5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81 |
| 138 | |
Jonathan Brassow | b12d437 | 2011-08-02 12:32:07 +0100 | [diff] [blame] | 139 | # RAID4 - 4 data drives, 1 parity (with metadata devices) |
NeilBrown | 9d09e66 | 2011-01-13 20:00:02 +0000 | [diff] [blame] | 140 | # Chunk size of 1MiB, force RAID initialization, |
| 141 | # min recovery rate at 20 kiB/sec/disk |
Jonathan Brassow | c0a2fa1 | 2011-08-02 12:32:06 +0100 | [diff] [blame] | 142 | |
NeilBrown | 9d09e66 | 2011-01-13 20:00:02 +0000 | [diff] [blame] | 143 | 0 1960893648 raid \ |
Jonathan Brassow | b12d437 | 2011-08-02 12:32:07 +0100 | [diff] [blame] | 144 | raid4 4 2048 sync min_recovery_rate 20 \ |
| 145 | 5 8:17 8:18 8:33 8:34 8:49 8:50 8:65 8:66 8:81 8:82 |
NeilBrown | 9d09e66 | 2011-01-13 20:00:02 +0000 | [diff] [blame] | 146 | |
Jonathan Brassow | be83651 | 2013-04-24 11:42:43 +1000 | [diff] [blame] | 147 | |
| 148 | Status Output |
| 149 | ------------- |
Jonathan Brassow | c0a2fa1 | 2011-08-02 12:32:06 +0100 | [diff] [blame] | 150 | 'dmsetup table' displays the table used to construct the mapping. |
Jonathan Brassow | 46bed2b | 2011-08-02 12:32:07 +0100 | [diff] [blame] | 151 | The optional parameters are always printed in the order listed |
Jonathan Brassow | c0a2fa1 | 2011-08-02 12:32:06 +0100 | [diff] [blame] | 152 | above with "sync" or "nosync" always output ahead of the other |
| 153 | arguments, regardless of the order used when originally loading the table. |
Jonathan Brassow | 46bed2b | 2011-08-02 12:32:07 +0100 | [diff] [blame] | 154 | Arguments that can be repeated are ordered by value. |
NeilBrown | 9d09e66 | 2011-01-13 20:00:02 +0000 | [diff] [blame] | 155 | |
Jonathan Brassow | be83651 | 2013-04-24 11:42:43 +1000 | [diff] [blame] | 156 | |
| 157 | 'dmsetup status' yields information on the state and health of the array. |
| 158 | The output is as follows (normally a single line, but expanded here for |
| 159 | clarity): |
NeilBrown | 9d09e66 | 2011-01-13 20:00:02 +0000 | [diff] [blame] | 160 | 1: <s> <l> raid \ |
Jonathan Brassow | be83651 | 2013-04-24 11:42:43 +1000 | [diff] [blame] | 161 | 2: <raid_type> <#devices> <health_chars> \ |
| 162 | 3: <sync_ratio> <sync_action> <mismatch_cnt> |
NeilBrown | 9d09e66 | 2011-01-13 20:00:02 +0000 | [diff] [blame] | 163 | |
Jonathan Brassow | c0a2fa1 | 2011-08-02 12:32:06 +0100 | [diff] [blame] | 164 | Line 1 is the standard output produced by device-mapper. |
Jonathan Brassow | be83651 | 2013-04-24 11:42:43 +1000 | [diff] [blame] | 165 | Line 2 & 3 are produced by the raid target and are best explained by example: |
| 166 | 0 1960893648 raid raid4 5 AAAAA 2/490221568 init 0 |
NeilBrown | 9d09e66 | 2011-01-13 20:00:02 +0000 | [diff] [blame] | 167 | Here we can see the RAID type is raid4, there are 5 devices - all of |
Jonathan Brassow | be83651 | 2013-04-24 11:42:43 +1000 | [diff] [blame] | 168 | which are 'A'live, and the array is 2/490221568 complete with its initial |
| 169 | recovery. Here is a fuller description of the individual fields: |
| 170 | <raid_type> Same as the <raid_type> used to create the array. |
| 171 | <health_chars> One char for each device, indicating: 'A' = alive and |
| 172 | in-sync, 'a' = alive but not in-sync, 'D' = dead/failed. |
| 173 | <sync_ratio> The ratio indicating how much of the array has undergone |
| 174 | the process described by 'sync_action'. If the |
| 175 | 'sync_action' is "check" or "repair", then the process |
| 176 | of "resync" or "recover" can be considered complete. |
| 177 | <sync_action> One of the following possible states: |
| 178 | idle - No synchronization action is being performed. |
| 179 | frozen - The current action has been halted. |
| 180 | resync - Array is undergoing its initial synchronization |
| 181 | or is resynchronizing after an unclean shutdown |
| 182 | (possibly aided by a bitmap). |
| 183 | recover - A device in the array is being rebuilt or |
| 184 | replaced. |
| 185 | check - A user-initiated full check of the array is |
| 186 | being performed. All blocks are read and |
| 187 | checked for consistency. The number of |
| 188 | discrepancies found are recorded in |
| 189 | <mismatch_cnt>. No changes are made to the |
| 190 | array by this action. |
| 191 | repair - The same as "check", but discrepancies are |
| 192 | corrected. |
| 193 | reshape - The array is undergoing a reshape. |
| 194 | <mismatch_cnt> The number of discrepancies found between mirror copies |
| 195 | in RAID1/10 or wrong parity values found in RAID4/5/6. |
| 196 | This value is valid only after a "check" of the array |
| 197 | is performed. A healthy array has a 'mismatch_cnt' of 0. |
Jonathan Brassow | 4ec1e36 | 2012-10-11 13:40:24 +1100 | [diff] [blame] | 198 | |
Jonathan Brassow | be83651 | 2013-04-24 11:42:43 +1000 | [diff] [blame] | 199 | Message Interface |
| 200 | ----------------- |
| 201 | The dm-raid target will accept certain actions through the 'message' interface. |
| 202 | ('man dmsetup' for more information on the message interface.) These actions |
| 203 | include: |
| 204 | "idle" - Halt the current sync action. |
| 205 | "frozen" - Freeze the current sync action. |
| 206 | "resync" - Initiate/continue a resync. |
| 207 | "recover"- Initiate/continue a recover process. |
| 208 | "check" - Initiate a check (i.e. a "scrub") of the array. |
| 209 | "repair" - Initiate a repair of the array. |
| 210 | "reshape"- Currently unsupported (-EINVAL). |
Jonathan Brassow | 4ec1e36 | 2012-10-11 13:40:24 +1100 | [diff] [blame] | 211 | |
Heinz Mauelshagen | f15f4d72 | 2015-08-25 17:15:41 +0200 | [diff] [blame] | 212 | |
| 213 | Discard Support |
| 214 | --------------- |
| 215 | The implementation of discard support among hardware vendors varies. |
| 216 | When a block is discarded, some storage devices will return zeroes when |
| 217 | the block is read. These devices set the 'discard_zeroes_data' |
| 218 | attribute. Other devices will return random data. Confusingly, some |
| 219 | devices that advertise 'discard_zeroes_data' will not reliably return |
| 220 | zeroes when discarded blocks are read! Since RAID 4/5/6 uses blocks |
| 221 | from a number of devices to calculate parity blocks and (for performance |
| 222 | reasons) relies on 'discard_zeroes_data' being reliable, it is important |
| 223 | that the devices be consistent. Blocks may be discarded in the middle |
| 224 | of a RAID 4/5/6 stripe and if subsequent read results are not |
| 225 | consistent, the parity blocks may be calculated differently at any time; |
| 226 | making the parity blocks useless for redundancy. It is important to |
| 227 | understand how your hardware behaves with discards if you are going to |
| 228 | enable discards with RAID 4/5/6. |
| 229 | |
| 230 | Since the behavior of storage devices is unreliable in this respect, |
| 231 | even when reporting 'discard_zeroes_data', by default RAID 4/5/6 |
| 232 | discard support is disabled -- this ensures data integrity at the |
| 233 | expense of losing some performance. |
| 234 | |
| 235 | Storage devices that properly support 'discard_zeroes_data' are |
| 236 | increasingly whitelisted in the kernel and can thus be trusted. |
| 237 | |
| 238 | For trusted devices, the following dm-raid module parameter can be set |
| 239 | to safely enable discard support for RAID 4/5/6: |
| 240 | 'devices_handle_discards_safely' |
| 241 | |
| 242 | |
Jonathan Brassow | 4ec1e36 | 2012-10-11 13:40:24 +1100 | [diff] [blame] | 243 | Version History |
| 244 | --------------- |
| 245 | 1.0.0 Initial version. Support for RAID 4/5/6 |
| 246 | 1.1.0 Added support for RAID 1 |
| 247 | 1.2.0 Handle creation of arrays that contain failed devices. |
| 248 | 1.3.0 Added support for RAID 10 |
| 249 | 1.3.1 Allow device replacement/rebuild for RAID 10 |
Jonathan Brassow | 55ebbb5 | 2013-01-22 21:42:18 -0600 | [diff] [blame] | 250 | 1.3.2 Fix/improve redundancy checking for RAID10 |
Jonathan Brassow | fe5d2f4 | 2013-02-21 13:28:10 +1100 | [diff] [blame] | 251 | 1.4.0 Non-functional change. Removes arg from mapping function. |
Jonathan Brassow | be83651 | 2013-04-24 11:42:43 +1000 | [diff] [blame] | 252 | 1.4.1 RAID10 fix redundancy validation checks (commit 55ebbb5). |
| 253 | 1.4.2 Add RAID10 "far" and "offset" algorithm support. |
| 254 | 1.5.0 Add message interface to allow manipulation of the sync_action. |
| 255 | New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt. |
Jonathan Brassow | 9092c02 | 2013-05-02 14:19:24 -0500 | [diff] [blame] | 256 | 1.5.1 Add ability to restore transiently failed devices on resume. |
Jonathan Brassow | c4a3955 | 2013-06-25 01:23:59 -0500 | [diff] [blame] | 257 | 1.5.2 'mismatch_cnt' is zero unless [last_]sync_action is "check". |
Heinz Mauelshagen | 0f4106b | 2015-04-29 14:03:07 +0200 | [diff] [blame] | 258 | 1.6.0 Add discard support (and devices_handle_discard_safely module param). |
Heinz Mauelshagen | 0cf4503 | 2015-04-29 14:03:04 +0200 | [diff] [blame] | 259 | 1.7.0 Add support for MD RAID0 mappings. |