Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 | Tools that manage md devices can be found at |
| 2 | http://www.<country>.kernel.org/pub/linux/utils/raid/.... |
| 3 | |
| 4 | |
| 5 | Boot time assembly of RAID arrays |
| 6 | --------------------------------- |
| 7 | |
| 8 | You can boot with your md device with the following kernel command |
| 9 | lines: |
| 10 | |
| 11 | for old raid arrays without persistent superblocks: |
| 12 | md=<md device no.>,<raid level>,<chunk size factor>,<fault level>,dev0,dev1,...,devn |
| 13 | |
| 14 | for raid arrays with persistent superblocks |
| 15 | md=<md device no.>,dev0,dev1,...,devn |
| 16 | or, to assemble a partitionable array: |
| 17 | md=d<md device no.>,dev0,dev1,...,devn |
| 18 | |
| 19 | md device no. = the number of the md device ... |
| 20 | 0 means md0, |
| 21 | 1 md1, |
| 22 | 2 md2, |
| 23 | 3 md3, |
| 24 | 4 md4 |
| 25 | |
| 26 | raid level = -1 linear mode |
| 27 | 0 striped mode |
| 28 | other modes are only supported with persistent super blocks |
| 29 | |
| 30 | chunk size factor = (raid-0 and raid-1 only) |
| 31 | Set the chunk size as 4k << n. |
| 32 | |
| 33 | fault level = totally ignored |
| 34 | |
| 35 | dev0-devn: e.g. /dev/hda1,/dev/hdc1,/dev/sda1,/dev/sdb1 |
| 36 | |
| 37 | A possible loadlin line (Harald Hoyer <HarryH@Royal.Net>) looks like this: |
| 38 | |
| 39 | e:\loadlin\loadlin e:\zimage root=/dev/md0 md=0,0,4,0,/dev/hdb2,/dev/hdc3 ro |
| 40 | |
| 41 | |
| 42 | Boot time autodetection of RAID arrays |
| 43 | -------------------------------------- |
| 44 | |
| 45 | When md is compiled into the kernel (not as module), partitions of |
| 46 | type 0xfd are scanned and automatically assembled into RAID arrays. |
| 47 | This autodetection may be suppressed with the kernel parameter |
| 48 | "raid=noautodetect". As of kernel 2.6.9, only drives with a type 0 |
| 49 | superblock can be autodetected and run at boot time. |
| 50 | |
| 51 | The kernel parameter "raid=partitionable" (or "raid=part") means |
| 52 | that all auto-detected arrays are assembled as partitionable. |
| 53 | |
| 54 | |
| 55 | Superblock formats |
| 56 | ------------------ |
| 57 | |
| 58 | The md driver can support a variety of different superblock formats. |
| 59 | Currently, it supports superblock formats "0.90.0" and the "md-1" format |
| 60 | introduced in the 2.5 development series. |
| 61 | |
| 62 | The kernel will autodetect which format superblock is being used. |
| 63 | |
| 64 | Superblock format '0' is treated differently to others for legacy |
| 65 | reasons - it is the original superblock format. |
| 66 | |
| 67 | |
| 68 | General Rules - apply for all superblock formats |
| 69 | ------------------------------------------------ |
| 70 | |
| 71 | An array is 'created' by writing appropriate superblocks to all |
| 72 | devices. |
| 73 | |
| 74 | It is 'assembled' by associating each of these devices with an |
| 75 | particular md virtual device. Once it is completely assembled, it can |
| 76 | be accessed. |
| 77 | |
| 78 | An array should be created by a user-space tool. This will write |
| 79 | superblocks to all devices. It will usually mark the array as |
| 80 | 'unclean', or with some devices missing so that the kernel md driver |
| 81 | can create appropriate redundancy (copying in raid1, parity |
| 82 | calculation in raid4/5). |
| 83 | |
| 84 | When an array is assembled, it is first initialized with the |
| 85 | SET_ARRAY_INFO ioctl. This contains, in particular, a major and minor |
| 86 | version number. The major version number selects which superblock |
| 87 | format is to be used. The minor number might be used to tune handling |
| 88 | of the format, such as suggesting where on each device to look for the |
| 89 | superblock. |
| 90 | |
| 91 | Then each device is added using the ADD_NEW_DISK ioctl. This |
| 92 | provides, in particular, a major and minor number identifying the |
| 93 | device to add. |
| 94 | |
| 95 | The array is started with the RUN_ARRAY ioctl. |
| 96 | |
| 97 | Once started, new devices can be added. They should have an |
| 98 | appropriate superblock written to them, and then passed be in with |
| 99 | ADD_NEW_DISK. |
| 100 | |
| 101 | Devices that have failed or are not yet active can be detached from an |
| 102 | array using HOT_REMOVE_DISK. |
| 103 | |
| 104 | |
| 105 | Specific Rules that apply to format-0 super block arrays, and |
| 106 | arrays with no superblock (non-persistent). |
| 107 | ------------------------------------------------------------- |
| 108 | |
| 109 | An array can be 'created' by describing the array (level, chunksize |
| 110 | etc) in a SET_ARRAY_INFO ioctl. This must has major_version==0 and |
| 111 | raid_disks != 0. |
| 112 | |
| 113 | Then uninitialized devices can be added with ADD_NEW_DISK. The |
| 114 | structure passed to ADD_NEW_DISK must specify the state of the device |
| 115 | and it's role in the array. |
| 116 | |
| 117 | Once started with RUN_ARRAY, uninitialized spares can be added with |
| 118 | HOT_ADD_DISK. |
NeilBrown | bb63654 | 2005-11-08 21:39:45 -0800 | [diff] [blame] | 119 | |
| 120 | |
| 121 | |
| 122 | MD devices in sysfs |
| 123 | ------------------- |
| 124 | md devices appear in sysfs (/sys) as regular block devices, |
| 125 | e.g. |
| 126 | /sys/block/md0 |
| 127 | |
| 128 | Each 'md' device will contain a subdirectory called 'md' which |
| 129 | contains further md-specific information about the device. |
| 130 | |
| 131 | All md devices contain: |
| 132 | level |
| 133 | a text file indicating the 'raid level'. This may be a standard |
| 134 | numerical level prefixed by "RAID-" - e.g. "RAID-5", or some |
| 135 | other name such as "linear" or "multipath". |
| 136 | If no raid level has been set yet (array is still being |
| 137 | assembled), this file will be empty. |
| 138 | |
| 139 | raid_disks |
| 140 | a text file with a simple number indicating the number of devices |
| 141 | in a fully functional array. If this is not yet known, the file |
| 142 | will be empty. If an array is being resized (not currently |
| 143 | possible) this will contain the larger of the old and new sizes. |
| 144 | |
| 145 | As component devices are added to an md array, they appear in the 'md' |
| 146 | directory as new directories named |
| 147 | dev-XXX |
| 148 | where XXX is a name that the kernel knows for the device, e.g. hdb1. |
| 149 | Each directory contains: |
| 150 | |
| 151 | block |
| 152 | a symlink to the block device in /sys/block, e.g. |
| 153 | /sys/block/md0/md/dev-hdb1/block -> ../../../../block/hdb/hdb1 |
| 154 | |
| 155 | super |
| 156 | A file containing an image of the superblock read from, or |
| 157 | written to, that device. |
| 158 | |
| 159 | state |
| 160 | A file recording the current state of the device in the array |
| 161 | which can be a comma separated list of |
| 162 | faulty - device has been kicked from active use due to |
| 163 | a detected fault |
| 164 | in_sync - device is a fully in-sync member of the array |
| 165 | spare - device is working, but not a full member. |
| 166 | This includes spares that are in the process |
| 167 | of being recoverred to |
| 168 | This list make grow in future. |
| 169 | |
| 170 | |
| 171 | An active md device will also contain and entry for each active device |
| 172 | in the array. These are named |
| 173 | |
| 174 | rdNN |
| 175 | |
| 176 | where 'NN' is the possition in the array, starting from 0. |
| 177 | So for a 3 drive array there will be rd0, rd1, rd2. |
| 178 | These are symbolic links to the appropriate 'dev-XXX' entry. |
| 179 | Thus, for example, |
| 180 | cat /sys/block/md*/md/rd*/state |
| 181 | will show 'in_sync' on every line. |
| 182 | |
| 183 | |
| 184 | |
| 185 | Active md devices for levels that support data redundancy (1,4,5,6) |
| 186 | also have |
| 187 | |
| 188 | sync_action |
| 189 | a text file that can be used to monitor and control the rebuild |
| 190 | process. It contains one word which can be one of: |
| 191 | resync - redundancy is being recalculated after unclean |
| 192 | shutdown or creation |
| 193 | recover - a hot spare is being built to replace a |
| 194 | failed/missing device |
| 195 | idle - nothing is happening |
| 196 | check - A full check of redundancy was requested and is |
| 197 | happening. This reads all block and checks |
| 198 | them. A repair may also happen for some raid |
| 199 | levels. |
| 200 | repair - A full check and repair is happening. This is |
| 201 | similar to 'resync', but was requested by the |
| 202 | user, and the write-intent bitmap is NOT used to |
| 203 | optimise the process. |
| 204 | |
| 205 | This file is writable, and each of the strings that could be |
| 206 | read are meaningful for writing. |
| 207 | |
| 208 | 'idle' will stop an active resync/recovery etc. There is no |
| 209 | guarantee that another resync/recovery may not be automatically |
| 210 | started again, though some event will be needed to trigger |
| 211 | this. |
| 212 | 'resync' or 'recovery' can be used to restart the |
| 213 | corresponding operation if it was stopped with 'idle'. |
| 214 | 'check' and 'repair' will start the appropriate process |
| 215 | providing the current state is 'idle'. |
| 216 | |
| 217 | mismatch_count |
| 218 | When performing 'check' and 'repair', and possibly when |
| 219 | performing 'resync', md will count the number of errors that are |
| 220 | found. The count in 'mismatch_cnt' is the number of sectors |
| 221 | that were re-written, or (for 'check') would have been |
| 222 | re-written. As most raid levels work in units of pages rather |
| 223 | than sectors, this my be larger than the number of actual errors |
| 224 | by a factor of the number of sectors in a page. |
| 225 | |
| 226 | Each active md device may also have attributes specific to the |
| 227 | personality module that manages it. |
| 228 | These are specific to the implementation of the module and could |
| 229 | change substantially if the implementation changes. |
| 230 | |
| 231 | These currently include |
| 232 | |
| 233 | stripe_cache_size (currently raid5 only) |
| 234 | number of entries in the stripe cache. This is writable, but |
| 235 | there are upper and lower limits (32768, 16). Default is 128. |
| 236 | strip_cache_active (currently raid5 only) |
| 237 | number of active entries in the stripe cache |