Jim Ramsay | 9d0eb0a | 2013-07-10 23:41:19 +0100 | [diff] [blame] | 1 | dm-switch |
| 2 | ========= |
| 3 | |
| 4 | The device-mapper switch target creates a device that supports an |
| 5 | arbitrary mapping of fixed-size regions of I/O across a fixed set of |
| 6 | paths. The path used for any specific region can be switched |
| 7 | dynamically by sending the target a message. |
| 8 | |
| 9 | It maps I/O to underlying block devices efficiently when there is a large |
| 10 | number of fixed-sized address regions but there is no simple pattern |
| 11 | that would allow for a compact representation of the mapping such as |
| 12 | dm-stripe. |
| 13 | |
| 14 | Background |
| 15 | ---------- |
| 16 | |
| 17 | Dell EqualLogic and some other iSCSI storage arrays use a distributed |
| 18 | frameless architecture. In this architecture, the storage group |
| 19 | consists of a number of distinct storage arrays ("members") each having |
| 20 | independent controllers, disk storage and network adapters. When a LUN |
| 21 | is created it is spread across multiple members. The details of the |
| 22 | spreading are hidden from initiators connected to this storage system. |
| 23 | The storage group exposes a single target discovery portal, no matter |
| 24 | how many members are being used. When iSCSI sessions are created, each |
| 25 | session is connected to an eth port on a single member. Data to a LUN |
| 26 | can be sent on any iSCSI session, and if the blocks being accessed are |
| 27 | stored on another member the I/O will be forwarded as required. This |
| 28 | forwarding is invisible to the initiator. The storage layout is also |
| 29 | dynamic, and the blocks stored on disk may be moved from member to |
| 30 | member as needed to balance the load. |
| 31 | |
| 32 | This architecture simplifies the management and configuration of both |
| 33 | the storage group and initiators. In a multipathing configuration, it |
| 34 | is possible to set up multiple iSCSI sessions to use multiple network |
| 35 | interfaces on both the host and target to take advantage of the |
| 36 | increased network bandwidth. An initiator could use a simple round |
| 37 | robin algorithm to send I/O across all paths and let the storage array |
| 38 | members forward it as necessary, but there is a performance advantage to |
| 39 | sending data directly to the correct member. |
| 40 | |
| 41 | A device-mapper table already lets you map different regions of a |
| 42 | device onto different targets. However in this architecture the LUN is |
| 43 | spread with an address region size on the order of 10s of MBs, which |
| 44 | means the resulting table could have more than a million entries and |
| 45 | consume far too much memory. |
| 46 | |
| 47 | Using this device-mapper switch target we can now build a two-layer |
| 48 | device hierarchy: |
| 49 | |
Mike Snitzer | e73f6e8 | 2015-02-27 15:25:31 -0500 | [diff] [blame] | 50 | Upper Tier - Determine which array member the I/O should be sent to. |
| 51 | Lower Tier - Load balance amongst paths to a particular member. |
Jim Ramsay | 9d0eb0a | 2013-07-10 23:41:19 +0100 | [diff] [blame] | 52 | |
| 53 | The lower tier consists of a single dm multipath device for each member. |
| 54 | Each of these multipath devices contains the set of paths directly to |
| 55 | the array member in one priority group, and leverages existing path |
| 56 | selectors to load balance amongst these paths. We also build a |
| 57 | non-preferred priority group containing paths to other array members for |
| 58 | failover reasons. |
| 59 | |
| 60 | The upper tier consists of a single dm-switch device. This device uses |
| 61 | a bitmap to look up the location of the I/O and choose the appropriate |
| 62 | lower tier device to route the I/O. By using a bitmap we are able to |
| 63 | use 4 bits for each address range in a 16 member group (which is very |
| 64 | large for us). This is a much denser representation than the dm table |
| 65 | b-tree can achieve. |
| 66 | |
| 67 | Construction Parameters |
| 68 | ======================= |
| 69 | |
| 70 | <num_paths> <region_size> <num_optional_args> [<optional_args>...] |
| 71 | [<dev_path> <offset>]+ |
| 72 | |
| 73 | <num_paths> |
| 74 | The number of paths across which to distribute the I/O. |
| 75 | |
| 76 | <region_size> |
| 77 | The number of 512-byte sectors in a region. Each region can be redirected |
| 78 | to any of the available paths. |
| 79 | |
| 80 | <num_optional_args> |
| 81 | The number of optional arguments. Currently, no optional arguments |
| 82 | are supported and so this must be zero. |
| 83 | |
| 84 | <dev_path> |
| 85 | The block device that represents a specific path to the device. |
| 86 | |
| 87 | <offset> |
| 88 | The offset of the start of data on the specific <dev_path> (in units |
| 89 | of 512-byte sectors). This number is added to the sector number when |
| 90 | forwarding the request to the specific path. Typically it is zero. |
| 91 | |
| 92 | Messages |
| 93 | ======== |
| 94 | |
| 95 | set_region_mappings <index>:<path_nr> [<index>]:<path_nr> [<index>]:<path_nr>... |
| 96 | |
| 97 | Modify the region table by specifying which regions are redirected to |
| 98 | which paths. |
| 99 | |
| 100 | <index> |
| 101 | The region number (region size was specified in constructor parameters). |
| 102 | If index is omitted, the next region (previous index + 1) is used. |
| 103 | Expressed in hexadecimal (WITHOUT any prefix like 0x). |
| 104 | |
| 105 | <path_nr> |
| 106 | The path number in the range 0 ... (<num_paths> - 1). |
| 107 | Expressed in hexadecimal (WITHOUT any prefix like 0x). |
| 108 | |
Mikulas Patocka | 56b1ebf | 2014-07-28 18:11:25 -0400 | [diff] [blame] | 109 | R<n>,<m> |
| 110 | This parameter allows repetitive patterns to be loaded quickly. <n> and <m> |
| 111 | are hexadecimal numbers. The last <n> mappings are repeated in the next <m> |
| 112 | slots. |
| 113 | |
Jim Ramsay | 9d0eb0a | 2013-07-10 23:41:19 +0100 | [diff] [blame] | 114 | Status |
| 115 | ====== |
| 116 | |
| 117 | No status line is reported. |
| 118 | |
| 119 | Example |
| 120 | ======= |
| 121 | |
| 122 | Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with |
| 123 | the same size. |
| 124 | |
| 125 | Create a switch device with 64kB region size: |
| 126 | dmsetup create switch --table "0 `blockdev --getsize /dev/vg1/switch0` |
| 127 | switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0" |
| 128 | |
| 129 | Set mappings for the first 7 entries to point to devices switch0, switch1, |
| 130 | switch2, switch0, switch1, switch2, switch1: |
| 131 | dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1 |
Mikulas Patocka | 56b1ebf | 2014-07-28 18:11:25 -0400 | [diff] [blame] | 132 | |
| 133 | Set repetitive mapping. This command: |
| 134 | dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10 |
| 135 | is equivalent to: |
| 136 | dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \ |
| 137 | :1 :2 :1 :2 :1 :2 :1 :2 :1 :2 |
| 138 | |