Vishal Verma | 5212e11 | 2015-06-25 04:20:32 -0400 | [diff] [blame] | 1 | BTT - Block Translation Table |
| 2 | ============================= |
| 3 | |
| 4 | |
| 5 | 1. Introduction |
| 6 | --------------- |
| 7 | |
| 8 | Persistent memory based storage is able to perform IO at byte (or more |
| 9 | accurately, cache line) granularity. However, we often want to expose such |
| 10 | storage as traditional block devices. The block drivers for persistent memory |
| 11 | will do exactly this. However, they do not provide any atomicity guarantees. |
| 12 | Traditional SSDs typically provide protection against torn sectors in hardware, |
| 13 | using stored energy in capacitors to complete in-flight block writes, or perhaps |
| 14 | in firmware. We don't have this luxury with persistent memory - if a write is in |
| 15 | progress, and we experience a power failure, the block will contain a mix of old |
| 16 | and new data. Applications may not be prepared to handle such a scenario. |
| 17 | |
| 18 | The Block Translation Table (BTT) provides atomic sector update semantics for |
| 19 | persistent memory devices, so that applications that rely on sector writes not |
| 20 | being torn can continue to do so. The BTT manifests itself as a stacked block |
| 21 | device, and reserves a portion of the underlying storage for its metadata. At |
| 22 | the heart of it, is an indirection table that re-maps all the blocks on the |
| 23 | volume. It can be thought of as an extremely simple file system that only |
| 24 | provides atomic sector updates. |
| 25 | |
| 26 | |
| 27 | 2. Static Layout |
| 28 | ---------------- |
| 29 | |
| 30 | The underlying storage on which a BTT can be laid out is not limited in any way. |
| 31 | The BTT, however, splits the available space into chunks of up to 512 GiB, |
| 32 | called "Arenas". |
| 33 | |
| 34 | Each arena follows the same layout for its metadata, and all references in an |
| 35 | arena are internal to it (with the exception of one field that points to the |
| 36 | next arena). The following depicts the "On-disk" metadata layout: |
| 37 | |
| 38 | |
| 39 | Backing Store +-------> Arena |
| 40 | +---------------+ | +------------------+ |
| 41 | | | | | Arena info block | |
| 42 | | Arena 0 +---+ | 4K | |
| 43 | | 512G | +------------------+ |
| 44 | | | | | |
| 45 | +---------------+ | | |
| 46 | | | | | |
| 47 | | Arena 1 | | Data Blocks | |
| 48 | | 512G | | | |
| 49 | | | | | |
| 50 | +---------------+ | | |
| 51 | | . | | | |
| 52 | | . | | | |
| 53 | | . | | | |
| 54 | | | | | |
| 55 | | | | | |
| 56 | +---------------+ +------------------+ |
| 57 | | | |
| 58 | | BTT Map | |
| 59 | | | |
| 60 | | | |
| 61 | +------------------+ |
| 62 | | | |
| 63 | | BTT Flog | |
| 64 | | | |
| 65 | +------------------+ |
| 66 | | Info block copy | |
| 67 | | 4K | |
| 68 | +------------------+ |
| 69 | |
| 70 | |
| 71 | 3. Theory of Operation |
| 72 | ---------------------- |
| 73 | |
| 74 | |
| 75 | a. The BTT Map |
| 76 | -------------- |
| 77 | |
| 78 | The map is a simple lookup/indirection table that maps an LBA to an internal |
| 79 | block. Each map entry is 32 bits. The two most significant bits are special |
| 80 | flags, and the remaining form the internal block number. |
| 81 | |
| 82 | Bit Description |
Dan Williams | bc30196 | 2015-06-25 04:48:19 -0400 | [diff] [blame^] | 83 | 31 - 30 : Error and Zero flags - Used in the following way: |
| 84 | Bit Description |
| 85 | 31 30 |
| 86 | ----------------------------------------------------------------------- |
| 87 | 00 Initial state. Reads return zeroes; Premap = Postmap |
| 88 | 01 Zero state: Reads return zeroes |
| 89 | 10 Error state: Reads fail; Writes clear 'E' bit |
| 90 | 11 Normal Block – has valid postmap |
| 91 | |
| 92 | |
| 93 | 29 - 0 : Mappings to internal 'postmap' blocks |
Vishal Verma | 5212e11 | 2015-06-25 04:20:32 -0400 | [diff] [blame] | 94 | |
| 95 | |
| 96 | Some of the terminology that will be subsequently used: |
| 97 | |
| 98 | External LBA : LBA as made visible to upper layers. |
| 99 | ABA : Arena Block Address - Block offset/number within an arena |
| 100 | Premap ABA : The block offset into an arena, which was decided upon by range |
| 101 | checking the External LBA |
| 102 | Postmap ABA : The block number in the "Data Blocks" area obtained after |
| 103 | indirection from the map |
| 104 | nfree : The number of free blocks that are maintained at any given time. |
| 105 | This is the number of concurrent writes that can happen to the |
| 106 | arena. |
| 107 | |
| 108 | |
| 109 | For example, after adding a BTT, we surface a disk of 1024G. We get a read for |
| 110 | the external LBA at 768G. This falls into the second arena, and of the 512G |
| 111 | worth of blocks that this arena contributes, this block is at 256G. Thus, the |
| 112 | premap ABA is 256G. We now refer to the map, and find out the mapping for block |
| 113 | 'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64. |
| 114 | |
| 115 | |
| 116 | b. The BTT Flog |
| 117 | --------------- |
| 118 | |
| 119 | The BTT provides sector atomicity by making every write an "allocating write", |
| 120 | i.e. Every write goes to a "free" block. A running list of free blocks is |
| 121 | maintained in the form of the BTT flog. 'Flog' is a combination of the words |
| 122 | "free list" and "log". The flog contains 'nfree' entries, and an entry contains: |
| 123 | |
| 124 | lba : The premap ABA that is being written to |
| 125 | old_map : The old postmap ABA - after 'this' write completes, this will be a |
| 126 | free block. |
| 127 | new_map : The new postmap ABA. The map will up updated to reflect this |
| 128 | lba->postmap_aba mapping, but we log it here in case we have to |
| 129 | recover. |
| 130 | seq : Sequence number to mark which of the 2 sections of this flog entry is |
| 131 | valid/newest. It cycles between 01->10->11->01 (binary) under normal |
| 132 | operation, with 00 indicating an uninitialized state. |
| 133 | lba' : alternate lba entry |
| 134 | old_map': alternate old postmap entry |
| 135 | new_map': alternate new postmap entry |
| 136 | seq' : alternate sequence number. |
| 137 | |
Dan Williams | bc30196 | 2015-06-25 04:48:19 -0400 | [diff] [blame^] | 138 | Each of the above fields is 32-bit, making one entry 32 bytes. Entries are also |
| 139 | padded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are |
Vishal Verma | 5212e11 | 2015-06-25 04:20:32 -0400 | [diff] [blame] | 140 | done such that for any entry being written, it: |
| 141 | a. overwrites the 'old' section in the entry based on sequence numbers |
Dan Williams | bc30196 | 2015-06-25 04:48:19 -0400 | [diff] [blame^] | 142 | b. writes the 'new' section such that the sequence number is written last. |
Vishal Verma | 5212e11 | 2015-06-25 04:20:32 -0400 | [diff] [blame] | 143 | |
| 144 | |
| 145 | c. The concept of lanes |
| 146 | ----------------------- |
| 147 | |
| 148 | While 'nfree' describes the number of concurrent IOs an arena can process |
| 149 | concurrently, 'nlanes' is the number of IOs the BTT device as a whole can |
| 150 | process. |
| 151 | nlanes = min(nfree, num_cpus) |
| 152 | A lane number is obtained at the start of any IO, and is used for indexing into |
Dan Williams | bc30196 | 2015-06-25 04:48:19 -0400 | [diff] [blame^] | 153 | all the on-disk and in-memory data structures for the duration of the IO. If |
| 154 | there are more CPUs than the max number of available lanes, than lanes are |
| 155 | protected by spinlocks. |
Vishal Verma | 5212e11 | 2015-06-25 04:20:32 -0400 | [diff] [blame] | 156 | |
| 157 | |
| 158 | d. In-memory data structure: Read Tracking Table (RTT) |
| 159 | ------------------------------------------------------ |
| 160 | |
| 161 | Consider a case where we have two threads, one doing reads and the other, |
| 162 | writes. We can hit a condition where the writer thread grabs a free block to do |
| 163 | a new IO, but the (slow) reader thread is still reading from it. In other words, |
| 164 | the reader consulted a map entry, and started reading the corresponding block. A |
| 165 | writer started writing to the same external LBA, and finished the write updating |
| 166 | the map for that external LBA to point to its new postmap ABA. At this point the |
| 167 | internal, postmap block that the reader is (still) reading has been inserted |
| 168 | into the list of free blocks. If another write comes in for the same LBA, it can |
| 169 | grab this free block, and start writing to it, causing the reader to read |
| 170 | incorrect data. To prevent this, we introduce the RTT. |
| 171 | |
| 172 | The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts |
| 173 | into rtt[lane_number], the postmap ABA it is reading, and clears it after the |
| 174 | read is complete. Every writer thread, after grabbing a free block, checks the |
| 175 | RTT for its presence. If the postmap free block is in the RTT, it waits till the |
| 176 | reader clears the RTT entry, and only then starts writing to it. |
| 177 | |
| 178 | |
| 179 | e. In-memory data structure: map locks |
| 180 | -------------------------------------- |
| 181 | |
| 182 | Consider a case where two writer threads are writing to the same LBA. There can |
| 183 | be a race in the following sequence of steps: |
| 184 | |
| 185 | free[lane] = map[premap_aba] |
| 186 | map[premap_aba] = postmap_aba |
| 187 | |
| 188 | Both threads can update their respective free[lane] with the same old, freed |
| 189 | postmap_aba. This has made the layout inconsistent by losing a free entry, and |
| 190 | at the same time, duplicating another free entry for two lanes. |
| 191 | |
| 192 | To solve this, we could have a single map lock (per arena) that has to be taken |
| 193 | before performing the above sequence, but we feel that could be too contentious. |
| 194 | Instead we use an array of (nfree) map_locks that is indexed by |
| 195 | (premap_aba modulo nfree). |
| 196 | |
| 197 | |
| 198 | f. Reconstruction from the Flog |
| 199 | ------------------------------- |
| 200 | |
| 201 | On startup, we analyze the BTT flog to create our list of free blocks. We walk |
| 202 | through all the entries, and for each lane, of the set of two possible |
| 203 | 'sections', we always look at the most recent one only (based on the sequence |
| 204 | number). The reconstruction rules/steps are simple: |
| 205 | - Read map[log_entry.lba]. |
| 206 | - If log_entry.new matches the map entry, then log_entry.old is free. |
| 207 | - If log_entry.new does not match the map entry, then log_entry.new is free. |
| 208 | (This case can only be caused by power-fails/unsafe shutdowns) |
| 209 | |
| 210 | |
| 211 | g. Summarizing - Read and Write flows |
| 212 | ------------------------------------- |
| 213 | |
| 214 | Read: |
| 215 | |
| 216 | 1. Convert external LBA to arena number + pre-map ABA |
| 217 | 2. Get a lane (and take lane_lock) |
| 218 | 3. Read map to get the entry for this pre-map ABA |
| 219 | 4. Enter post-map ABA into RTT[lane] |
| 220 | 5. If TRIM flag set in map, return zeroes, and end IO (go to step 8) |
| 221 | 6. If ERROR flag set in map, end IO with EIO (go to step 8) |
| 222 | 7. Read data from this block |
| 223 | 8. Remove post-map ABA entry from RTT[lane] |
| 224 | 9. Release lane (and lane_lock) |
| 225 | |
| 226 | Write: |
| 227 | |
| 228 | 1. Convert external LBA to Arena number + pre-map ABA |
| 229 | 2. Get a lane (and take lane_lock) |
| 230 | 3. Use lane to index into in-memory free list and obtain a new block, next flog |
| 231 | index, next sequence number |
| 232 | 4. Scan the RTT to check if free block is present, and spin/wait if it is. |
| 233 | 5. Write data to this free block |
| 234 | 6. Read map to get the existing post-map ABA entry for this pre-map ABA |
| 235 | 7. Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num] |
| 236 | 8. Write new post-map ABA into map. |
| 237 | 9. Write old post-map entry into the free list |
| 238 | 10. Calculate next sequence number and write into the free list entry |
| 239 | 11. Release lane (and lane_lock) |
| 240 | |
| 241 | |
| 242 | 4. Error Handling |
| 243 | ================= |
| 244 | |
| 245 | An arena would be in an error state if any of the metadata is corrupted |
| 246 | irrecoverably, either due to a bug or a media error. The following conditions |
| 247 | indicate an error: |
| 248 | - Info block checksum does not match (and recovering from the copy also fails) |
| 249 | - All internal available blocks are not uniquely and entirely addressed by the |
| 250 | sum of mapped blocks and free blocks (from the BTT flog). |
| 251 | - Rebuilding free list from the flog reveals missing/duplicate/impossible |
| 252 | entries |
| 253 | - A map entry is out of bounds |
| 254 | |
| 255 | If any of these error conditions are encountered, the arena is put into a read |
| 256 | only state using a flag in the info block. |
| 257 | |
| 258 | |
| 259 | 5. In-kernel usage |
| 260 | ================== |
| 261 | |
| 262 | Any block driver that supports byte granularity IO to the storage may register |
| 263 | with the BTT. It will have to provide the rw_bytes interface in its |
| 264 | block_device_operations struct: |
| 265 | |
| 266 | int (*rw_bytes)(struct gendisk *, void *, size_t, off_t, int rw); |
| 267 | |
| 268 | It may register with the BTT after it adds its own gendisk, using btt_init: |
| 269 | |
| 270 | struct btt *btt_init(struct gendisk *disk, unsigned long long rawsize, |
| 271 | u32 lbasize, u8 uuid[], int maxlane); |
| 272 | |
| 273 | note that maxlane is the maximum amount of concurrency the driver wishes to |
| 274 | allow the BTT to use. |
| 275 | |
| 276 | The BTT 'disk' appears as a stacked block device that grabs the underlying block |
| 277 | device in the O_EXCL mode. |
| 278 | |
| 279 | When the driver wishes to remove the backing disk, it should similarly call |
| 280 | btt_fini using the same struct btt* handle that was provided to it by btt_init. |
| 281 | |
| 282 | void btt_fini(struct btt *btt); |
| 283 | |