Kent Overstreet | 4550dd6 | 2013-08-07 14:26:21 -0700 | [diff] [blame] | 1 | |
| 2 | Immutable biovecs and biovec iterators: |
| 3 | ======================================= |
| 4 | |
| 5 | Kent Overstreet <kmo@daterainc.com> |
| 6 | |
| 7 | As of 3.13, biovecs should never be modified after a bio has been submitted. |
| 8 | Instead, we have a new struct bvec_iter which represents a range of a biovec - |
| 9 | the iterator will be modified as the bio is completed, not the biovec. |
| 10 | |
| 11 | More specifically, old code that needed to partially complete a bio would |
| 12 | update bi_sector and bi_size, and advance bi_idx to the next biovec. If it |
| 13 | ended up partway through a biovec, it would increment bv_offset and decrement |
| 14 | bv_len by the number of bytes completed in that biovec. |
| 15 | |
| 16 | In the new scheme of things, everything that must be mutated in order to |
| 17 | partially complete a bio is segregated into struct bvec_iter: bi_sector, |
| 18 | bi_size and bi_idx have been moved there; and instead of modifying bv_offset |
| 19 | and bv_len, struct bvec_iter has bi_bvec_done, which represents the number of |
| 20 | bytes completed in the current bvec. |
| 21 | |
| 22 | There are a bunch of new helper macros for hiding the gory details - in |
| 23 | particular, presenting the illusion of partially completed biovecs so that |
| 24 | normal code doesn't have to deal with bi_bvec_done. |
| 25 | |
| 26 | * Driver code should no longer refer to biovecs directly; we now have |
Dongsu Park | 2ec3182 | 2014-12-19 14:53:03 +0100 | [diff] [blame] | 27 | bio_iovec() and bio_iter_iovec() macros that return literal struct biovecs, |
Kent Overstreet | 4550dd6 | 2013-08-07 14:26:21 -0700 | [diff] [blame] | 28 | constructed from the raw biovecs but taking into account bi_bvec_done and |
| 29 | bi_size. |
| 30 | |
| 31 | bio_for_each_segment() has been updated to take a bvec_iter argument |
| 32 | instead of an integer (that corresponded to bi_idx); for a lot of code the |
| 33 | conversion just required changing the types of the arguments to |
| 34 | bio_for_each_segment(). |
| 35 | |
| 36 | * Advancing a bvec_iter is done with bio_advance_iter(); bio_advance() is a |
| 37 | wrapper around bio_advance_iter() that operates on bio->bi_iter, and also |
| 38 | advances the bio integrity's iter if present. |
| 39 | |
| 40 | There is a lower level advance function - bvec_iter_advance() - which takes |
| 41 | a pointer to a biovec, not a bio; this is used by the bio integrity code. |
| 42 | |
| 43 | What's all this get us? |
| 44 | ======================= |
| 45 | |
| 46 | Having a real iterator, and making biovecs immutable, has a number of |
| 47 | advantages: |
| 48 | |
| 49 | * Before, iterating over bios was very awkward when you weren't processing |
| 50 | exactly one bvec at a time - for example, bio_copy_data() in fs/bio.c, |
| 51 | which copies the contents of one bio into another. Because the biovecs |
| 52 | wouldn't necessarily be the same size, the old code was tricky convoluted - |
| 53 | it had to walk two different bios at the same time, keeping both bi_idx and |
| 54 | and offset into the current biovec for each. |
| 55 | |
| 56 | The new code is much more straightforward - have a look. This sort of |
| 57 | pattern comes up in a lot of places; a lot of drivers were essentially open |
| 58 | coding bvec iterators before, and having common implementation considerably |
| 59 | simplifies a lot of code. |
| 60 | |
| 61 | * Before, any code that might need to use the biovec after the bio had been |
| 62 | completed (perhaps to copy the data somewhere else, or perhaps to resubmit |
| 63 | it somewhere else if there was an error) had to save the entire bvec array |
| 64 | - again, this was being done in a fair number of places. |
| 65 | |
| 66 | * Biovecs can be shared between multiple bios - a bvec iter can represent an |
| 67 | arbitrary range of an existing biovec, both starting and ending midway |
| 68 | through biovecs. This is what enables efficient splitting of arbitrary |
| 69 | bios. Note that this means we _only_ use bi_size to determine when we've |
| 70 | reached the end of a bio, not bi_vcnt - and the bio_iovec() macro takes |
| 71 | bi_size into account when constructing biovecs. |
| 72 | |
| 73 | * Splitting bios is now much simpler. The old bio_split() didn't even work on |
| 74 | bios with more than a single bvec! Now, we can efficiently split arbitrary |
| 75 | size bios - because the new bio can share the old bio's biovec. |
| 76 | |
| 77 | Care must be taken to ensure the biovec isn't freed while the split bio is |
| 78 | still using it, in case the original bio completes first, though. Using |
| 79 | bio_chain() when splitting bios helps with this. |
| 80 | |
| 81 | * Submitting partially completed bios is now perfectly fine - this comes up |
| 82 | occasionally in stacking block drivers and various code (e.g. md and |
| 83 | bcache) had some ugly workarounds for this. |
| 84 | |
| 85 | It used to be the case that submitting a partially completed bio would work |
| 86 | fine to _most_ devices, but since accessing the raw bvec array was the |
| 87 | norm, not all drivers would respect bi_idx and those would break. Now, |
| 88 | since all drivers _must_ go through the bvec iterator - and have been |
| 89 | audited to make sure they are - submitting partially completed bios is |
| 90 | perfectly fine. |
| 91 | |
| 92 | Other implications: |
| 93 | =================== |
| 94 | |
| 95 | * Almost all usage of bi_idx is now incorrect and has been removed; instead, |
| 96 | where previously you would have used bi_idx you'd now use a bvec_iter, |
| 97 | probably passing it to one of the helper macros. |
| 98 | |
| 99 | I.e. instead of using bio_iovec_idx() (or bio->bi_iovec[bio->bi_idx]), you |
| 100 | now use bio_iter_iovec(), which takes a bvec_iter and returns a |
| 101 | literal struct bio_vec - constructed on the fly from the raw biovec but |
| 102 | taking into account bi_bvec_done (and bi_size). |
| 103 | |
| 104 | * bi_vcnt can't be trusted or relied upon by driver code - i.e. anything that |
| 105 | doesn't actually own the bio. The reason is twofold: firstly, it's not |
| 106 | actually needed for iterating over the bio anymore - we only use bi_size. |
| 107 | Secondly, when cloning a bio and reusing (a portion of) the original bio's |
| 108 | biovec, in order to calculate bi_vcnt for the new bio we'd have to iterate |
| 109 | over all the biovecs in the new bio - which is silly as it's not needed. |
| 110 | |
| 111 | So, don't use bi_vcnt anymore. |
Dongsu Park | 2ec3182 | 2014-12-19 14:53:03 +0100 | [diff] [blame] | 112 | |
| 113 | * The current interface allows the block layer to split bios as needed, so we |
| 114 | could eliminate a lot of complexity particularly in stacked drivers. Code |
| 115 | that creates bios can then create whatever size bios are convenient, and |
| 116 | more importantly stacked drivers don't have to deal with both their own bio |
| 117 | size limitations and the limitations of the underlying devices. Thus |
| 118 | there's no need to define ->merge_bvec_fn() callbacks for individual block |
| 119 | drivers. |