Josef Bacik | 0e9cebe | 2015-03-20 10:50:37 -0400 | [diff] [blame] | 1 | dm-log-writes |
| 2 | ============= |
| 3 | |
| 4 | This target takes 2 devices, one to pass all IO to normally, and one to log all |
| 5 | of the write operations to. This is intended for file system developers wishing |
| 6 | to verify the integrity of metadata or data as the file system is written to. |
| 7 | There is a log_write_entry written for every WRITE request and the target is |
| 8 | able to take arbitrary data from userspace to insert into the log. The data |
| 9 | that is in the WRITE requests is copied into the log to make the replay happen |
| 10 | exactly as it happened originally. |
| 11 | |
| 12 | Log Ordering |
| 13 | ============ |
| 14 | |
| 15 | We log things in order of completion once we are sure the write is no longer in |
| 16 | cache. This means that normal WRITE requests are not actually logged until the |
| 17 | next REQ_FLUSH request. This is to make it easier for userspace to replay the |
| 18 | log in a way that correlates to what is on disk and not what is in cache, to |
| 19 | make it easier to detect improper waiting/flushing. |
| 20 | |
| 21 | This works by attaching all WRITE requests to a list once the write completes. |
| 22 | Once we see a REQ_FLUSH request we splice this list onto the request and once |
| 23 | the FLUSH request completes we log all of the WRITEs and then the FLUSH. Only |
| 24 | completed WRITEs, at the time the REQ_FLUSH is issued, are added in order to |
| 25 | simulate the worst case scenario with regard to power failures. Consider the |
| 26 | following example (W means write, C means complete): |
| 27 | |
| 28 | W1,W2,W3,C3,C2,Wflush,C1,Cflush |
| 29 | |
| 30 | The log would show the following |
| 31 | |
| 32 | W3,W2,flush,W1.... |
| 33 | |
| 34 | Again this is to simulate what is actually on disk, this allows us to detect |
| 35 | cases where a power failure at a particular point in time would create an |
| 36 | inconsistent file system. |
| 37 | |
| 38 | Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as |
| 39 | they complete as those requests will obviously bypass the device cache. |
| 40 | |
| 41 | Any REQ_DISCARD requests are treated like WRITE requests. Otherwise we would |
| 42 | have all the DISCARD requests, and then the WRITE requests and then the FLUSH |
| 43 | request. Consider the following example: |
| 44 | |
| 45 | WRITE block 1, DISCARD block 1, FLUSH |
| 46 | |
| 47 | If we logged DISCARD when it completed, the replay would look like this |
| 48 | |
| 49 | DISCARD 1, WRITE 1, FLUSH |
| 50 | |
| 51 | which isn't quite what happened and wouldn't be caught during the log replay. |
| 52 | |
| 53 | Target interface |
| 54 | ================ |
| 55 | |
| 56 | i) Constructor |
| 57 | |
| 58 | log-writes <dev_path> <log_dev_path> |
| 59 | |
| 60 | dev_path : Device that all of the IO will go to normally. |
| 61 | log_dev_path : Device where the log entries are written to. |
| 62 | |
| 63 | ii) Status |
| 64 | |
| 65 | <#logged entries> <highest allocated sector> |
| 66 | |
| 67 | #logged entries : Number of logged entries |
| 68 | highest allocated sector : Highest allocated sector |
| 69 | |
| 70 | iii) Messages |
| 71 | |
| 72 | mark <description> |
| 73 | |
| 74 | You can use a dmsetup message to set an arbitrary mark in a log. |
| 75 | For example say you want to fsck a file system after every |
| 76 | write, but first you need to replay up to the mkfs to make sure |
| 77 | we're fsck'ing something reasonable, you would do something like |
| 78 | this: |
| 79 | |
| 80 | mkfs.btrfs -f /dev/mapper/log |
| 81 | dmsetup message log 0 mark mkfs |
| 82 | <run test> |
| 83 | |
| 84 | This would allow you to replay the log up to the mkfs mark and |
| 85 | then replay from that point on doing the fsck check in the |
| 86 | interval that you want. |
| 87 | |
| 88 | Every log has a mark at the end labeled "dm-log-writes-end". |
| 89 | |
| 90 | Userspace component |
| 91 | =================== |
| 92 | |
| 93 | There is a userspace tool that will replay the log for you in various ways. |
| 94 | It can be found here: https://github.com/josefbacik/log-writes |
| 95 | |
| 96 | Example usage |
| 97 | ============= |
| 98 | |
| 99 | Say you want to test fsync on your file system. You would do something like |
| 100 | this: |
| 101 | |
| 102 | TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" |
| 103 | dmsetup create log --table "$TABLE" |
| 104 | mkfs.btrfs -f /dev/mapper/log |
| 105 | dmsetup message log 0 mark mkfs |
| 106 | |
| 107 | mount /dev/mapper/log /mnt/btrfs-test |
| 108 | <some test that does fsync at the end> |
| 109 | dmsetup message log 0 mark fsync |
| 110 | md5sum /mnt/btrfs-test/foo |
| 111 | umount /mnt/btrfs-test |
| 112 | |
| 113 | dmsetup remove log |
| 114 | replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync |
| 115 | mount /dev/sdb /mnt/btrfs-test |
| 116 | md5sum /mnt/btrfs-test/foo |
| 117 | <verify md5sum's are correct> |
| 118 | |
| 119 | Another option is to do a complicated file system operation and verify the file |
| 120 | system is consistent during the entire operation. You could do this with: |
| 121 | |
| 122 | TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" |
| 123 | dmsetup create log --table "$TABLE" |
| 124 | mkfs.btrfs -f /dev/mapper/log |
| 125 | dmsetup message log 0 mark mkfs |
| 126 | |
| 127 | mount /dev/mapper/log /mnt/btrfs-test |
| 128 | <fsstress to dirty the fs> |
| 129 | btrfs filesystem balance /mnt/btrfs-test |
| 130 | umount /mnt/btrfs-test |
| 131 | dmsetup remove log |
| 132 | |
| 133 | replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs |
| 134 | btrfsck /dev/sdb |
| 135 | replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \ |
| 136 | --fsck "btrfsck /dev/sdb" --check fua |
| 137 | |
| 138 | And that will replay the log until it sees a FUA request, run the fsck command |
| 139 | and if the fsck passes it will replay to the next FUA, until it is completed or |
| 140 | the fsck command exists abnormally. |