blob: f4ebcbaf50f308313770e79a37d180ac5acd6542 [file] [log] [blame]
Josef Bacik0e9cebe2015-03-20 10:50:37 -04001dm-log-writes
2=============
3
4This target takes 2 devices, one to pass all IO to normally, and one to log all
5of the write operations to. This is intended for file system developers wishing
6to verify the integrity of metadata or data as the file system is written to.
7There is a log_write_entry written for every WRITE request and the target is
8able to take arbitrary data from userspace to insert into the log. The data
9that is in the WRITE requests is copied into the log to make the replay happen
10exactly as it happened originally.
11
12Log Ordering
13============
14
15We log things in order of completion once we are sure the write is no longer in
16cache. This means that normal WRITE requests are not actually logged until the
Mike Christie28a8f0d2016-06-05 14:32:25 -050017next REQ_PREFLUSH request. This is to make it easier for userspace to replay
18the log in a way that correlates to what is on disk and not what is in cache,
19to make it easier to detect improper waiting/flushing.
Josef Bacik0e9cebe2015-03-20 10:50:37 -040020
21This works by attaching all WRITE requests to a list once the write completes.
Mike Christie28a8f0d2016-06-05 14:32:25 -050022Once we see a REQ_PREFLUSH request we splice this list onto the request and once
Josef Bacik0e9cebe2015-03-20 10:50:37 -040023the FLUSH request completes we log all of the WRITEs and then the FLUSH. Only
Mike Christie28a8f0d2016-06-05 14:32:25 -050024completed WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
Josef Bacik0e9cebe2015-03-20 10:50:37 -040025simulate the worst case scenario with regard to power failures. Consider the
26following example (W means write, C means complete):
27
28W1,W2,W3,C3,C2,Wflush,C1,Cflush
29
30The log would show the following
31
32W3,W2,flush,W1....
33
34Again this is to simulate what is actually on disk, this allows us to detect
35cases where a power failure at a particular point in time would create an
36inconsistent file system.
37
38Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as
39they complete as those requests will obviously bypass the device cache.
40
41Any REQ_DISCARD requests are treated like WRITE requests. Otherwise we would
42have all the DISCARD requests, and then the WRITE requests and then the FLUSH
43request. Consider the following example:
44
45WRITE block 1, DISCARD block 1, FLUSH
46
47If we logged DISCARD when it completed, the replay would look like this
48
49DISCARD 1, WRITE 1, FLUSH
50
51which isn't quite what happened and wouldn't be caught during the log replay.
52
53Target interface
54================
55
56i) Constructor
57
58 log-writes <dev_path> <log_dev_path>
59
60 dev_path : Device that all of the IO will go to normally.
61 log_dev_path : Device where the log entries are written to.
62
63ii) Status
64
65 <#logged entries> <highest allocated sector>
66
67 #logged entries : Number of logged entries
68 highest allocated sector : Highest allocated sector
69
70iii) Messages
71
72 mark <description>
73
74 You can use a dmsetup message to set an arbitrary mark in a log.
75 For example say you want to fsck a file system after every
76 write, but first you need to replay up to the mkfs to make sure
77 we're fsck'ing something reasonable, you would do something like
78 this:
79
80 mkfs.btrfs -f /dev/mapper/log
81 dmsetup message log 0 mark mkfs
82 <run test>
83
84 This would allow you to replay the log up to the mkfs mark and
85 then replay from that point on doing the fsck check in the
86 interval that you want.
87
88 Every log has a mark at the end labeled "dm-log-writes-end".
89
90Userspace component
91===================
92
93There is a userspace tool that will replay the log for you in various ways.
94It can be found here: https://github.com/josefbacik/log-writes
95
96Example usage
97=============
98
99Say you want to test fsync on your file system. You would do something like
100this:
101
102TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
103dmsetup create log --table "$TABLE"
104mkfs.btrfs -f /dev/mapper/log
105dmsetup message log 0 mark mkfs
106
107mount /dev/mapper/log /mnt/btrfs-test
108<some test that does fsync at the end>
109dmsetup message log 0 mark fsync
110md5sum /mnt/btrfs-test/foo
111umount /mnt/btrfs-test
112
113dmsetup remove log
114replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
115mount /dev/sdb /mnt/btrfs-test
116md5sum /mnt/btrfs-test/foo
117<verify md5sum's are correct>
118
119Another option is to do a complicated file system operation and verify the file
120system is consistent during the entire operation. You could do this with:
121
122TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
123dmsetup create log --table "$TABLE"
124mkfs.btrfs -f /dev/mapper/log
125dmsetup message log 0 mark mkfs
126
127mount /dev/mapper/log /mnt/btrfs-test
128<fsstress to dirty the fs>
129btrfs filesystem balance /mnt/btrfs-test
130umount /mnt/btrfs-test
131dmsetup remove log
132
133replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
134btrfsck /dev/sdb
135replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
136 --fsck "btrfsck /dev/sdb" --check fua
137
138And that will replay the log until it sees a FUA request, run the fsck command
139and if the fsck passes it will replay to the next FUA, until it is completed or
140the fsck command exists abnormally.