Blame - Documentation/device-mapper/log-writes.txt - kernel/msm-4.9

blob: f4ebcbaf50f308313770e79a37d180ac5acd6542 [file] [log] [blame]

Josef Bacik	0e9cebe	2015-03-20 10:50:37 -0400	[diff] [blame]	1	dm-log-writes
				2	=============
				3
				4	This target takes 2 devices, one to pass all IO to normally, and one to log all
				5	of the write operations to. This is intended for file system developers wishing
				6	to verify the integrity of metadata or data as the file system is written to.
				7	There is a log_write_entry written for every WRITE request and the target is
				8	able to take arbitrary data from userspace to insert into the log. The data
				9	that is in the WRITE requests is copied into the log to make the replay happen
				10	exactly as it happened originally.
				11
				12	Log Ordering
				13	============
				14
				15	We log things in order of completion once we are sure the write is no longer in
				16	cache. This means that normal WRITE requests are not actually logged until the
Mike Christie	28a8f0d	2016-06-05 14:32:25 -0500	[diff] [blame]	17	next REQ_PREFLUSH request. This is to make it easier for userspace to replay
				18	the log in a way that correlates to what is on disk and not what is in cache,
				19	to make it easier to detect improper waiting/flushing.
Josef Bacik	0e9cebe	2015-03-20 10:50:37 -0400	[diff] [blame]	20
				21	This works by attaching all WRITE requests to a list once the write completes.
Mike Christie	28a8f0d	2016-06-05 14:32:25 -0500	[diff] [blame]	22	Once we see a REQ_PREFLUSH request we splice this list onto the request and once
Josef Bacik	0e9cebe	2015-03-20 10:50:37 -0400	[diff] [blame]	23	the FLUSH request completes we log all of the WRITEs and then the FLUSH. Only
Mike Christie	28a8f0d	2016-06-05 14:32:25 -0500	[diff] [blame]	24	completed WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
Josef Bacik	0e9cebe	2015-03-20 10:50:37 -0400	[diff] [blame]	25	simulate the worst case scenario with regard to power failures. Consider the
				26	following example (W means write, C means complete):
				27
				28	W1,W2,W3,C3,C2,Wflush,C1,Cflush
				29
				30	The log would show the following
				31
				32	W3,W2,flush,W1....
				33
				34	Again this is to simulate what is actually on disk, this allows us to detect
				35	cases where a power failure at a particular point in time would create an
				36	inconsistent file system.
				37
				38	Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as
				39	they complete as those requests will obviously bypass the device cache.
				40
				41	Any REQ_DISCARD requests are treated like WRITE requests. Otherwise we would
				42	have all the DISCARD requests, and then the WRITE requests and then the FLUSH
				43	request. Consider the following example:
				44
				45	WRITE block 1, DISCARD block 1, FLUSH
				46
				47	If we logged DISCARD when it completed, the replay would look like this
				48
				49	DISCARD 1, WRITE 1, FLUSH
				50
				51	which isn't quite what happened and wouldn't be caught during the log replay.
				52
				53	Target interface
				54	================
				55
				56	i) Constructor
				57
				58	log-writes <dev_path> <log_dev_path>
				59
				60	dev_path : Device that all of the IO will go to normally.
				61	log_dev_path : Device where the log entries are written to.
				62
				63	ii) Status
				64
				65	<#logged entries> <highest allocated sector>
				66
				67	#logged entries : Number of logged entries
				68	highest allocated sector : Highest allocated sector
				69
				70	iii) Messages
				71
				72	mark <description>
				73
				74	You can use a dmsetup message to set an arbitrary mark in a log.
				75	For example say you want to fsck a file system after every
				76	write, but first you need to replay up to the mkfs to make sure
				77	we're fsck'ing something reasonable, you would do something like
				78	this:
				79
				80	mkfs.btrfs -f /dev/mapper/log
				81	dmsetup message log 0 mark mkfs
				82	<run test>
				83
				84	This would allow you to replay the log up to the mkfs mark and
				85	then replay from that point on doing the fsck check in the
				86	interval that you want.
				87
				88	Every log has a mark at the end labeled "dm-log-writes-end".
				89
				90	Userspace component
				91	===================
				92
				93	There is a userspace tool that will replay the log for you in various ways.
				94	It can be found here: https://github.com/josefbacik/log-writes
				95
				96	Example usage
				97	=============
				98
				99	Say you want to test fsync on your file system. You would do something like
				100	this:
				101
				102	TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
				103	dmsetup create log --table "$TABLE"
				104	mkfs.btrfs -f /dev/mapper/log
				105	dmsetup message log 0 mark mkfs
				106
				107	mount /dev/mapper/log /mnt/btrfs-test
				108	<some test that does fsync at the end>
				109	dmsetup message log 0 mark fsync
				110	md5sum /mnt/btrfs-test/foo
				111	umount /mnt/btrfs-test
				112
				113	dmsetup remove log
				114	replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
				115	mount /dev/sdb /mnt/btrfs-test
				116	md5sum /mnt/btrfs-test/foo
				117	<verify md5sum's are correct>
				118
				119	Another option is to do a complicated file system operation and verify the file
				120	system is consistent during the entire operation. You could do this with:
				121
				122	TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
				123	dmsetup create log --table "$TABLE"
				124	mkfs.btrfs -f /dev/mapper/log
				125	dmsetup message log 0 mark mkfs
				126
				127	mount /dev/mapper/log /mnt/btrfs-test
				128	<fsstress to dirty the fs>
				129	btrfs filesystem balance /mnt/btrfs-test
				130	umount /mnt/btrfs-test
				131	dmsetup remove log
				132
				133	replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
				134	btrfsck /dev/sdb
				135	replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
				136	--fsck "btrfsck /dev/sdb" --check fua
				137
				138	And that will replay the log until it sees a FUA request, run the fsck command
				139	and if the fsck passes it will replay to the next FUA, until it is completed or
				140	the fsck command exists abnormally.