[deliverable/linux.git] / Documentation / device-mapper / log-writes.txt

dm-log-writes
=============

This target takes 2 devices, one to pass all IO to normally, and one to log all
of the write operations to.  This is intended for file system developers wishing
to verify the integrity of metadata or data as the file system is written to.
There is a log_write_entry written for every WRITE request and the target is
able to take arbitrary data from userspace to insert into the log.  The data
that is in the WRITE requests is copied into the log to make the replay happen
exactly as it happened originally.

Log Ordering
============

We log things in order of completion once we are sure the write is no longer in
cache.  This means that normal WRITE requests are not actually logged until the
next REQ_FLUSH request.  This is to make it easier for userspace to replay the
log in a way that correlates to what is on disk and not what is in cache, to
make it easier to detect improper waiting/flushing.

This works by attaching all WRITE requests to a list once the write completes.
Once we see a REQ_FLUSH request we splice this list onto the request and once
the FLUSH request completes we log all of the WRITEs and then the FLUSH.  Only
completed WRITEs, at the time the REQ_FLUSH is issued, are added in order to
simulate the worst case scenario with regard to power failures.  Consider the
following example (W means write, C means complete):

W1,W2,W3,C3,C2,Wflush,C1,Cflush

The log would show the following

W3,W2,flush,W1....

Again this is to simulate what is actually on disk, this allows us to detect
cases where a power failure at a particular point in time would create an
inconsistent file system.

Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as
they complete as those requests will obviously bypass the device cache.

Any REQ_DISCARD requests are treated like WRITE requests.  Otherwise we would
have all the DISCARD requests, and then the WRITE requests and then the FLUSH
request.  Consider the following example:

WRITE block 1, DISCARD block 1, FLUSH

If we logged DISCARD when it completed, the replay would look like this

DISCARD 1, WRITE 1, FLUSH

which isn't quite what happened and wouldn't be caught during the log replay.

Target interface
================

i) Constructor

   log-writes <dev_path> <log_dev_path>

   dev_path	: Device that all of the IO will go to normally.
   log_dev_path : Device where the log entries are written to.

ii) Status

    <#logged entries> <highest allocated sector>

    #logged entries	       : Number of logged entries
    highest allocated sector   : Highest allocated sector

iii) Messages

    mark <description>

	You can use a dmsetup message to set an arbitrary mark in a log.
	For example say you want to fsck a file system after every
	write, but first you need to replay up to the mkfs to make sure
	we're fsck'ing something reasonable, you would do something like
	this:

	  mkfs.btrfs -f /dev/mapper/log
	  dmsetup message log 0 mark mkfs
	  <run test>

	  This would allow you to replay the log up to the mkfs mark and
	  then replay from that point on doing the fsck check in the
	  interval that you want.

	Every log has a mark at the end labeled "dm-log-writes-end".

Userspace component
===================

There is a userspace tool that will replay the log for you in various ways.
It can be found here: https://github.com/josefbacik/log-writes

Example usage
=============

Say you want to test fsync on your file system.  You would do something like
this:

TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
dmsetup create log --table "$TABLE"
mkfs.btrfs -f /dev/mapper/log
dmsetup message log 0 mark mkfs

mount /dev/mapper/log /mnt/btrfs-test
<some test that does fsync at the end>
dmsetup message log 0 mark fsync
md5sum /mnt/btrfs-test/foo
umount /mnt/btrfs-test

dmsetup remove log
replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
mount /dev/sdb /mnt/btrfs-test
md5sum /mnt/btrfs-test/foo
<verify md5sum's are correct>

Another option is to do a complicated file system operation and verify the file
system is consistent during the entire operation.  You could do this with:

TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
dmsetup create log --table "$TABLE"
mkfs.btrfs -f /dev/mapper/log
dmsetup message log 0 mark mkfs

mount /dev/mapper/log /mnt/btrfs-test
<fsstress to dirty the fs>
btrfs filesystem balance /mnt/btrfs-test
umount /mnt/btrfs-test
dmsetup remove log

replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
btrfsck /dev/sdb
replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
	--fsck "btrfsck /dev/sdb" --check fua

And that will replay the log until it sees a FUA request, run the fsck command
and if the fsck passes it will replay to the next FUA, until it is completed or
the fsck command exists abnormally.
Commit	Line	Data
0e9cebe7 JB	1	dm-log-writes
	2	=============
	3
	4	This target takes 2 devices, one to pass all IO to normally, and one to log all
	5	of the write operations to. This is intended for file system developers wishing
	6	to verify the integrity of metadata or data as the file system is written to.
	7	There is a log_write_entry written for every WRITE request and the target is
	8	able to take arbitrary data from userspace to insert into the log. The data
	9	that is in the WRITE requests is copied into the log to make the replay happen
	10	exactly as it happened originally.
	11
	12	Log Ordering
	13	============
	14
	15	We log things in order of completion once we are sure the write is no longer in
	16	cache. This means that normal WRITE requests are not actually logged until the
	17	next REQ_FLUSH request. This is to make it easier for userspace to replay the
	18	log in a way that correlates to what is on disk and not what is in cache, to
	19	make it easier to detect improper waiting/flushing.
	20
	21	This works by attaching all WRITE requests to a list once the write completes.
	22	Once we see a REQ_FLUSH request we splice this list onto the request and once
	23	the FLUSH request completes we log all of the WRITEs and then the FLUSH. Only
	24	completed WRITEs, at the time the REQ_FLUSH is issued, are added in order to
	25	simulate the worst case scenario with regard to power failures. Consider the
	26	following example (W means write, C means complete):
	27
	28	W1,W2,W3,C3,C2,Wflush,C1,Cflush
	29
	30	The log would show the following
	31
	32	W3,W2,flush,W1....
	33
	34	Again this is to simulate what is actually on disk, this allows us to detect
	35	cases where a power failure at a particular point in time would create an
	36	inconsistent file system.
	37
	38	Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as
	39	they complete as those requests will obviously bypass the device cache.
	40
	41	Any REQ_DISCARD requests are treated like WRITE requests. Otherwise we would
	42	have all the DISCARD requests, and then the WRITE requests and then the FLUSH
	43	request. Consider the following example:
	44
	45	WRITE block 1, DISCARD block 1, FLUSH
	46
	47	If we logged DISCARD when it completed, the replay would look like this
	48
	49	DISCARD 1, WRITE 1, FLUSH
	50
	51	which isn't quite what happened and wouldn't be caught during the log replay.
	52
	53	Target interface
	54	================
	55
	56	i) Constructor
	57
	58	log-writes <dev_path> <log_dev_path>
	59
	60	dev_path : Device that all of the IO will go to normally.
	61	log_dev_path : Device where the log entries are written to.
	62
	63	ii) Status
	64
65	<#logged entries> <highest allocated sector>
66
67	#logged entries : Number of logged entries
68	highest allocated sector : Highest allocated sector
69
70	iii) Messages
71
72	mark <description>
73
74	You can use a dmsetup message to set an arbitrary mark in a log.
75	For example say you want to fsck a file system after every
76	write, but first you need to replay up to the mkfs to make sure
77	we're fsck'ing something reasonable, you would do something like
78	this:
79
80	mkfs.btrfs -f /dev/mapper/log
81	dmsetup message log 0 mark mkfs
82	<run test>
83
84	This would allow you to replay the log up to the mkfs mark and
85	then replay from that point on doing the fsck check in the
86	interval that you want.
87
88	Every log has a mark at the end labeled "dm-log-writes-end".
89
90	Userspace component
91	===================
92
93	There is a userspace tool that will replay the log for you in various ways.
94	It can be found here: https://github.com/josefbacik/log-writes
95
96	Example usage
97	=============
98
99	Say you want to test fsync on your file system. You would do something like
100	this:
101
102	TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
103	dmsetup create log --table "$TABLE"
104	mkfs.btrfs -f /dev/mapper/log
105	dmsetup message log 0 mark mkfs
106
107	mount /dev/mapper/log /mnt/btrfs-test
108	<some test that does fsync at the end>
109	dmsetup message log 0 mark fsync
110	md5sum /mnt/btrfs-test/foo
111	umount /mnt/btrfs-test
112
113	dmsetup remove log
114	replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
115	mount /dev/sdb /mnt/btrfs-test
116	md5sum /mnt/btrfs-test/foo
117	<verify md5sum's are correct>
118
119	Another option is to do a complicated file system operation and verify the file
120	system is consistent during the entire operation. You could do this with:
121
122	TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
123	dmsetup create log --table "$TABLE"
124	mkfs.btrfs -f /dev/mapper/log
125	dmsetup message log 0 mark mkfs
126
127	mount /dev/mapper/log /mnt/btrfs-test
128	<fsstress to dirty the fs>
129	btrfs filesystem balance /mnt/btrfs-test
130	umount /mnt/btrfs-test
131	dmsetup remove log
132
133	replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
134	btrfsck /dev/sdb
135	replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
136	--fsck "btrfsck /dev/sdb" --check fua
137
138	And that will replay the log until it sees a FUA request, run the fsck command
139	and if the fsck passes it will replay to the next FUA, until it is completed or
140	the fsck command exists abnormally.