Commit | Line | Data |
---|---|---|
0e9cebe7 JB |
1 | dm-log-writes |
2 | ============= | |
3 | ||
4 | This target takes 2 devices, one to pass all IO to normally, and one to log all | |
5 | of the write operations to. This is intended for file system developers wishing | |
6 | to verify the integrity of metadata or data as the file system is written to. | |
7 | There is a log_write_entry written for every WRITE request and the target is | |
8 | able to take arbitrary data from userspace to insert into the log. The data | |
9 | that is in the WRITE requests is copied into the log to make the replay happen | |
10 | exactly as it happened originally. | |
11 | ||
12 | Log Ordering | |
13 | ============ | |
14 | ||
15 | We log things in order of completion once we are sure the write is no longer in | |
16 | cache. This means that normal WRITE requests are not actually logged until the | |
17 | next REQ_FLUSH request. This is to make it easier for userspace to replay the | |
18 | log in a way that correlates to what is on disk and not what is in cache, to | |
19 | make it easier to detect improper waiting/flushing. | |
20 | ||
21 | This works by attaching all WRITE requests to a list once the write completes. | |
22 | Once we see a REQ_FLUSH request we splice this list onto the request and once | |
23 | the FLUSH request completes we log all of the WRITEs and then the FLUSH. Only | |
24 | completed WRITEs, at the time the REQ_FLUSH is issued, are added in order to | |
25 | simulate the worst case scenario with regard to power failures. Consider the | |
26 | following example (W means write, C means complete): | |
27 | ||
28 | W1,W2,W3,C3,C2,Wflush,C1,Cflush | |
29 | ||
30 | The log would show the following | |
31 | ||
32 | W3,W2,flush,W1.... | |
33 | ||
34 | Again this is to simulate what is actually on disk, this allows us to detect | |
35 | cases where a power failure at a particular point in time would create an | |
36 | inconsistent file system. | |
37 | ||
38 | Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as | |
39 | they complete as those requests will obviously bypass the device cache. | |
40 | ||
41 | Any REQ_DISCARD requests are treated like WRITE requests. Otherwise we would | |
42 | have all the DISCARD requests, and then the WRITE requests and then the FLUSH | |
43 | request. Consider the following example: | |
44 | ||
45 | WRITE block 1, DISCARD block 1, FLUSH | |
46 | ||
47 | If we logged DISCARD when it completed, the replay would look like this | |
48 | ||
49 | DISCARD 1, WRITE 1, FLUSH | |
50 | ||
51 | which isn't quite what happened and wouldn't be caught during the log replay. | |
52 | ||
53 | Target interface | |
54 | ================ | |
55 | ||
56 | i) Constructor | |
57 | ||
58 | log-writes <dev_path> <log_dev_path> | |
59 | ||
60 | dev_path : Device that all of the IO will go to normally. | |
61 | log_dev_path : Device where the log entries are written to. | |
62 | ||
63 | ii) Status | |
64 | ||
65 | <#logged entries> <highest allocated sector> | |
66 | ||
67 | #logged entries : Number of logged entries | |
68 | highest allocated sector : Highest allocated sector | |
69 | ||
70 | iii) Messages | |
71 | ||
72 | mark <description> | |
73 | ||
74 | You can use a dmsetup message to set an arbitrary mark in a log. | |
75 | For example say you want to fsck a file system after every | |
76 | write, but first you need to replay up to the mkfs to make sure | |
77 | we're fsck'ing something reasonable, you would do something like | |
78 | this: | |
79 | ||
80 | mkfs.btrfs -f /dev/mapper/log | |
81 | dmsetup message log 0 mark mkfs | |
82 | <run test> | |
83 | ||
84 | This would allow you to replay the log up to the mkfs mark and | |
85 | then replay from that point on doing the fsck check in the | |
86 | interval that you want. | |
87 | ||
88 | Every log has a mark at the end labeled "dm-log-writes-end". | |
89 | ||
90 | Userspace component | |
91 | =================== | |
92 | ||
93 | There is a userspace tool that will replay the log for you in various ways. | |
94 | It can be found here: https://github.com/josefbacik/log-writes | |
95 | ||
96 | Example usage | |
97 | ============= | |
98 | ||
99 | Say you want to test fsync on your file system. You would do something like | |
100 | this: | |
101 | ||
102 | TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" | |
103 | dmsetup create log --table "$TABLE" | |
104 | mkfs.btrfs -f /dev/mapper/log | |
105 | dmsetup message log 0 mark mkfs | |
106 | ||
107 | mount /dev/mapper/log /mnt/btrfs-test | |
108 | <some test that does fsync at the end> | |
109 | dmsetup message log 0 mark fsync | |
110 | md5sum /mnt/btrfs-test/foo | |
111 | umount /mnt/btrfs-test | |
112 | ||
113 | dmsetup remove log | |
114 | replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync | |
115 | mount /dev/sdb /mnt/btrfs-test | |
116 | md5sum /mnt/btrfs-test/foo | |
117 | <verify md5sum's are correct> | |
118 | ||
119 | Another option is to do a complicated file system operation and verify the file | |
120 | system is consistent during the entire operation. You could do this with: | |
121 | ||
122 | TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" | |
123 | dmsetup create log --table "$TABLE" | |
124 | mkfs.btrfs -f /dev/mapper/log | |
125 | dmsetup message log 0 mark mkfs | |
126 | ||
127 | mount /dev/mapper/log /mnt/btrfs-test | |
128 | <fsstress to dirty the fs> | |
129 | btrfs filesystem balance /mnt/btrfs-test | |
130 | umount /mnt/btrfs-test | |
131 | dmsetup remove log | |
132 | ||
133 | replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs | |
134 | btrfsck /dev/sdb | |
135 | replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \ | |
136 | --fsck "btrfsck /dev/sdb" --check fua | |
137 | ||
138 | And that will replay the log until it sees a FUA request, run the fsck command | |
139 | and if the fsck passes it will replay to the next FUA, until it is completed or | |
140 | the fsck command exists abnormally. |