Commit | Line | Data |
---|---|---|
d28a7932 MA |
1 | |
2 | Hypervisor-Assisted Dump | |
3 | ------------------------ | |
4 | November 2007 | |
5 | ||
6 | The goal of hypervisor-assisted dump is to enable the dump of | |
7 | a crashed system, and to do so from a fully-reset system, and | |
8 | to minimize the total elapsed time until the system is back | |
9 | in production use. | |
10 | ||
11 | As compared to kdump or other strategies, hypervisor-assisted | |
12 | dump offers several strong, practical advantages: | |
13 | ||
14 | -- Unlike kdump, the system has been reset, and loaded | |
15 | with a fresh copy of the kernel. In particular, | |
16 | PCI and I/O devices have been reinitialized and are | |
17 | in a clean, consistent state. | |
18 | -- As the dump is performed, the dumped memory becomes | |
19 | immediately available to the system for normal use. | |
20 | -- After the dump is completed, no further reboots are | |
21 | required; the system will be fully usable, and running | |
22 | in it's normal, production mode on it normal kernel. | |
23 | ||
24 | The above can only be accomplished by coordination with, | |
25 | and assistance from the hypervisor. The procedure is | |
26 | as follows: | |
27 | ||
28 | -- When a system crashes, the hypervisor will save | |
29 | the low 256MB of RAM to a previously registered | |
30 | save region. It will also save system state, system | |
31 | registers, and hardware PTE's. | |
32 | ||
33 | -- After the low 256MB area has been saved, the | |
34 | hypervisor will reset PCI and other hardware state. | |
35 | It will *not* clear RAM. It will then launch the | |
36 | bootloader, as normal. | |
37 | ||
38 | -- The freshly booted kernel will notice that there | |
39 | is a new node (ibm,dump-kernel) in the device tree, | |
40 | indicating that there is crash data available from | |
41 | a previous boot. It will boot into only 256MB of RAM, | |
42 | reserving the rest of system memory. | |
43 | ||
44 | -- Userspace tools will parse /sys/kernel/release_region | |
45 | and read /proc/vmcore to obtain the contents of memory, | |
46 | which holds the previous crashed kernel. The userspace | |
47 | tools may copy this info to disk, or network, nas, san, | |
48 | iscsi, etc. as desired. | |
49 | ||
50 | For Example: the values in /sys/kernel/release-region | |
51 | would look something like this (address-range pairs). | |
52 | CPU:0x177fee000-0x10000: HPTE:0x177ffe020-0x1000: / | |
53 | DUMP:0x177fff020-0x10000000, 0x10000000-0x16F1D370A | |
54 | ||
55 | -- As the userspace tools complete saving a portion of | |
56 | dump, they echo an offset and size to | |
57 | /sys/kernel/release_region to release the reserved | |
58 | memory back to general use. | |
59 | ||
60 | An example of this is: | |
61 | "echo 0x40000000 0x10000000 > /sys/kernel/release_region" | |
62 | which will release 256MB at the 1GB boundary. | |
63 | ||
64 | Please note that the hypervisor-assisted dump feature | |
65 | is only available on Power6-based systems with recent | |
66 | firmware versions. | |
67 | ||
68 | Implementation details: | |
69 | ---------------------- | |
70 | ||
71 | During boot, a check is made to see if firmware supports | |
72 | this feature on this particular machine. If it does, then | |
73 | we check to see if a active dump is waiting for us. If yes | |
74 | then everything but 256 MB of RAM is reserved during early | |
75 | boot. This area is released once we collect a dump from user | |
76 | land scripts that are run. If there is dump data, then | |
77 | the /sys/kernel/release_region file is created, and | |
78 | the reserved memory is held. | |
79 | ||
80 | If there is no waiting dump data, then only the highest | |
81 | 256MB of the ram is reserved as a scratch area. This area | |
82 | is *not* released: this region will be kept permanently | |
83 | reserved, so that it can act as a receptacle for a copy | |
84 | of the low 256MB in the case a crash does occur. See, | |
85 | however, "open issues" below, as to whether | |
86 | such a reserved region is really needed. | |
87 | ||
88 | Currently the dump will be copied from /proc/vmcore to a | |
89 | a new file upon user intervention. The starting address | |
90 | to be read and the range for each data point in provided | |
91 | in /sys/kernel/release_region. | |
92 | ||
93 | The tools to examine the dump will be same as the ones | |
94 | used for kdump. | |
95 | ||
96 | General notes: | |
97 | -------------- | |
98 | Security: please note that there are potential security issues | |
99 | with any sort of dump mechanism. In particular, plaintext | |
100 | (unencrypted) data, and possibly passwords, may be present in | |
101 | the dump data. Userspace tools must take adequate precautions to | |
102 | preserve security. | |
103 | ||
104 | Open issues/ToDo: | |
105 | ------------ | |
106 | o The various code paths that tell the hypervisor that a crash | |
107 | occurred, vs. it simply being a normal reboot, should be | |
108 | reviewed, and possibly clarified/fixed. | |
109 | ||
110 | o Instead of using /sys/kernel, should there be a /sys/dump | |
111 | instead? There is a dump_subsys being created by the s390 code, | |
112 | perhaps the pseries code should use a similar layout as well. | |
113 | ||
114 | o Is reserving a 256MB region really required? The goal of | |
115 | reserving a 256MB scratch area is to make sure that no | |
116 | important crash data is clobbered when the hypervisor | |
117 | save low mem to the scratch area. But, if one could assure | |
118 | that nothing important is located in some 256MB area, then | |
119 | it would not need to be reserved. Something that can be | |
120 | improved in subsequent versions. | |
121 | ||
122 | o Still working the kdump team to integrate this with kdump, | |
123 | some work remains but this would not affect the current | |
124 | patches. | |
125 | ||
126 | o Still need to write a shell script, to copy the dump away. | |
127 | Currently I am parsing it manually. |