Commit | Line | Data |
---|---|---|
a98f0dd3 AK |
1 | |
2 | Configurable sysfs parameters for the x86-64 machine check code. | |
3 | ||
4 | Machine checks report internal hardware error conditions detected | |
5 | by the CPU. Uncorrected errors typically cause a machine check | |
6 | (often with panic), corrected ones cause a machine check log entry. | |
7 | ||
8 | Machine checks are organized in banks (normally associated with | |
9 | a hardware subsystem) and subevents in a bank. The exact meaning | |
10 | of the banks and subevent is CPU specific. | |
11 | ||
12 | mcelog knows how to decode them. | |
13 | ||
14 | When you see the "Machine check errors logged" message in the system | |
15 | log then mcelog should run to collect and decode machine check entries | |
16 | from /dev/mcelog. Normally mcelog should be run regularly from a cronjob. | |
17 | ||
18 | Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN | |
19 | (N = CPU number) | |
20 | ||
21 | The directory contains some configurable entries: | |
22 | ||
23 | Entries: | |
24 | ||
25 | bankNctl | |
26 | (N bank number) | |
27 | 64bit Hex bitmask enabling/disabling specific subevents for bank N | |
28 | When a bit in the bitmask is zero then the respective | |
29 | subevent will not be reported. | |
30 | By default all events are enabled. | |
31 | Note that BIOS maintain another mask to disable specific events | |
32 | per bank. This is not visible here | |
33 | ||
34 | The following entries appear for each CPU, but they are truly shared | |
35 | between all CPUs. | |
36 | ||
37 | check_interval | |
38 | How often to poll for corrected machine check errors, in seconds | |
8a336b0a TH |
39 | (Note output is hexademical). Default 5 minutes. When the poller |
40 | finds MCEs it triggers an exponential speedup (poll more often) on | |
41 | the polling interval. When the poller stops finding MCEs, it | |
42 | triggers an exponential backoff (poll less often) on the polling | |
43 | interval. The check_interval variable is both the initial and | |
8780e8e0 AK |
44 | maximum polling interval. 0 means no polling for corrected machine |
45 | check errors (but some corrected errors might be still reported | |
46 | in other ways) | |
a98f0dd3 AK |
47 | |
48 | tolerant | |
49 | Tolerance level. When a machine check exception occurs for a non | |
50 | corrected machine check the kernel can take different actions. | |
51 | Since machine check exceptions can happen any time it is sometimes | |
52 | risky for the kernel to kill a process because it defies | |
53 | normal kernel locking rules. The tolerance level configures | |
bd78432c TH |
54 | how hard the kernel tries to recover even at some risk of |
55 | deadlock. Higher tolerant values trade potentially better uptime | |
56 | with the risk of a crash or even corruption (for tolerant >= 3). | |
57 | ||
58 | 0: always panic on uncorrected errors, log corrected errors | |
59 | 1: panic or SIGBUS on uncorrected errors, log corrected errors | |
60 | 2: SIGBUS or log uncorrected errors, log corrected errors | |
61 | 3: never panic or SIGBUS, log all errors (for testing only) | |
a98f0dd3 AK |
62 | |
63 | Default: 1 | |
64 | ||
65 | Note this only makes a difference if the CPU allows recovery | |
66 | from a machine check exception. Current x86 CPUs generally do not. | |
67 | ||
68 | trigger | |
69 | Program to run when a machine check event is detected. | |
70 | This is an alternative to running mcelog regularly from cron | |
71 | and allows to detect events faster. | |
3c079792 AK |
72 | monarch_timeout |
73 | How long to wait for the other CPUs to machine check too on a | |
74 | exception. 0 to disable waiting for other CPUs. | |
75 | Unit: us | |
a98f0dd3 AK |
76 | |
77 | TBD document entries for AMD threshold interrupt configuration | |
78 | ||
79 | For more details about the x86 machine check architecture | |
80 | see the Intel and AMD architecture manuals from their developer websites. | |
81 | ||
82 | For more details about the architecture see | |
83 | see http://one.firstfloor.org/~andi/mce.pdf |