| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960 |
- .. SPDX-License-Identifier: GPL-2.0
- =================================================
- Recoverable Hardware Error Tracking in vmcoreinfo
- =================================================
- Overview
- --------
- This feature provides a generic infrastructure within the Linux kernel to track
- and log recoverable hardware errors. These are hardware recoverable errors
- visible that might not cause immediate panics but may influence health, mainly
- because new code path will be executed in the kernel.
- By recording counts and timestamps of recoverable errors into the vmcoreinfo
- crash dump notes, this infrastructure aids post-mortem crash analysis tools in
- correlating hardware events with kernel failures. This enables faster triage
- and better understanding of root causes, especially in large-scale cloud
- environments where hardware issues are common.
- Benefits
- --------
- - Facilitates correlation of hardware recoverable errors with kernel panics or
- unusual code paths that lead to system crashes.
- - Provides operators and cloud providers quick insights, improving reliability
- and reducing troubleshooting time.
- - Complements existing full hardware diagnostics without replacing them.
- Data Exposure and Consumption
- -----------------------------
- - The tracked error data consists of per-error-type counts and timestamps of
- last occurrence.
- - This data is stored in the `hwerror_data` array, categorized by error source
- types like CPU, memory, PCI, CXL, and others.
- - It is exposed via vmcoreinfo crash dump notes and can be read using tools
- like `crash`, `drgn`, or other kernel crash analysis utilities.
- - There is no other way to read these data other than from crash dumps.
- - These errors are divided by area, which includes CPU, Memory, PCI, CXL and
- others.
- Typical usage example (in drgn REPL):
- .. code-block:: python
- >>> prog['hwerror_data']
- (struct hwerror_info[HWERR_RECOV_MAX]){
- {
- .count = (int)844,
- .timestamp = (time64_t)1752852018,
- },
- ...
- }
- Enabling
- --------
- - This feature is enabled when CONFIG_VMCORE_INFO is set.
|