hw-recoverable-errors.rst 2.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
  1. .. SPDX-License-Identifier: GPL-2.0
  2. =================================================
  3. Recoverable Hardware Error Tracking in vmcoreinfo
  4. =================================================
  5. Overview
  6. --------
  7. This feature provides a generic infrastructure within the Linux kernel to track
  8. and log recoverable hardware errors. These are hardware recoverable errors
  9. visible that might not cause immediate panics but may influence health, mainly
  10. because new code path will be executed in the kernel.
  11. By recording counts and timestamps of recoverable errors into the vmcoreinfo
  12. crash dump notes, this infrastructure aids post-mortem crash analysis tools in
  13. correlating hardware events with kernel failures. This enables faster triage
  14. and better understanding of root causes, especially in large-scale cloud
  15. environments where hardware issues are common.
  16. Benefits
  17. --------
  18. - Facilitates correlation of hardware recoverable errors with kernel panics or
  19. unusual code paths that lead to system crashes.
  20. - Provides operators and cloud providers quick insights, improving reliability
  21. and reducing troubleshooting time.
  22. - Complements existing full hardware diagnostics without replacing them.
  23. Data Exposure and Consumption
  24. -----------------------------
  25. - The tracked error data consists of per-error-type counts and timestamps of
  26. last occurrence.
  27. - This data is stored in the `hwerror_data` array, categorized by error source
  28. types like CPU, memory, PCI, CXL, and others.
  29. - It is exposed via vmcoreinfo crash dump notes and can be read using tools
  30. like `crash`, `drgn`, or other kernel crash analysis utilities.
  31. - There is no other way to read these data other than from crash dumps.
  32. - These errors are divided by area, which includes CPU, Memory, PCI, CXL and
  33. others.
  34. Typical usage example (in drgn REPL):
  35. .. code-block:: python
  36. >>> prog['hwerror_data']
  37. (struct hwerror_info[HWERR_RECOV_MAX]){
  38. {
  39. .count = (int)844,
  40. .timestamp = (time64_t)1752852018,
  41. },
  42. ...
  43. }
  44. Enabling
  45. --------
  46. - This feature is enabled when CONFIG_VMCORE_INFO is set.