devlink-health.rst 6.0 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138
  1. .. SPDX-License-Identifier: GPL-2.0
  2. ==============
  3. Devlink Health
  4. ==============
  5. Background
  6. ==========
  7. The ``devlink`` health mechanism is targeted for Real Time Alerting, in
  8. order to know when something bad happened to a PCI device.
  9. * Provide alert debug information.
  10. * Self healing.
  11. * If problem needs vendor support, provide a way to gather all needed
  12. debugging information.
  13. Overview
  14. ========
  15. The main idea is to unify and centralize driver health reports in the
  16. generic ``devlink`` instance and allow the user to set different
  17. attributes of the health reporting and recovery procedures.
  18. The ``devlink`` health reporter:
  19. Device driver creates a "health reporter" per each error/health type.
  20. Error/Health type can be a known/generic (e.g. PCI error, fw error, rx/tx error)
  21. or unknown (driver specific).
  22. For each registered health reporter a driver can issue error/health reports
  23. asynchronously. All health reports handling is done by ``devlink``.
  24. Device driver can provide specific callbacks for each "health reporter", e.g.:
  25. * Recovery procedures
  26. * Diagnostics procedures
  27. * Object dump procedures
  28. * Out Of Box initial parameters
  29. Different parts of the driver can register different types of health reporters
  30. with different handlers.
  31. Actions
  32. =======
  33. Once an error is reported, devlink health will perform the following actions:
  34. * A log is being send to the kernel trace events buffer
  35. * Health status and statistics are being updated for the reporter instance
  36. * Object dump is being taken and saved at the reporter instance (as long as
  37. auto-dump is set and there is no other dump which is already stored)
  38. * Auto recovery attempt is being done. Depends on:
  39. - Auto-recovery configuration
  40. - Grace period (and burst period) vs. time passed since last recover
  41. Devlink formatted message
  42. =========================
  43. To handle devlink health diagnose and health dump requests, devlink creates a
  44. formatted message structure ``devlink_fmsg`` and send it to the driver's callback
  45. to fill the data in using the devlink fmsg API.
  46. Devlink fmsg is a mechanism to pass descriptors between drivers and devlink, in
  47. json-like format. The API allows the driver to add nested attributes such as
  48. object, object pair and value array, in addition to attributes such as name and
  49. value.
  50. Driver should use this API to fill the fmsg context in a format which will be
  51. translated by the devlink to the netlink message later. When it needs to send
  52. the data using SKBs to the netlink layer, it fragments the data between
  53. different SKBs. In order to do this fragmentation, it uses virtual nests
  54. attributes, to avoid actual nesting use which cannot be divided between
  55. different SKBs.
  56. User Interface
  57. ==============
  58. User can access/change each reporter's parameters and driver specific callbacks
  59. via ``devlink``, e.g per error type (per health reporter):
  60. * Configure reporter's generic parameters (like: disable/enable auto recovery)
  61. * Invoke recovery procedure
  62. * Run diagnostics
  63. * Object dump
  64. .. list-table:: List of devlink health interfaces
  65. :widths: 10 90
  66. * - Name
  67. - Description
  68. * - ``DEVLINK_CMD_HEALTH_REPORTER_GET``
  69. - Retrieves status and configuration info per DEV and reporter.
  70. * - ``DEVLINK_CMD_HEALTH_REPORTER_SET``
  71. - Allows reporter-related configuration setting.
  72. * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER``
  73. - Triggers reporter's recovery procedure.
  74. * - ``DEVLINK_CMD_HEALTH_REPORTER_TEST``
  75. - Triggers a fake health event on the reporter. The effects of the test
  76. event in terms of recovery flow should follow closely that of a real
  77. event.
  78. * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE``
  79. - Retrieves current device state related to the reporter.
  80. * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET``
  81. - Retrieves the last stored dump. Devlink health
  82. saves a single dump. If an dump is not already stored by devlink
  83. for this reporter, devlink generates a new dump.
  84. Dump output is defined by the reporter.
  85. * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR``
  86. - Clears the last saved dump file for the specified reporter.
  87. The following diagram provides a general overview of ``devlink-health``::
  88. netlink
  89. +--------------------------+
  90. | |
  91. | + |
  92. | | |
  93. +--------------------------+
  94. |request for ops
  95. |(diagnose,
  96. driver devlink |recover,
  97. |dump)
  98. +--------+ +--------------------------+
  99. | | | reporter| |
  100. | | | +---------v----------+ |
  101. | | ops execution | | | |
  102. | <----------------------------------+ | |
  103. | | | | | |
  104. | | | + ^------------------+ |
  105. | | | | request for ops |
  106. | | | | (recover, dump) |
  107. | | | | |
  108. | | | +-+------------------+ |
  109. | | health report | | health handler | |
  110. | +-------------------------------> | |
  111. | | | +--------------------+ |
  112. | | health reporter create | |
  113. | +----------------------------> |
  114. +--------+ +--------------------------+