| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138 |
- .. SPDX-License-Identifier: GPL-2.0
- ==============
- Devlink Health
- ==============
- Background
- ==========
- The ``devlink`` health mechanism is targeted for Real Time Alerting, in
- order to know when something bad happened to a PCI device.
- * Provide alert debug information.
- * Self healing.
- * If problem needs vendor support, provide a way to gather all needed
- debugging information.
- Overview
- ========
- The main idea is to unify and centralize driver health reports in the
- generic ``devlink`` instance and allow the user to set different
- attributes of the health reporting and recovery procedures.
- The ``devlink`` health reporter:
- Device driver creates a "health reporter" per each error/health type.
- Error/Health type can be a known/generic (e.g. PCI error, fw error, rx/tx error)
- or unknown (driver specific).
- For each registered health reporter a driver can issue error/health reports
- asynchronously. All health reports handling is done by ``devlink``.
- Device driver can provide specific callbacks for each "health reporter", e.g.:
- * Recovery procedures
- * Diagnostics procedures
- * Object dump procedures
- * Out Of Box initial parameters
- Different parts of the driver can register different types of health reporters
- with different handlers.
- Actions
- =======
- Once an error is reported, devlink health will perform the following actions:
- * A log is being send to the kernel trace events buffer
- * Health status and statistics are being updated for the reporter instance
- * Object dump is being taken and saved at the reporter instance (as long as
- auto-dump is set and there is no other dump which is already stored)
- * Auto recovery attempt is being done. Depends on:
- - Auto-recovery configuration
- - Grace period (and burst period) vs. time passed since last recover
- Devlink formatted message
- =========================
- To handle devlink health diagnose and health dump requests, devlink creates a
- formatted message structure ``devlink_fmsg`` and send it to the driver's callback
- to fill the data in using the devlink fmsg API.
- Devlink fmsg is a mechanism to pass descriptors between drivers and devlink, in
- json-like format. The API allows the driver to add nested attributes such as
- object, object pair and value array, in addition to attributes such as name and
- value.
- Driver should use this API to fill the fmsg context in a format which will be
- translated by the devlink to the netlink message later. When it needs to send
- the data using SKBs to the netlink layer, it fragments the data between
- different SKBs. In order to do this fragmentation, it uses virtual nests
- attributes, to avoid actual nesting use which cannot be divided between
- different SKBs.
- User Interface
- ==============
- User can access/change each reporter's parameters and driver specific callbacks
- via ``devlink``, e.g per error type (per health reporter):
- * Configure reporter's generic parameters (like: disable/enable auto recovery)
- * Invoke recovery procedure
- * Run diagnostics
- * Object dump
- .. list-table:: List of devlink health interfaces
- :widths: 10 90
- * - Name
- - Description
- * - ``DEVLINK_CMD_HEALTH_REPORTER_GET``
- - Retrieves status and configuration info per DEV and reporter.
- * - ``DEVLINK_CMD_HEALTH_REPORTER_SET``
- - Allows reporter-related configuration setting.
- * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER``
- - Triggers reporter's recovery procedure.
- * - ``DEVLINK_CMD_HEALTH_REPORTER_TEST``
- - Triggers a fake health event on the reporter. The effects of the test
- event in terms of recovery flow should follow closely that of a real
- event.
- * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE``
- - Retrieves current device state related to the reporter.
- * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET``
- - Retrieves the last stored dump. Devlink health
- saves a single dump. If an dump is not already stored by devlink
- for this reporter, devlink generates a new dump.
- Dump output is defined by the reporter.
- * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR``
- - Clears the last saved dump file for the specified reporter.
- The following diagram provides a general overview of ``devlink-health``::
- netlink
- +--------------------------+
- | |
- | + |
- | | |
- +--------------------------+
- |request for ops
- |(diagnose,
- driver devlink |recover,
- |dump)
- +--------+ +--------------------------+
- | | | reporter| |
- | | | +---------v----------+ |
- | | ops execution | | | |
- | <----------------------------------+ | |
- | | | | | |
- | | | + ^------------------+ |
- | | | | request for ops |
- | | | | (recover, dump) |
- | | | | |
- | | | +-+------------------+ |
- | | health report | | health handler | |
- | +-------------------------------> | |
- | | | +--------------------+ |
- | | health reporter create | |
- | +----------------------------> |
- +--------+ +--------------------------+
|