memory_repair.rst 6.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152
  1. .. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-or-later
  2. ==========================
  3. EDAC Memory Repair Control
  4. ==========================
  5. Copyright (c) 2024-2025 HiSilicon Limited.
  6. :Author: Shiju Jose <shiju.jose@huawei.com>
  7. :License: The GNU Free Documentation License, Version 1.2 without
  8. Invariant Sections, Front-Cover Texts nor Back-Cover Texts.
  9. (dual licensed under the GPL v2)
  10. :Original Reviewers:
  11. - Written for: 6.15
  12. Introduction
  13. ------------
  14. Some memory devices support repair operations to address issues in their
  15. memory media. Post Package Repair (PPR) and memory sparing are examples of
  16. such features.
  17. Post Package Repair (PPR)
  18. ~~~~~~~~~~~~~~~~~~~~~~~~~
  19. Post Package Repair is a maintenance operation which requests the memory
  20. device to perform repair operation on its media. It is a memory self-healing
  21. feature that fixes a failing memory location by replacing it with a spare row
  22. in a DRAM device.
  23. For example, a CXL memory device with DRAM components that support PPR
  24. features implements maintenance operations. DRAM components support those
  25. types of PPR functions:
  26. - hard PPR, for a permanent row repair, and
  27. - soft PPR, for a temporary row repair.
  28. Soft PPR is much faster than hard PPR, but the repair is lost after a power
  29. cycle.
  30. The data may not be retained and memory requests may not be correctly
  31. processed during a repair operation. In such case, the repair operation should
  32. not be executed at runtime.
  33. For example, for CXL memory devices, see CXL spec rev 3.1 [1]_ sections
  34. 8.2.9.7.1.1 PPR Maintenance Operations, 8.2.9.7.1.2 sPPR Maintenance Operation
  35. and 8.2.9.7.1.3 hPPR Maintenance Operation for more details.
  36. Memory Sparing
  37. ~~~~~~~~~~~~~~
  38. Memory sparing is a repair function that replaces a portion of memory with
  39. a portion of functional memory at a particular granularity. Memory
  40. sparing has cacheline/row/bank/rank sparing granularities. For example, in
  41. rank memory-sparing mode, one memory rank serves as a spare for other ranks on
  42. the same channel in case they fail.
  43. The spare rank is held in reserve and not used as active memory until
  44. a failure is indicated, with reserved capacity subtracted from the total
  45. available memory in the system.
  46. After an error threshold is surpassed in a system protected by memory sparing,
  47. the content of a failing rank of DIMMs is copied to the spare rank. The
  48. failing rank is then taken offline and the spare rank placed online for use as
  49. active memory in place of the failed rank.
  50. For example, CXL memory devices can support various subclasses for sparing
  51. operation vary in terms of the scope of the sparing being performed.
  52. Cacheline sparing subclass refers to a sparing action that can replace a full
  53. cacheline. Row sparing is provided as an alternative to PPR sparing functions
  54. and its scope is that of a single DDR row. Bank sparing allows an entire bank
  55. to be replaced. Rank sparing is defined as an operation in which an entire DDR
  56. rank is replaced.
  57. See CXL spec 3.1 [1]_ section 8.2.9.7.1.4 Memory Sparing Maintenance
  58. Operations for more details.
  59. .. [1] https://computeexpresslink.org/cxl-specification/
  60. Use cases of generic memory repair features control
  61. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  62. 1. The soft PPR, hard PPR and memory-sparing features share similar control
  63. attributes. Therefore, there is a need for a standardized, generic sysfs
  64. repair control that is exposed to userspace and used by administrators,
  65. scripts and tools.
  66. 2. When a CXL device detects an error in a memory component, it informs the
  67. host of the need for a repair maintenance operation by using an event
  68. record where the "maintenance needed" flag is set. The event record
  69. specifies the device physical address (DPA) and attributes of the memory
  70. that requires repair. The kernel reports the corresponding CXL general
  71. media or DRAM trace event to userspace, and userspace tools (e.g.
  72. rasdaemon) initiate a repair maintenance operation in response to the
  73. device request using the sysfs repair control.
  74. 3. Userspace tools, such as rasdaemon, request a repair operation on a memory
  75. region when maintenance need flag set or an uncorrected memory error or
  76. excess of corrected memory errors above a threshold value is reported or an
  77. exceed corrected errors threshold flag set for that memory.
  78. 4. Multiple PPR/sparing instances may be present per memory device.
  79. 5. Drivers should enforce that live repair is safe. In systems where memory
  80. mapping functions can change between boots, one approach to this is to log
  81. memory errors seen on this boot against which to check live memory repair
  82. requests.
  83. The File System
  84. ---------------
  85. The control attributes of a registered memory repair instance could be
  86. accessed in the /sys/bus/edac/devices/<dev-name>/mem_repairX/
  87. sysfs
  88. -----
  89. Sysfs files are documented in
  90. `Documentation/ABI/testing/sysfs-edac-memory-repair`.
  91. Examples
  92. --------
  93. The memory repair usage takes the form shown in this example:
  94. 1. CXL memory sparing
  95. Memory sparing is defined as a repair function that replaces a portion of
  96. memory with a portion of functional memory at that same DPA. The subclass
  97. for this operation, cacheline/row/bank/rank sparing, vary in terms of the
  98. scope of the sparing being performed.
  99. Memory sparing maintenance operations may be supported by CXL devices that
  100. implement CXL.mem protocol. A sparing maintenance operation requests the
  101. CXL device to perform a repair operation on its media. For example, a CXL
  102. device with DRAM components that support memory sparing features may
  103. implement sparing maintenance operations.
  104. 2. CXL memory Soft Post Package Repair (sPPR)
  105. Post Package Repair (PPR) maintenance operations may be supported by CXL
  106. devices that implement CXL.mem protocol. A PPR maintenance operation
  107. requests the CXL device to perform a repair operation on its media.
  108. For example, a CXL device with DRAM components that support PPR features
  109. may implement PPR Maintenance operations. Soft PPR (sPPR) is a temporary
  110. row repair. Soft PPR may be faster, but the repair is lost with a power
  111. cycle.
  112. Sysfs files for memory repair are documented in
  113. `Documentation/ABI/testing/sysfs-edac-memory-repair`