dm-pcache.rst 6.3 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202
  1. .. SPDX-License-Identifier: GPL-2.0
  2. =================================
  3. dm-pcache — Persistent Cache
  4. =================================
  5. *Author: Dongsheng Yang <dongsheng.yang@linux.dev>*
  6. This document describes *dm-pcache*, a Device-Mapper target that lets a
  7. byte-addressable *DAX* (persistent-memory, “pmem”) region act as a
  8. high-performance, crash-persistent cache in front of a slower block
  9. device. The code lives in `drivers/md/dm-pcache/`.
  10. Quick feature summary
  11. =====================
  12. * *Write-back* caching (only mode currently supported).
  13. * *16 MiB segments* allocated on the pmem device.
  14. * *Data CRC32* verification (optional, per cache).
  15. * Crash-safe: every metadata structure is duplicated (`PCACHE_META_INDEX_MAX
  16. == 2`) and protected with CRC+sequence numbers.
  17. * *Multi-tree indexing* (indexing trees sharded by logical address) for high PMem parallelism
  18. * Pure *DAX path* I/O – no extra BIO round-trips
  19. * *Log-structured write-back* that preserves backend crash-consistency
  20. Constructor
  21. ===========
  22. ::
  23. pcache <cache_dev> <backing_dev> [<number_of_optional_arguments> <cache_mode writeback> <data_crc true|false>]
  24. ========================= ====================================================
  25. ``cache_dev`` Any DAX-capable block device (``/dev/pmem0``…).
  26. All metadata *and* cached blocks are stored here.
  27. ``backing_dev`` The slow block device to be cached.
  28. ``cache_mode`` Optional, Only ``writeback`` is accepted at the
  29. moment.
  30. ``data_crc`` Optional, default to ``false``
  31. * ``true`` – store CRC32 for every cached entry
  32. and verify on reads
  33. * ``false`` – skip CRC (faster)
  34. ========================= ====================================================
  35. Example
  36. -------
  37. .. code-block:: shell
  38. dmsetup create pcache_sdb --table \
  39. "0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true"
  40. The first time a pmem device is used, dm-pcache formats it automatically
  41. (super-block, cache_info, etc.).
  42. Status line
  43. ===========
  44. ``dmsetup status <device>`` (``STATUSTYPE_INFO``) prints:
  45. ::
  46. <sb_flags> <seg_total> <cache_segs> <segs_used> \
  47. <gc_percent> <cache_flags> \
  48. <key_head_seg>:<key_head_off> \
  49. <dirty_tail_seg>:<dirty_tail_off> \
  50. <key_tail_seg>:<key_tail_off>
  51. Field meanings
  52. --------------
  53. =============================== =============================================
  54. ``sb_flags`` Super-block flags (e.g. endian marker).
  55. ``seg_total`` Number of physical *pmem* segments.
  56. ``cache_segs`` Number of segments used for cache.
  57. ``segs_used`` Segments currently allocated (bitmap weight).
  58. ``gc_percent`` Current GC high-water mark (0-90).
  59. ``cache_flags`` Bit 0 – DATA_CRC enabled
  60. Bit 1 – INIT_DONE (cache initialised)
  61. Bits 2-5 – cache mode (0 == WB).
  62. ``key_head`` Where new key-sets are being written.
  63. ``dirty_tail`` First dirty key-set that still needs
  64. write-back to the backing device.
  65. ``key_tail`` First key-set that may be reclaimed by GC.
  66. =============================== =============================================
  67. Messages
  68. ========
  69. *Change GC trigger*
  70. ::
  71. dmsetup message <dev> 0 gc_percent <0-90>
  72. Theory of operation
  73. ===================
  74. Sub-devices
  75. -----------
  76. ==================== =========================================================
  77. backing_dev Any block device (SSD/HDD/loop/LVM, etc.).
  78. cache_dev DAX device; must expose direct-access memory.
  79. ==================== =========================================================
  80. Segments and key-sets
  81. ---------------------
  82. * The pmem space is divided into *16 MiB segments*.
  83. * Each write allocates space from a per-CPU *data_head* inside a segment.
  84. * A *cache-key* records a logical range on the origin and where it lives
  85. inside pmem (segment + offset + generation).
  86. * 128 keys form a *key-set* (kset); ksets are written sequentially in pmem
  87. and are themselves crash-safe (CRC).
  88. * The pair *(key_tail, dirty_tail)* delimit clean/dirty and live/dead ksets.
  89. Write-back
  90. ----------
  91. Dirty keys are queued into a tree; a background worker copies data
  92. back to the backing_dev and advances *dirty_tail*. A FLUSH/FUA bio from the
  93. upper layers forces an immediate metadata commit.
  94. Garbage collection
  95. ------------------
  96. GC starts when ``segs_used >= seg_total * gc_percent / 100``. It walks
  97. from *key_tail*, frees segments whose every key has been invalidated, and
  98. advances *key_tail*.
  99. CRC verification
  100. ----------------
  101. If ``data_crc is enabled`` dm-pcache computes a CRC32 over every cached data
  102. range when it is inserted and stores it in the on-media key. Reads
  103. validate the CRC before copying to the caller.
  104. Failure handling
  105. ================
  106. * *pmem media errors* – all metadata copies are read with
  107. ``copy_mc_to_kernel``; an uncorrectable error logs and aborts initialisation.
  108. * *Cache full* – if no free segment can be found, writes return ``-EBUSY``;
  109. dm-pcache retries internally (request deferral).
  110. * *System crash* – on attach, the driver replays ksets from *key_tail* to
  111. rebuild the in-core trees; every segment’s generation guards against
  112. use-after-free keys.
  113. Limitations & TODO
  114. ==================
  115. * Only *write-back* mode; other modes planned.
  116. * Only FIFO cache invalidate; other (LRU, ARC...) planned.
  117. * Table reload is not supported currently.
  118. * Discard planned.
  119. Example workflow
  120. ================
  121. .. code-block:: shell
  122. # 1. Create devices
  123. dmsetup create pcache_sdb --table \
  124. "0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true"
  125. # 2. Put a filesystem on top
  126. mkfs.ext4 /dev/mapper/pcache_sdb
  127. mount /dev/mapper/pcache_sdb /mnt
  128. # 3. Tune GC threshold to 80 %
  129. dmsetup message pcache_sdb 0 gc_percent 80
  130. # 4. Observe status
  131. watch -n1 'dmsetup status pcache_sdb'
  132. # 5. Shutdown
  133. umount /mnt
  134. dmsetup remove pcache_sdb
  135. ``dm-pcache`` is under active development; feedback, bug reports and patches
  136. are very welcome!