multigen_lru.rst 6.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163
  1. .. SPDX-License-Identifier: GPL-2.0
  2. =============
  3. Multi-Gen LRU
  4. =============
  5. The multi-gen LRU is an alternative LRU implementation that optimizes
  6. page reclaim and improves performance under memory pressure. Page
  7. reclaim decides the kernel's caching policy and ability to overcommit
  8. memory. It directly impacts the kswapd CPU usage and RAM efficiency.
  9. Quick start
  10. ===========
  11. Build the kernel with the following configurations.
  12. * ``CONFIG_LRU_GEN=y``
  13. * ``CONFIG_LRU_GEN_ENABLED=y``
  14. All set!
  15. Runtime options
  16. ===============
  17. ``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
  18. following subsections.
  19. Kill switch
  20. -----------
  21. ``enabled`` accepts different values to enable or disable the
  22. following components. Its default value depends on
  23. ``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
  24. unless some of them have unforeseen side effects. Writing to
  25. ``enabled`` has no effect when a component is not supported by the
  26. hardware, and valid values will be accepted even when the main switch
  27. is off.
  28. ====== ===============================================================
  29. Values Components
  30. ====== ===============================================================
  31. 0x0001 The main switch for the multi-gen LRU.
  32. 0x0002 Clearing the accessed bit in leaf page table entries in large
  33. batches, when MMU sets it (e.g., on x86). This behavior can
  34. theoretically worsen lock contention (mmap_lock). If it is
  35. disabled, the multi-gen LRU will suffer a minor performance
  36. degradation for workloads that contiguously map hot pages,
  37. whose accessed bits can be otherwise cleared by fewer larger
  38. batches.
  39. 0x0004 Clearing the accessed bit in non-leaf page table entries as
  40. well, when MMU sets it (e.g., on x86). This behavior was not
  41. verified on x86 varieties other than Intel and AMD. If it is
  42. disabled, the multi-gen LRU will suffer a negligible
  43. performance degradation.
  44. [yYnN] Apply to all the components above.
  45. ====== ===============================================================
  46. E.g.,
  47. ::
  48. echo y >/sys/kernel/mm/lru_gen/enabled
  49. cat /sys/kernel/mm/lru_gen/enabled
  50. 0x0007
  51. echo 5 >/sys/kernel/mm/lru_gen/enabled
  52. cat /sys/kernel/mm/lru_gen/enabled
  53. 0x0005
  54. Thrashing prevention
  55. --------------------
  56. Personal computers are more sensitive to thrashing because it can
  57. cause janks (lags when rendering UI) and negatively impact user
  58. experience. The multi-gen LRU offers thrashing prevention to the
  59. majority of laptop and desktop users who do not have ``oomd``.
  60. Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
  61. ``N`` milliseconds from getting evicted. The OOM killer is triggered
  62. if this working set cannot be kept in memory. In other words, this
  63. option works as an adjustable pressure relief valve, and when open, it
  64. terminates applications that are hopefully not being used.
  65. Based on the average human detectable lag (~100ms), ``N=1000`` usually
  66. eliminates intolerable janks due to thrashing. Larger values like
  67. ``N=3000`` make janks less noticeable at the risk of premature OOM
  68. kills.
  69. The default value ``0`` means disabled.
  70. Experimental features
  71. =====================
  72. ``/sys/kernel/debug/lru_gen`` accepts commands described in the
  73. following subsections. Multiple command lines are supported, so does
  74. concatenation with delimiters ``,`` and ``;``.
  75. ``/sys/kernel/debug/lru_gen_full`` provides additional stats for
  76. debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
  77. evicted generations in this file.
  78. Working set estimation
  79. ----------------------
  80. Working set estimation measures how much memory an application needs
  81. in a given time interval, and it is usually done with little impact on
  82. the performance of the application. E.g., data centers want to
  83. optimize job scheduling (bin packing) to improve memory utilizations.
  84. When a new job comes in, the job scheduler needs to find out whether
  85. each server it manages can allocate a certain amount of memory for
  86. this new job before it can pick a candidate. To do so, the job
  87. scheduler needs to estimate the working sets of the existing jobs.
  88. When it is read, ``lru_gen`` returns a histogram of numbers of pages
  89. accessed over different time intervals for each memcg and node.
  90. ``MAX_NR_GENS`` decides the number of bins for each histogram. The
  91. histograms are noncumulative.
  92. ::
  93. memcg memcg_id memcg_path
  94. node node_id
  95. min_gen_nr age_in_ms nr_anon_pages nr_file_pages
  96. ...
  97. max_gen_nr age_in_ms nr_anon_pages nr_file_pages
  98. Each bin contains an estimated number of pages that have been accessed
  99. within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages
  100. and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of
  101. the former is the largest and that of the latter is the smallest.
  102. Users can write the following command to ``lru_gen`` to create a new
  103. generation ``max_gen_nr+1``:
  104. ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]``
  105. ``can_swap`` defaults to the swap setting and, if it is set to ``1``,
  106. it forces the scan of anon pages when swap is off, and vice versa.
  107. ``force_scan`` defaults to ``1`` and, if it is set to ``0``, it
  108. employs heuristics to reduce the overhead, which is likely to reduce
  109. the coverage as well.
  110. A typical use case is that a job scheduler runs this command at a
  111. certain time interval to create new generations, and it ranks the
  112. servers it manages based on the sizes of their cold pages defined by
  113. this time interval.
  114. Proactive reclaim
  115. -----------------
  116. Proactive reclaim induces page reclaim when there is no memory
  117. pressure. It usually targets cold pages only. E.g., when a new job
  118. comes in, the job scheduler wants to proactively reclaim cold pages on
  119. the server it selected, to improve the chance of successfully landing
  120. this new job.
  121. Users can write the following command to ``lru_gen`` to evict
  122. generations less than or equal to ``min_gen_nr``.
  123. ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]``
  124. ``min_gen_nr`` should be less than ``max_gen_nr-1``, since
  125. ``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to
  126. the active list) and therefore cannot be evicted. ``swappiness``
  127. overrides the default value in ``/proc/sys/vm/swappiness`` and the valid
  128. range is [0-200, max], with max being exclusively used for the reclamation
  129. of anonymous memory. ``nr_to_reclaim`` limits the number of pages to evict.
  130. A typical use case is that a job scheduler runs this command before it
  131. tries to land a new job on a server. If it fails to materialize enough
  132. cold pages because of the overestimation, it retries on the next
  133. server according to the ranking result obtained from the working set
  134. estimation step. This less forceful approach limits the impacts on the
  135. existing jobs.