physical_memory.rst 26 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635
  1. .. SPDX-License-Identifier: GPL-2.0
  2. ===============
  3. Physical Memory
  4. ===============
  5. Linux is available for a wide range of architectures so there is a need for an
  6. architecture-independent abstraction to represent the physical memory. This
  7. chapter describes the structures used to manage physical memory in a running
  8. system.
  9. The first principal concept prevalent in the memory management is
  10. `Non-Uniform Memory Access (NUMA)
  11. <https://en.wikipedia.org/wiki/Non-uniform_memory_access>`_.
  12. With multi-core and multi-socket machines, memory may be arranged into banks
  13. that incur a different cost to access depending on the “distance” from the
  14. processor. For example, there might be a bank of memory assigned to each CPU or
  15. a bank of memory very suitable for DMA near peripheral devices.
  16. Each bank is called a node and the concept is represented under Linux by a
  17. ``struct pglist_data`` even if the architecture is UMA. This structure is
  18. always referenced by its typedef ``pg_data_t``. A ``pg_data_t`` structure
  19. for a particular node can be referenced by ``NODE_DATA(nid)`` macro where
  20. ``nid`` is the ID of that node.
  21. For NUMA architectures, the node structures are allocated by the architecture
  22. specific code early during boot. Usually, these structures are allocated
  23. locally on the memory bank they represent. For UMA architectures, only one
  24. static ``pg_data_t`` structure called ``contig_page_data`` is used. Nodes will
  25. be discussed further in Section :ref:`Nodes <nodes>`
  26. The entire physical address space is partitioned into one or more blocks
  27. called zones which represent ranges within memory. These ranges are usually
  28. determined by architectural constraints for accessing the physical memory.
  29. The memory range within a node that corresponds to a particular zone is
  30. described by a ``struct zone``. Each zone has
  31. one of the types described below.
  32. * ``ZONE_DMA`` and ``ZONE_DMA32`` historically represented memory suitable for
  33. DMA by peripheral devices that cannot access all of the addressable
  34. memory. For many years there are better more and robust interfaces to get
  35. memory with DMA specific requirements (Documentation/core-api/dma-api.rst),
  36. but ``ZONE_DMA`` and ``ZONE_DMA32`` still represent memory ranges that have
  37. restrictions on how they can be accessed.
  38. Depending on the architecture, either of these zone types or even they both
  39. can be disabled at build time using ``CONFIG_ZONE_DMA`` and
  40. ``CONFIG_ZONE_DMA32`` configuration options. Some 64-bit platforms may need
  41. both zones as they support peripherals with different DMA addressing
  42. limitations.
  43. * ``ZONE_NORMAL`` is for normal memory that can be accessed by the kernel all
  44. the time. DMA operations can be performed on pages in this zone if the DMA
  45. devices support transfers to all addressable memory. ``ZONE_NORMAL`` is
  46. always enabled.
  47. * ``ZONE_HIGHMEM`` is the part of the physical memory that is not covered by a
  48. permanent mapping in the kernel page tables. The memory in this zone is only
  49. accessible to the kernel using temporary mappings. This zone is available
  50. only on some 32-bit architectures and is enabled with ``CONFIG_HIGHMEM``.
  51. * ``ZONE_MOVABLE`` is for normal accessible memory, just like ``ZONE_NORMAL``.
  52. The difference is that the contents of most pages in ``ZONE_MOVABLE`` is
  53. movable. That means that while virtual addresses of these pages do not
  54. change, their content may move between different physical pages. Often
  55. ``ZONE_MOVABLE`` is populated during memory hotplug, but it may be
  56. also populated on boot using one of ``kernelcore``, ``movablecore`` and
  57. ``movable_node`` kernel command line parameters. See
  58. Documentation/mm/page_migration.rst and
  59. Documentation/admin-guide/mm/memory-hotplug.rst for additional details.
  60. * ``ZONE_DEVICE`` represents memory residing on devices such as PMEM and GPU.
  61. It has different characteristics than RAM zone types and it exists to provide
  62. :ref:`struct page <Pages>` and memory map services for device driver
  63. identified physical address ranges. ``ZONE_DEVICE`` is enabled with
  64. configuration option ``CONFIG_ZONE_DEVICE``.
  65. It is important to note that many kernel operations can only take place using
  66. ``ZONE_NORMAL`` so it is the most performance critical zone. Zones are
  67. discussed further in Section :ref:`Zones <zones>`.
  68. The relation between node and zone extents is determined by the physical memory
  69. map reported by the firmware, architectural constraints for memory addressing
  70. and certain parameters in the kernel command line.
  71. For example, with 32-bit kernel on an x86 UMA machine with 2 Gbytes of RAM the
  72. entire memory will be on node 0 and there will be three zones: ``ZONE_DMA``,
  73. ``ZONE_NORMAL`` and ``ZONE_HIGHMEM``::
  74. 0 2G
  75. +-------------------------------------------------------------+
  76. | node 0 |
  77. +-------------------------------------------------------------+
  78. 0 16M 896M 2G
  79. +----------+-----------------------+--------------------------+
  80. | ZONE_DMA | ZONE_NORMAL | ZONE_HIGHMEM |
  81. +----------+-----------------------+--------------------------+
  82. With a kernel built with ``ZONE_DMA`` disabled and ``ZONE_DMA32`` enabled and
  83. booted with ``movablecore=80%`` parameter on an arm64 machine with 16 Gbytes of
  84. RAM equally split between two nodes, there will be ``ZONE_DMA32``,
  85. ``ZONE_NORMAL`` and ``ZONE_MOVABLE`` on node 0, and ``ZONE_NORMAL`` and
  86. ``ZONE_MOVABLE`` on node 1::
  87. 1G 9G 17G
  88. +--------------------------------+ +--------------------------+
  89. | node 0 | | node 1 |
  90. +--------------------------------+ +--------------------------+
  91. 1G 4G 4200M 9G 9320M 17G
  92. +---------+----------+-----------+ +------------+-------------+
  93. | DMA32 | NORMAL | MOVABLE | | NORMAL | MOVABLE |
  94. +---------+----------+-----------+ +------------+-------------+
  95. Memory banks may belong to interleaving nodes. In the example below an x86
  96. machine has 16 Gbytes of RAM in 4 memory banks, even banks belong to node 0
  97. and odd banks belong to node 1::
  98. 0 4G 8G 12G 16G
  99. +-------------+ +-------------+ +-------------+ +-------------+
  100. | node 0 | | node 1 | | node 0 | | node 1 |
  101. +-------------+ +-------------+ +-------------+ +-------------+
  102. 0 16M 4G
  103. +-----+-------+ +-------------+ +-------------+ +-------------+
  104. | DMA | DMA32 | | NORMAL | | NORMAL | | NORMAL |
  105. +-----+-------+ +-------------+ +-------------+ +-------------+
  106. In this case node 0 will span from 0 to 12 Gbytes and node 1 will span from
  107. 4 to 16 Gbytes.
  108. .. _nodes:
  109. Nodes
  110. =====
  111. As we have mentioned, each node in memory is described by a ``pg_data_t`` which
  112. is a typedef for a ``struct pglist_data``. When allocating a page, by default
  113. Linux uses a node-local allocation policy to allocate memory from the node
  114. closest to the running CPU. As processes tend to run on the same CPU, it is
  115. likely the memory from the current node will be used. The allocation policy can
  116. be controlled by users as described in
  117. Documentation/admin-guide/mm/numa_memory_policy.rst.
  118. Most NUMA architectures maintain an array of pointers to the node
  119. structures. The actual structures are allocated early during boot when
  120. architecture specific code parses the physical memory map reported by the
  121. firmware. The bulk of the node initialization happens slightly later in the
  122. boot process by free_area_init() function, described later in Section
  123. :ref:`Initialization <initialization>`.
  124. Along with the node structures, kernel maintains an array of ``nodemask_t``
  125. bitmasks called ``node_states``. Each bitmask in this array represents a set of
  126. nodes with particular properties as defined by ``enum node_states``:
  127. ``N_POSSIBLE``
  128. The node could become online at some point.
  129. ``N_ONLINE``
  130. The node is online.
  131. ``N_NORMAL_MEMORY``
  132. The node has regular memory.
  133. ``N_HIGH_MEMORY``
  134. The node has regular or high memory. When ``CONFIG_HIGHMEM`` is disabled
  135. aliased to ``N_NORMAL_MEMORY``.
  136. ``N_MEMORY``
  137. The node has memory(regular, high, movable)
  138. ``N_CPU``
  139. The node has one or more CPUs
  140. ``N_GENERIC_INITIATOR``
  141. The node has one or more Generic Initiators
  142. For each node that has a property described above, the bit corresponding to the
  143. node ID in the ``node_states[<property>]`` bitmask is set.
  144. For example, for node 2 with normal memory and CPUs, bit 2 will be set in ::
  145. node_states[N_POSSIBLE]
  146. node_states[N_ONLINE]
  147. node_states[N_NORMAL_MEMORY]
  148. node_states[N_HIGH_MEMORY]
  149. node_states[N_MEMORY]
  150. node_states[N_CPU]
  151. For various operations possible with nodemasks please refer to
  152. ``include/linux/nodemask.h``.
  153. Among other things, nodemasks are used to provide macros for node traversal,
  154. namely ``for_each_node()`` and ``for_each_online_node()``.
  155. For instance, to call a function foo() for each online node::
  156. for_each_online_node(nid) {
  157. pg_data_t *pgdat = NODE_DATA(nid);
  158. foo(pgdat);
  159. }
  160. Node structure
  161. --------------
  162. The nodes structure ``struct pglist_data`` is declared in
  163. ``include/linux/mmzone.h``. Here we briefly describe fields of this
  164. structure:
  165. General
  166. ~~~~~~~
  167. ``node_zones``
  168. The zones for this node. Not all of the zones may be populated, but it is
  169. the full list. It is referenced by this node's node_zonelists as well as
  170. other node's node_zonelists.
  171. ``node_zonelists``
  172. The list of all zones in all nodes. This list defines the order of zones
  173. that allocations are preferred from. The ``node_zonelists`` is set up by
  174. ``build_zonelists()`` in ``mm/page_alloc.c`` during the initialization of
  175. core memory management structures.
  176. ``nr_zones``
  177. Number of populated zones in this node.
  178. ``node_mem_map``
  179. For UMA systems that use FLATMEM memory model the 0's node
  180. ``node_mem_map`` is array of struct pages representing each physical frame.
  181. ``node_page_ext``
  182. For UMA systems that use FLATMEM memory model the 0's node
  183. ``node_page_ext`` is array of extensions of struct pages. Available only
  184. in the kernels built with ``CONFIG_PAGE_EXTENSION`` enabled.
  185. ``node_start_pfn``
  186. The page frame number of the starting page frame in this node.
  187. ``node_present_pages``
  188. Total number of physical pages present in this node.
  189. ``node_spanned_pages``
  190. Total size of physical page range, including holes.
  191. ``node_size_lock``
  192. A lock that protects the fields defining the node extents. Only defined when
  193. at least one of ``CONFIG_MEMORY_HOTPLUG`` or
  194. ``CONFIG_DEFERRED_STRUCT_PAGE_INIT`` configuration options are enabled.
  195. ``pgdat_resize_lock()`` and ``pgdat_resize_unlock()`` are provided to
  196. manipulate ``node_size_lock`` without checking for ``CONFIG_MEMORY_HOTPLUG``
  197. or ``CONFIG_DEFERRED_STRUCT_PAGE_INIT``.
  198. ``node_id``
  199. The Node ID (NID) of the node, starts at 0.
  200. ``totalreserve_pages``
  201. This is a per-node reserve of pages that are not available to userspace
  202. allocations.
  203. ``first_deferred_pfn``
  204. If memory initialization on large machines is deferred then this is the first
  205. PFN that needs to be initialized. Defined only when
  206. ``CONFIG_DEFERRED_STRUCT_PAGE_INIT`` is enabled
  207. ``deferred_split_queue``
  208. Per-node queue of huge pages that their split was deferred. Defined only when ``CONFIG_TRANSPARENT_HUGEPAGE`` is enabled.
  209. ``__lruvec``
  210. Per-node lruvec holding LRU lists and related parameters. Used only when
  211. memory cgroups are disabled. It should not be accessed directly, use
  212. ``mem_cgroup_lruvec()`` to look up lruvecs instead.
  213. Reclaim control
  214. ~~~~~~~~~~~~~~~
  215. See also Documentation/mm/page_reclaim.rst.
  216. ``kswapd``
  217. Per-node instance of kswapd kernel thread.
  218. ``kswapd_wait``, ``pfmemalloc_wait``, ``reclaim_wait``
  219. Workqueues used to synchronize memory reclaim tasks
  220. ``nr_writeback_throttled``
  221. Number of tasks that are throttled waiting on dirty pages to clean.
  222. ``nr_reclaim_start``
  223. Number of pages written while reclaim is throttled waiting for writeback.
  224. ``kswapd_order``
  225. Controls the order kswapd tries to reclaim
  226. ``kswapd_highest_zoneidx``
  227. The highest zone index to be reclaimed by kswapd
  228. ``kswapd_failures``
  229. Number of runs kswapd was unable to reclaim any pages
  230. ``min_unmapped_pages``
  231. Minimal number of unmapped file backed pages that cannot be reclaimed.
  232. Determined by ``vm.min_unmapped_ratio`` sysctl. Only defined when
  233. ``CONFIG_NUMA`` is enabled.
  234. ``min_slab_pages``
  235. Minimal number of SLAB pages that cannot be reclaimed. Determined by
  236. ``vm.min_slab_ratio sysctl``. Only defined when ``CONFIG_NUMA`` is enabled
  237. ``flags``
  238. Flags controlling reclaim behavior.
  239. Compaction control
  240. ~~~~~~~~~~~~~~~~~~
  241. ``kcompactd_max_order``
  242. Page order that kcompactd should try to achieve.
  243. ``kcompactd_highest_zoneidx``
  244. The highest zone index to be compacted by kcompactd.
  245. ``kcompactd_wait``
  246. Workqueue used to synchronize memory compaction tasks.
  247. ``kcompactd``
  248. Per-node instance of kcompactd kernel thread.
  249. ``proactive_compact_trigger``
  250. Determines if proactive compaction is enabled. Controlled by
  251. ``vm.compaction_proactiveness`` sysctl.
  252. Statistics
  253. ~~~~~~~~~~
  254. ``per_cpu_nodestats``
  255. Per-CPU VM statistics for the node
  256. ``vm_stat``
  257. VM statistics for the node.
  258. .. _zones:
  259. Zones
  260. =====
  261. As we have mentioned, each zone in memory is described by a ``struct zone``
  262. which is an element of the ``node_zones`` array of the node it belongs to.
  263. ``struct zone`` is the core data structure of the page allocator. A zone
  264. represents a range of physical memory and may have holes.
  265. The page allocator uses the GFP flags, see :ref:`mm-api-gfp-flags`, specified by
  266. a memory allocation to determine the highest zone in a node from which the
  267. memory allocation can allocate memory. The page allocator first allocates memory
  268. from that zone, if the page allocator can't allocate the requested amount of
  269. memory from the zone, it will allocate memory from the next lower zone in the
  270. node, the process continues up to and including the lowest zone. For example, if
  271. a node contains ``ZONE_DMA32``, ``ZONE_NORMAL`` and ``ZONE_MOVABLE`` and the
  272. highest zone of a memory allocation is ``ZONE_MOVABLE``, the order of the zones
  273. from which the page allocator allocates memory is ``ZONE_MOVABLE`` >
  274. ``ZONE_NORMAL`` > ``ZONE_DMA32``.
  275. At runtime, free pages in a zone are in the Per-CPU Pagesets (PCP) or free areas
  276. of the zone. The Per-CPU Pagesets are a vital mechanism in the kernel's memory
  277. management system. By handling most frequent allocations and frees locally on
  278. each CPU, the Per-CPU Pagesets improve performance and scalability, especially
  279. on systems with many cores. The page allocator in the kernel employs a two-step
  280. strategy for memory allocation, starting with the Per-CPU Pagesets before
  281. falling back to the buddy allocator. Pages are transferred between the Per-CPU
  282. Pagesets and the global free areas (managed by the buddy allocator) in batches.
  283. This minimizes the overhead of frequent interactions with the global buddy
  284. allocator.
  285. Architecture specific code calls free_area_init() to initializes zones.
  286. Zone structure
  287. --------------
  288. The zones structure ``struct zone`` is defined in ``include/linux/mmzone.h``.
  289. Here we briefly describe fields of this structure:
  290. General
  291. ~~~~~~~
  292. ``_watermark``
  293. The watermarks for this zone. When the amount of free pages in a zone is below
  294. the min watermark, boosting is ignored, an allocation may trigger direct
  295. reclaim and direct compaction, it is also used to throttle direct reclaim.
  296. When the amount of free pages in a zone is below the low watermark, kswapd is
  297. woken up. When the amount of free pages in a zone is above the high watermark,
  298. kswapd stops reclaiming (a zone is balanced) when the
  299. ``NUMA_BALANCING_MEMORY_TIERING`` bit of ``sysctl_numa_balancing_mode`` is not
  300. set. The promo watermark is used for memory tiering and NUMA balancing. When
  301. the amount of free pages in a zone is above the promo watermark, kswapd stops
  302. reclaiming when the ``NUMA_BALANCING_MEMORY_TIERING`` bit of
  303. ``sysctl_numa_balancing_mode`` is set. The watermarks are set by
  304. ``__setup_per_zone_wmarks()``. The min watermark is calculated according to
  305. ``vm.min_free_kbytes`` sysctl. The other three watermarks are set according
  306. to the distance between two watermarks. The distance itself is calculated
  307. taking ``vm.watermark_scale_factor`` sysctl into account.
  308. ``watermark_boost``
  309. The number of pages which are used to boost watermarks to increase reclaim
  310. pressure to reduce the likelihood of future fallbacks and wake kswapd now
  311. as the node may be balanced overall and kswapd will not wake naturally.
  312. ``nr_reserved_highatomic``
  313. The number of pages which are reserved for high-order atomic allocations.
  314. ``nr_free_highatomic``
  315. The number of free pages in reserved highatomic pageblocks
  316. ``lowmem_reserve``
  317. The array of the amounts of the memory reserved in this zone for memory
  318. allocations. For example, if the highest zone a memory allocation can
  319. allocate memory from is ``ZONE_MOVABLE``, the amount of memory reserved in
  320. this zone for this allocation is ``lowmem_reserve[ZONE_MOVABLE]`` when
  321. attempting to allocate memory from this zone. This is a mechanism the page
  322. allocator uses to prevent allocations which could use ``highmem`` from using
  323. too much ``lowmem``. For some specialised workloads on ``highmem`` machines,
  324. it is dangerous for the kernel to allow process memory to be allocated from
  325. the ``lowmem`` zone. This is because that memory could then be pinned via the
  326. ``mlock()`` system call, or by unavailability of swapspace.
  327. ``vm.lowmem_reserve_ratio`` sysctl determines how aggressive the kernel is in
  328. defending these lower zones. This array is recalculated by
  329. ``setup_per_zone_lowmem_reserve()`` at runtime if ``vm.lowmem_reserve_ratio``
  330. sysctl changes.
  331. ``node``
  332. The index of the node this zone belongs to. Available only when
  333. ``CONFIG_NUMA`` is enabled because there is only one zone in a UMA system.
  334. ``zone_pgdat``
  335. Pointer to the ``struct pglist_data`` of the node this zone belongs to.
  336. ``per_cpu_pageset``
  337. Pointer to the Per-CPU Pagesets (PCP) allocated and initialized by
  338. ``setup_zone_pageset()``. By handling most frequent allocations and frees
  339. locally on each CPU, PCP improves performance and scalability on systems with
  340. many cores.
  341. ``pageset_high_min``
  342. Copied to the ``high_min`` of the Per-CPU Pagesets for faster access.
  343. ``pageset_high_max``
  344. Copied to the ``high_max`` of the Per-CPU Pagesets for faster access.
  345. ``pageset_batch``
  346. Copied to the ``batch`` of the Per-CPU Pagesets for faster access. The
  347. ``batch``, ``high_min`` and ``high_max`` of the Per-CPU Pagesets are used to
  348. calculate the number of elements the Per-CPU Pagesets obtain from the buddy
  349. allocator under a single hold of the lock for efficiency. They are also used
  350. to decide if the Per-CPU Pagesets return pages to the buddy allocator in page
  351. free process.
  352. ``pageblock_flags``
  353. The pointer to the flags for the pageblocks in the zone (see
  354. ``include/linux/pageblock-flags.h`` for flags list). The memory is allocated
  355. in ``setup_usemap()``. Each pageblock occupies ``NR_PAGEBLOCK_BITS`` bits.
  356. Defined only when ``CONFIG_FLATMEM`` is enabled. The flags is stored in
  357. ``mem_section`` when ``CONFIG_SPARSEMEM`` is enabled.
  358. ``zone_start_pfn``
  359. The start pfn of the zone. It is initialized by
  360. ``calculate_node_totalpages()``.
  361. ``managed_pages``
  362. The present pages managed by the buddy system, which is calculated as:
  363. ``managed_pages`` = ``present_pages`` - ``reserved_pages``, ``reserved_pages``
  364. includes pages allocated by the memblock allocator. It should be used by page
  365. allocator and vm scanner to calculate all kinds of watermarks and thresholds.
  366. It is accessed using ``atomic_long_xxx()`` functions. It is initialized in
  367. ``free_area_init_core()`` and then is reinitialized when memblock allocator
  368. frees pages into buddy system.
  369. ``spanned_pages``
  370. The total pages spanned by the zone, including holes, which is calculated as:
  371. ``spanned_pages`` = ``zone_end_pfn`` - ``zone_start_pfn``. It is initialized
  372. by ``calculate_node_totalpages()``.
  373. ``present_pages``
  374. The physical pages existing within the zone, which is calculated as:
  375. ``present_pages`` = ``spanned_pages`` - ``absent_pages`` (pages in holes). It
  376. may be used by memory hotplug or memory power management logic to figure out
  377. unmanaged pages by checking (``present_pages`` - ``managed_pages``). Write
  378. access to ``present_pages`` at runtime should be protected by
  379. ``mem_hotplug_begin/done()``. Any reader who can't tolerant drift of
  380. ``present_pages`` should use ``get_online_mems()`` to get a stable value. It
  381. is initialized by ``calculate_node_totalpages()``.
  382. ``present_early_pages``
  383. The present pages existing within the zone located on memory available since
  384. early boot, excluding hotplugged memory. Defined only when
  385. ``CONFIG_MEMORY_HOTPLUG`` is enabled and initialized by
  386. ``calculate_node_totalpages()``.
  387. ``cma_pages``
  388. The pages reserved for CMA use. These pages behave like ``ZONE_MOVABLE`` when
  389. they are not used for CMA. Defined only when ``CONFIG_CMA`` is enabled.
  390. ``name``
  391. The name of the zone. It is a pointer to the corresponding element of
  392. the ``zone_names`` array.
  393. ``nr_isolate_pageblock``
  394. Number of isolated pageblocks. It is used to solve incorrect freepage counting
  395. problem due to racy retrieving migratetype of pageblock. Protected by
  396. ``zone->lock``. Defined only when ``CONFIG_MEMORY_ISOLATION`` is enabled.
  397. ``span_seqlock``
  398. The seqlock to protect ``zone_start_pfn`` and ``spanned_pages``. It is a
  399. seqlock because it has to be read outside of ``zone->lock``, and it is done in
  400. the main allocator path. However, the seqlock is written quite infrequently.
  401. Defined only when ``CONFIG_MEMORY_HOTPLUG`` is enabled.
  402. ``initialized``
  403. The flag indicating if the zone is initialized. Set by
  404. ``init_currently_empty_zone()`` during boot.
  405. ``free_area``
  406. The array of free areas, where each element corresponds to a specific order
  407. which is a power of two. The buddy allocator uses this structure to manage
  408. free memory efficiently. When allocating, it tries to find the smallest
  409. sufficient block, if the smallest sufficient block is larger than the
  410. requested size, it will be recursively split into the next smaller blocks
  411. until the required size is reached. When a page is freed, it may be merged
  412. with its buddy to form a larger block. It is initialized by
  413. ``zone_init_free_lists()``.
  414. ``unaccepted_pages``
  415. The list of pages to be accepted. All pages on the list are ``MAX_PAGE_ORDER``.
  416. Defined only when ``CONFIG_UNACCEPTED_MEMORY`` is enabled.
  417. ``flags``
  418. The zone flags. The least three bits are used and defined by
  419. ``enum zone_flags``. ``ZONE_BOOSTED_WATERMARK`` (bit 0): zone recently boosted
  420. watermarks. Cleared when kswapd is woken. ``ZONE_RECLAIM_ACTIVE`` (bit 1):
  421. kswapd may be scanning the zone. ``ZONE_BELOW_HIGH`` (bit 2): zone is below
  422. high watermark.
  423. ``lock``
  424. The main lock that protects the internal data structures of the page allocator
  425. specific to the zone, especially protects ``free_area``.
  426. ``percpu_drift_mark``
  427. When free pages are below this point, additional steps are taken when reading
  428. the number of free pages to avoid per-cpu counter drift allowing watermarks
  429. to be breached. It is updated in ``refresh_zone_stat_thresholds()``.
  430. Compaction control
  431. ~~~~~~~~~~~~~~~~~~
  432. ``compact_cached_free_pfn``
  433. The PFN where compaction free scanner should start in the next scan.
  434. ``compact_cached_migrate_pfn``
  435. The PFNs where compaction migration scanner should start in the next scan.
  436. This array has two elements: the first one is used in ``MIGRATE_ASYNC`` mode,
  437. and the other one is used in ``MIGRATE_SYNC`` mode.
  438. ``compact_init_migrate_pfn``
  439. The initial migration PFN which is initialized to 0 at boot time, and to the
  440. first pageblock with migratable pages in the zone after a full compaction
  441. finishes. It is used to check if a scan is a whole zone scan or not.
  442. ``compact_init_free_pfn``
  443. The initial free PFN which is initialized to 0 at boot time and to the last
  444. pageblock with free ``MIGRATE_MOVABLE`` pages in the zone. It is used to check
  445. if it is the start of a scan.
  446. ``compact_considered``
  447. The number of compactions attempted since last failure. It is reset in
  448. ``defer_compaction()`` when a compaction fails to result in a page allocation
  449. success. It is increased by 1 in ``compaction_deferred()`` when a compaction
  450. should be skipped. ``compaction_deferred()`` is called before
  451. ``compact_zone()`` is called, ``compaction_defer_reset()`` is called when
  452. ``compact_zone()`` returns ``COMPACT_SUCCESS``, ``defer_compaction()`` is
  453. called when ``compact_zone()`` returns ``COMPACT_PARTIAL_SKIPPED`` or
  454. ``COMPACT_COMPLETE``.
  455. ``compact_defer_shift``
  456. The number of compactions skipped before trying again is
  457. ``1<<compact_defer_shift``. It is increased by 1 in ``defer_compaction()``.
  458. It is reset in ``compaction_defer_reset()`` when a direct compaction results
  459. in a page allocation success. Its maximum value is ``COMPACT_MAX_DEFER_SHIFT``.
  460. ``compact_order_failed``
  461. The minimum compaction failed order. It is set in ``compaction_defer_reset()``
  462. when a compaction succeeds and in ``defer_compaction()`` when a compaction
  463. fails to result in a page allocation success.
  464. ``compact_blockskip_flush``
  465. Set to true when compaction migration scanner and free scanner meet, which
  466. means the ``PB_compact_skip`` bits should be cleared.
  467. ``contiguous``
  468. Set to true when the zone is contiguous (in other words, no hole).
  469. Statistics
  470. ~~~~~~~~~~
  471. ``vm_stat``
  472. VM statistics for the zone. The items tracked are defined by
  473. ``enum zone_stat_item``.
  474. ``vm_numa_event``
  475. VM NUMA event statistics for the zone. The items tracked are defined by
  476. ``enum numa_stat_item``.
  477. ``per_cpu_zonestats``
  478. Per-CPU VM statistics for the zone. It records VM statistics and VM NUMA event
  479. statistics on a per-CPU basis. It reduces updates to the global ``vm_stat``
  480. and ``vm_numa_event`` fields of the zone to improve performance.
  481. .. _pages:
  482. Pages
  483. =====
  484. .. admonition:: Stub
  485. This section is incomplete. Please list and describe the appropriate fields.
  486. .. _folios:
  487. Folios
  488. ======
  489. .. admonition:: Stub
  490. This section is incomplete. Please list and describe the appropriate fields.
  491. .. _initialization:
  492. Initialization
  493. ==============
  494. .. admonition:: Stub
  495. This section is incomplete. Please list and describe the appropriate fields.