process_addrs.rst 46 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916
  1. .. SPDX-License-Identifier: GPL-2.0
  2. =================
  3. Process Addresses
  4. =================
  5. .. toctree::
  6. :maxdepth: 3
  7. Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
  8. 'VMA's of type :c:struct:`!struct vm_area_struct`.
  9. Each VMA describes a virtually contiguous memory range with identical
  10. attributes, each described by a :c:struct:`!struct vm_area_struct`
  11. object. Userland access outside of VMAs is invalid except in the case where an
  12. adjacent stack VMA could be extended to contain the accessed address.
  13. All VMAs are contained within one and only one virtual address space, described
  14. by a :c:struct:`!struct mm_struct` object which is referenced by all tasks (that is,
  15. threads) which share the virtual address space. We refer to this as the
  16. :c:struct:`!mm`.
  17. Each mm object contains a maple tree data structure which describes all VMAs
  18. within the virtual address space.
  19. .. note:: An exception to this is the 'gate' VMA which is provided by
  20. architectures which use :c:struct:`!vsyscall` and is a global static
  21. object which does not belong to any specific mm.
  22. -------
  23. Locking
  24. -------
  25. The kernel is designed to be highly scalable against concurrent read operations
  26. on VMA **metadata** so a complicated set of locks are required to ensure memory
  27. corruption does not occur.
  28. .. note:: Locking VMAs for their metadata does not have any impact on the memory
  29. they describe nor the page tables that map them.
  30. Terminology
  31. -----------
  32. * **mmap locks** - Each MM has a read/write semaphore :c:member:`!mmap_lock`
  33. which locks at a process address space granularity which can be acquired via
  34. :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants.
  35. * **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves
  36. as a read/write semaphore in practice. A VMA read lock is obtained via
  37. :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a
  38. write lock via vma_start_write() or vma_start_write_killable()
  39. (all VMA write locks are unlocked
  40. automatically when the mmap write lock is released). To take a VMA write lock
  41. you **must** have already acquired an :c:func:`!mmap_write_lock`.
  42. * **rmap locks** - When trying to access VMAs through the reverse mapping via a
  43. :c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` object
  44. (reachable from a folio via :c:member:`!folio->mapping`). VMAs must be stabilised via
  45. :c:func:`!anon_vma_[try]lock_read` or :c:func:`!anon_vma_[try]lock_write` for
  46. anonymous memory and :c:func:`!i_mmap_[try]lock_read` or
  47. :c:func:`!i_mmap_[try]lock_write` for file-backed memory. We refer to these
  48. locks as the reverse mapping locks, or 'rmap locks' for brevity.
  49. We discuss page table locks separately in the dedicated section below.
  50. The first thing **any** of these locks achieve is to **stabilise** the VMA
  51. within the MM tree. That is, guaranteeing that the VMA object will not be
  52. deleted from under you nor modified (except for some specific fields
  53. described below).
  54. Stabilising a VMA also keeps the address space described by it around.
  55. Lock usage
  56. ----------
  57. If you want to **read** VMA metadata fields or just keep the VMA stable, you
  58. must do one of the following:
  59. * Obtain an mmap read lock at the MM granularity via :c:func:`!mmap_read_lock` (or a
  60. suitable variant), unlocking it with a matching :c:func:`!mmap_read_unlock` when
  61. you're done with the VMA, *or*
  62. * Try to obtain a VMA read lock via :c:func:`!lock_vma_under_rcu`. This tries to
  63. acquire the lock atomically so might fail, in which case fall-back logic is
  64. required to instead obtain an mmap read lock if this returns :c:macro:`!NULL`,
  65. *or*
  66. * Acquire an rmap lock before traversing the locked interval tree (whether
  67. anonymous or file-backed) to obtain the required VMA.
  68. If you want to **write** VMA metadata fields, then things vary depending on the
  69. field (we explore each VMA field in detail below). For the majority you must:
  70. * Obtain an mmap write lock at the MM granularity via :c:func:`!mmap_write_lock` (or a
  71. suitable variant), unlocking it with a matching :c:func:`!mmap_write_unlock` when
  72. you're done with the VMA, *and*
  73. * Obtain a VMA write lock via :c:func:`!vma_start_write` for each VMA you wish to
  74. modify, which will be released automatically when :c:func:`!mmap_write_unlock` is
  75. called.
  76. * If you want to be able to write to **any** field, you must also hide the VMA
  77. from the reverse mapping by obtaining an **rmap write lock**.
  78. VMA locks are special in that you must obtain an mmap **write** lock **first**
  79. in order to obtain a VMA **write** lock. A VMA **read** lock however can be
  80. obtained without any other lock (:c:func:`!lock_vma_under_rcu` will acquire then
  81. release an RCU lock to lookup the VMA for you).
  82. This constrains the impact of writers on readers, as a writer can interact with
  83. one VMA while a reader interacts with another simultaneously.
  84. .. note:: The primary users of VMA read locks are page fault handlers, which
  85. means that without a VMA write lock, page faults will run concurrent with
  86. whatever you are doing.
  87. Examining all valid lock states:
  88. .. table::
  89. ========= ======== ========= ======= ===== =========== ==========
  90. mmap lock VMA lock rmap lock Stable? Read? Write most? Write all?
  91. ========= ======== ========= ======= ===== =========== ==========
  92. \- \- \- N N N N
  93. \- R \- Y Y N N
  94. \- \- R/W Y Y N N
  95. R/W \-/R \-/R/W Y Y N N
  96. W W \-/R Y Y Y N
  97. W W W Y Y Y Y
  98. ========= ======== ========= ======= ===== =========== ==========
  99. .. warning:: While it's possible to obtain a VMA lock while holding an mmap read lock,
  100. attempting to do the reverse is invalid as it can result in deadlock - if
  101. another task already holds an mmap write lock and attempts to acquire a VMA
  102. write lock that will deadlock on the VMA read lock.
  103. All of these locks behave as read/write semaphores in practice, so you can
  104. obtain either a read or a write lock for each of these.
  105. .. note:: Generally speaking, a read/write semaphore is a class of lock which
  106. permits concurrent readers. However a write lock can only be obtained
  107. once all readers have left the critical region (and pending readers
  108. made to wait).
  109. This renders read locks on a read/write semaphore concurrent with other
  110. readers and write locks exclusive against all others holding the semaphore.
  111. VMA fields
  112. ^^^^^^^^^^
  113. We can subdivide :c:struct:`!struct vm_area_struct` fields by their purpose, which makes it
  114. easier to explore their locking characteristics:
  115. .. note:: We exclude VMA lock-specific fields here to avoid confusion, as these
  116. are in effect an internal implementation detail.
  117. .. table:: Virtual layout fields
  118. ===================== ======================================== ===========
  119. Field Description Write lock
  120. ===================== ======================================== ===========
  121. :c:member:`!vm_start` Inclusive start virtual address of range mmap write,
  122. VMA describes. VMA write,
  123. rmap write.
  124. :c:member:`!vm_end` Exclusive end virtual address of range mmap write,
  125. VMA describes. VMA write,
  126. rmap write.
  127. :c:member:`!vm_pgoff` Describes the page offset into the file, mmap write,
  128. the original page offset within the VMA write,
  129. virtual address space (prior to any rmap write.
  130. :c:func:`!mremap`), or PFN if a PFN map
  131. and the architecture does not support
  132. :c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`.
  133. ===================== ======================================== ===========
  134. These fields describes the size, start and end of the VMA, and as such cannot be
  135. modified without first being hidden from the reverse mapping since these fields
  136. are used to locate VMAs within the reverse mapping interval trees.
  137. .. table:: Core fields
  138. ============================ ======================================== =========================
  139. Field Description Write lock
  140. ============================ ======================================== =========================
  141. :c:member:`!vm_mm` Containing mm_struct. None - written once on
  142. initial map.
  143. :c:member:`!vm_page_prot` Architecture-specific page table mmap write, VMA write.
  144. protection bits determined from VMA
  145. flags.
  146. :c:member:`!vm_flags` Read-only access to VMA flags describing N/A
  147. attributes of the VMA, in union with
  148. private writable
  149. :c:member:`!__vm_flags`.
  150. :c:member:`!__vm_flags` Private, writable access to VMA flags mmap write, VMA write.
  151. field, updated by
  152. :c:func:`!vm_flags_*` functions.
  153. :c:member:`!vm_file` If the VMA is file-backed, points to a None - written once on
  154. struct file object describing the initial map.
  155. underlying file, if anonymous then
  156. :c:macro:`!NULL`.
  157. :c:member:`!vm_ops` If the VMA is file-backed, then either None - Written once on
  158. the driver or file-system provides a initial map by
  159. :c:struct:`!struct vm_operations_struct` :c:func:`!f_ops->mmap()`.
  160. object describing callbacks to be
  161. invoked on VMA lifetime events.
  162. :c:member:`!vm_private_data` A :c:member:`!void *` field for Handled by driver.
  163. driver-specific metadata.
  164. ============================ ======================================== =========================
  165. These are the core fields which describe the MM the VMA belongs to and its attributes.
  166. .. table:: Config-specific fields
  167. ================================= ===================== ======================================== ===============
  168. Field Configuration option Description Write lock
  169. ================================= ===================== ======================================== ===============
  170. :c:member:`!anon_name` CONFIG_ANON_VMA_NAME A field for storing a mmap write,
  171. :c:struct:`!struct anon_vma_name` VMA write.
  172. object providing a name for anonymous
  173. mappings, or :c:macro:`!NULL` if none
  174. is set or the VMA is file-backed. The
  175. underlying object is reference counted
  176. and can be shared across multiple VMAs
  177. for scalability.
  178. :c:member:`!swap_readahead_info` CONFIG_SWAP Metadata used by the swap mechanism mmap read,
  179. to perform readahead. This field is swap-specific
  180. accessed atomically. lock.
  181. :c:member:`!vm_policy` CONFIG_NUMA :c:type:`!mempolicy` object which mmap write,
  182. describes the NUMA behaviour of the VMA write.
  183. VMA. The underlying object is reference
  184. counted.
  185. :c:member:`!numab_state` CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which mmap read,
  186. describes the current state of numab-specific
  187. NUMA balancing in relation to this VMA. lock.
  188. Updated under mmap read lock by
  189. :c:func:`!task_numa_work`.
  190. :c:member:`!vm_userfaultfd_ctx` CONFIG_USERFAULTFD Userfaultfd context wrapper object of mmap write,
  191. type :c:type:`!vm_userfaultfd_ctx`, VMA write.
  192. either of zero size if userfaultfd is
  193. disabled, or containing a pointer
  194. to an underlying
  195. :c:type:`!userfaultfd_ctx` object which
  196. describes userfaultfd metadata.
  197. ================================= ===================== ======================================== ===============
  198. These fields are present or not depending on whether the relevant kernel
  199. configuration option is set.
  200. .. table:: Reverse mapping fields
  201. =================================== ========================================= ============================
  202. Field Description Write lock
  203. =================================== ========================================= ============================
  204. :c:member:`!shared.rb` A red/black tree node used, if the mmap write, VMA write,
  205. mapping is file-backed, to place the VMA i_mmap write.
  206. in the
  207. :c:member:`!struct address_space->i_mmap`
  208. red/black interval tree.
  209. :c:member:`!shared.rb_subtree_last` Metadata used for management of the mmap write, VMA write,
  210. interval tree if the VMA is file-backed. i_mmap write.
  211. :c:member:`!anon_vma_chain` List of pointers to both forked/CoW’d mmap read, anon_vma write.
  212. :c:type:`!anon_vma` objects and
  213. :c:member:`!vma->anon_vma` if it is
  214. non-:c:macro:`!NULL`.
  215. :c:member:`!anon_vma` :c:type:`!anon_vma` object used by When :c:macro:`NULL` and
  216. anonymous folios mapped exclusively to setting non-:c:macro:`NULL`:
  217. this VMA. Initially set by mmap read, page_table_lock.
  218. :c:func:`!anon_vma_prepare` serialised
  219. by the :c:macro:`!page_table_lock`. This When non-:c:macro:`NULL` and
  220. is set as soon as any page is faulted in. setting :c:macro:`NULL`:
  221. mmap write, VMA write,
  222. anon_vma write.
  223. =================================== ========================================= ============================
  224. These fields are used to both place the VMA within the reverse mapping, and for
  225. anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects
  226. and the :c:struct:`!struct anon_vma` in which folios mapped exclusively to this VMA should
  227. reside.
  228. .. note:: If a file-backed mapping is mapped with :c:macro:`!MAP_PRIVATE` set
  229. then it can be in both the :c:type:`!anon_vma` and :c:type:`!i_mmap`
  230. trees at the same time, so all of these fields might be utilised at
  231. once.
  232. Page tables
  233. -----------
  234. We won't speak exhaustively on the subject but broadly speaking, page tables map
  235. virtual addresses to physical ones through a series of page tables, each of
  236. which contain entries with physical addresses for the next page table level
  237. (along with flags), and at the leaf level the physical addresses of the
  238. underlying physical data pages or a special entry such as a swap entry,
  239. migration entry or other special marker. Offsets into these pages are provided
  240. by the virtual address itself.
  241. In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge
  242. pages might eliminate one or two of these levels, but when this is the case we
  243. typically refer to the leaf level as the PTE level regardless.
  244. .. note:: In instances where the architecture supports fewer page tables than
  245. five the kernel cleverly 'folds' page table levels, that is stubbing
  246. out functions related to the skipped levels. This allows us to
  247. conceptually act as if there were always five levels, even if the
  248. compiler might, in practice, eliminate any code relating to missing
  249. ones.
  250. There are four key operations typically performed on page tables:
  251. 1. **Traversing** page tables - Simply reading page tables in order to traverse
  252. them. This only requires that the VMA is kept stable, so a lock which
  253. establishes this suffices for traversal (there are also lockless variants
  254. which eliminate even this requirement, such as :c:func:`!gup_fast`). There is
  255. also a special case of page table traversal for non-VMA regions which we
  256. consider separately below.
  257. 2. **Installing** page table mappings - Whether creating a new mapping or
  258. modifying an existing one in such a way as to change its identity. This
  259. requires that the VMA is kept stable via an mmap or VMA lock (explicitly not
  260. rmap locks).
  261. 3. **Zapping/unmapping** page table entries - This is what the kernel calls
  262. clearing page table mappings at the leaf level only, whilst leaving all page
  263. tables in place. This is a very common operation in the kernel performed on
  264. file truncation, the :c:macro:`!MADV_DONTNEED` operation via
  265. :c:func:`!madvise`, and others. This is performed by a number of functions
  266. including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages`.
  267. The VMA need only be kept stable for this operation.
  268. 4. **Freeing** page tables - When finally the kernel removes page tables from a
  269. userland process (typically via :c:func:`!free_pgtables`) extreme care must
  270. be taken to ensure this is done safely, as this logic finally frees all page
  271. tables in the specified range, ignoring existing leaf entries (it assumes the
  272. caller has both zapped the range and prevented any further faults or
  273. modifications within it).
  274. .. note:: Modifying mappings for reclaim or migration is performed under rmap
  275. lock as it, like zapping, does not fundamentally modify the identity
  276. of what is being mapped.
  277. **Traversing** and **zapping** ranges can be performed holding any one of the
  278. locks described in the terminology section above - that is the mmap lock, the
  279. VMA lock or either of the reverse mapping locks.
  280. That is - as long as you keep the relevant VMA **stable** - you are good to go
  281. ahead and perform these operations on page tables (though internally, kernel
  282. operations that perform writes also acquire internal page table locks to
  283. serialise - see the page table implementation detail section for more details).
  284. .. note:: We free empty PTE tables on zap under the RCU lock - this does not
  285. change the aforementioned locking requirements around zapping.
  286. When **installing** page table entries, the mmap or VMA lock must be held to
  287. keep the VMA stable. We explore why this is in the page table locking details
  288. section below.
  289. **Freeing** page tables is an entirely internal memory management operation and
  290. has special requirements (see the page freeing section below for more details).
  291. .. warning:: When **freeing** page tables, it must not be possible for VMAs
  292. containing the ranges those page tables map to be accessible via
  293. the reverse mapping.
  294. The :c:func:`!free_pgtables` function removes the relevant VMAs
  295. from the reverse mappings, but no other VMAs can be permitted to be
  296. accessible and span the specified range.
  297. Traversing non-VMA page tables
  298. ------------------------------
  299. We've focused above on traversal of page tables belonging to VMAs. It is also
  300. possible to traverse page tables which are not represented by VMAs.
  301. Kernel page table mappings themselves are generally managed but whatever part of
  302. the kernel established them and the aforementioned locking rules do not apply -
  303. for instance vmalloc has its own set of locks which are utilised for
  304. establishing and tearing down page its page tables.
  305. However, for convenience we provide the :c:func:`!walk_kernel_page_table_range`
  306. function which is synchronised via the mmap lock on the :c:macro:`!init_mm`
  307. kernel instantiation of the :c:struct:`!struct mm_struct` metadata object.
  308. If an operation requires exclusive access, a write lock is used, but if not, a
  309. read lock suffices - we assert only that at least a read lock has been acquired.
  310. Since, aside from vmalloc and memory hot plug, kernel page tables are not torn
  311. down all that often - this usually suffices, however any caller of this
  312. functionality must ensure that any additionally required locks are acquired in
  313. advance.
  314. We also permit a truly unusual case is the traversal of non-VMA ranges in
  315. **userland** ranges, as provided for by :c:func:`!walk_page_range_debug`.
  316. This has only one user - the general page table dumping logic (implemented in
  317. :c:macro:`!mm/ptdump.c`) - which seeks to expose all mappings for debug purposes
  318. even if they are highly unusual (possibly architecture-specific) and are not
  319. backed by a VMA.
  320. We must take great care in this case, as the :c:func:`!munmap` implementation
  321. detaches VMAs under an mmap write lock before tearing down page tables under a
  322. downgraded mmap read lock.
  323. This means such an operation could race with this, and thus an mmap **write**
  324. lock is required.
  325. Lock ordering
  326. -------------
  327. As we have multiple locks across the kernel which may or may not be taken at the
  328. same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
  329. the **order** in which locks are acquired and released becomes very important.
  330. .. note:: Lock inversion occurs when two threads need to acquire multiple locks,
  331. but in doing so inadvertently cause a mutual deadlock.
  332. For example, consider thread 1 which holds lock A and tries to acquire lock B,
  333. while thread 2 holds lock B and tries to acquire lock A.
  334. Both threads are now deadlocked on each other. However, had they attempted to
  335. acquire locks in the same order, one would have waited for the other to
  336. complete its work and no deadlock would have occurred.
  337. The opening comment in :c:macro:`!mm/rmap.c` describes in detail the required
  338. ordering of locks within memory management code:
  339. .. code-block::
  340. inode->i_rwsem (while writing or truncating, not reading or faulting)
  341. mm->mmap_lock
  342. mapping->invalidate_lock (in filemap_fault)
  343. folio_lock
  344. hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
  345. vma_start_write
  346. mapping->i_mmap_rwsem
  347. anon_vma->rwsem
  348. mm->page_table_lock or pte_lock
  349. swap_lock (in swap_duplicate, swap_info_get)
  350. mmlist_lock (in mmput, drain_mmlist and others)
  351. mapping->private_lock (in block_dirty_folio)
  352. i_pages lock (widely used)
  353. lruvec->lru_lock (in folio_lruvec_lock_irq)
  354. inode->i_lock (in set_page_dirty's __mark_inode_dirty)
  355. bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
  356. sb_lock (within inode_lock in fs/fs-writeback.c)
  357. i_pages lock (widely used, in set_page_dirty,
  358. in arch-dependent flush_dcache_mmap_lock,
  359. within bdi.wb->list_lock in __sync_single_inode)
  360. There is also a file-system specific lock ordering comment located at the top of
  361. :c:macro:`!mm/filemap.c`:
  362. .. code-block::
  363. ->i_mmap_rwsem (truncate_pagecache)
  364. ->private_lock (__free_pte->block_dirty_folio)
  365. ->swap_lock (exclusive_swap_page, others)
  366. ->i_pages lock
  367. ->i_rwsem
  368. ->invalidate_lock (acquired by fs in truncate path)
  369. ->i_mmap_rwsem (truncate->unmap_mapping_range)
  370. ->mmap_lock
  371. ->i_mmap_rwsem
  372. ->page_table_lock or pte_lock (various, mainly in memory.c)
  373. ->i_pages lock (arch-dependent flush_dcache_mmap_lock)
  374. ->mmap_lock
  375. ->invalidate_lock (filemap_fault)
  376. ->lock_page (filemap_fault, access_process_vm)
  377. ->i_rwsem (generic_perform_write)
  378. ->mmap_lock (fault_in_readable->do_page_fault)
  379. bdi->wb.list_lock
  380. sb_lock (fs/fs-writeback.c)
  381. ->i_pages lock (__sync_single_inode)
  382. ->i_mmap_rwsem
  383. ->anon_vma.lock (vma_merge)
  384. ->anon_vma.lock
  385. ->page_table_lock or pte_lock (anon_vma_prepare and various)
  386. ->page_table_lock or pte_lock
  387. ->swap_lock (try_to_unmap_one)
  388. ->private_lock (try_to_unmap_one)
  389. ->i_pages lock (try_to_unmap_one)
  390. ->lruvec->lru_lock (follow_page_mask->mark_page_accessed)
  391. ->lruvec->lru_lock (check_pte_range->folio_isolate_lru)
  392. ->private_lock (folio_remove_rmap_pte->set_page_dirty)
  393. ->i_pages lock (folio_remove_rmap_pte->set_page_dirty)
  394. bdi.wb->list_lock (folio_remove_rmap_pte->set_page_dirty)
  395. ->inode->i_lock (folio_remove_rmap_pte->set_page_dirty)
  396. bdi.wb->list_lock (zap_pte_range->set_page_dirty)
  397. ->inode->i_lock (zap_pte_range->set_page_dirty)
  398. ->private_lock (zap_pte_range->block_dirty_folio)
  399. Please check the current state of these comments which may have changed since
  400. the time of writing of this document.
  401. ------------------------------
  402. Locking Implementation Details
  403. ------------------------------
  404. .. warning:: Locking rules for PTE-level page tables are very different from
  405. locking rules for page tables at other levels.
  406. Page table locking details
  407. --------------------------
  408. .. note:: This section explores page table locking requirements for page tables
  409. encompassed by a VMA. See the above section on non-VMA page table
  410. traversal for details on how we handle that case.
  411. In addition to the locks described in the terminology section above, we have
  412. additional locks dedicated to page tables:
  413. * **Higher level page table locks** - Higher level page tables, that is PGD, P4D
  414. and PUD each make use of the process address space granularity
  415. :c:member:`!mm->page_table_lock` lock when modified.
  416. * **Fine-grained page table locks** - PMDs and PTEs each have fine-grained locks
  417. either kept within the folios describing the page tables or allocated
  418. separated and pointed at by the folios if :c:macro:`!ALLOC_SPLIT_PTLOCKS` is
  419. set. The PMD spin lock is obtained via :c:func:`!pmd_lock`, however PTEs are
  420. mapped into higher memory (if a 32-bit system) and carefully locked via
  421. :c:func:`!pte_offset_map_lock`.
  422. These locks represent the minimum required to interact with each page table
  423. level, but there are further requirements.
  424. Importantly, note that on a **traversal** of page tables, sometimes no such
  425. locks are taken. However, at the PTE level, at least concurrent page table
  426. deletion must be prevented (using RCU) and the page table must be mapped into
  427. high memory, see below.
  428. Whether care is taken on reading the page table entries depends on the
  429. architecture, see the section on atomicity below.
  430. Locking rules
  431. ^^^^^^^^^^^^^
  432. We establish basic locking rules when interacting with page tables:
  433. * When changing a page table entry the page table lock for that page table
  434. **must** be held, except if you can safely assume nobody can access the page
  435. tables concurrently (such as on invocation of :c:func:`!free_pgtables`).
  436. * Reads from and writes to page table entries must be *appropriately*
  437. atomic. See the section on atomicity below for details.
  438. * Populating previously empty entries requires that the mmap or VMA locks are
  439. held (read or write), doing so with only rmap locks would be dangerous (see
  440. the warning below).
  441. * As mentioned previously, zapping can be performed while simply keeping the VMA
  442. stable, that is holding any one of the mmap, VMA or rmap locks.
  443. .. warning:: Populating previously empty entries is dangerous as, when unmapping
  444. VMAs, :c:func:`!vms_clear_ptes` has a window of time between
  445. zapping (via :c:func:`!unmap_vmas`) and freeing page tables (via
  446. :c:func:`!free_pgtables`), where the VMA is still visible in the
  447. rmap tree. :c:func:`!free_pgtables` assumes that the zap has
  448. already been performed and removes PTEs unconditionally (along with
  449. all other page tables in the freed range), so installing new PTE
  450. entries could leak memory and also cause other unexpected and
  451. dangerous behaviour.
  452. There are additional rules applicable when moving page tables, which we discuss
  453. in the section on this topic below.
  454. PTE-level page tables are different from page tables at other levels, and there
  455. are extra requirements for accessing them:
  456. * On 32-bit architectures, they may be in high memory (meaning they need to be
  457. mapped into kernel memory to be accessible).
  458. * When empty, they can be unlinked and RCU-freed while holding an mmap lock or
  459. rmap lock for reading in combination with the PTE and PMD page table locks.
  460. In particular, this happens in :c:func:`!retract_page_tables` when handling
  461. :c:macro:`!MADV_COLLAPSE`.
  462. So accessing PTE-level page tables requires at least holding an RCU read lock;
  463. but that only suffices for readers that can tolerate racing with concurrent
  464. page table updates such that an empty PTE is observed (in a page table that
  465. has actually already been detached and marked for RCU freeing) while another
  466. new page table has been installed in the same location and filled with
  467. entries. Writers normally need to take the PTE lock and revalidate that the
  468. PMD entry still refers to the same PTE-level page table.
  469. If the writer does not care whether it is the same PTE-level page table, it
  470. can take the PMD lock and revalidate that the contents of pmd entry still meet
  471. the requirements. In particular, this also happens in :c:func:`!retract_page_tables`
  472. when handling :c:macro:`!MADV_COLLAPSE`.
  473. To access PTE-level page tables, a helper like :c:func:`!pte_offset_map_lock` or
  474. :c:func:`!pte_offset_map` can be used depending on stability requirements.
  475. These map the page table into kernel memory if required, take the RCU lock, and
  476. depending on variant, may also look up or acquire the PTE lock.
  477. See the comment on :c:func:`!pte_offset_map_lock`.
  478. Atomicity
  479. ^^^^^^^^^
  480. Regardless of page table locks, the MMU hardware concurrently updates accessed
  481. and dirty bits (perhaps more, depending on architecture). Additionally, page
  482. table traversal operations in parallel (though holding the VMA stable) and
  483. functionality like GUP-fast locklessly traverses (that is reads) page tables,
  484. without even keeping the VMA stable at all.
  485. When performing a page table traversal and keeping the VMA stable, whether a
  486. read must be performed once and only once or not depends on the architecture
  487. (for instance x86-64 does not require any special precautions).
  488. If a write is being performed, or if a read informs whether a write takes place
  489. (on an installation of a page table entry say, for instance in
  490. :c:func:`!__pud_install`), special care must always be taken. In these cases we
  491. can never assume that page table locks give us entirely exclusive access, and
  492. must retrieve page table entries once and only once.
  493. If we are reading page table entries, then we need only ensure that the compiler
  494. does not rearrange our loads. This is achieved via :c:func:`!pXXp_get`
  495. functions - :c:func:`!pgdp_get`, :c:func:`!p4dp_get`, :c:func:`!pudp_get`,
  496. :c:func:`!pmdp_get`, and :c:func:`!ptep_get`.
  497. Each of these uses :c:func:`!READ_ONCE` to guarantee that the compiler reads
  498. the page table entry only once.
  499. However, if we wish to manipulate an existing page table entry and care about
  500. the previously stored data, we must go further and use an hardware atomic
  501. operation as, for example, in :c:func:`!ptep_get_and_clear`.
  502. Equally, operations that do not rely on the VMA being held stable, such as
  503. GUP-fast (see :c:func:`!gup_fast` and its various page table level handlers like
  504. :c:func:`!gup_fast_pte_range`), must very carefully interact with page table
  505. entries, using functions such as :c:func:`!ptep_get_lockless` and equivalent for
  506. higher level page table levels.
  507. Writes to page table entries must also be appropriately atomic, as established
  508. by :c:func:`!set_pXX` functions - :c:func:`!set_pgd`, :c:func:`!set_p4d`,
  509. :c:func:`!set_pud`, :c:func:`!set_pmd`, and :c:func:`!set_pte`.
  510. Equally functions which clear page table entries must be appropriately atomic,
  511. as in :c:func:`!pXX_clear` functions - :c:func:`!pgd_clear`,
  512. :c:func:`!p4d_clear`, :c:func:`!pud_clear`, :c:func:`!pmd_clear`, and
  513. :c:func:`!pte_clear`.
  514. Page table installation
  515. ^^^^^^^^^^^^^^^^^^^^^^^
  516. Page table installation is performed with the VMA held stable explicitly by an
  517. mmap or VMA lock in read or write mode (see the warning in the locking rules
  518. section for details as to why).
  519. When allocating a P4D, PUD or PMD and setting the relevant entry in the above
  520. PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is
  521. acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and
  522. :c:func:`!__pmd_alloc` respectively.
  523. .. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` and
  524. :c:func:`!pud_lockptr` in turn, however at the time of writing it ultimately
  525. references the :c:member:`!mm->page_table_lock`.
  526. Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if
  527. :c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, a lock embedded in the PMD
  528. physical page metadata in the form of a :c:struct:`!struct ptdesc`, acquired by
  529. :c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately
  530. :c:func:`!__pte_alloc`.
  531. Finally, modifying the contents of the PTE requires special treatment, as the
  532. PTE page table lock must be acquired whenever we want stable and exclusive
  533. access to entries contained within a PTE, especially when we wish to modify
  534. them.
  535. This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to
  536. ensure that the PTE hasn't changed from under us, ultimately invoking
  537. :c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within
  538. the :c:struct:`!struct ptdesc` associated with the physical PTE page. The lock
  539. must be released via :c:func:`!pte_unmap_unlock`.
  540. .. note:: There are some variants on this, such as
  541. :c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stable but
  542. for brevity we do not explore this. See the comment for
  543. :c:func:`!pte_offset_map_lock` for more details.
  544. When modifying data in ranges we typically only wish to allocate higher page
  545. tables as necessary, using these locks to avoid races or overwriting anything,
  546. and set/clear data at the PTE level as required (for instance when page faulting
  547. or zapping).
  548. A typical pattern taken when traversing page table entries to install a new
  549. mapping is to optimistically determine whether the page table entry in the table
  550. above is empty, if so, only then acquiring the page table lock and checking
  551. again to see if it was allocated underneath us.
  552. This allows for a traversal with page table locks only being taken when
  553. required. An example of this is :c:func:`!__pud_alloc`.
  554. At the leaf page table, that is the PTE, we can't entirely rely on this pattern
  555. as we have separate PMD and PTE locks and a THP collapse for instance might have
  556. eliminated the PMD entry as well as the PTE from under us.
  557. This is why :c:func:`!pte_offset_map_lock` locklessly retrieves the PMD entry
  558. for the PTE, carefully checking it is as expected, before acquiring the
  559. PTE-specific lock, and then *again* checking that the PMD entry is as expected.
  560. If a THP collapse (or similar) were to occur then the lock on both pages would
  561. be acquired, so we can ensure this is prevented while the PTE lock is held.
  562. Installing entries this way ensures mutual exclusion on write.
  563. Page table freeing
  564. ^^^^^^^^^^^^^^^^^^
  565. Tearing down page tables themselves is something that requires significant
  566. care. There must be no way that page tables designated for removal can be
  567. traversed or referenced by concurrent tasks.
  568. It is insufficient to simply hold an mmap write lock and VMA lock (which will
  569. prevent racing faults, and rmap operations), as a file-backed mapping can be
  570. truncated under the :c:struct:`!struct address_space->i_mmap_rwsem` alone.
  571. As a result, no VMA which can be accessed via the reverse mapping (either
  572. through the :c:struct:`!struct anon_vma->rb_root` or the :c:member:`!struct
  573. address_space->i_mmap` interval trees) can have its page tables torn down.
  574. The operation is typically performed via :c:func:`!free_pgtables`, which assumes
  575. either the mmap write lock has been taken (as specified by its
  576. :c:member:`!mm_wr_locked` parameter), or that the VMA is already unreachable.
  577. It carefully removes the VMA from all reverse mappings, however it's important
  578. that no new ones overlap these or any route remain to permit access to addresses
  579. within the range whose page tables are being torn down.
  580. Additionally, it assumes that a zap has already been performed and steps have
  581. been taken to ensure that no further page table entries can be installed between
  582. the zap and the invocation of :c:func:`!free_pgtables`.
  583. Since it is assumed that all such steps have been taken, page table entries are
  584. cleared without page table locks (in the :c:func:`!pgd_clear`, :c:func:`!p4d_clear`,
  585. :c:func:`!pud_clear`, and :c:func:`!pmd_clear` functions.
  586. .. note:: It is possible for leaf page tables to be torn down independent of
  587. the page tables above it as is done by
  588. :c:func:`!retract_page_tables`, which is performed under the i_mmap
  589. read lock, PMD, and PTE page table locks, without this level of care.
  590. Page table moving
  591. ^^^^^^^^^^^^^^^^^
  592. Some functions manipulate page table levels above PMD (that is PUD, P4D and PGD
  593. page tables). Most notable of these is :c:func:`!mremap`, which is capable of
  594. moving higher level page tables.
  595. In these instances, it is required that **all** locks are taken, that is
  596. the mmap lock, the VMA lock and the relevant rmap locks.
  597. You can observe this in the :c:func:`!mremap` implementation in the functions
  598. :c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform the rmap
  599. side of lock acquisition, invoked ultimately by :c:func:`!move_page_tables`.
  600. VMA lock internals
  601. ------------------
  602. Overview
  603. ^^^^^^^^
  604. VMA read locking is entirely optimistic - if the lock is contended or a competing
  605. write has started, then we do not obtain a read lock.
  606. A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu`, which first
  607. calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU
  608. critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`,
  609. before releasing the RCU lock via :c:func:`!rcu_read_unlock`.
  610. In cases when the user already holds mmap read lock, :c:func:`!vma_start_read_locked`
  611. and :c:func:`!vma_start_read_locked_nested` can be used. These functions do not
  612. fail due to lock contention but the caller should still check their return values
  613. in case they fail for other reasons.
  614. VMA read locks increment :c:member:`!vma.vm_refcnt` reference counter for their
  615. duration and the caller of :c:func:`!lock_vma_under_rcu` must drop it via
  616. :c:func:`!vma_end_read`.
  617. VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
  618. VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
  619. acquired. An mmap write lock **must** be held for the duration of the VMA write
  620. lock, releasing or downgrading the mmap write lock also releases the VMA write
  621. lock so there is no :c:func:`!vma_end_write` function.
  622. Note that when write-locking a VMA lock, the :c:member:`!vma.vm_refcnt` is temporarily
  623. modified so that readers can detect the presense of a writer. The reference counter is
  624. restored once the vma sequence number used for serialisation is updated.
  625. This ensures the semantics we require - VMA write locks provide exclusive write
  626. access to the VMA.
  627. Implementation details
  628. ^^^^^^^^^^^^^^^^^^^^^^
  629. The VMA lock mechanism is designed to be a lightweight means of avoiding the use
  630. of the heavily contended mmap lock. It is implemented using a combination of a
  631. reference counter and sequence numbers belonging to the containing
  632. :c:struct:`!struct mm_struct` and the VMA.
  633. Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
  634. operation, i.e. it tries to acquire a read lock but returns false if it is
  635. unable to do so. At the end of the read operation, :c:func:`!vma_end_read` is
  636. called to release the VMA read lock.
  637. Invoking :c:func:`!vma_start_read` requires that :c:func:`!rcu_read_lock` has
  638. been called first, establishing that we are in an RCU critical section upon VMA
  639. read lock acquisition. Once acquired, the RCU lock can be released as it is only
  640. required for lookup. This is abstracted by :c:func:`!lock_vma_under_rcu` which
  641. is the interface a user should use.
  642. Writing requires the mmap to be write-locked and the VMA lock to be acquired via
  643. :c:func:`!vma_start_write`, however the write lock is released by the termination or
  644. downgrade of the mmap write lock so no :c:func:`!vma_end_write` is required.
  645. All this is achieved by the use of per-mm and per-VMA sequence counts, which are
  646. used in order to reduce complexity, especially for operations which write-lock
  647. multiple VMAs at once.
  648. If the mm sequence count, :c:member:`!mm->mm_lock_seq` is equal to the VMA
  649. sequence count :c:member:`!vma->vm_lock_seq` then the VMA is write-locked. If
  650. they differ, then it is not.
  651. Each time the mmap write lock is released in :c:func:`!mmap_write_unlock` or
  652. :c:func:`!mmap_write_downgrade`, :c:func:`!vma_end_write_all` is invoked which
  653. also increments :c:member:`!mm->mm_lock_seq` via
  654. :c:func:`!mm_lock_seqcount_end`.
  655. This way, we ensure that, regardless of the VMA's sequence number, a write lock
  656. is never incorrectly indicated and that when we release an mmap write lock we
  657. efficiently release **all** VMA write locks contained within the mmap at the
  658. same time.
  659. Since the mmap write lock is exclusive against others who hold it, the automatic
  660. release of any VMA locks on its release makes sense, as you would never want to
  661. keep VMAs locked across entirely separate write operations. It also maintains
  662. correct lock ordering.
  663. Each time a VMA read lock is acquired, we increment :c:member:`!vma.vm_refcnt`
  664. reference counter and check that the sequence count of the VMA does not match
  665. that of the mm.
  666. If it does, the read lock fails and :c:member:`!vma.vm_refcnt` is dropped.
  667. If it does not, we keep the reference counter raised, excluding writers, but
  668. permitting other readers, who can also obtain this lock under RCU.
  669. Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
  670. are also RCU safe, so the whole read lock operation is guaranteed to function
  671. correctly.
  672. On the write side, we set a bit in :c:member:`!vma.vm_refcnt` which can't be
  673. modified by readers and wait for all readers to drop their reference count.
  674. Once there are no readers, the VMA's sequence number is set to match that of
  675. the mm. During this entire operation mmap write lock is held.
  676. This way, if any read locks are in effect, :c:func:`!vma_start_write` will sleep
  677. until these are finished and mutual exclusion is achieved.
  678. After setting the VMA's sequence number, the bit in :c:member:`!vma.vm_refcnt`
  679. indicating a writer is cleared. From this point on, VMA's sequence number will
  680. indicate VMA's write-locked state until mmap write lock is dropped or downgraded.
  681. This clever combination of a reference counter and sequence count allows for
  682. fast RCU-based per-VMA lock acquisition (especially on page fault, though
  683. utilised elsewhere) with minimal complexity around lock ordering.
  684. mmap write lock downgrading
  685. ---------------------------
  686. When an mmap write lock is held one has exclusive access to resources within the
  687. mmap (with the usual caveats about requiring VMA write locks to avoid races with
  688. tasks holding VMA read locks).
  689. It is then possible to **downgrade** from a write lock to a read lock via
  690. :c:func:`!mmap_write_downgrade` which, similar to :c:func:`!mmap_write_unlock`,
  691. implicitly terminates all VMA write locks via :c:func:`!vma_end_write_all`, but
  692. importantly does not relinquish the mmap lock while downgrading, therefore
  693. keeping the locked virtual address space stable.
  694. An interesting consequence of this is that downgraded locks are exclusive
  695. against any other task possessing a downgraded lock (since a racing task would
  696. have to acquire a write lock first to downgrade it, and the downgraded lock
  697. prevents a new write lock from being obtained until the original lock is
  698. released).
  699. For clarity, we map read (R)/downgraded write (D)/write (W) locks against one
  700. another showing which locks exclude the others:
  701. .. list-table:: Lock exclusivity
  702. :widths: 5 5 5 5
  703. :header-rows: 1
  704. :stub-columns: 1
  705. * -
  706. - R
  707. - D
  708. - W
  709. * - R
  710. - N
  711. - N
  712. - Y
  713. * - D
  714. - N
  715. - Y
  716. - Y
  717. * - W
  718. - Y
  719. - Y
  720. - Y
  721. Here a Y indicates the locks in the matching row/column are mutually exclusive,
  722. and N indicates that they are not.
  723. Stack expansion
  724. ---------------
  725. Stack expansion throws up additional complexities in that we cannot permit there
  726. to be racing page faults, as a result we invoke :c:func:`!vma_start_write` to
  727. prevent this in :c:func:`!expand_downwards` or :c:func:`!expand_upwards`.
  728. ------------------------
  729. Functions and structures
  730. ------------------------
  731. .. kernel-doc:: include/linux/mmap_lock.h