napi.rst 20 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505
  1. .. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
  2. .. _napi:
  3. ====
  4. NAPI
  5. ====
  6. NAPI is the event handling mechanism used by the Linux networking stack.
  7. The name NAPI no longer stands for anything in particular [#]_.
  8. In basic operation the device notifies the host about new events
  9. via an interrupt.
  10. The host then schedules a NAPI instance to process the events.
  11. The device may also be polled for events via NAPI without receiving
  12. interrupts first (:ref:`busy polling<poll>`).
  13. NAPI processing usually happens in the software interrupt context,
  14. but there is an option to use :ref:`separate kernel threads<threaded>`
  15. for NAPI processing.
  16. All in all NAPI abstracts away from the drivers the context and configuration
  17. of event (packet Rx and Tx) processing.
  18. Driver API
  19. ==========
  20. The two most important elements of NAPI are the struct napi_struct
  21. and the associated poll method. struct napi_struct holds the state
  22. of the NAPI instance while the method is the driver-specific event
  23. handler. The method will typically free Tx packets that have been
  24. transmitted and process newly received packets.
  25. .. _drv_ctrl:
  26. Control API
  27. -----------
  28. netif_napi_add() and netif_napi_del() add/remove a NAPI instance
  29. from the system. The instances are attached to the netdevice passed
  30. as argument (and will be deleted automatically when netdevice is
  31. unregistered). Instances are added in a disabled state.
  32. napi_enable() and napi_disable() manage the disabled state.
  33. A disabled NAPI can't be scheduled and its poll method is guaranteed
  34. to not be invoked. napi_disable() waits for ownership of the NAPI
  35. instance to be released.
  36. The control APIs are not idempotent. Control API calls are safe against
  37. concurrent use of datapath APIs but an incorrect sequence of control API
  38. calls may result in crashes, deadlocks, or race conditions. For example,
  39. calling napi_disable() multiple times in a row will deadlock.
  40. Datapath API
  41. ------------
  42. napi_schedule() is the basic method of scheduling a NAPI poll.
  43. Drivers should call this function in their interrupt handler
  44. (see :ref:`drv_sched` for more info). A successful call to napi_schedule()
  45. will take ownership of the NAPI instance.
  46. Later, after NAPI is scheduled, the driver's poll method will be
  47. called to process the events/packets. The method takes a ``budget``
  48. argument - drivers can process completions for any number of Tx
  49. packets but should only process up to ``budget`` number of
  50. Rx packets. Rx processing is usually much more expensive.
  51. In other words for Rx processing the ``budget`` argument limits how many
  52. packets driver can process in a single poll. Rx specific APIs like page
  53. pool or XDP cannot be used at all when ``budget`` is 0.
  54. skb Tx processing should happen regardless of the ``budget``, but if
  55. the argument is 0 driver cannot call any XDP (or page pool) APIs.
  56. .. warning::
  57. The ``budget`` argument may be 0 if core tries to only process
  58. skb Tx completions and no Rx or XDP packets.
  59. The poll method returns the amount of work done. If the driver still
  60. has outstanding work to do (e.g. ``budget`` was exhausted)
  61. the poll method should return exactly ``budget``. In that case,
  62. the NAPI instance will be serviced/polled again (without the
  63. need to be scheduled).
  64. If event processing has been completed (all outstanding packets
  65. processed) the poll method should call napi_complete_done()
  66. before returning. napi_complete_done() releases the ownership
  67. of the instance.
  68. .. warning::
  69. The case of finishing all events and using exactly ``budget``
  70. must be handled carefully. There is no way to report this
  71. (rare) condition to the stack, so the driver must either
  72. not call napi_complete_done() and wait to be called again,
  73. or return ``budget - 1``.
  74. If the ``budget`` is 0 napi_complete_done() should never be called.
  75. Call sequence
  76. -------------
  77. Drivers should not make assumptions about the exact sequencing
  78. of calls. The poll method may be called without the driver scheduling
  79. the instance (unless the instance is disabled). Similarly,
  80. it's not guaranteed that the poll method will be called, even
  81. if napi_schedule() succeeded (e.g. if the instance gets disabled).
  82. As mentioned in the :ref:`drv_ctrl` section - napi_disable() and subsequent
  83. calls to the poll method only wait for the ownership of the instance
  84. to be released, not for the poll method to exit. This means that
  85. drivers should avoid accessing any data structures after calling
  86. napi_complete_done().
  87. .. _drv_sched:
  88. Scheduling and IRQ masking
  89. --------------------------
  90. Drivers should keep the interrupts masked after scheduling
  91. the NAPI instance - until NAPI polling finishes any further
  92. interrupts are unnecessary.
  93. Drivers which have to mask the interrupts explicitly (as opposed
  94. to IRQ being auto-masked by the device) should use the napi_schedule_prep()
  95. and __napi_schedule() calls:
  96. .. code-block:: c
  97. if (napi_schedule_prep(&v->napi)) {
  98. mydrv_mask_rxtx_irq(v->idx);
  99. /* schedule after masking to avoid races */
  100. __napi_schedule(&v->napi);
  101. }
  102. IRQ should only be unmasked after a successful call to napi_complete_done():
  103. .. code-block:: c
  104. if (budget && napi_complete_done(&v->napi, work_done)) {
  105. mydrv_unmask_rxtx_irq(v->idx);
  106. return min(work_done, budget - 1);
  107. }
  108. napi_schedule_irqoff() is a variant of napi_schedule() which takes advantage
  109. of guarantees given by being invoked in IRQ context (no need to
  110. mask interrupts). napi_schedule_irqoff() will fall back to napi_schedule() if
  111. IRQs are threaded (such as if ``PREEMPT_RT`` is enabled).
  112. Instance to queue mapping
  113. -------------------------
  114. Modern devices have multiple NAPI instances (struct napi_struct) per
  115. interface. There is no strong requirement on how the instances are
  116. mapped to queues and interrupts. NAPI is primarily a polling/processing
  117. abstraction without specific user-facing semantics. That said, most networking
  118. devices end up using NAPI in fairly similar ways.
  119. NAPI instances most often correspond 1:1:1 to interrupts and queue pairs
  120. (queue pair is a set of a single Rx and single Tx queue).
  121. In less common cases a NAPI instance may be used for multiple queues
  122. or Rx and Tx queues can be serviced by separate NAPI instances on a single
  123. core. Regardless of the queue assignment, however, there is usually still
  124. a 1:1 mapping between NAPI instances and interrupts.
  125. It's worth noting that the ethtool API uses a "channel" terminology where
  126. each channel can be either ``rx``, ``tx`` or ``combined``. It's not clear
  127. what constitutes a channel; the recommended interpretation is to understand
  128. a channel as an IRQ/NAPI which services queues of a given type. For example,
  129. a configuration of 1 ``rx``, 1 ``tx`` and 1 ``combined`` channel is expected
  130. to utilize 3 interrupts, 2 Rx and 2 Tx queues.
  131. Persistent NAPI config
  132. ----------------------
  133. Drivers often allocate and free NAPI instances dynamically. This leads to loss
  134. of NAPI-related user configuration each time NAPI instances are reallocated.
  135. The netif_napi_add_config() API prevents this loss of configuration by
  136. associating each NAPI instance with a persistent NAPI configuration based on
  137. a driver defined index value, like a queue number.
  138. Using this API allows for persistent NAPI IDs (among other settings), which can
  139. be beneficial to userspace programs using ``SO_INCOMING_NAPI_ID``. See the
  140. sections below for other NAPI configuration settings.
  141. Drivers should try to use netif_napi_add_config() whenever possible.
  142. User API
  143. ========
  144. User interactions with NAPI depend on NAPI instance ID. The instance IDs
  145. are only visible to the user thru the ``SO_INCOMING_NAPI_ID`` socket option.
  146. Users can query NAPI IDs for a device or device queue using netlink. This can
  147. be done programmatically in a user application or by using a script included in
  148. the kernel source tree: ``tools/net/ynl/pyynl/cli.py``.
  149. For example, using the script to dump all of the queues for a device (which
  150. will reveal each queue's NAPI ID):
  151. .. code-block:: bash
  152. $ kernel-source/tools/net/ynl/pyynl/cli.py \
  153. --spec Documentation/netlink/specs/netdev.yaml \
  154. --dump queue-get \
  155. --json='{"ifindex": 2}'
  156. See ``Documentation/netlink/specs/netdev.yaml`` for more details on
  157. available operations and attributes.
  158. Software IRQ coalescing
  159. -----------------------
  160. NAPI does not perform any explicit event coalescing by default.
  161. In most scenarios batching happens due to IRQ coalescing which is done
  162. by the device. There are cases where software coalescing is helpful.
  163. NAPI can be configured to arm a repoll timer instead of unmasking
  164. the hardware interrupts as soon as all packets are processed.
  165. The ``gro_flush_timeout`` sysfs configuration of the netdevice
  166. is reused to control the delay of the timer, while
  167. ``napi_defer_hard_irqs`` controls the number of consecutive empty polls
  168. before NAPI gives up and goes back to using hardware IRQs.
  169. The above parameters can also be set on a per-NAPI basis using netlink via
  170. netdev-genl. When used with netlink and configured on a per-NAPI basis, the
  171. parameters mentioned above use hyphens instead of underscores:
  172. ``gro-flush-timeout`` and ``napi-defer-hard-irqs``.
  173. Per-NAPI configuration can be done programmatically in a user application
  174. or by using a script included in the kernel source tree:
  175. ``tools/net/ynl/pyynl/cli.py``.
  176. For example, using the script:
  177. .. code-block:: bash
  178. $ kernel-source/tools/net/ynl/pyynl/cli.py \
  179. --spec Documentation/netlink/specs/netdev.yaml \
  180. --do napi-set \
  181. --json='{"id": 345,
  182. "defer-hard-irqs": 111,
  183. "gro-flush-timeout": 11111}'
  184. Similarly, the parameter ``irq-suspend-timeout`` can be set using netlink
  185. via netdev-genl. There is no global sysfs parameter for this value.
  186. ``irq-suspend-timeout`` is used to determine how long an application can
  187. completely suspend IRQs. It is used in combination with SO_PREFER_BUSY_POLL,
  188. which can be set on a per-epoll context basis with ``EPIOCSPARAMS`` ioctl.
  189. .. _poll:
  190. Busy polling
  191. ------------
  192. Busy polling allows a user process to check for incoming packets before
  193. the device interrupt fires. As is the case with any busy polling it trades
  194. off CPU cycles for lower latency (production uses of NAPI busy polling
  195. are not well known).
  196. Busy polling is enabled by either setting ``SO_BUSY_POLL`` on
  197. selected sockets or using the global ``net.core.busy_poll`` and
  198. ``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling
  199. also exists. Threaded polling of NAPI also has a mode to busy poll for
  200. packets (:ref:`threaded busy polling<threaded_busy_poll>`) using the NAPI
  201. processing kthread.
  202. epoll-based busy polling
  203. ------------------------
  204. It is possible to trigger packet processing directly from calls to
  205. ``epoll_wait``. In order to use this feature, a user application must ensure
  206. all file descriptors which are added to an epoll context have the same NAPI ID.
  207. If the application uses a dedicated acceptor thread, the application can obtain
  208. the NAPI ID of the incoming connection using SO_INCOMING_NAPI_ID and then
  209. distribute that file descriptor to a worker thread. The worker thread would add
  210. the file descriptor to its epoll context. This would ensure each worker thread
  211. has an epoll context with FDs that have the same NAPI ID.
  212. Alternatively, if the application uses SO_REUSEPORT, a bpf or ebpf program can
  213. be inserted to distribute incoming connections to threads such that each thread
  214. is only given incoming connections with the same NAPI ID. Care must be taken to
  215. carefully handle cases where a system may have multiple NICs.
  216. In order to enable busy polling, there are two choices:
  217. 1. ``/proc/sys/net/core/busy_poll`` can be set with a time in useconds to busy
  218. loop waiting for events. This is a system-wide setting and will cause all
  219. epoll-based applications to busy poll when they call epoll_wait. This may
  220. not be desirable as many applications may not have the need to busy poll.
  221. 2. Applications using recent kernels can issue an ioctl on the epoll context
  222. file descriptor to set (``EPIOCSPARAMS``) or get (``EPIOCGPARAMS``) ``struct
  223. epoll_params``:, which user programs can define as follows:
  224. .. code-block:: c
  225. struct epoll_params {
  226. uint32_t busy_poll_usecs;
  227. uint16_t busy_poll_budget;
  228. uint8_t prefer_busy_poll;
  229. /* pad the struct to a multiple of 64bits */
  230. uint8_t __pad;
  231. };
  232. IRQ mitigation
  233. ---------------
  234. While busy polling is supposed to be used by low latency applications,
  235. a similar mechanism can be used for IRQ mitigation.
  236. Very high request-per-second applications (especially routing/forwarding
  237. applications and especially applications using AF_XDP sockets) may not
  238. want to be interrupted until they finish processing a request or a batch
  239. of packets.
  240. Such applications can pledge to the kernel that they will perform a busy
  241. polling operation periodically, and the driver should keep the device IRQs
  242. permanently masked. This mode is enabled by using the ``SO_PREFER_BUSY_POLL``
  243. socket option. To avoid system misbehavior the pledge is revoked
  244. if ``gro_flush_timeout`` passes without any busy poll call. For epoll-based
  245. busy polling applications, the ``prefer_busy_poll`` field of ``struct
  246. epoll_params`` can be set to 1 and the ``EPIOCSPARAMS`` ioctl can be issued to
  247. enable this mode. See the above section for more details.
  248. The NAPI budget for busy polling is lower than the default (which makes
  249. sense given the low latency intention of normal busy polling). This is
  250. not the case with IRQ mitigation, however, so the budget can be adjusted
  251. with the ``SO_BUSY_POLL_BUDGET`` socket option. For epoll-based busy polling
  252. applications, the ``busy_poll_budget`` field can be adjusted to the desired value
  253. in ``struct epoll_params`` and set on a specific epoll context using the ``EPIOCSPARAMS``
  254. ioctl. See the above section for more details.
  255. It is important to note that choosing a large value for ``gro_flush_timeout``
  256. will defer IRQs to allow for better batch processing, but will induce latency
  257. when the system is not fully loaded. Choosing a small value for
  258. ``gro_flush_timeout`` can cause interference of the user application which is
  259. attempting to busy poll by device IRQs and softirq processing. This value
  260. should be chosen carefully with these tradeoffs in mind. epoll-based busy
  261. polling applications may be able to mitigate how much user processing happens
  262. by choosing an appropriate value for ``maxevents``.
  263. Users may want to consider an alternate approach, IRQ suspension, to help deal
  264. with these tradeoffs.
  265. IRQ suspension
  266. --------------
  267. IRQ suspension is a mechanism wherein device IRQs are masked while epoll
  268. triggers NAPI packet processing.
  269. While application calls to epoll_wait successfully retrieve events, the kernel will
  270. defer the IRQ suspension timer. If the kernel does not retrieve any events
  271. while busy polling (for example, because network traffic levels subsided), IRQ
  272. suspension is disabled and the IRQ mitigation strategies described above are
  273. engaged.
  274. This allows users to balance CPU consumption with network processing
  275. efficiency.
  276. To use this mechanism:
  277. 1. The per-NAPI config parameter ``irq-suspend-timeout`` should be set to the
  278. maximum time (in nanoseconds) the application can have its IRQs
  279. suspended. This is done using netlink, as described above. This timeout
  280. serves as a safety mechanism to restart IRQ driver interrupt processing if
  281. the application has stalled. This value should be chosen so that it covers
  282. the amount of time the user application needs to process data from its
  283. call to epoll_wait, noting that applications can control how much data
  284. they retrieve by setting ``max_events`` when calling epoll_wait.
  285. 2. The sysfs parameter or per-NAPI config parameters ``gro_flush_timeout``
  286. and ``napi_defer_hard_irqs`` can be set to low values. They will be used
  287. to defer IRQs after busy poll has found no data.
  288. 3. The ``prefer_busy_poll`` flag must be set to true. This can be done using
  289. the ``EPIOCSPARAMS`` ioctl as described above.
  290. 4. The application uses epoll as described above to trigger NAPI packet
  291. processing.
  292. As mentioned above, as long as subsequent calls to epoll_wait return events to
  293. userland, the ``irq-suspend-timeout`` is deferred and IRQs are disabled. This
  294. allows the application to process data without interference.
  295. Once a call to epoll_wait results in no events being found, IRQ suspension is
  296. automatically disabled and the ``gro_flush_timeout`` and
  297. ``napi_defer_hard_irqs`` mitigation mechanisms take over.
  298. It is expected that ``irq-suspend-timeout`` will be set to a value much larger
  299. than ``gro_flush_timeout`` as ``irq-suspend-timeout`` should suspend IRQs for
  300. the duration of one userland processing cycle.
  301. While it is not strictly necessary to use ``napi_defer_hard_irqs`` and
  302. ``gro_flush_timeout`` to use IRQ suspension, their use is strongly
  303. recommended.
  304. IRQ suspension causes the system to alternate between polling mode and
  305. irq-driven packet delivery. During busy periods, ``irq-suspend-timeout``
  306. overrides ``gro_flush_timeout`` and keeps the system busy polling, but when
  307. epoll finds no events, the setting of ``gro_flush_timeout`` and
  308. ``napi_defer_hard_irqs`` determine the next step.
  309. There are essentially three possible loops for network processing and
  310. packet delivery:
  311. 1) hardirq -> softirq -> napi poll; basic interrupt delivery
  312. 2) timer -> softirq -> napi poll; deferred irq processing
  313. 3) epoll -> busy-poll -> napi poll; busy looping
  314. Loop 2 can take control from Loop 1, if ``gro_flush_timeout`` and
  315. ``napi_defer_hard_irqs`` are set.
  316. If ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` are set, Loops 2
  317. and 3 "wrestle" with each other for control.
  318. During busy periods, ``irq-suspend-timeout`` is used as timer in Loop 2,
  319. which essentially tilts network processing in favour of Loop 3.
  320. If ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` are not set, Loop 3
  321. cannot take control from Loop 1.
  322. Therefore, setting ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` is
  323. the recommended usage, because otherwise setting ``irq-suspend-timeout``
  324. might not have any discernible effect.
  325. .. _threaded_busy_poll:
  326. Threaded NAPI busy polling
  327. --------------------------
  328. Threaded NAPI busy polling extends threaded NAPI and adds support to do
  329. continuous busy polling of the NAPI. This can be useful for forwarding or
  330. AF_XDP applications.
  331. Threaded NAPI busy polling can be enabled on per NIC queue basis using Netlink.
  332. For example, using the following script:
  333. .. code-block:: bash
  334. $ ynl --family netdev --do napi-set \
  335. --json='{"id": 66, "threaded": "busy-poll"}'
  336. The kernel will create a kthread that busy polls on this NAPI.
  337. The user may elect to set the CPU affinity of this kthread to an unused CPU
  338. core to improve how often the NAPI is polled at the expense of wasted CPU
  339. cycles. Note that this will keep the CPU core busy with 100% usage.
  340. Once threaded busy polling is enabled for a NAPI, PID of the kthread can be
  341. retrieved using Netlink so the affinity of the kthread can be set up.
  342. For example, the following script can be used to fetch the PID:
  343. .. code-block:: bash
  344. $ ynl --family netdev --do napi-get --json='{"id": 66}'
  345. This will output something like following, the pid `258` is the PID of the
  346. kthread that is polling this NAPI.
  347. .. code-block:: bash
  348. $ {'defer-hard-irqs': 0,
  349. 'gro-flush-timeout': 0,
  350. 'id': 66,
  351. 'ifindex': 2,
  352. 'irq-suspend-timeout': 0,
  353. 'pid': 258,
  354. 'threaded': 'busy-poll'}
  355. .. _threaded:
  356. Threaded NAPI
  357. -------------
  358. Threaded NAPI is an operating mode that uses dedicated kernel
  359. threads rather than software IRQ context for NAPI processing.
  360. Each threaded NAPI instance will spawn a separate thread
  361. (called ``napi/${ifc-name}-${napi-id}``).
  362. It is recommended to pin each kernel thread to a single CPU, the same
  363. CPU as the CPU which services the interrupt. Note that the mapping
  364. between IRQs and NAPI instances may not be trivial (and is driver
  365. dependent). The NAPI instance IDs will be assigned in the opposite
  366. order than the process IDs of the kernel threads.
  367. Threaded NAPI is controlled by writing 0/1 to the ``threaded`` file in
  368. netdev's sysfs directory. It can also be enabled for a specific NAPI using
  369. netlink interface.
  370. For example, using the script:
  371. .. code-block:: bash
  372. $ ynl --family netdev --do napi-set --json='{"id": 66, "threaded": 1}'
  373. .. rubric:: Footnotes
  374. .. [#] NAPI was originally referred to as New API in 2.4 Linux.