| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505 |
- .. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
- .. _napi:
- ====
- NAPI
- ====
- NAPI is the event handling mechanism used by the Linux networking stack.
- The name NAPI no longer stands for anything in particular [#]_.
- In basic operation the device notifies the host about new events
- via an interrupt.
- The host then schedules a NAPI instance to process the events.
- The device may also be polled for events via NAPI without receiving
- interrupts first (:ref:`busy polling<poll>`).
- NAPI processing usually happens in the software interrupt context,
- but there is an option to use :ref:`separate kernel threads<threaded>`
- for NAPI processing.
- All in all NAPI abstracts away from the drivers the context and configuration
- of event (packet Rx and Tx) processing.
- Driver API
- ==========
- The two most important elements of NAPI are the struct napi_struct
- and the associated poll method. struct napi_struct holds the state
- of the NAPI instance while the method is the driver-specific event
- handler. The method will typically free Tx packets that have been
- transmitted and process newly received packets.
- .. _drv_ctrl:
- Control API
- -----------
- netif_napi_add() and netif_napi_del() add/remove a NAPI instance
- from the system. The instances are attached to the netdevice passed
- as argument (and will be deleted automatically when netdevice is
- unregistered). Instances are added in a disabled state.
- napi_enable() and napi_disable() manage the disabled state.
- A disabled NAPI can't be scheduled and its poll method is guaranteed
- to not be invoked. napi_disable() waits for ownership of the NAPI
- instance to be released.
- The control APIs are not idempotent. Control API calls are safe against
- concurrent use of datapath APIs but an incorrect sequence of control API
- calls may result in crashes, deadlocks, or race conditions. For example,
- calling napi_disable() multiple times in a row will deadlock.
- Datapath API
- ------------
- napi_schedule() is the basic method of scheduling a NAPI poll.
- Drivers should call this function in their interrupt handler
- (see :ref:`drv_sched` for more info). A successful call to napi_schedule()
- will take ownership of the NAPI instance.
- Later, after NAPI is scheduled, the driver's poll method will be
- called to process the events/packets. The method takes a ``budget``
- argument - drivers can process completions for any number of Tx
- packets but should only process up to ``budget`` number of
- Rx packets. Rx processing is usually much more expensive.
- In other words for Rx processing the ``budget`` argument limits how many
- packets driver can process in a single poll. Rx specific APIs like page
- pool or XDP cannot be used at all when ``budget`` is 0.
- skb Tx processing should happen regardless of the ``budget``, but if
- the argument is 0 driver cannot call any XDP (or page pool) APIs.
- .. warning::
- The ``budget`` argument may be 0 if core tries to only process
- skb Tx completions and no Rx or XDP packets.
- The poll method returns the amount of work done. If the driver still
- has outstanding work to do (e.g. ``budget`` was exhausted)
- the poll method should return exactly ``budget``. In that case,
- the NAPI instance will be serviced/polled again (without the
- need to be scheduled).
- If event processing has been completed (all outstanding packets
- processed) the poll method should call napi_complete_done()
- before returning. napi_complete_done() releases the ownership
- of the instance.
- .. warning::
- The case of finishing all events and using exactly ``budget``
- must be handled carefully. There is no way to report this
- (rare) condition to the stack, so the driver must either
- not call napi_complete_done() and wait to be called again,
- or return ``budget - 1``.
- If the ``budget`` is 0 napi_complete_done() should never be called.
- Call sequence
- -------------
- Drivers should not make assumptions about the exact sequencing
- of calls. The poll method may be called without the driver scheduling
- the instance (unless the instance is disabled). Similarly,
- it's not guaranteed that the poll method will be called, even
- if napi_schedule() succeeded (e.g. if the instance gets disabled).
- As mentioned in the :ref:`drv_ctrl` section - napi_disable() and subsequent
- calls to the poll method only wait for the ownership of the instance
- to be released, not for the poll method to exit. This means that
- drivers should avoid accessing any data structures after calling
- napi_complete_done().
- .. _drv_sched:
- Scheduling and IRQ masking
- --------------------------
- Drivers should keep the interrupts masked after scheduling
- the NAPI instance - until NAPI polling finishes any further
- interrupts are unnecessary.
- Drivers which have to mask the interrupts explicitly (as opposed
- to IRQ being auto-masked by the device) should use the napi_schedule_prep()
- and __napi_schedule() calls:
- .. code-block:: c
- if (napi_schedule_prep(&v->napi)) {
- mydrv_mask_rxtx_irq(v->idx);
- /* schedule after masking to avoid races */
- __napi_schedule(&v->napi);
- }
- IRQ should only be unmasked after a successful call to napi_complete_done():
- .. code-block:: c
- if (budget && napi_complete_done(&v->napi, work_done)) {
- mydrv_unmask_rxtx_irq(v->idx);
- return min(work_done, budget - 1);
- }
- napi_schedule_irqoff() is a variant of napi_schedule() which takes advantage
- of guarantees given by being invoked in IRQ context (no need to
- mask interrupts). napi_schedule_irqoff() will fall back to napi_schedule() if
- IRQs are threaded (such as if ``PREEMPT_RT`` is enabled).
- Instance to queue mapping
- -------------------------
- Modern devices have multiple NAPI instances (struct napi_struct) per
- interface. There is no strong requirement on how the instances are
- mapped to queues and interrupts. NAPI is primarily a polling/processing
- abstraction without specific user-facing semantics. That said, most networking
- devices end up using NAPI in fairly similar ways.
- NAPI instances most often correspond 1:1:1 to interrupts and queue pairs
- (queue pair is a set of a single Rx and single Tx queue).
- In less common cases a NAPI instance may be used for multiple queues
- or Rx and Tx queues can be serviced by separate NAPI instances on a single
- core. Regardless of the queue assignment, however, there is usually still
- a 1:1 mapping between NAPI instances and interrupts.
- It's worth noting that the ethtool API uses a "channel" terminology where
- each channel can be either ``rx``, ``tx`` or ``combined``. It's not clear
- what constitutes a channel; the recommended interpretation is to understand
- a channel as an IRQ/NAPI which services queues of a given type. For example,
- a configuration of 1 ``rx``, 1 ``tx`` and 1 ``combined`` channel is expected
- to utilize 3 interrupts, 2 Rx and 2 Tx queues.
- Persistent NAPI config
- ----------------------
- Drivers often allocate and free NAPI instances dynamically. This leads to loss
- of NAPI-related user configuration each time NAPI instances are reallocated.
- The netif_napi_add_config() API prevents this loss of configuration by
- associating each NAPI instance with a persistent NAPI configuration based on
- a driver defined index value, like a queue number.
- Using this API allows for persistent NAPI IDs (among other settings), which can
- be beneficial to userspace programs using ``SO_INCOMING_NAPI_ID``. See the
- sections below for other NAPI configuration settings.
- Drivers should try to use netif_napi_add_config() whenever possible.
- User API
- ========
- User interactions with NAPI depend on NAPI instance ID. The instance IDs
- are only visible to the user thru the ``SO_INCOMING_NAPI_ID`` socket option.
- Users can query NAPI IDs for a device or device queue using netlink. This can
- be done programmatically in a user application or by using a script included in
- the kernel source tree: ``tools/net/ynl/pyynl/cli.py``.
- For example, using the script to dump all of the queues for a device (which
- will reveal each queue's NAPI ID):
- .. code-block:: bash
- $ kernel-source/tools/net/ynl/pyynl/cli.py \
- --spec Documentation/netlink/specs/netdev.yaml \
- --dump queue-get \
- --json='{"ifindex": 2}'
- See ``Documentation/netlink/specs/netdev.yaml`` for more details on
- available operations and attributes.
- Software IRQ coalescing
- -----------------------
- NAPI does not perform any explicit event coalescing by default.
- In most scenarios batching happens due to IRQ coalescing which is done
- by the device. There are cases where software coalescing is helpful.
- NAPI can be configured to arm a repoll timer instead of unmasking
- the hardware interrupts as soon as all packets are processed.
- The ``gro_flush_timeout`` sysfs configuration of the netdevice
- is reused to control the delay of the timer, while
- ``napi_defer_hard_irqs`` controls the number of consecutive empty polls
- before NAPI gives up and goes back to using hardware IRQs.
- The above parameters can also be set on a per-NAPI basis using netlink via
- netdev-genl. When used with netlink and configured on a per-NAPI basis, the
- parameters mentioned above use hyphens instead of underscores:
- ``gro-flush-timeout`` and ``napi-defer-hard-irqs``.
- Per-NAPI configuration can be done programmatically in a user application
- or by using a script included in the kernel source tree:
- ``tools/net/ynl/pyynl/cli.py``.
- For example, using the script:
- .. code-block:: bash
- $ kernel-source/tools/net/ynl/pyynl/cli.py \
- --spec Documentation/netlink/specs/netdev.yaml \
- --do napi-set \
- --json='{"id": 345,
- "defer-hard-irqs": 111,
- "gro-flush-timeout": 11111}'
- Similarly, the parameter ``irq-suspend-timeout`` can be set using netlink
- via netdev-genl. There is no global sysfs parameter for this value.
- ``irq-suspend-timeout`` is used to determine how long an application can
- completely suspend IRQs. It is used in combination with SO_PREFER_BUSY_POLL,
- which can be set on a per-epoll context basis with ``EPIOCSPARAMS`` ioctl.
- .. _poll:
- Busy polling
- ------------
- Busy polling allows a user process to check for incoming packets before
- the device interrupt fires. As is the case with any busy polling it trades
- off CPU cycles for lower latency (production uses of NAPI busy polling
- are not well known).
- Busy polling is enabled by either setting ``SO_BUSY_POLL`` on
- selected sockets or using the global ``net.core.busy_poll`` and
- ``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling
- also exists. Threaded polling of NAPI also has a mode to busy poll for
- packets (:ref:`threaded busy polling<threaded_busy_poll>`) using the NAPI
- processing kthread.
- epoll-based busy polling
- ------------------------
- It is possible to trigger packet processing directly from calls to
- ``epoll_wait``. In order to use this feature, a user application must ensure
- all file descriptors which are added to an epoll context have the same NAPI ID.
- If the application uses a dedicated acceptor thread, the application can obtain
- the NAPI ID of the incoming connection using SO_INCOMING_NAPI_ID and then
- distribute that file descriptor to a worker thread. The worker thread would add
- the file descriptor to its epoll context. This would ensure each worker thread
- has an epoll context with FDs that have the same NAPI ID.
- Alternatively, if the application uses SO_REUSEPORT, a bpf or ebpf program can
- be inserted to distribute incoming connections to threads such that each thread
- is only given incoming connections with the same NAPI ID. Care must be taken to
- carefully handle cases where a system may have multiple NICs.
- In order to enable busy polling, there are two choices:
- 1. ``/proc/sys/net/core/busy_poll`` can be set with a time in useconds to busy
- loop waiting for events. This is a system-wide setting and will cause all
- epoll-based applications to busy poll when they call epoll_wait. This may
- not be desirable as many applications may not have the need to busy poll.
- 2. Applications using recent kernels can issue an ioctl on the epoll context
- file descriptor to set (``EPIOCSPARAMS``) or get (``EPIOCGPARAMS``) ``struct
- epoll_params``:, which user programs can define as follows:
- .. code-block:: c
- struct epoll_params {
- uint32_t busy_poll_usecs;
- uint16_t busy_poll_budget;
- uint8_t prefer_busy_poll;
- /* pad the struct to a multiple of 64bits */
- uint8_t __pad;
- };
- IRQ mitigation
- ---------------
- While busy polling is supposed to be used by low latency applications,
- a similar mechanism can be used for IRQ mitigation.
- Very high request-per-second applications (especially routing/forwarding
- applications and especially applications using AF_XDP sockets) may not
- want to be interrupted until they finish processing a request or a batch
- of packets.
- Such applications can pledge to the kernel that they will perform a busy
- polling operation periodically, and the driver should keep the device IRQs
- permanently masked. This mode is enabled by using the ``SO_PREFER_BUSY_POLL``
- socket option. To avoid system misbehavior the pledge is revoked
- if ``gro_flush_timeout`` passes without any busy poll call. For epoll-based
- busy polling applications, the ``prefer_busy_poll`` field of ``struct
- epoll_params`` can be set to 1 and the ``EPIOCSPARAMS`` ioctl can be issued to
- enable this mode. See the above section for more details.
- The NAPI budget for busy polling is lower than the default (which makes
- sense given the low latency intention of normal busy polling). This is
- not the case with IRQ mitigation, however, so the budget can be adjusted
- with the ``SO_BUSY_POLL_BUDGET`` socket option. For epoll-based busy polling
- applications, the ``busy_poll_budget`` field can be adjusted to the desired value
- in ``struct epoll_params`` and set on a specific epoll context using the ``EPIOCSPARAMS``
- ioctl. See the above section for more details.
- It is important to note that choosing a large value for ``gro_flush_timeout``
- will defer IRQs to allow for better batch processing, but will induce latency
- when the system is not fully loaded. Choosing a small value for
- ``gro_flush_timeout`` can cause interference of the user application which is
- attempting to busy poll by device IRQs and softirq processing. This value
- should be chosen carefully with these tradeoffs in mind. epoll-based busy
- polling applications may be able to mitigate how much user processing happens
- by choosing an appropriate value for ``maxevents``.
- Users may want to consider an alternate approach, IRQ suspension, to help deal
- with these tradeoffs.
- IRQ suspension
- --------------
- IRQ suspension is a mechanism wherein device IRQs are masked while epoll
- triggers NAPI packet processing.
- While application calls to epoll_wait successfully retrieve events, the kernel will
- defer the IRQ suspension timer. If the kernel does not retrieve any events
- while busy polling (for example, because network traffic levels subsided), IRQ
- suspension is disabled and the IRQ mitigation strategies described above are
- engaged.
- This allows users to balance CPU consumption with network processing
- efficiency.
- To use this mechanism:
- 1. The per-NAPI config parameter ``irq-suspend-timeout`` should be set to the
- maximum time (in nanoseconds) the application can have its IRQs
- suspended. This is done using netlink, as described above. This timeout
- serves as a safety mechanism to restart IRQ driver interrupt processing if
- the application has stalled. This value should be chosen so that it covers
- the amount of time the user application needs to process data from its
- call to epoll_wait, noting that applications can control how much data
- they retrieve by setting ``max_events`` when calling epoll_wait.
- 2. The sysfs parameter or per-NAPI config parameters ``gro_flush_timeout``
- and ``napi_defer_hard_irqs`` can be set to low values. They will be used
- to defer IRQs after busy poll has found no data.
- 3. The ``prefer_busy_poll`` flag must be set to true. This can be done using
- the ``EPIOCSPARAMS`` ioctl as described above.
- 4. The application uses epoll as described above to trigger NAPI packet
- processing.
- As mentioned above, as long as subsequent calls to epoll_wait return events to
- userland, the ``irq-suspend-timeout`` is deferred and IRQs are disabled. This
- allows the application to process data without interference.
- Once a call to epoll_wait results in no events being found, IRQ suspension is
- automatically disabled and the ``gro_flush_timeout`` and
- ``napi_defer_hard_irqs`` mitigation mechanisms take over.
- It is expected that ``irq-suspend-timeout`` will be set to a value much larger
- than ``gro_flush_timeout`` as ``irq-suspend-timeout`` should suspend IRQs for
- the duration of one userland processing cycle.
- While it is not strictly necessary to use ``napi_defer_hard_irqs`` and
- ``gro_flush_timeout`` to use IRQ suspension, their use is strongly
- recommended.
- IRQ suspension causes the system to alternate between polling mode and
- irq-driven packet delivery. During busy periods, ``irq-suspend-timeout``
- overrides ``gro_flush_timeout`` and keeps the system busy polling, but when
- epoll finds no events, the setting of ``gro_flush_timeout`` and
- ``napi_defer_hard_irqs`` determine the next step.
- There are essentially three possible loops for network processing and
- packet delivery:
- 1) hardirq -> softirq -> napi poll; basic interrupt delivery
- 2) timer -> softirq -> napi poll; deferred irq processing
- 3) epoll -> busy-poll -> napi poll; busy looping
- Loop 2 can take control from Loop 1, if ``gro_flush_timeout`` and
- ``napi_defer_hard_irqs`` are set.
- If ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` are set, Loops 2
- and 3 "wrestle" with each other for control.
- During busy periods, ``irq-suspend-timeout`` is used as timer in Loop 2,
- which essentially tilts network processing in favour of Loop 3.
- If ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` are not set, Loop 3
- cannot take control from Loop 1.
- Therefore, setting ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` is
- the recommended usage, because otherwise setting ``irq-suspend-timeout``
- might not have any discernible effect.
- .. _threaded_busy_poll:
- Threaded NAPI busy polling
- --------------------------
- Threaded NAPI busy polling extends threaded NAPI and adds support to do
- continuous busy polling of the NAPI. This can be useful for forwarding or
- AF_XDP applications.
- Threaded NAPI busy polling can be enabled on per NIC queue basis using Netlink.
- For example, using the following script:
- .. code-block:: bash
- $ ynl --family netdev --do napi-set \
- --json='{"id": 66, "threaded": "busy-poll"}'
- The kernel will create a kthread that busy polls on this NAPI.
- The user may elect to set the CPU affinity of this kthread to an unused CPU
- core to improve how often the NAPI is polled at the expense of wasted CPU
- cycles. Note that this will keep the CPU core busy with 100% usage.
- Once threaded busy polling is enabled for a NAPI, PID of the kthread can be
- retrieved using Netlink so the affinity of the kthread can be set up.
- For example, the following script can be used to fetch the PID:
- .. code-block:: bash
- $ ynl --family netdev --do napi-get --json='{"id": 66}'
- This will output something like following, the pid `258` is the PID of the
- kthread that is polling this NAPI.
- .. code-block:: bash
- $ {'defer-hard-irqs': 0,
- 'gro-flush-timeout': 0,
- 'id': 66,
- 'ifindex': 2,
- 'irq-suspend-timeout': 0,
- 'pid': 258,
- 'threaded': 'busy-poll'}
- .. _threaded:
- Threaded NAPI
- -------------
- Threaded NAPI is an operating mode that uses dedicated kernel
- threads rather than software IRQ context for NAPI processing.
- Each threaded NAPI instance will spawn a separate thread
- (called ``napi/${ifc-name}-${napi-id}``).
- It is recommended to pin each kernel thread to a single CPU, the same
- CPU as the CPU which services the interrupt. Note that the mapping
- between IRQs and NAPI instances may not be trivial (and is driver
- dependent). The NAPI instance IDs will be assigned in the opposite
- order than the process IDs of the kernel threads.
- Threaded NAPI is controlled by writing 0/1 to the ``threaded`` file in
- netdev's sysfs directory. It can also be enabled for a specific NAPI using
- netlink interface.
- For example, using the script:
- .. code-block:: bash
- $ ynl --family netdev --do napi-set --json='{"id": 66, "threaded": 1}'
- .. rubric:: Footnotes
- .. [#] NAPI was originally referred to as New API in 2.4 Linux.
|