representors.rst 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262
  1. .. SPDX-License-Identifier: GPL-2.0
  2. .. _representors:
  3. =============================
  4. Network Function Representors
  5. =============================
  6. This document describes the semantics and usage of representor netdevices, as
  7. used to control internal switching on SmartNICs. For the closely-related port
  8. representors on physical (multi-port) switches, see
  9. :ref:`Documentation/networking/switchdev.rst <switchdev>`.
  10. Motivation
  11. ----------
  12. Since the mid-2010s, network cards have started offering more complex
  13. virtualisation capabilities than the legacy SR-IOV approach (with its simple
  14. MAC/VLAN-based switching model) can support. This led to a desire to offload
  15. software-defined networks (such as OpenVSwitch) to these NICs to specify the
  16. network connectivity of each function. The resulting designs are variously
  17. called SmartNICs or DPUs.
  18. Network function representors bring the standard Linux networking stack to
  19. virtual switches and IOV devices. Just as each physical port of a Linux-
  20. controlled switch has a separate netdev, so does each virtual port of a virtual
  21. switch.
  22. When the system boots, and before any offload is configured, all packets from
  23. the virtual functions appear in the networking stack of the PF via the
  24. representors. The PF can thus always communicate freely with the virtual
  25. functions.
  26. The PF can configure standard Linux forwarding between representors, the uplink
  27. or any other netdev (routing, bridging, TC classifiers).
  28. Thus, a representor is both a control plane object (representing the function in
  29. administrative commands) and a data plane object (one end of a virtual pipe).
  30. As a virtual link endpoint, the representor can be configured like any other
  31. netdevice; in some cases (e.g. link state) the representee will follow the
  32. representor's configuration, while in others there are separate APIs to
  33. configure the representee.
  34. Definitions
  35. -----------
  36. This document uses the term "switchdev function" to refer to the PCIe function
  37. which has administrative control over the virtual switch on the device.
  38. Typically, this will be a PF, but conceivably a NIC could be configured to grant
  39. these administrative privileges instead to a VF or SF (subfunction).
  40. Depending on NIC design, a multi-port NIC might have a single switchdev function
  41. for the whole device or might have a separate virtual switch, and hence
  42. switchdev function, for each physical network port.
  43. If the NIC supports nested switching, there might be separate switchdev
  44. functions for each nested switch, in which case each switchdev function should
  45. only create representors for the ports on the (sub-)switch it directly
  46. administers.
  47. A "representee" is the object that a representor represents. So for example in
  48. the case of a VF representor, the representee is the corresponding VF.
  49. What does a representor do?
  50. ---------------------------
  51. A representor has three main roles.
  52. 1. It is used to configure the network connection the representee sees, e.g.
  53. link up/down, MTU, etc. For instance, bringing the representor
  54. administratively UP should cause the representee to see a link up / carrier
  55. on event.
  56. 2. It provides the slow path for traffic which does not hit any offloaded
  57. fast-path rules in the virtual switch. Packets transmitted on the
  58. representor netdevice should be delivered to the representee; packets
  59. transmitted by the representee which fail to match any switching rule should
  60. be received on the representor netdevice. (That is, there is a virtual pipe
  61. connecting the representor to the representee, similar in concept to a veth
  62. pair.)
  63. This allows software switch implementations (such as OpenVSwitch or a Linux
  64. bridge) to forward packets between representees and the rest of the network.
  65. 3. It acts as a handle by which switching rules (such as TC filters) can refer
  66. to the representee, allowing these rules to be offloaded.
  67. The combination of 2) and 3) means that the behaviour (apart from performance)
  68. should be the same whether a TC filter is offloaded or not. E.g. a TC rule
  69. on a VF representor applies in software to packets received on that representor
  70. netdevice, while in hardware offload it would apply to packets transmitted by
  71. the representee VF. Conversely, a mirred egress redirect to a VF representor
  72. corresponds in hardware to delivery directly to the representee VF.
  73. What functions should have a representor?
  74. -----------------------------------------
  75. Essentially, for each virtual port on the device's internal switch, there
  76. should be a representor.
  77. Some vendors have chosen to omit representors for the uplink and the physical
  78. network port, which can simplify usage (the uplink netdev becomes in effect the
  79. physical port's representor) but does not generalise to devices with multiple
  80. ports or uplinks.
  81. Thus, the following should all have representors:
  82. - VFs belonging to the switchdev function.
  83. - Other PFs on the local PCIe controller, and any VFs belonging to them.
  84. - PFs and VFs on external PCIe controllers on the device (e.g. for any embedded
  85. System-on-Chip within the SmartNIC).
  86. - PFs and VFs with other personalities, including network block devices (such
  87. as a vDPA virtio-blk PF backed by remote/distributed storage), if (and only
  88. if) their network access is implemented through a virtual switch port. [#]_
  89. Note that such functions can require a representor despite the representee
  90. not having a netdev.
  91. - Subfunctions (SFs) belonging to any of the above PFs or VFs, if they have
  92. their own port on the switch (as opposed to using their parent PF's port).
  93. - Any accelerators or plugins on the device whose interface to the network is
  94. through a virtual switch port, even if they do not have a corresponding PCIe
  95. PF or VF.
  96. This allows the entire switching behaviour of the NIC to be controlled through
  97. representor TC rules.
  98. It is a common misunderstanding to conflate virtual ports with PCIe virtual
  99. functions or their netdevs. While in simple cases there will be a 1:1
  100. correspondence between VF netdevices and VF representors, more advanced device
  101. configurations may not follow this.
  102. A PCIe function which does not have network access through the internal switch
  103. (not even indirectly through the hardware implementation of whatever services
  104. the function provides) should *not* have a representor (even if it has a
  105. netdev).
  106. Such a function has no switch virtual port for the representor to configure or
  107. to be the other end of the virtual pipe.
  108. The representor represents the virtual port, not the PCIe function nor the 'end
  109. user' netdevice.
  110. .. [#] The concept here is that a hardware IP stack in the device performs the
  111. translation between block DMA requests and network packets, so that only
  112. network packets pass through the virtual port onto the switch. The network
  113. access that the IP stack "sees" would then be configurable through tc rules;
  114. e.g. its traffic might all be wrapped in a specific VLAN or VxLAN. However,
  115. any needed configuration of the block device *qua* block device, not being a
  116. networking entity, would not be appropriate for the representor and would
  117. thus use some other channel such as devlink.
  118. Contrast this with the case of a virtio-blk implementation which forwards the
  119. DMA requests unchanged to another PF whose driver then initiates and
  120. terminates IP traffic in software; in that case the DMA traffic would *not*
  121. run over the virtual switch and the virtio-blk PF should thus *not* have a
  122. representor.
  123. How are representors created?
  124. -----------------------------
  125. The driver instance attached to the switchdev function should, for each virtual
  126. port on the switch, create a pure-software netdevice which has some form of
  127. in-kernel reference to the switchdev function's own netdevice or driver private
  128. data (``netdev_priv()``).
  129. This may be by enumerating ports at probe time, reacting dynamically to the
  130. creation and destruction of ports at run time, or a combination of the two.
  131. The operations of the representor netdevice will generally involve acting
  132. through the switchdev function. For example, ``ndo_start_xmit()`` might send
  133. the packet through a hardware TX queue attached to the switchdev function, with
  134. either packet metadata or queue configuration marking it for delivery to the
  135. representee.
  136. How are representors identified?
  137. --------------------------------
  138. The representor netdevice should *not* directly refer to a PCIe device (e.g.
  139. through ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the
  140. representee or of the switchdev function.
  141. Instead, the driver should use the ``SET_NETDEV_DEVLINK_PORT`` macro to
  142. assign a devlink port instance to the netdevice before registering the
  143. netdevice; the kernel uses the devlink port to provide the ``phys_switch_id``
  144. and ``phys_port_name`` sysfs nodes.
  145. (Some legacy drivers implement ``ndo_get_port_parent_id()`` and
  146. ``ndo_get_phys_port_name()`` directly, but this is deprecated.) See
  147. :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>` for the
  148. details of this API.
  149. It is expected that userland will use this information (e.g. through udev rules)
  150. to construct an appropriately informative name or alias for the netdevice. For
  151. instance if the switchdev function is ``eth4`` then a representor with a
  152. ``phys_port_name`` of ``p0pf1vf2`` might be renamed ``eth4pf1vf2rep``.
  153. There are as yet no established conventions for naming representors which do not
  154. correspond to PCIe functions (e.g. accelerators and plugins).
  155. How do representors interact with TC rules?
  156. -------------------------------------------
  157. Any TC rule on a representor applies (in software TC) to packets received by
  158. that representor netdevice. Thus, if the delivery part of the rule corresponds
  159. to another port on the virtual switch, the driver may choose to offload it to
  160. hardware, applying it to packets transmitted by the representee.
  161. Similarly, since a TC mirred egress action targeting the representor would (in
  162. software) send the packet through the representor (and thus indirectly deliver
  163. it to the representee), hardware offload should interpret this as delivery to
  164. the representee.
  165. As a simple example, if ``PORT_DEV`` is the physical port representor and
  166. ``REP_DEV`` is a VF representor, the following rules::
  167. tc filter add dev $REP_DEV parent ffff: protocol ipv4 flower \
  168. action mirred egress redirect dev $PORT_DEV
  169. tc filter add dev $PORT_DEV parent ffff: protocol ipv4 flower skip_sw \
  170. action mirred egress mirror dev $REP_DEV
  171. would mean that all IPv4 packets from the VF are sent out the physical port, and
  172. all IPv4 packets received on the physical port are delivered to the VF in
  173. addition to ``PORT_DEV``. (Note that without ``skip_sw`` on the second rule,
  174. the VF would get two copies, as the packet reception on ``PORT_DEV`` would
  175. trigger the TC rule again and mirror the packet to ``REP_DEV``.)
  176. On devices without separate port and uplink representors, ``PORT_DEV`` would
  177. instead be the switchdev function's own uplink netdevice.
  178. Of course the rules can (if supported by the NIC) include packet-modifying
  179. actions (e.g. VLAN push/pop), which should be performed by the virtual switch.
  180. Tunnel encapsulation and decapsulation are rather more complicated, as they
  181. involve a third netdevice (a tunnel netdev operating in metadata mode, such as
  182. a VxLAN device created with ``ip link add vxlan0 type vxlan external``) and
  183. require an IP address to be bound to the underlay device (e.g. switchdev
  184. function uplink netdev or port representor). TC rules such as::
  185. tc filter add dev $REP_DEV parent ffff: flower \
  186. action tunnel_key set id $VNI src_ip $LOCAL_IP dst_ip $REMOTE_IP \
  187. dst_port 4789 \
  188. action mirred egress redirect dev vxlan0
  189. tc filter add dev vxlan0 parent ffff: flower enc_src_ip $REMOTE_IP \
  190. enc_dst_ip $LOCAL_IP enc_key_id $VNI enc_dst_port 4789 \
  191. action tunnel_key unset action mirred egress redirect dev $REP_DEV
  192. where ``LOCAL_IP`` is an IP address bound to ``PORT_DEV``, and ``REMOTE_IP`` is
  193. another IP address on the same subnet, mean that packets sent by the VF should
  194. be VxLAN encapsulated and sent out the physical port (the driver has to deduce
  195. this by a route lookup of ``LOCAL_IP`` leading to ``PORT_DEV``, and also
  196. perform an ARP/neighbour table lookup to find the MAC addresses to use in the
  197. outer Ethernet frame), while UDP packets received on the physical port with UDP
  198. port 4789 should be parsed as VxLAN and, if their VSID matches ``$VNI``,
  199. decapsulated and forwarded to the VF.
  200. If this all seems complicated, just remember the 'golden rule' of TC offload:
  201. the hardware should ensure the same final results as if the packets were
  202. processed through the slow path, traversed software TC (except ignoring any
  203. ``skip_hw`` rules and applying any ``skip_sw`` rules) and were transmitted or
  204. received through the representor netdevices.
  205. Configuring the representee's MAC
  206. ---------------------------------
  207. The representee's link state is controlled through the representor. Setting the
  208. representor administratively UP or DOWN should cause carrier ON or OFF at the
  209. representee.
  210. Setting an MTU on the representor should cause that same MTU to be reported to
  211. the representee.
  212. (On hardware that allows configuring separate and distinct MTU and MRU values,
  213. the representor MTU should correspond to the representee's MRU and vice-versa.)
  214. Currently there is no way to use the representor to set the station permanent
  215. MAC address of the representee; other methods available to do this include:
  216. - legacy SR-IOV (``ip link set DEVICE vf NUM mac LLADDR``)
  217. - devlink port function (see **devlink-port(8)** and
  218. :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`)