devlink-port.rst 20 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484
  1. .. SPDX-License-Identifier: GPL-2.0
  2. .. _devlink_port:
  3. ============
  4. Devlink Port
  5. ============
  6. ``devlink-port`` is a port that exists on the device. It has a logically
  7. separate ingress/egress point of the device. A devlink port can be any one
  8. of many flavours. A devlink port flavour along with port attributes
  9. describe what a port represents.
  10. A device driver that intends to publish a devlink port sets the
  11. devlink port attributes and registers the devlink port.
  12. Devlink port flavours are described below.
  13. .. list-table:: List of devlink port flavours
  14. :widths: 33 90
  15. * - Flavour
  16. - Description
  17. * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL``
  18. - Any kind of physical port. This can be an eswitch physical port or any
  19. other physical port on the device.
  20. * - ``DEVLINK_PORT_FLAVOUR_DSA``
  21. - This indicates a DSA interconnect port.
  22. * - ``DEVLINK_PORT_FLAVOUR_CPU``
  23. - This indicates a CPU port applicable only to DSA.
  24. * - ``DEVLINK_PORT_FLAVOUR_PCI_PF``
  25. - This indicates an eswitch port representing a port of PCI
  26. physical function (PF).
  27. * - ``DEVLINK_PORT_FLAVOUR_PCI_VF``
  28. - This indicates an eswitch port representing a port of PCI
  29. virtual function (VF).
  30. * - ``DEVLINK_PORT_FLAVOUR_PCI_SF``
  31. - This indicates an eswitch port representing a port of PCI
  32. subfunction (SF).
  33. * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL``
  34. - This indicates a virtual port for the PCI virtual function.
  35. Devlink port can have a different type based on the link layer described below.
  36. .. list-table:: List of devlink port types
  37. :widths: 23 90
  38. * - Type
  39. - Description
  40. * - ``DEVLINK_PORT_TYPE_ETH``
  41. - Driver should set this port type when a link layer of the port is
  42. Ethernet.
  43. * - ``DEVLINK_PORT_TYPE_IB``
  44. - Driver should set this port type when a link layer of the port is
  45. InfiniBand.
  46. * - ``DEVLINK_PORT_TYPE_AUTO``
  47. - This type is indicated by the user when driver should detect the port
  48. type automatically.
  49. PCI controllers
  50. ---------------
  51. In most cases a PCI device has only one controller. A controller consists of
  52. potentially multiple physical, virtual functions and subfunctions. A function
  53. consists of one or more ports. This port is represented by the devlink eswitch
  54. port.
  55. A PCI device connected to multiple CPUs or multiple PCI root complexes or a
  56. SmartNIC, however, may have multiple controllers. For a device with multiple
  57. controllers, each controller is distinguished by a unique controller number.
  58. An eswitch is on the PCI device which supports ports of multiple controllers.
  59. An example view of a system with two controllers::
  60. ---------------------------------------------------------
  61. | |
  62. | --------- --------- ------- ------- |
  63. ----------- | | vf(s) | | sf(s) | |vf(s)| |sf(s)| |
  64. | server | | ------- ----/---- ---/----- ------- ---/--- ---/--- |
  65. | pci rc |=== | pf0 |______/________/ | pf1 |___/_______/ |
  66. | connect | | ------- ------- |
  67. ----------- | | controller_num=1 (no eswitch) |
  68. ------|--------------------------------------------------
  69. (internal wire)
  70. |
  71. ---------------------------------------------------------
  72. | devlink eswitch ports and reps |
  73. | ----------------------------------------------------- |
  74. | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | |
  75. | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | |
  76. | ----------------------------------------------------- |
  77. | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | |
  78. | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | |
  79. | ----------------------------------------------------- |
  80. | |
  81. | |
  82. ----------- | --------- --------- ------- ------- |
  83. | smartNIC| | | vf(s) | | sf(s) | |vf(s)| |sf(s)| |
  84. | pci rc |==| ------- ----/---- ---/----- ------- ---/--- ---/--- |
  85. | connect | | | pf0 |______/________/ | pf1 |___/_______/ |
  86. ----------- | ------- ------- |
  87. | |
  88. | local controller_num=0 (eswitch) |
  89. ---------------------------------------------------------
  90. In the above example, the external controller (identified by controller number = 1)
  91. doesn't have the eswitch. Local controller (identified by controller number = 0)
  92. has the eswitch. The Devlink instance on the local controller has eswitch
  93. devlink ports for both the controllers.
  94. Function configuration
  95. ======================
  96. Users can configure one or more function attributes before enumerating the PCI
  97. function. Usually it means, user should configure function attribute
  98. before a bus specific device for the function is created. However, when
  99. SRIOV is enabled, virtual function devices are created on the PCI bus.
  100. Hence, function attribute should be configured before binding virtual
  101. function device to the driver. For subfunctions, this means user should
  102. configure port function attribute before activating the port function.
  103. A user may set the hardware address of the function using
  104. `devlink port function set hw_addr` command. For Ethernet port function
  105. this means a MAC address.
  106. Users may also set the RoCE capability of the function using
  107. `devlink port function set roce` command.
  108. Users may also set the function as migratable using
  109. `devlink port function set migratable` command.
  110. Users may also set the IPsec crypto capability of the function using
  111. `devlink port function set ipsec_crypto` command.
  112. Users may also set the IPsec packet capability of the function using
  113. `devlink port function set ipsec_packet` command.
  114. Users may also set the maximum IO event queues of the function
  115. using `devlink port function set max_io_eqs` command.
  116. Function attributes
  117. ===================
  118. MAC address setup
  119. -----------------
  120. The configured MAC address of the PCI VF/SF will be used by netdevice and rdma
  121. device created for the PCI VF/SF.
  122. - Get the MAC address of the VF identified by its unique devlink port index::
  123. $ devlink port show pci/0000:06:00.0/2
  124. pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
  125. function:
  126. hw_addr 00:00:00:00:00:00
  127. - Set the MAC address of the VF identified by its unique devlink port index::
  128. $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55
  129. $ devlink port show pci/0000:06:00.0/2
  130. pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
  131. function:
  132. hw_addr 00:11:22:33:44:55
  133. - Get the MAC address of the SF identified by its unique devlink port index::
  134. $ devlink port show pci/0000:06:00.0/32768
  135. pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
  136. function:
  137. hw_addr 00:00:00:00:00:00
  138. - Set the MAC address of the SF identified by its unique devlink port index::
  139. $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88
  140. $ devlink port show pci/0000:06:00.0/32768
  141. pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
  142. function:
  143. hw_addr 00:00:00:00:88:88
  144. RoCE capability setup
  145. ---------------------
  146. Not all PCI VFs/SFs require RoCE capability.
  147. When RoCE capability is disabled, it saves system memory per PCI VF/SF.
  148. When user disables RoCE capability for a VF/SF, user application cannot send or
  149. receive any RoCE packets through this VF/SF and RoCE GID table for this PCI
  150. will be empty.
  151. When RoCE capability is disabled in the device using port function attribute,
  152. VF/SF driver cannot override it.
  153. - Get RoCE capability of the VF device::
  154. $ devlink port show pci/0000:06:00.0/2
  155. pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
  156. function:
  157. hw_addr 00:00:00:00:00:00 roce enable
  158. - Set RoCE capability of the VF device::
  159. $ devlink port function set pci/0000:06:00.0/2 roce disable
  160. $ devlink port show pci/0000:06:00.0/2
  161. pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
  162. function:
  163. hw_addr 00:00:00:00:00:00 roce disable
  164. migratable capability setup
  165. ---------------------------
  166. Live migration is the process of transferring a live virtual machine
  167. from one physical host to another without disrupting its normal
  168. operation.
  169. User who want PCI VFs to be able to perform live migration need to
  170. explicitly enable the VF migratable capability.
  171. When user enables migratable capability for a VF, and the HV binds the VF to VFIO driver
  172. with migration support, the user can migrate the VM with this VF from one HV to a
  173. different one.
  174. However, when migratable capability is enable, device will disable features which cannot
  175. be migrated. Thus migratable cap can impose limitations on a VF so let the user decide.
  176. Example of LM with migratable function configuration:
  177. - Get migratable capability of the VF device::
  178. $ devlink port show pci/0000:06:00.0/2
  179. pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
  180. function:
  181. hw_addr 00:00:00:00:00:00 migratable disable
  182. - Set migratable capability of the VF device::
  183. $ devlink port function set pci/0000:06:00.0/2 migratable enable
  184. $ devlink port show pci/0000:06:00.0/2
  185. pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
  186. function:
  187. hw_addr 00:00:00:00:00:00 migratable enable
  188. - Bind VF to VFIO driver with migration support::
  189. $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
  190. $ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override
  191. $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind
  192. Attach VF to the VM.
  193. Start the VM.
  194. Perform live migration.
  195. IPsec crypto capability setup
  196. -----------------------------
  197. When user enables IPsec crypto capability for a VF, user application can offload
  198. XFRM state crypto operation (Encrypt/Decrypt) to this VF.
  199. When IPsec crypto capability is disabled (default) for a VF, the XFRM state is
  200. processed in software by the kernel.
  201. - Get IPsec crypto capability of the VF device::
  202. $ devlink port show pci/0000:06:00.0/2
  203. pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
  204. function:
  205. hw_addr 00:00:00:00:00:00 ipsec_crypto disabled
  206. - Set IPsec crypto capability of the VF device::
  207. $ devlink port function set pci/0000:06:00.0/2 ipsec_crypto enable
  208. $ devlink port show pci/0000:06:00.0/2
  209. pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
  210. function:
  211. hw_addr 00:00:00:00:00:00 ipsec_crypto enabled
  212. IPsec packet capability setup
  213. -----------------------------
  214. When user enables IPsec packet capability for a VF, user application can offload
  215. XFRM state and policy crypto operation (Encrypt/Decrypt) to this VF, as well as
  216. IPsec encapsulation.
  217. When IPsec packet capability is disabled (default) for a VF, the XFRM state and
  218. policy is processed in software by the kernel.
  219. - Get IPsec packet capability of the VF device::
  220. $ devlink port show pci/0000:06:00.0/2
  221. pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
  222. function:
  223. hw_addr 00:00:00:00:00:00 ipsec_packet disabled
  224. - Set IPsec packet capability of the VF device::
  225. $ devlink port function set pci/0000:06:00.0/2 ipsec_packet enable
  226. $ devlink port show pci/0000:06:00.0/2
  227. pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
  228. function:
  229. hw_addr 00:00:00:00:00:00 ipsec_packet enabled
  230. Maximum IO events queues setup
  231. ------------------------------
  232. When user sets maximum number of IO event queues for a SF or
  233. a VF, such function driver is limited to consume only enforced
  234. number of IO event queues.
  235. IO event queues deliver events related to IO queues, including network
  236. device transmit and receive queues (txq and rxq) and RDMA Queue Pairs (QPs).
  237. For example, the number of netdevice channels and RDMA device completion
  238. vectors are derived from the function's IO event queues. Usually, the number
  239. of interrupt vectors consumed by the driver is limited by the number of IO
  240. event queues per device, as each of the IO event queues is connected to an
  241. interrupt vector.
  242. - Get maximum IO event queues of the VF device::
  243. $ devlink port show pci/0000:06:00.0/2
  244. pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
  245. function:
  246. hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 10
  247. - Set maximum IO event queues of the VF device::
  248. $ devlink port function set pci/0000:06:00.0/2 max_io_eqs 32
  249. $ devlink port show pci/0000:06:00.0/2
  250. pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
  251. function:
  252. hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 32
  253. Subfunction
  254. ============
  255. Subfunction is a lightweight function that has a parent PCI function on which
  256. it is deployed. Subfunction is created and deployed in unit of 1. Unlike
  257. SRIOV VFs, a subfunction doesn't require its own PCI virtual function.
  258. A subfunction communicates with the hardware through the parent PCI function.
  259. To use a subfunction, 3 steps setup sequence is followed:
  260. 1) create - create a subfunction;
  261. 2) configure - configure subfunction attributes;
  262. 3) deploy - deploy the subfunction;
  263. Subfunction management is done using devlink port user interface.
  264. User performs setup on the subfunction management device.
  265. (1) Create
  266. ----------
  267. A subfunction is created using a devlink port interface. A user adds the
  268. subfunction by adding a devlink port of subfunction flavour. The devlink
  269. kernel code calls down to subfunction management driver (devlink ops) and asks
  270. it to create a subfunction devlink port. Driver then instantiates the
  271. subfunction port and any associated objects such as health reporters and
  272. representor netdevice.
  273. (2) Configure
  274. -------------
  275. A subfunction devlink port is created but it is not active yet. That means the
  276. entities are created on devlink side, the e-switch port representor is created,
  277. but the subfunction device itself is not created. A user might use e-switch port
  278. representor to do settings, putting it into bridge, adding TC rules, etc. A user
  279. might as well configure the hardware address (such as MAC address) of the
  280. subfunction while subfunction is inactive.
  281. (3) Deploy
  282. ----------
  283. Once a subfunction is configured, user must activate it to use it. Upon
  284. activation, subfunction management driver asks the subfunction management
  285. device to instantiate the subfunction device on particular PCI function.
  286. A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`.
  287. At this point a matching subfunction driver binds to the subfunction's auxiliary device.
  288. Rate object management
  289. ======================
  290. Devlink provides API to manage tx rates of single devlink port or a group.
  291. This is done through rate objects, which can be one of the two types:
  292. ``leaf``
  293. Represents a single devlink port; created/destroyed by the driver. Since leaf
  294. have 1to1 mapping to its devlink port, in user space it is referred as
  295. ``pci/<bus_addr>/<port_index>``;
  296. ``node``
  297. Represents a group of rate objects (leafs and/or nodes); created/deleted by
  298. request from the userspace; initially empty (no rate objects added). In
  299. userspace it is referred as ``pci/<bus_addr>/<node_name>``, where
  300. ``node_name`` can be any identifier, except decimal number, to avoid
  301. collisions with leafs.
  302. API allows to configure following rate object's parameters:
  303. ``tx_share``
  304. Minimum TX rate value shared among all other rate objects, or rate objects
  305. that parts of the parent group, if it is a part of the same group.
  306. ``tx_max``
  307. Maximum TX rate value.
  308. ``tx_priority``
  309. Allows for usage of strict priority arbiter among siblings. This
  310. arbitration scheme attempts to schedule nodes based on their priority
  311. as long as the nodes remain within their bandwidth limit. The higher the
  312. priority the higher the probability that the node will get selected for
  313. scheduling.
  314. ``tx_weight``
  315. Allows for usage of Weighted Fair Queuing arbitration scheme among
  316. siblings. This arbitration scheme can be used simultaneously with the
  317. strict priority. As a node is configured with a higher rate it gets more
  318. BW relative to its siblings. Values are relative like a percentage
  319. points, they basically tell how much BW should node take relative to
  320. its siblings.
  321. ``parent``
  322. Parent node name. Parent node rate limits are considered as additional limits
  323. to all node children limits. ``tx_max`` is an upper limit for children.
  324. ``tx_share`` is a total bandwidth distributed among children.
  325. ``tc_bw``
  326. Allow users to set the bandwidth allocation per traffic class on rate
  327. objects. This enables fine-grained QoS configurations by assigning a relative
  328. share value to each traffic class. The bandwidth is distributed in proportion
  329. to the share value for each class, relative to the sum of all shares.
  330. When applied to a non-leaf node, tc_bw determines how bandwidth is shared
  331. among its child elements.
  332. ``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
  333. nodes with the same priority form a WFQ subgroup in the sibling group
  334. and arbitration among them is based on assigned weights.
  335. Arbitration flow from the high level:
  336. #. Choose a node, or group of nodes with the highest priority that stays
  337. within the BW limit and are not blocked. Use ``tx_priority`` as a
  338. parameter for this arbitration.
  339. #. If group of nodes have the same priority perform WFQ arbitration on
  340. that subgroup. Use ``tx_weight`` as a parameter for this arbitration.
  341. #. Select the winner node, and continue arbitration flow among its children,
  342. until leaf node is reached, and the winner is established.
  343. #. If all the nodes from the highest priority sub-group are satisfied, or
  344. overused their assigned BW, move to the lower priority nodes.
  345. Driver implementations are allowed to support both or either rate object types
  346. and setting methods of their parameters. Additionally driver implementation
  347. may export nodes/leafs and their child-parent relationships.
  348. Terms and Definitions
  349. =====================
  350. .. list-table:: Terms and Definitions
  351. :widths: 22 90
  352. * - Term
  353. - Definitions
  354. * - ``PCI device``
  355. - A physical PCI device having one or more PCI buses consists of one or
  356. more PCI controllers.
  357. * - ``PCI controller``
  358. - A controller consists of potentially multiple physical functions,
  359. virtual functions and subfunctions.
  360. * - ``Port function``
  361. - An object to manage the function of a port.
  362. * - ``Subfunction``
  363. - A lightweight function that has parent PCI function on which it is
  364. deployed.
  365. * - ``Subfunction device``
  366. - A bus device of the subfunction, usually on a auxiliary bus.
  367. * - ``Subfunction driver``
  368. - A device driver for the subfunction auxiliary device.
  369. * - ``Subfunction management device``
  370. - A PCI physical function that supports subfunction management.
  371. * - ``Subfunction management driver``
  372. - A device driver for PCI physical function that supports
  373. subfunction management using devlink port interface.
  374. * - ``Subfunction host driver``
  375. - A device driver for PCI physical function that hosts subfunction
  376. devices. In most cases it is same as subfunction management driver. When
  377. subfunction is used on external controller, subfunction management and
  378. host drivers are different.