multi-pf-netdev.rst 7.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174
  1. .. SPDX-License-Identifier: GPL-2.0
  2. .. include:: <isonum.txt>
  3. ===============
  4. Multi-PF Netdev
  5. ===============
  6. Contents
  7. ========
  8. - `Background`_
  9. - `Overview`_
  10. - `mlx5 implementation`_
  11. - `Channels distribution`_
  12. - `Observability`_
  13. - `Steering`_
  14. - `Mutually exclusive features`_
  15. Background
  16. ==========
  17. The Multi-PF NIC technology enables several CPUs within a multi-socket server to connect directly to
  18. the network, each through its own dedicated PCIe interface. Through either a connection harness that
  19. splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a single card. This
  20. results in eliminating the network traffic traversing over the internal bus between the sockets,
  21. significantly reducing overhead and latency, in addition to reducing CPU utilization and increasing
  22. network throughput.
  23. Overview
  24. ========
  25. The feature adds support for combining multiple PFs of the same port in a Multi-PF environment under
  26. one netdev instance. It is implemented in the netdev layer. Lower-layer instances like pci func,
  27. sysfs entry, and devlink are kept separate.
  28. Passing traffic through different devices belonging to different NUMA sockets saves cross-NUMA
  29. traffic and allows apps running on the same netdev from different NUMAs to still feel a sense of
  30. proximity to the device and achieve improved performance.
  31. mlx5 implementation
  32. ===================
  33. Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs together which belong to the same
  34. NIC and has the socket-direct property enabled, once all PFs are probed, we create a single netdev
  35. to represent all of them, symmetrically, we destroy the netdev whenever any of the PFs is removed.
  36. The netdev network channels are distributed between all devices, a proper configuration would utilize
  37. the correct close NUMA node when working on a certain app/CPU.
  38. We pick one PF to be a primary (leader), and it fills a special role. The other devices
  39. (secondaries) are disconnected from the network at the chip level (set to silent mode). In silent
  40. mode, no south <-> north traffic flowing directly through a secondary PF. It needs the assistance of
  41. the leader PF (east <-> west traffic) to function. All Rx/Tx traffic is steered through the primary
  42. to/from the secondaries.
  43. Currently, we limit the support to PFs only, and up to two PFs (sockets).
  44. Channels distribution
  45. =====================
  46. We distribute the channels between the different PFs to achieve local NUMA node performance
  47. on multiple NUMA nodes.
  48. Each combined channel works against one specific PF, creating all its datapath queues against it. We
  49. distribute channels to PFs in a round-robin policy.
  50. ::
  51. Example for 2 PFs and 5 channels:
  52. +--------+--------+
  53. | ch idx | PF idx |
  54. +--------+--------+
  55. | 0 | 0 |
  56. | 1 | 1 |
  57. | 2 | 0 |
  58. | 3 | 1 |
  59. | 4 | 0 |
  60. +--------+--------+
  61. The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The
  62. mapping between a channel index and a PF is fixed, no matter how many channels the user configures.
  63. As the channel stats are persistent across channel's closure, changing the mapping every single time
  64. would turn the accumulative stats less representing of the channel's history.
  65. This is achieved by using the correct core device instance (mdev) in each channel, instead of them
  66. all using the same instance under "priv->mdev".
  67. Observability
  68. =============
  69. The relation between PF, irq, napi, and queue can be observed via netlink spec::
  70. $ ./tools/net/ynl/pyynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump queue-get --json='{"ifindex": 13}'
  71. [{'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'rx'},
  72. {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'rx'},
  73. {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'rx'},
  74. {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'rx'},
  75. {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'rx'},
  76. {'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'tx'},
  77. {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'tx'},
  78. {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'tx'},
  79. {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'tx'},
  80. {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'tx'}]
  81. $ ./tools/net/ynl/pyynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump napi-get --json='{"ifindex": 13}'
  82. [{'id': 543, 'ifindex': 13, 'irq': 42},
  83. {'id': 542, 'ifindex': 13, 'irq': 41},
  84. {'id': 541, 'ifindex': 13, 'irq': 40},
  85. {'id': 540, 'ifindex': 13, 'irq': 39},
  86. {'id': 539, 'ifindex': 13, 'irq': 36}]
  87. Here you can clearly observe our channels distribution policy::
  88. $ ls /proc/irq/{36,39,40,41,42}/mlx5* -d -1
  89. /proc/irq/36/mlx5_comp0@pci:0000:08:00.0
  90. /proc/irq/39/mlx5_comp0@pci:0000:09:00.0
  91. /proc/irq/40/mlx5_comp1@pci:0000:08:00.0
  92. /proc/irq/41/mlx5_comp1@pci:0000:09:00.0
  93. /proc/irq/42/mlx5_comp2@pci:0000:08:00.0
  94. Steering
  95. ========
  96. Secondary PFs are set to "silent" mode, meaning they are disconnected from the network.
  97. In Rx, the steering tables belong to the primary PF only, and it is its role to distribute incoming
  98. traffic to other PFs, via cross-vhca steering capabilities. Still maintain a single default RSS table,
  99. that is capable of pointing to the receive queues of a different PF.
  100. In Tx, the primary PF creates a new Tx flow table, which is aliased by the secondaries, so they can
  101. go out to the network through it.
  102. In addition, we set default XPS configuration that, based on the CPU, selects an SQ belonging to the
  103. PF on the same node as the CPU.
  104. XPS default config example:
  105. NUMA node(s): 2
  106. NUMA node0 CPU(s): 0-11
  107. NUMA node1 CPU(s): 12-23
  108. PF0 on node0, PF1 on node1.
  109. - /sys/class/net/eth2/queues/tx-0/xps_cpus:000001
  110. - /sys/class/net/eth2/queues/tx-1/xps_cpus:001000
  111. - /sys/class/net/eth2/queues/tx-2/xps_cpus:000002
  112. - /sys/class/net/eth2/queues/tx-3/xps_cpus:002000
  113. - /sys/class/net/eth2/queues/tx-4/xps_cpus:000004
  114. - /sys/class/net/eth2/queues/tx-5/xps_cpus:004000
  115. - /sys/class/net/eth2/queues/tx-6/xps_cpus:000008
  116. - /sys/class/net/eth2/queues/tx-7/xps_cpus:008000
  117. - /sys/class/net/eth2/queues/tx-8/xps_cpus:000010
  118. - /sys/class/net/eth2/queues/tx-9/xps_cpus:010000
  119. - /sys/class/net/eth2/queues/tx-10/xps_cpus:000020
  120. - /sys/class/net/eth2/queues/tx-11/xps_cpus:020000
  121. - /sys/class/net/eth2/queues/tx-12/xps_cpus:000040
  122. - /sys/class/net/eth2/queues/tx-13/xps_cpus:040000
  123. - /sys/class/net/eth2/queues/tx-14/xps_cpus:000080
  124. - /sys/class/net/eth2/queues/tx-15/xps_cpus:080000
  125. - /sys/class/net/eth2/queues/tx-16/xps_cpus:000100
  126. - /sys/class/net/eth2/queues/tx-17/xps_cpus:100000
  127. - /sys/class/net/eth2/queues/tx-18/xps_cpus:000200
  128. - /sys/class/net/eth2/queues/tx-19/xps_cpus:200000
  129. - /sys/class/net/eth2/queues/tx-20/xps_cpus:000400
  130. - /sys/class/net/eth2/queues/tx-21/xps_cpus:400000
  131. - /sys/class/net/eth2/queues/tx-22/xps_cpus:000800
  132. - /sys/class/net/eth2/queues/tx-23/xps_cpus:800000
  133. Mutually exclusive features
  134. ===========================
  135. The nature of Multi-PF, where different channels work with different PFs, conflicts with
  136. stateful features where the state is maintained in one of the PFs.
  137. For example, in the TLS device-offload feature, special context objects are created per connection
  138. and maintained in the PF. Transitioning between different RQs/SQs would break the feature. Hence,
  139. we disable this combination for now.