ice.rst 21 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487
  1. .. SPDX-License-Identifier: GPL-2.0
  2. ===================
  3. ice devlink support
  4. ===================
  5. This document describes the devlink features implemented by the ``ice``
  6. device driver.
  7. Parameters
  8. ==========
  9. .. list-table:: Generic parameters implemented
  10. :widths: 5 5 90
  11. * - Name
  12. - Mode
  13. - Notes
  14. * - ``enable_roce``
  15. - runtime
  16. - mutually exclusive with ``enable_iwarp``
  17. * - ``enable_iwarp``
  18. - runtime
  19. - mutually exclusive with ``enable_roce``
  20. * - ``tx_scheduling_layers``
  21. - permanent
  22. - The ice hardware uses hierarchical scheduling for Tx with a fixed
  23. number of layers in the scheduling tree. Each of them are decision
  24. points. Root node represents a port, while all the leaves represent
  25. the queues. This way of configuring the Tx scheduler allows features
  26. like DCB or devlink-rate (documented below) to configure how much
  27. bandwidth is given to any given queue or group of queues, enabling
  28. fine-grained control because scheduling parameters can be configured
  29. at any given layer of the tree.
  30. The default 9-layer tree topology was deemed best for most workloads,
  31. as it gives an optimal ratio of performance to configurability. However,
  32. for some specific cases, this 9-layer topology might not be desired.
  33. One example would be sending traffic to queues that are not a multiple
  34. of 8. Because the maximum radix is limited to 8 in 9-layer topology,
  35. the 9th queue has a different parent than the rest, and it's given
  36. more bandwidth credits. This causes a problem when the system is
  37. sending traffic to 9 queues:
  38. | tx_queue_0_packets: 24163396
  39. | tx_queue_1_packets: 24164623
  40. | tx_queue_2_packets: 24163188
  41. | tx_queue_3_packets: 24163701
  42. | tx_queue_4_packets: 24163683
  43. | tx_queue_5_packets: 24164668
  44. | tx_queue_6_packets: 23327200
  45. | tx_queue_7_packets: 24163853
  46. | tx_queue_8_packets: 91101417 < Too much traffic is sent from 9th
  47. To address this need, you can switch to a 5-layer topology, which
  48. changes the maximum topology radix to 512. With this enhancement,
  49. the performance characteristic is equal as all queues can be assigned
  50. to the same parent in the tree. The obvious drawback of this solution
  51. is a lower configuration depth of the tree.
  52. Use the ``tx_scheduling_layer`` parameter with the devlink command
  53. to change the transmit scheduler topology. To use 5-layer topology,
  54. use a value of 5. For example:
  55. $ devlink dev param set pci/0000:16:00.0 name tx_scheduling_layers
  56. value 5 cmode permanent
  57. Use a value of 9 to set it back to the default value.
  58. You must do PCI slot powercycle for the selected topology to take effect.
  59. To verify that value has been set:
  60. $ devlink dev param show pci/0000:16:00.0 name tx_scheduling_layers
  61. * - ``msix_vec_per_pf_max``
  62. - driverinit
  63. - Set the max MSI-X that can be used by the PF, rest can be utilized for
  64. SRIOV. The range is from min value set in msix_vec_per_pf_min to
  65. 2k/number of ports.
  66. * - ``msix_vec_per_pf_min``
  67. - driverinit
  68. - Set the min MSI-X that will be used by the PF. This value inform how many
  69. MSI-X will be allocated statically. The range is from 2 to value set
  70. in msix_vec_per_pf_max.
  71. .. list-table:: Driver specific parameters implemented
  72. :widths: 5 5 90
  73. * - Name
  74. - Mode
  75. - Description
  76. * - ``local_forwarding``
  77. - runtime
  78. - Controls loopback behavior by tuning scheduler bandwidth.
  79. It impacts all kinds of functions: physical, virtual and
  80. subfunctions.
  81. Supported values are:
  82. ``enabled`` - loopback traffic is allowed on port
  83. ``disabled`` - loopback traffic is not allowed on this port
  84. ``prioritized`` - loopback traffic is prioritized on this port
  85. Default value of ``local_forwarding`` parameter is ``enabled``.
  86. ``prioritized`` provides ability to adjust loopback traffic rate to increase
  87. one port capacity at cost of the another. User needs to disable
  88. local forwarding on one of the ports in order have increased capacity
  89. on the ``prioritized`` port.
  90. Info versions
  91. =============
  92. The ``ice`` driver reports the following versions
  93. .. list-table:: devlink info versions implemented
  94. :widths: 5 5 5 90
  95. * - Name
  96. - Type
  97. - Example
  98. - Description
  99. * - ``board.id``
  100. - fixed
  101. - K65390-000
  102. - The Product Board Assembly (PBA) identifier of the board.
  103. * - ``cgu.id``
  104. - fixed
  105. - 36
  106. - The Clock Generation Unit (CGU) hardware revision identifier.
  107. * - ``fw.mgmt``
  108. - running
  109. - 2.1.7
  110. - 3-digit version number of the management firmware running on the
  111. Embedded Management Processor of the device. It controls the PHY,
  112. link, access to device resources, etc. Intel documentation refers to
  113. this as the EMP firmware.
  114. * - ``fw.mgmt.api``
  115. - running
  116. - 1.5.1
  117. - 3-digit version number (major.minor.patch) of the API exported over
  118. the AdminQ by the management firmware. Used by the driver to
  119. identify what commands are supported. Historical versions of the
  120. kernel only displayed a 2-digit version number (major.minor).
  121. * - ``fw.mgmt.build``
  122. - running
  123. - 0x305d955f
  124. - Unique identifier of the source for the management firmware.
  125. * - ``fw.undi``
  126. - running
  127. - 1.2581.0
  128. - Version of the Option ROM containing the UEFI driver. The version is
  129. reported in ``major.minor.patch`` format. The major version is
  130. incremented whenever a major breaking change occurs, or when the
  131. minor version would overflow. The minor version is incremented for
  132. non-breaking changes and reset to 1 when the major version is
  133. incremented. The patch version is normally 0 but is incremented when
  134. a fix is delivered as a patch against an older base Option ROM.
  135. * - ``fw.psid.api``
  136. - running
  137. - 0.80
  138. - Version defining the format of the flash contents.
  139. * - ``fw.bundle_id``
  140. - running
  141. - 0x80002ec0
  142. - Unique identifier of the firmware image file that was loaded onto
  143. the device. Also referred to as the EETRACK identifier of the NVM.
  144. * - ``fw.app.name``
  145. - running
  146. - ICE OS Default Package
  147. - The name of the DDP package that is active in the device. The DDP
  148. package is loaded by the driver during initialization. Each
  149. variation of the DDP package has a unique name.
  150. * - ``fw.app``
  151. - running
  152. - 1.3.1.0
  153. - The version of the DDP package that is active in the device. Note
  154. that both the name (as reported by ``fw.app.name``) and version are
  155. required to uniquely identify the package.
  156. * - ``fw.app.bundle_id``
  157. - running
  158. - 0xc0000001
  159. - Unique identifier for the DDP package loaded in the device. Also
  160. referred to as the DDP Track ID. Can be used to uniquely identify
  161. the specific DDP package.
  162. * - ``fw.netlist``
  163. - running
  164. - 1.1.2000-6.7.0
  165. - The version of the netlist module. This module defines the device's
  166. Ethernet capabilities and default settings, and is used by the
  167. management firmware as part of managing link and device
  168. connectivity.
  169. * - ``fw.netlist.build``
  170. - running
  171. - 0xee16ced7
  172. - The first 4 bytes of the hash of the netlist module contents.
  173. * - ``fw.cgu``
  174. - running
  175. - 8032.16973825.6021
  176. - The version of Clock Generation Unit (CGU). Format:
  177. <CGU type>.<configuration version>.<firmware version>.
  178. Flash Update
  179. ============
  180. The ``ice`` driver implements support for flash update using the
  181. ``devlink-flash`` interface. It supports updating the device flash using a
  182. combined flash image that contains the ``fw.mgmt``, ``fw.undi``, and
  183. ``fw.netlist`` components.
  184. .. list-table:: List of supported overwrite modes
  185. :widths: 5 95
  186. * - Bits
  187. - Behavior
  188. * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS``
  189. - Do not preserve settings stored in the flash components being
  190. updated. This includes overwriting the port configuration that
  191. determines the number of physical functions the device will
  192. initialize with.
  193. * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS`` and ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS``
  194. - Do not preserve either settings or identifiers. Overwrite everything
  195. in the flash with the contents from the provided image, without
  196. performing any preservation. This includes overwriting device
  197. identifying fields such as the MAC address, VPD area, and device
  198. serial number. It is expected that this combination be used with an
  199. image customized for the specific device.
  200. The ice hardware does not support overwriting only identifiers while
  201. preserving settings, and thus ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS`` on its
  202. own will be rejected. If no overwrite mask is provided, the firmware will be
  203. instructed to preserve all settings and identifying fields when updating.
  204. Reload
  205. ======
  206. The ``ice`` driver supports activating new firmware after a flash update
  207. using ``DEVLINK_CMD_RELOAD`` with the ``DEVLINK_RELOAD_ACTION_FW_ACTIVATE``
  208. action.
  209. .. code:: shell
  210. $ devlink dev reload pci/0000:01:00.0 reload action fw_activate
  211. The new firmware is activated by issuing a device specific Embedded
  212. Management Processor reset which requests the device to reset and reload the
  213. EMP firmware image.
  214. The driver does not currently support reloading the driver via
  215. ``DEVLINK_RELOAD_ACTION_DRIVER_REINIT``.
  216. Port split
  217. ==========
  218. The ``ice`` driver supports port splitting only for port 0, as the FW has
  219. a predefined set of available port split options for the whole device.
  220. A system reboot is required for port split to be applied.
  221. The following command will select the port split option with 4 ports:
  222. .. code:: shell
  223. $ devlink port split pci/0000:16:00.0/0 count 4
  224. The list of all available port options will be printed to dynamic debug after
  225. each ``split`` and ``unsplit`` command. The first option is the default.
  226. .. code:: shell
  227. ice 0000:16:00.0: Available port split options and max port speeds (Gbps):
  228. ice 0000:16:00.0: Status Split Quad 0 Quad 1
  229. ice 0000:16:00.0: count L0 L1 L2 L3 L4 L5 L6 L7
  230. ice 0000:16:00.0: Active 2 100 - - - 100 - - -
  231. ice 0000:16:00.0: 2 50 - 50 - - - - -
  232. ice 0000:16:00.0: Pending 4 25 25 25 25 - - - -
  233. ice 0000:16:00.0: 4 25 25 - - 25 25 - -
  234. ice 0000:16:00.0: 8 10 10 10 10 10 10 10 10
  235. ice 0000:16:00.0: 1 100 - - - - - - -
  236. There could be multiple FW port options with the same port split count. When
  237. the same port split count request is issued again, the next FW port option with
  238. the same port split count will be selected.
  239. ``devlink port unsplit`` will select the option with a split count of 1. If
  240. there is no FW option available with split count 1, you will receive an error.
  241. Regions
  242. =======
  243. The ``ice`` driver implements the following regions for accessing internal
  244. device data.
  245. .. list-table:: regions implemented
  246. :widths: 15 85
  247. * - Name
  248. - Description
  249. * - ``nvm-flash``
  250. - The contents of the entire flash chip, sometimes referred to as
  251. the device's Non Volatile Memory.
  252. * - ``shadow-ram``
  253. - The contents of the Shadow RAM, which is loaded from the beginning
  254. of the flash. Although the contents are primarily from the flash,
  255. this area also contains data generated during device boot which is
  256. not stored in flash.
  257. * - ``device-caps``
  258. - The contents of the device firmware's capabilities buffer. Useful to
  259. determine the current state and configuration of the device.
  260. Both the ``nvm-flash`` and ``shadow-ram`` regions can be accessed without a
  261. snapshot. The ``device-caps`` region requires a snapshot as the contents are
  262. sent by firmware and can't be split into separate reads.
  263. Users can request an immediate capture of a snapshot for all three regions
  264. via the ``DEVLINK_CMD_REGION_NEW`` command.
  265. .. code:: shell
  266. $ devlink region show
  267. pci/0000:01:00.0/nvm-flash: size 10485760 snapshot [] max 1
  268. pci/0000:01:00.0/device-caps: size 4096 snapshot [] max 10
  269. $ devlink region new pci/0000:01:00.0/nvm-flash snapshot 1
  270. $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1
  271. $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1
  272. 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
  273. 0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
  274. 0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc
  275. 0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5
  276. $ devlink region read pci/0000:01:00.0/nvm-flash snapshot 1 address 0 length 16
  277. 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
  278. $ devlink region delete pci/0000:01:00.0/nvm-flash snapshot 1
  279. $ devlink region new pci/0000:01:00.0/device-caps snapshot 1
  280. $ devlink region dump pci/0000:01:00.0/device-caps snapshot 1
  281. 0000000000000000 01 00 01 00 00 00 00 00 01 00 00 00 00 00 00 00
  282. 0000000000000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  283. 0000000000000020 02 00 02 01 32 03 00 00 0a 00 00 00 25 00 00 00
  284. 0000000000000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  285. 0000000000000040 04 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
  286. 0000000000000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  287. 0000000000000060 05 00 01 00 03 00 00 00 00 00 00 00 00 00 00 00
  288. 0000000000000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  289. 0000000000000080 06 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
  290. 0000000000000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  291. 00000000000000a0 08 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
  292. 00000000000000b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  293. 00000000000000c0 12 00 01 00 01 00 00 00 01 00 01 00 00 00 00 00
  294. 00000000000000d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  295. 00000000000000e0 13 00 01 00 00 01 00 00 00 00 00 00 00 00 00 00
  296. 00000000000000f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  297. 0000000000000100 14 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
  298. 0000000000000110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  299. 0000000000000120 15 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
  300. 0000000000000130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  301. 0000000000000140 16 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
  302. 0000000000000150 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  303. 0000000000000160 17 00 01 00 06 00 00 00 00 00 00 00 00 00 00 00
  304. 0000000000000170 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  305. 0000000000000180 18 00 01 00 01 00 00 00 01 00 00 00 08 00 00 00
  306. 0000000000000190 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  307. 00000000000001a0 22 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
  308. 00000000000001b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  309. 00000000000001c0 40 00 01 00 00 08 00 00 08 00 00 00 00 00 00 00
  310. 00000000000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  311. 00000000000001e0 41 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00
  312. 00000000000001f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  313. 0000000000000200 42 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00
  314. 0000000000000210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  315. $ devlink region delete pci/0000:01:00.0/device-caps snapshot 1
  316. Devlink Rate
  317. ============
  318. The ``ice`` driver implements devlink-rate API. It allows for offload of
  319. the Hierarchical QoS to the hardware. It enables user to group Virtual
  320. Functions in a tree structure and assign supported parameters: tx_share,
  321. tx_max, tx_priority and tx_weight to each node in a tree. So effectively
  322. user gains an ability to control how much bandwidth is allocated for each
  323. VF group. This is later enforced by the HW.
  324. It is assumed that this feature is mutually exclusive with DCB performed
  325. in FW and ADQ, or any driver feature that would trigger changes in QoS,
  326. for example creation of the new traffic class. The driver will prevent DCB
  327. or ADQ configuration if user started making any changes to the nodes using
  328. devlink-rate API. To configure those features a driver reload is necessary.
  329. Correspondingly if ADQ or DCB will get configured the driver won't export
  330. hierarchy at all, or will remove the untouched hierarchy if those
  331. features are enabled after the hierarchy is exported, but before any
  332. changes are made.
  333. This feature is also dependent on switchdev being enabled in the system.
  334. It's required because devlink-rate requires devlink-port objects to be
  335. present, and those objects are only created in switchdev mode.
  336. If the driver is set to the switchdev mode, it will export internal
  337. hierarchy the moment VF's are created. Root of the tree is always
  338. represented by the node_0. This node can't be deleted by the user. Leaf
  339. nodes and nodes with children also can't be deleted.
  340. .. list-table:: Attributes supported
  341. :widths: 15 85
  342. * - Name
  343. - Description
  344. * - ``tx_max``
  345. - maximum bandwidth to be consumed by the tree Node. Rate Limit is
  346. an absolute number specifying a maximum amount of bytes a Node may
  347. consume during the course of one second. Rate limit guarantees
  348. that a link will not oversaturate the receiver on the remote end
  349. and also enforces an SLA between the subscriber and network
  350. provider.
  351. * - ``tx_share``
  352. - minimum bandwidth allocated to a tree node when it is not blocked.
  353. It specifies an absolute BW. While tx_max defines the maximum
  354. bandwidth the node may consume, the tx_share marks committed BW
  355. for the Node.
  356. * - ``tx_priority``
  357. - allows for usage of strict priority arbiter among siblings. This
  358. arbitration scheme attempts to schedule nodes based on their
  359. priority as long as the nodes remain within their bandwidth limit.
  360. Range 0-7. Nodes with priority 7 have the highest priority and are
  361. selected first, while nodes with priority 0 have the lowest
  362. priority. Nodes that have the same priority are treated equally.
  363. * - ``tx_weight``
  364. - allows for usage of Weighted Fair Queuing arbitration scheme among
  365. siblings. This arbitration scheme can be used simultaneously with
  366. the strict priority. Range 1-200. Only relative values matter for
  367. arbitration.
  368. ``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
  369. nodes with the same priority form a WFQ subgroup in the sibling group
  370. and arbitration among them is based on assigned weights.
  371. .. code:: shell
  372. # enable switchdev
  373. $ devlink dev eswitch set pci/0000:4b:00.0 mode switchdev
  374. # at this point driver should export internal hierarchy
  375. $ echo 2 > /sys/class/net/ens785np0/device/sriov_numvfs
  376. $ devlink port function rate show
  377. pci/0000:4b:00.0/node_25: type node parent node_24
  378. pci/0000:4b:00.0/node_24: type node parent node_0
  379. pci/0000:4b:00.0/node_32: type node parent node_31
  380. pci/0000:4b:00.0/node_31: type node parent node_30
  381. pci/0000:4b:00.0/node_30: type node parent node_16
  382. pci/0000:4b:00.0/node_19: type node parent node_18
  383. pci/0000:4b:00.0/node_18: type node parent node_17
  384. pci/0000:4b:00.0/node_17: type node parent node_16
  385. pci/0000:4b:00.0/node_14: type node parent node_5
  386. pci/0000:4b:00.0/node_5: type node parent node_3
  387. pci/0000:4b:00.0/node_13: type node parent node_4
  388. pci/0000:4b:00.0/node_12: type node parent node_4
  389. pci/0000:4b:00.0/node_11: type node parent node_4
  390. pci/0000:4b:00.0/node_10: type node parent node_4
  391. pci/0000:4b:00.0/node_9: type node parent node_4
  392. pci/0000:4b:00.0/node_8: type node parent node_4
  393. pci/0000:4b:00.0/node_7: type node parent node_4
  394. pci/0000:4b:00.0/node_6: type node parent node_4
  395. pci/0000:4b:00.0/node_4: type node parent node_3
  396. pci/0000:4b:00.0/node_3: type node parent node_16
  397. pci/0000:4b:00.0/node_16: type node parent node_15
  398. pci/0000:4b:00.0/node_15: type node parent node_0
  399. pci/0000:4b:00.0/node_2: type node parent node_1
  400. pci/0000:4b:00.0/node_1: type node parent node_0
  401. pci/0000:4b:00.0/node_0: type node
  402. pci/0000:4b:00.0/1: type leaf parent node_25
  403. pci/0000:4b:00.0/2: type leaf parent node_25
  404. # let's create some custom node
  405. $ devlink port function rate add pci/0000:4b:00.0/node_custom parent node_0
  406. # second custom node
  407. $ devlink port function rate add pci/0000:4b:00.0/node_custom_1 parent node_custom
  408. # reassign second VF to newly created branch
  409. $ devlink port function rate set pci/0000:4b:00.0/2 parent node_custom_1
  410. # assign tx_weight to the VF
  411. $ devlink port function rate set pci/0000:4b:00.0/2 tx_weight 5
  412. # assign tx_share to the VF
  413. $ devlink port function rate set pci/0000:4b:00.0/2 tx_share 500Mbps