nvme-pci-endpoint-target.rst 15 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368
  1. .. SPDX-License-Identifier: GPL-2.0
  2. =================================
  3. NVMe PCI Endpoint Function Target
  4. =================================
  5. :Author: Damien Le Moal <dlemoal@kernel.org>
  6. The NVMe PCI endpoint function target driver implements an NVMe PCIe controller
  7. using an NVMe fabrics target controller configured with the PCI transport type.
  8. Overview
  9. ========
  10. The NVMe PCI endpoint function target driver allows exposing an NVMe target
  11. controller over a PCIe link, thus implementing an NVMe PCIe device similar to a
  12. regular M.2 SSD. The target controller is created in the same manner as when
  13. using NVMe over fabrics: the controller represents the interface to an NVMe
  14. subsystem using a port. The port transfer type must be configured to be
  15. "pci". The subsystem can be configured to have namespaces backed by regular
  16. files or block devices, or can use NVMe passthrough to expose to the PCI host an
  17. existing physical NVMe device or an NVMe fabrics host controller (e.g. a NVMe
  18. TCP host controller).
  19. The NVMe PCI endpoint function target driver relies as much as possible on the
  20. NVMe target core code to parse and execute NVMe commands submitted by the PCIe
  21. host. However, using the PCI endpoint framework API and DMA API, the driver is
  22. also responsible for managing all data transfers over the PCIe link. This
  23. implies that the NVMe PCI endpoint function target driver implements several
  24. NVMe data structure management and some NVMe command parsing.
  25. 1) The driver manages retrieval of NVMe commands in submission queues using DMA
  26. if supported, or MMIO otherwise. Each command retrieved is then executed
  27. using a work item to maximize performance with the parallel execution of
  28. multiple commands on different CPUs. The driver uses a work item to
  29. constantly poll the doorbell of all submission queues to detect command
  30. submissions from the PCIe host.
  31. 2) The driver transfers completion queues entries of completed commands to the
  32. PCIe host using MMIO copy of the entries in the host completion queue.
  33. After posting completion entries in a completion queue, the driver uses the
  34. PCI endpoint framework API to raise an interrupt to the host to signal the
  35. commands completion.
  36. 3) For any command that has a data buffer, the NVMe PCI endpoint target driver
  37. parses the command PRPs or SGLs lists to create a list of PCI address
  38. segments representing the mapping of the command data buffer on the host.
  39. The command data buffer is transferred over the PCIe link using this list of
  40. PCI address segments using DMA, if supported. If DMA is not supported, MMIO
  41. is used, which results in poor performance. For write commands, the command
  42. data buffer is transferred from the host into a local memory buffer before
  43. executing the command using the target core code. For read commands, a local
  44. memory buffer is allocated to execute the command and the content of that
  45. buffer is transferred to the host once the command completes.
  46. Controller Capabilities
  47. -----------------------
  48. The NVMe capabilities exposed to the PCIe host through the BAR 0 registers
  49. are almost identical to the capabilities of the NVMe target controller
  50. implemented by the target core code. There are some exceptions.
  51. 1) The NVMe PCI endpoint target driver always sets the controller capability
  52. CQR bit to request "Contiguous Queues Required". This is to facilitate the
  53. mapping of a queue PCI address range to the local CPU address space.
  54. 2) The doorbell stride (DSTRB) is always set to be 4B
  55. 3) Since the PCI endpoint framework does not provide a way to handle PCI level
  56. resets, the controller capability NSSR bit (NVM Subsystem Reset Supported)
  57. is always cleared.
  58. 4) The boot partition support (BPS), Persistent Memory Region Supported (PMRS)
  59. and Controller Memory Buffer Supported (CMBS) capabilities are never
  60. reported.
  61. Supported Features
  62. ------------------
  63. The NVMe PCI endpoint target driver implements support for both PRPs and SGLs.
  64. The driver also implements IRQ vector coalescing and submission queue
  65. arbitration burst.
  66. The maximum number of queues and the maximum data transfer size (MDTS) are
  67. configurable through configfs before starting the controller. To avoid issues
  68. with excessive local memory usage for executing commands, MDTS defaults to 512
  69. KB and is limited to a maximum of 2 MB (arbitrary limit).
  70. Minimum number of PCI Address Mapping Windows Required
  71. ------------------------------------------------------
  72. Most PCI endpoint controllers provide a limited number of mapping windows for
  73. mapping a PCI address range to local CPU memory addresses. The NVMe PCI
  74. endpoint target controllers uses mapping windows for the following.
  75. 1) One memory window for raising MSI or MSI-X interrupts
  76. 2) One memory window for MMIO transfers
  77. 3) One memory window for each completion queue
  78. Given the highly asynchronous nature of the NVMe PCI endpoint target driver
  79. operation, the memory windows as described above will generally not be used
  80. simultaneously, but that may happen. So a safe maximum number of completion
  81. queues that can be supported is equal to the total number of memory mapping
  82. windows of the PCI endpoint controller minus two. E.g. for an endpoint PCI
  83. controller with 32 outbound memory windows available, up to 30 completion
  84. queues can be safely operated without any risk of getting PCI address mapping
  85. errors due to the lack of memory windows.
  86. Maximum Number of Queue Pairs
  87. -----------------------------
  88. Upon binding of the NVMe PCI endpoint target driver to the PCI endpoint
  89. controller, BAR 0 is allocated with enough space to accommodate the admin queue
  90. and multiple I/O queues. The maximum of number of I/O queues pairs that can be
  91. supported is limited by several factors.
  92. 1) The NVMe target core code limits the maximum number of I/O queues to the
  93. number of online CPUs.
  94. 2) The total number of queue pairs, including the admin queue, cannot exceed
  95. the number of MSI-X or MSI vectors available.
  96. 3) The total number of completion queues must not exceed the total number of
  97. PCI mapping windows minus 2 (see above).
  98. The NVMe endpoint function driver allows configuring the maximum number of
  99. queue pairs through configfs.
  100. Limitations and NVMe Specification Non-Compliance
  101. -------------------------------------------------
  102. Similar to the NVMe target core code, the NVMe PCI endpoint target driver does
  103. not support multiple submission queues using the same completion queue. All
  104. submission queues must specify a unique completion queue.
  105. User Guide
  106. ==========
  107. This section describes the hardware requirements and how to setup an NVMe PCI
  108. endpoint target device.
  109. Kernel Requirements
  110. -------------------
  111. The kernel must be compiled with the configuration options CONFIG_PCI_ENDPOINT,
  112. CONFIG_PCI_ENDPOINT_CONFIGFS, and CONFIG_NVME_TARGET_PCI_EPF enabled.
  113. CONFIG_PCI, CONFIG_BLK_DEV_NVME and CONFIG_NVME_TARGET must also be enabled
  114. (obviously).
  115. In addition to this, at least one PCI endpoint controller driver should be
  116. available for the endpoint hardware used.
  117. To facilitate testing, enabling the null-blk driver (CONFIG_BLK_DEV_NULL_BLK)
  118. is also recommended. With this, a simple setup using a null_blk block device
  119. as a subsystem namespace can be used.
  120. Hardware Requirements
  121. ---------------------
  122. To use the NVMe PCI endpoint target driver, at least one endpoint controller
  123. device is required.
  124. To find the list of endpoint controller devices in the system::
  125. # ls /sys/class/pci_epc/
  126. a40000000.pcie-ep
  127. If PCI_ENDPOINT_CONFIGFS is enabled::
  128. # ls /sys/kernel/config/pci_ep/controllers
  129. a40000000.pcie-ep
  130. The endpoint board must of course also be connected to a host with a PCI cable
  131. with RX-TX signal swapped. If the host PCI slot used does not have
  132. plug-and-play capabilities, the host should be powered off when the NVMe PCI
  133. endpoint device is configured.
  134. NVMe Endpoint Device
  135. --------------------
  136. Creating an NVMe endpoint device is a two step process. First, an NVMe target
  137. subsystem and port must be defined. Second, the NVMe PCI endpoint device must
  138. be setup and bound to the subsystem and port created.
  139. Creating an NVMe Subsystem and Port
  140. -----------------------------------
  141. Details about how to configure an NVMe target subsystem and port are outside the
  142. scope of this document. The following only provides a simple example of a port
  143. and subsystem with a single namespace backed by a null_blk device.
  144. First, make sure that configfs is enabled::
  145. # mount -t configfs none /sys/kernel/config
  146. Next, create a null_blk device (default settings give a 250 GB device without
  147. memory backing). The block device created will be /dev/nullb0 by default::
  148. # modprobe null_blk
  149. # ls /dev/nullb0
  150. /dev/nullb0
  151. The NVMe PCI endpoint function target driver must be loaded::
  152. # modprobe nvmet_pci_epf
  153. # lsmod | grep nvmet
  154. nvmet_pci_epf 32768 0
  155. nvmet 118784 1 nvmet_pci_epf
  156. nvme_core 131072 2 nvmet_pci_epf,nvmet
  157. Now, create a subsystem and a port that we will use to create a PCI target
  158. controller when setting up the NVMe PCI endpoint target device. In this
  159. example, the port is created with a maximum of 4 I/O queue pairs::
  160. # cd /sys/kernel/config/nvmet/subsystems
  161. # mkdir nvmepf.0.nqn
  162. # echo -n "Linux-pci-epf" > nvmepf.0.nqn/attr_model
  163. # echo "0x1b96" > nvmepf.0.nqn/attr_vendor_id
  164. # echo "0x1b96" > nvmepf.0.nqn/attr_subsys_vendor_id
  165. # echo 1 > nvmepf.0.nqn/attr_allow_any_host
  166. # echo 4 > nvmepf.0.nqn/attr_qid_max
  167. Next, create and enable the subsystem namespace using the null_blk block
  168. device::
  169. # mkdir nvmepf.0.nqn/namespaces/1
  170. # echo -n "/dev/nullb0" > nvmepf.0.nqn/namespaces/1/device_path
  171. # echo 1 > "nvmepf.0.nqn/namespaces/1/enable"
  172. Finally, create the target port and link it to the subsystem::
  173. # cd /sys/kernel/config/nvmet/ports
  174. # mkdir 1
  175. # echo -n "pci" > 1/addr_trtype
  176. # ln -s /sys/kernel/config/nvmet/subsystems/nvmepf.0.nqn \
  177. /sys/kernel/config/nvmet/ports/1/subsystems/nvmepf.0.nqn
  178. Creating an NVMe PCI Endpoint Device
  179. ------------------------------------
  180. With the NVMe target subsystem and port ready for use, the NVMe PCI endpoint
  181. device can now be created and enabled. The NVMe PCI endpoint target driver
  182. should already be loaded (that is done automatically when the port is created)::
  183. # ls /sys/kernel/config/pci_ep/functions
  184. nvmet_pci_epf
  185. Next, create function 0::
  186. # cd /sys/kernel/config/pci_ep/functions/nvmet_pci_epf
  187. # mkdir nvmepf.0
  188. # ls nvmepf.0/
  189. baseclass_code msix_interrupts secondary
  190. cache_line_size nvme subclass_code
  191. deviceid primary subsys_id
  192. interrupt_pin progif_code subsys_vendor_id
  193. msi_interrupts revid vendorid
  194. Configure the function using any device ID (the vendor ID for the device will
  195. be automatically set to the same value as the NVMe target subsystem vendor
  196. ID)::
  197. # cd /sys/kernel/config/pci_ep/functions/nvmet_pci_epf
  198. # echo 0xBEEF > nvmepf.0/deviceid
  199. # echo 32 > nvmepf.0/msix_interrupts
  200. If the PCI endpoint controller used does not support MSI-X, MSI can be
  201. configured instead::
  202. # echo 32 > nvmepf.0/msi_interrupts
  203. Next, let's bind our endpoint device with the target subsystem and port that we
  204. created::
  205. # echo 1 > nvmepf.0/nvme/portid
  206. # echo "nvmepf.0.nqn" > nvmepf.0/nvme/subsysnqn
  207. The endpoint function can then be bound to the endpoint controller and the
  208. controller started::
  209. # cd /sys/kernel/config/pci_ep
  210. # ln -s functions/nvmet_pci_epf/nvmepf.0 controllers/a40000000.pcie-ep/
  211. # echo 1 > controllers/a40000000.pcie-ep/start
  212. On the endpoint machine, kernel messages will show information as the NVMe
  213. target device and endpoint device are created and connected.
  214. .. code-block:: text
  215. null_blk: disk nullb0 created
  216. null_blk: module loaded
  217. nvmet: adding nsid 1 to subsystem nvmepf.0.nqn
  218. nvmet_pci_epf nvmet_pci_epf.0: PCI endpoint controller supports MSI-X, 32 vectors
  219. nvmet: Created nvm controller 1 for subsystem nvmepf.0.nqn for NQN nqn.2014-08.org.nvmexpress:uuid:2ab90791-2246-4fbb-961d-4c3d5a5a0176.
  220. nvmet_pci_epf nvmet_pci_epf.0: New PCI ctrl "nvmepf.0.nqn", 4 I/O queues, mdts 524288 B
  221. PCI Root-Complex Host
  222. ---------------------
  223. Booting the PCI host will result in the initialization of the PCIe link (this
  224. may be signaled by the PCI endpoint driver with a kernel message). A kernel
  225. message on the endpoint will also signal when the host NVMe driver enables the
  226. device controller::
  227. nvmet_pci_epf nvmet_pci_epf.0: Enabling controller
  228. On the host side, the NVMe PCI endpoint function target device is
  229. discoverable as a PCI device, with the vendor ID and device ID as configured::
  230. # lspci -n
  231. 0000:01:00.0 0108: 1b96:beef
  232. An this device will be recognized as an NVMe device with a single namespace::
  233. # lsblk
  234. NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
  235. nvme0n1 259:0 0 250G 0 disk
  236. The NVMe endpoint block device can then be used as any other regular NVMe
  237. namespace block device. The *nvme* command line utility can be used to get more
  238. detailed information about the endpoint device::
  239. # nvme id-ctrl /dev/nvme0
  240. NVME Identify Controller:
  241. vid : 0x1b96
  242. ssvid : 0x1b96
  243. sn : 94993c85650ef7bcd625
  244. mn : Linux-pci-epf
  245. fr : 6.13.0-r
  246. rab : 6
  247. ieee : 000000
  248. cmic : 0xb
  249. mdts : 7
  250. cntlid : 0x1
  251. ver : 0x20100
  252. ...
  253. Endpoint Bindings
  254. =================
  255. The NVMe PCI endpoint target driver uses the PCI endpoint configfs device
  256. attributes as follows.
  257. ================ ===========================================================
  258. vendorid Ignored (the vendor id of the NVMe target subsystem is used)
  259. deviceid Anything is OK (e.g. PCI_ANY_ID)
  260. revid Do not care
  261. progif_code Must be 0x02 (NVM Express)
  262. baseclass_code Must be 0x01 (PCI_BASE_CLASS_STORAGE)
  263. subclass_code Must be 0x08 (Non-Volatile Memory controller)
  264. cache_line_size Do not care
  265. subsys_vendor_id Ignored (the subsystem vendor id of the NVMe target subsystem
  266. is used)
  267. subsys_id Anything is OK (e.g. PCI_ANY_ID)
  268. msi_interrupts At least equal to the number of queue pairs desired
  269. msix_interrupts At least equal to the number of queue pairs desired
  270. interrupt_pin Interrupt PIN to use if MSI and MSI-X are not supported
  271. ================ ===========================================================
  272. The NVMe PCI endpoint target function also has some specific configurable
  273. fields defined in the *nvme* subdirectory of the function directory. These
  274. fields are as follows.
  275. ================ ===========================================================
  276. mdts_kb Maximum data transfer size in KiB (default: 512)
  277. portid The ID of the target port to use
  278. subsysnqn The NQN of the target subsystem to use
  279. ================ ===========================================================