devmem.rst 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419
  1. .. SPDX-License-Identifier: GPL-2.0
  2. =================
  3. Device Memory TCP
  4. =================
  5. Intro
  6. =====
  7. Device memory TCP (devmem TCP) enables receiving data directly into device
  8. memory (dmabuf). The feature is currently implemented for TCP sockets.
  9. Opportunity
  10. -----------
  11. A large number of data transfers have device memory as the source and/or
  12. destination. Accelerators drastically increased the prevalence of such
  13. transfers. Some examples include:
  14. - Distributed training, where ML accelerators, such as GPUs on different hosts,
  15. exchange data.
  16. - Distributed raw block storage applications transfer large amounts of data with
  17. remote SSDs. Much of this data does not require host processing.
  18. Typically the Device-to-Device data transfers in the network are implemented as
  19. the following low-level operations: Device-to-Host copy, Host-to-Host network
  20. transfer, and Host-to-Device copy.
  21. The flow involving host copies is suboptimal, especially for bulk data transfers,
  22. and can put significant strains on system resources such as host memory
  23. bandwidth and PCIe bandwidth.
  24. Devmem TCP optimizes this use case by implementing socket APIs that enable
  25. the user to receive incoming network packets directly into device memory.
  26. Packet payloads go directly from the NIC to device memory.
  27. Packet headers go to host memory and are processed by the TCP/IP stack
  28. normally. The NIC must support header split to achieve this.
  29. Advantages:
  30. - Alleviate host memory bandwidth pressure, compared to existing
  31. network-transfer + device-copy semantics.
  32. - Alleviate PCIe bandwidth pressure, by limiting data transfer to the lowest
  33. level of the PCIe tree, compared to the traditional path which sends data
  34. through the root complex.
  35. More Info
  36. ---------
  37. slides, video
  38. https://netdevconf.org/0x17/sessions/talk/device-memory-tcp.html
  39. patchset
  40. [PATCH net-next v24 00/13] Device Memory TCP
  41. https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/
  42. RX Interface
  43. ============
  44. Example
  45. -------
  46. ./tools/testing/selftests/drivers/net/hw/ncdevmem:do_server shows an example of
  47. setting up the RX path of this API.
  48. NIC Setup
  49. ---------
  50. Header split, flow steering, & RSS are required features for devmem TCP.
  51. Header split is used to split incoming packets into a header buffer in host
  52. memory, and a payload buffer in device memory.
  53. Flow steering & RSS are used to ensure that only flows targeting devmem land on
  54. an RX queue bound to devmem.
  55. Enable header split & flow steering::
  56. # enable header split
  57. ethtool -G eth1 tcp-data-split on
  58. # enable flow steering
  59. ethtool -K eth1 ntuple on
  60. Configure RSS to steer all traffic away from the target RX queue (queue 15 in
  61. this example)::
  62. ethtool --set-rxfh-indir eth1 equal 15
  63. The user must bind a dmabuf to any number of RX queues on a given NIC using
  64. the netlink API::
  65. /* Bind dmabuf to NIC RX queue 15 */
  66. struct netdev_queue *queues;
  67. queues = malloc(sizeof(*queues) * 1);
  68. queues[0]._present.type = 1;
  69. queues[0]._present.idx = 1;
  70. queues[0].type = NETDEV_RX_QUEUE_TYPE_RX;
  71. queues[0].idx = 15;
  72. *ys = ynl_sock_create(&ynl_netdev_family, &yerr);
  73. req = netdev_bind_rx_req_alloc();
  74. netdev_bind_rx_req_set_ifindex(req, 1 /* ifindex */);
  75. netdev_bind_rx_req_set_dmabuf_fd(req, dmabuf_fd);
  76. __netdev_bind_rx_req_set_queues(req, queues, n_queue_index);
  77. rsp = netdev_bind_rx(*ys, req);
  78. dmabuf_id = rsp->dmabuf_id;
  79. The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf
  80. that has been bound.
  81. The user can unbind the dmabuf from the netdevice by closing the netlink socket
  82. that established the binding. We do this so that the binding is automatically
  83. unbound even if the userspace process crashes.
  84. Note that any reasonably well-behaved dmabuf from any exporter should work with
  85. devmem TCP, even if the dmabuf is not actually backed by devmem. An example of
  86. this is udmabuf, which wraps user memory (non-devmem) in a dmabuf.
  87. Socket Setup
  88. ------------
  89. The socket must be flow steered to the dmabuf bound RX queue::
  90. ethtool -N eth1 flow-type tcp4 ... queue 15
  91. Receiving data
  92. --------------
  93. The user application must signal to the kernel that it is capable of receiving
  94. devmem data by passing the MSG_SOCK_DEVMEM flag to recvmsg::
  95. ret = recvmsg(fd, &msg, MSG_SOCK_DEVMEM);
  96. Applications that do not specify the MSG_SOCK_DEVMEM flag will receive an EFAULT
  97. on devmem data.
  98. Devmem data is received directly into the dmabuf bound to the NIC in 'NIC
  99. Setup', and the kernel signals such to the user via the SCM_DEVMEM_* cmsgs::
  100. for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
  101. if (cm->cmsg_level != SOL_SOCKET ||
  102. (cm->cmsg_type != SCM_DEVMEM_DMABUF &&
  103. cm->cmsg_type != SCM_DEVMEM_LINEAR))
  104. continue;
  105. dmabuf_cmsg = (struct dmabuf_cmsg *)CMSG_DATA(cm);
  106. if (cm->cmsg_type == SCM_DEVMEM_DMABUF) {
  107. /* Frag landed in dmabuf.
  108. *
  109. * dmabuf_cmsg->dmabuf_id is the dmabuf the
  110. * frag landed on.
  111. *
  112. * dmabuf_cmsg->frag_offset is the offset into
  113. * the dmabuf where the frag starts.
  114. *
  115. * dmabuf_cmsg->frag_size is the size of the
  116. * frag.
  117. *
  118. * dmabuf_cmsg->frag_token is a token used to
  119. * refer to this frag for later freeing.
  120. */
  121. struct dmabuf_token token;
  122. token.token_start = dmabuf_cmsg->frag_token;
  123. token.token_count = 1;
  124. continue;
  125. }
  126. if (cm->cmsg_type == SCM_DEVMEM_LINEAR)
  127. /* Frag landed in linear buffer.
  128. *
  129. * dmabuf_cmsg->frag_size is the size of the
  130. * frag.
  131. */
  132. continue;
  133. }
  134. Applications may receive 2 cmsgs:
  135. - SCM_DEVMEM_DMABUF: this indicates the fragment landed in the dmabuf indicated
  136. by dmabuf_id.
  137. - SCM_DEVMEM_LINEAR: this indicates the fragment landed in the linear buffer.
  138. This typically happens when the NIC is unable to split the packet at the
  139. header boundary, such that part (or all) of the payload landed in host
  140. memory.
  141. Applications may receive no SO_DEVMEM_* cmsgs. That indicates non-devmem,
  142. regular TCP data that landed on an RX queue not bound to a dmabuf.
  143. Freeing frags
  144. -------------
  145. Frags received via SCM_DEVMEM_DMABUF are pinned by the kernel while the user
  146. processes the frag. The user must return the frag to the kernel via
  147. SO_DEVMEM_DONTNEED::
  148. ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_DONTNEED, &token,
  149. sizeof(token));
  150. The user must ensure the tokens are returned to the kernel in a timely manner.
  151. Failure to do so will exhaust the limited dmabuf that is bound to the RX queue
  152. and will lead to packet drops.
  153. The user must pass no more than 128 tokens, with no more than 1024 total frags
  154. among the token->token_count across all the tokens. If the user provides more
  155. than 1024 frags, the kernel will free up to 1024 frags and return early.
  156. The kernel returns the number of actual frags freed. The number of frags freed
  157. can be less than the tokens provided by the user in case of:
  158. (a) an internal kernel leak bug.
  159. (b) the user passed more than 1024 frags.
  160. TX Interface
  161. ============
  162. Example
  163. -------
  164. ./tools/testing/selftests/drivers/net/hw/ncdevmem:do_client shows an example of
  165. setting up the TX path of this API.
  166. NIC Setup
  167. ---------
  168. The user must bind a TX dmabuf to a given NIC using the netlink API::
  169. struct netdev_bind_tx_req *req = NULL;
  170. struct netdev_bind_tx_rsp *rsp = NULL;
  171. struct ynl_error yerr;
  172. *ys = ynl_sock_create(&ynl_netdev_family, &yerr);
  173. req = netdev_bind_tx_req_alloc();
  174. netdev_bind_tx_req_set_ifindex(req, ifindex);
  175. netdev_bind_tx_req_set_fd(req, dmabuf_fd);
  176. rsp = netdev_bind_tx(*ys, req);
  177. tx_dmabuf_id = rsp->id;
  178. The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf
  179. that has been bound.
  180. The user can unbind the dmabuf from the netdevice by closing the netlink socket
  181. that established the binding. We do this so that the binding is automatically
  182. unbound even if the userspace process crashes.
  183. Note that any reasonably well-behaved dmabuf from any exporter should work with
  184. devmem TCP, even if the dmabuf is not actually backed by devmem. An example of
  185. this is udmabuf, which wraps user memory (non-devmem) in a dmabuf.
  186. Socket Setup
  187. ------------
  188. The user application must use MSG_ZEROCOPY flag when sending devmem TCP. Devmem
  189. cannot be copied by the kernel, so the semantics of the devmem TX are similar
  190. to the semantics of MSG_ZEROCOPY::
  191. setsockopt(socket_fd, SOL_SOCKET, SO_ZEROCOPY, &opt, sizeof(opt));
  192. It is also recommended that the user binds the TX socket to the same interface
  193. the dma-buf has been bound to via SO_BINDTODEVICE::
  194. setsockopt(socket_fd, SOL_SOCKET, SO_BINDTODEVICE, ifname, strlen(ifname) + 1);
  195. Sending data
  196. ------------
  197. Devmem data is sent using the SCM_DEVMEM_DMABUF cmsg.
  198. The user should create a msghdr where,
  199. * iov_base is set to the offset into the dmabuf to start sending from
  200. * iov_len is set to the number of bytes to be sent from the dmabuf
  201. The user passes the dma-buf id to send from via the dmabuf_tx_cmsg.dmabuf_id.
  202. The example below sends 1024 bytes from offset 100 into the dmabuf, and 2048
  203. from offset 2000 into the dmabuf. The dmabuf to send from is tx_dmabuf_id::
  204. char ctrl_data[CMSG_SPACE(sizeof(struct dmabuf_tx_cmsg))];
  205. struct dmabuf_tx_cmsg ddmabuf;
  206. struct msghdr msg = {};
  207. struct cmsghdr *cmsg;
  208. struct iovec iov[2];
  209. iov[0].iov_base = (void*)100;
  210. iov[0].iov_len = 1024;
  211. iov[1].iov_base = (void*)2000;
  212. iov[1].iov_len = 2048;
  213. msg.msg_iov = iov;
  214. msg.msg_iovlen = 2;
  215. msg.msg_control = ctrl_data;
  216. msg.msg_controllen = sizeof(ctrl_data);
  217. cmsg = CMSG_FIRSTHDR(&msg);
  218. cmsg->cmsg_level = SOL_SOCKET;
  219. cmsg->cmsg_type = SCM_DEVMEM_DMABUF;
  220. cmsg->cmsg_len = CMSG_LEN(sizeof(struct dmabuf_tx_cmsg));
  221. ddmabuf.dmabuf_id = tx_dmabuf_id;
  222. *((struct dmabuf_tx_cmsg *)CMSG_DATA(cmsg)) = ddmabuf;
  223. sendmsg(socket_fd, &msg, MSG_ZEROCOPY);
  224. Reusing TX dmabufs
  225. ------------------
  226. Similar to MSG_ZEROCOPY with regular memory, the user should not modify the
  227. contents of the dma-buf while a send operation is in progress. This is because
  228. the kernel does not keep a copy of the dmabuf contents. Instead, the kernel
  229. will pin and send data from the buffer available to the userspace.
  230. Just as in MSG_ZEROCOPY, the kernel notifies the userspace of send completions
  231. using MSG_ERRQUEUE::
  232. int64_t tstop = gettimeofday_ms() + waittime_ms;
  233. char control[CMSG_SPACE(100)] = {};
  234. struct sock_extended_err *serr;
  235. struct msghdr msg = {};
  236. struct cmsghdr *cm;
  237. int retries = 10;
  238. __u32 hi, lo;
  239. msg.msg_control = control;
  240. msg.msg_controllen = sizeof(control);
  241. while (gettimeofday_ms() < tstop) {
  242. if (!do_poll(fd)) continue;
  243. ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
  244. for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
  245. serr = (void *)CMSG_DATA(cm);
  246. hi = serr->ee_data;
  247. lo = serr->ee_info;
  248. fprintf(stdout, "tx complete [%d,%d]\n", lo, hi);
  249. }
  250. }
  251. After the associated sendmsg has been completed, the dmabuf can be reused by
  252. the userspace.
  253. Implementation & Caveats
  254. ========================
  255. Unreadable skbs
  256. ---------------
  257. Devmem payloads are inaccessible to the kernel processing the packets. This
  258. results in a few quirks for payloads of devmem skbs:
  259. - Loopback is not functional. Loopback relies on copying the payload, which is
  260. not possible with devmem skbs.
  261. - Software checksum calculation fails.
  262. - TCP Dump and bpf can't access devmem packet payloads.
  263. Testing
  264. =======
  265. More realistic example code can be found in the kernel source under
  266. ``tools/testing/selftests/drivers/net/hw/ncdevmem.c``
  267. ncdevmem is a devmem TCP netcat. It works very similarly to netcat, but
  268. receives data directly into a udmabuf.
  269. To run ncdevmem, you need to run it on a server on the machine under test, and
  270. you need to run netcat on a peer to provide the TX data.
  271. ncdevmem has a validation mode as well that expects a repeating pattern of
  272. incoming data and validates it as such. For example, you can launch
  273. ncdevmem on the server by::
  274. ncdevmem -s <server IP> -c <client IP> -f <ifname> -l -p 5201 -v 7
  275. On client side, use regular netcat to send TX data to ncdevmem process
  276. on the server::
  277. yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | \
  278. tr \\n \\0 | head -c 5G | nc <server IP> 5201 -p 5201