vmbus.rst 18 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346
  1. .. SPDX-License-Identifier: GPL-2.0
  2. VMBus
  3. =====
  4. VMBus is a software construct provided by Hyper-V to guest VMs. It
  5. consists of a control path and common facilities used by synthetic
  6. devices that Hyper-V presents to guest VMs. The control path is
  7. used to offer synthetic devices to the guest VM and, in some cases,
  8. to rescind those devices. The common facilities include software
  9. channels for communicating between the device driver in the guest VM
  10. and the synthetic device implementation that is part of Hyper-V, and
  11. signaling primitives to allow Hyper-V and the guest to interrupt
  12. each other.
  13. VMBus is modeled in Linux as a bus, with the expected /sys/bus/vmbus
  14. entry in a running Linux guest. The VMBus driver (drivers/hv/vmbus_drv.c)
  15. establishes the VMBus control path with the Hyper-V host, then
  16. registers itself as a Linux bus driver. It implements the standard
  17. bus functions for adding and removing devices to/from the bus.
  18. Most synthetic devices offered by Hyper-V have a corresponding Linux
  19. device driver. These devices include:
  20. * SCSI controller
  21. * NIC
  22. * Graphics frame buffer
  23. * Keyboard
  24. * Mouse
  25. * PCI device pass-thru
  26. * Heartbeat
  27. * Time Sync
  28. * Shutdown
  29. * Memory balloon
  30. * Key/Value Pair (KVP) exchange with Hyper-V
  31. * Hyper-V online backup (a.k.a. VSS)
  32. Guest VMs may have multiple instances of the synthetic SCSI
  33. controller, synthetic NIC, and PCI pass-thru devices. Other
  34. synthetic devices are limited to a single instance per VM. Not
  35. listed above are a small number of synthetic devices offered by
  36. Hyper-V that are used only by Windows guests and for which Linux
  37. does not have a driver.
  38. Hyper-V uses the terms "VSP" and "VSC" in describing synthetic
  39. devices. "VSP" refers to the Hyper-V code that implements a
  40. particular synthetic device, while "VSC" refers to the driver for
  41. the device in the guest VM. For example, the Linux driver for the
  42. synthetic NIC is referred to as "netvsc" and the Linux driver for
  43. the synthetic SCSI controller is "storvsc". These drivers contain
  44. functions with names like "storvsc_connect_to_vsp".
  45. VMBus channels
  46. --------------
  47. An instance of a synthetic device uses VMBus channels to communicate
  48. between the VSP and the VSC. Channels are bi-directional and used
  49. for passing messages. Most synthetic devices use a single channel,
  50. but the synthetic SCSI controller and synthetic NIC may use multiple
  51. channels to achieve higher performance and greater parallelism.
  52. Each channel consists of two ring buffers. These are classic ring
  53. buffers from a university data structures textbook. If the read
  54. and writes pointers are equal, the ring buffer is considered to be
  55. empty, so a full ring buffer always has at least one byte unused.
  56. The "in" ring buffer is for messages from the Hyper-V host to the
  57. guest, and the "out" ring buffer is for messages from the guest to
  58. the Hyper-V host. In Linux, the "in" and "out" designations are as
  59. viewed by the guest side. The ring buffers are memory that is
  60. shared between the guest and the host, and they follow the standard
  61. paradigm where the memory is allocated by the guest, with the list
  62. of GPAs that make up the ring buffer communicated to the host. Each
  63. ring buffer consists of a header page (4 Kbytes) with the read and
  64. write indices and some control flags, followed by the memory for the
  65. actual ring. The size of the ring is determined by the VSC in the
  66. guest and is specific to each synthetic device. The list of GPAs
  67. making up the ring is communicated to the Hyper-V host over the
  68. VMBus control path as a GPA Descriptor List (GPADL). See function
  69. vmbus_establish_gpadl().
  70. Each ring buffer is mapped into contiguous Linux kernel virtual
  71. space in three parts: 1) the 4 Kbyte header page, 2) the memory
  72. that makes up the ring itself, and 3) a second mapping of the memory
  73. that makes up the ring itself. Because (2) and (3) are contiguous
  74. in kernel virtual space, the code that copies data to and from the
  75. ring buffer need not be concerned with ring buffer wrap-around.
  76. Once a copy operation has completed, the read or write index may
  77. need to be reset to point back into the first mapping, but the
  78. actual data copy does not need to be broken into two parts. This
  79. approach also allows complex data structures to be easily accessed
  80. directly in the ring without handling wrap-around.
  81. On arm64 with page sizes > 4 Kbytes, the header page must still be
  82. passed to Hyper-V as a 4 Kbyte area. But the memory for the actual
  83. ring must be aligned to PAGE_SIZE and have a size that is a multiple
  84. of PAGE_SIZE so that the duplicate mapping trick can be done. Hence
  85. a portion of the header page is unused and not communicated to
  86. Hyper-V. This case is handled by vmbus_establish_gpadl().
  87. Hyper-V enforces a limit on the aggregate amount of guest memory
  88. that can be shared with the host via GPADLs. This limit ensures
  89. that a rogue guest can't force the consumption of excessive host
  90. resources. For Windows Server 2019 and later, this limit is
  91. approximately 1280 Mbytes. For versions prior to Windows Server
  92. 2019, the limit is approximately 384 Mbytes.
  93. VMBus channel messages
  94. ----------------------
  95. All messages sent in a VMBus channel have a standard header that includes
  96. the message length, the offset of the message payload, some flags, and a
  97. transactionID. The portion of the message after the header is
  98. unique to each VSP/VSC pair.
  99. Messages follow one of two patterns:
  100. * Unidirectional: Either side sends a message and does not
  101. expect a response message
  102. * Request/response: One side (usually the guest) sends a message
  103. and expects a response
  104. The transactionID (a.k.a. "requestID") is for matching requests &
  105. responses. Some synthetic devices allow multiple requests to be in-
  106. flight simultaneously, so the guest specifies a transactionID when
  107. sending a request. Hyper-V sends back the same transactionID in the
  108. matching response.
  109. Messages passed between the VSP and VSC are control messages. For
  110. example, a message sent from the storvsc driver might be "execute
  111. this SCSI command". If a message also implies some data transfer
  112. between the guest and the Hyper-V host, the actual data to be
  113. transferred may be embedded with the control message, or it may be
  114. specified as a separate data buffer that the Hyper-V host will
  115. access as a DMA operation. The former case is used when the size of
  116. the data is small and the cost of copying the data to and from the
  117. ring buffer is minimal. For example, time sync messages from the
  118. Hyper-V host to the guest contain the actual time value. When the
  119. data is larger, a separate data buffer is used. In this case, the
  120. control message contains a list of GPAs that describe the data
  121. buffer. For example, the storvsc driver uses this approach to
  122. specify the data buffers to/from which disk I/O is done.
  123. Three functions exist to send VMBus channel messages:
  124. 1. vmbus_sendpacket(): Control-only messages and messages with
  125. embedded data -- no GPAs
  126. 2. vmbus_sendpacket_pagebuffer(): Message with list of GPAs
  127. identifying data to transfer. An offset and length is
  128. associated with each GPA so that multiple discontinuous areas
  129. of guest memory can be targeted.
  130. 3. vmbus_sendpacket_mpb_desc(): Message with list of GPAs
  131. identifying data to transfer. A single offset and length is
  132. associated with a list of GPAs. The GPAs must describe a
  133. single logical area of guest memory to be targeted.
  134. Historically, Linux guests have trusted Hyper-V to send well-formed
  135. and valid messages, and Linux drivers for synthetic devices did not
  136. fully validate messages. With the introduction of processor
  137. technologies that fully encrypt guest memory and that allow the
  138. guest to not trust the hypervisor (AMD SEV-SNP, Intel TDX), trusting
  139. the Hyper-V host is no longer a valid assumption. The drivers for
  140. VMBus synthetic devices are being updated to fully validate any
  141. values read from memory that is shared with Hyper-V, which includes
  142. messages from VMBus devices. To facilitate such validation,
  143. messages read by the guest from the "in" ring buffer are copied to a
  144. temporary buffer that is not shared with Hyper-V. Validation is
  145. performed in this temporary buffer without the risk of Hyper-V
  146. maliciously modifying the message after it is validated but before
  147. it is used.
  148. Synthetic Interrupt Controller (synic)
  149. --------------------------------------
  150. Hyper-V provides each guest CPU with a synthetic interrupt controller
  151. that is used by VMBus for host-guest communication. While each synic
  152. defines 16 synthetic interrupts (SINT), Linux uses only one of the 16
  153. (VMBUS_MESSAGE_SINT). All interrupts related to communication between
  154. the Hyper-V host and a guest CPU use that SINT.
  155. The SINT is mapped to a single per-CPU architectural interrupt (i.e,
  156. an 8-bit x86/x64 interrupt vector, or an arm64 PPI INTID). Because
  157. each CPU in the guest has a synic and may receive VMBus interrupts,
  158. they are best modeled in Linux as per-CPU interrupts. This model works
  159. well on arm64 where a single per-CPU Linux IRQ is allocated for
  160. VMBUS_MESSAGE_SINT. This IRQ appears in /proc/interrupts as an IRQ labelled
  161. "Hyper-V VMbus". Since x86/x64 lacks support for per-CPU IRQs, an x86
  162. interrupt vector is statically allocated (HYPERVISOR_CALLBACK_VECTOR)
  163. across all CPUs and explicitly coded to call vmbus_isr(). In this case,
  164. there's no Linux IRQ, and the interrupts are visible in aggregate in
  165. /proc/interrupts on the "HYP" line.
  166. The synic provides the means to demultiplex the architectural interrupt into
  167. one or more logical interrupts and route the logical interrupt to the proper
  168. VMBus handler in Linux. This demultiplexing is done by vmbus_isr() and
  169. related functions that access synic data structures.
  170. The synic is not modeled in Linux as an irq chip or irq domain,
  171. and the demultiplexed logical interrupts are not Linux IRQs. As such,
  172. they don't appear in /proc/interrupts or /proc/irq. The CPU
  173. affinity for one of these logical interrupts is controlled via an
  174. entry under /sys/bus/vmbus as described below.
  175. VMBus interrupts
  176. ----------------
  177. VMBus provides a mechanism for the guest to interrupt the host when
  178. the guest has queued new messages in a ring buffer. The host
  179. expects that the guest will send an interrupt only when an "out"
  180. ring buffer transitions from empty to non-empty. If the guest sends
  181. interrupts at other times, the host deems such interrupts to be
  182. unnecessary. If a guest sends an excessive number of unnecessary
  183. interrupts, the host may throttle that guest by suspending its
  184. execution for a few seconds to prevent a denial-of-service attack.
  185. Similarly, the host will interrupt the guest via the synic when
  186. it sends a new message on the VMBus control path, or when a VMBus
  187. channel "in" ring buffer transitions from empty to non-empty due to
  188. the host inserting a new VMBus channel message. The control message stream
  189. and each VMBus channel "in" ring buffer are separate logical interrupts
  190. that are demultiplexed by vmbus_isr(). It demultiplexes by first checking
  191. for channel interrupts by calling vmbus_chan_sched(), which looks at a synic
  192. bitmap to determine which channels have pending interrupts on this CPU.
  193. If multiple channels have pending interrupts for this CPU, they are
  194. processed sequentially. When all channel interrupts have been processed,
  195. vmbus_isr() checks for and processes any messages received on the VMBus
  196. control path.
  197. The guest CPU that a VMBus channel will interrupt is selected by the
  198. guest when the channel is created, and the host is informed of that
  199. selection. VMBus devices are broadly grouped into two categories:
  200. 1. "Slow" devices that need only one VMBus channel. The devices
  201. (such as keyboard, mouse, heartbeat, and timesync) generate
  202. relatively few interrupts. Their VMBus channels are all
  203. assigned to interrupt the VMBUS_CONNECT_CPU, which is always
  204. CPU 0.
  205. 2. "High speed" devices that may use multiple VMBus channels for
  206. higher parallelism and performance. These devices include the
  207. synthetic SCSI controller and synthetic NIC. Their VMBus
  208. channels interrupts are assigned to CPUs that are spread out
  209. among the available CPUs in the VM so that interrupts on
  210. multiple channels can be processed in parallel.
  211. The assignment of VMBus channel interrupts to CPUs is done in the
  212. function init_vp_index(). This assignment is done outside of the
  213. normal Linux interrupt affinity mechanism, so the interrupts are
  214. neither "unmanaged" nor "managed" interrupts.
  215. The CPU that a VMBus channel will interrupt can be seen in
  216. /sys/bus/vmbus/devices/<deviceGUID>/ channels/<channelRelID>/cpu.
  217. When running on later versions of Hyper-V, the CPU can be changed
  218. by writing a new value to this sysfs entry. Because VMBus channel
  219. interrupts are not Linux IRQs, there are no entries in /proc/interrupts
  220. or /proc/irq corresponding to individual VMBus channel interrupts.
  221. An online CPU in a Linux guest may not be taken offline if it has
  222. VMBus channel interrupts assigned to it. Starting in kernel v6.15,
  223. any such interrupts are automatically reassigned to some other CPU
  224. at the time of offlining. The "other" CPU is chosen by the
  225. implementation and is not load balanced or otherwise intelligently
  226. determined. If the CPU is onlined again, channel interrupts previously
  227. assigned to it are not moved back. As a result, after multiple CPUs
  228. have been offlined, and perhaps onlined again, the interrupt-to-CPU
  229. mapping may be scrambled and non-optimal. In such a case, optimal
  230. assignments must be re-established manually. For kernels v6.14 and
  231. earlier, any conflicting channel interrupts must first be manually
  232. reassigned to another CPU as described above. Then when no channel
  233. interrupts are assigned to the CPU, it can be taken offline.
  234. The VMBus channel interrupt handling code is designed to work
  235. correctly even if an interrupt is received on a CPU other than the
  236. CPU assigned to the channel. Specifically, the code does not use
  237. CPU-based exclusion for correctness. In normal operation, Hyper-V
  238. will interrupt the assigned CPU. But when the CPU assigned to a
  239. channel is being changed via sysfs, the guest doesn't know exactly
  240. when Hyper-V will make the transition. The code must work correctly
  241. even if there is a time lag before Hyper-V starts interrupting the
  242. new CPU. See comments in target_cpu_store().
  243. VMBus device creation/deletion
  244. ------------------------------
  245. Hyper-V and the Linux guest have a separate message-passing path
  246. that is used for synthetic device creation and deletion. This
  247. path does not use a VMBus channel. See vmbus_post_msg() and
  248. vmbus_on_msg_dpc().
  249. The first step is for the guest to connect to the generic
  250. Hyper-V VMBus mechanism. As part of establishing this connection,
  251. the guest and Hyper-V agree on a VMBus protocol version they will
  252. use. This negotiation allows newer Linux kernels to run on older
  253. Hyper-V versions, and vice versa.
  254. The guest then tells Hyper-V to "send offers". Hyper-V sends an
  255. offer message to the guest for each synthetic device that the VM
  256. is configured to have. Each VMBus device type has a fixed GUID
  257. known as the "class ID", and each VMBus device instance is also
  258. identified by a GUID. The offer message from Hyper-V contains
  259. both GUIDs to uniquely (within the VM) identify the device.
  260. There is one offer message for each device instance, so a VM with
  261. two synthetic NICs will get two offers messages with the NIC
  262. class ID. The ordering of offer messages can vary from boot-to-boot
  263. and must not be assumed to be consistent in Linux code. Offer
  264. messages may also arrive long after Linux has initially booted
  265. because Hyper-V supports adding devices, such as synthetic NICs,
  266. to running VMs. A new offer message is processed by
  267. vmbus_process_offer(), which indirectly invokes vmbus_add_channel_work().
  268. Upon receipt of an offer message, the guest identifies the device
  269. type based on the class ID, and invokes the correct driver to set up
  270. the device. Driver/device matching is performed using the standard
  271. Linux mechanism.
  272. The device driver probe function opens the primary VMBus channel to
  273. the corresponding VSP. It allocates guest memory for the channel
  274. ring buffers and shares the ring buffer with the Hyper-V host by
  275. giving the host a list of GPAs for the ring buffer memory. See
  276. vmbus_establish_gpadl().
  277. Once the ring buffer is set up, the device driver and VSP exchange
  278. setup messages via the primary channel. These messages may include
  279. negotiating the device protocol version to be used between the Linux
  280. VSC and the VSP on the Hyper-V host. The setup messages may also
  281. include creating additional VMBus channels, which are somewhat
  282. mis-named as "sub-channels" since they are functionally
  283. equivalent to the primary channel once they are created.
  284. Finally, the device driver may create entries in /dev as with
  285. any device driver.
  286. The Hyper-V host can send a "rescind" message to the guest to
  287. remove a device that was previously offered. Linux drivers must
  288. handle such a rescind message at any time. Rescinding a device
  289. invokes the device driver "remove" function to cleanly shut
  290. down the device and remove it. Once a synthetic device is
  291. rescinded, neither Hyper-V nor Linux retains any state about
  292. its previous existence. Such a device might be re-added later,
  293. in which case it is treated as an entirely new device. See
  294. vmbus_onoffer_rescind().
  295. For some devices, such as the KVP device, Hyper-V automatically
  296. sends a rescind message when the primary channel is closed,
  297. likely as a result of unbinding the device from its driver.
  298. The rescind causes Linux to remove the device. But then Hyper-V
  299. immediately reoffers the device to the guest, causing a new
  300. instance of the device to be created in Linux. For other
  301. devices, such as the synthetic SCSI and NIC devices, closing the
  302. primary channel does *not* result in Hyper-V sending a rescind
  303. message. The device continues to exist in Linux on the VMBus,
  304. but with no driver bound to it. The same driver or a new driver
  305. can subsequently be bound to the existing instance of the device.