coco.rst 22 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397
  1. .. SPDX-License-Identifier: GPL-2.0
  2. Confidential Computing VMs
  3. ==========================
  4. Hyper-V can create and run Linux guests that are Confidential Computing
  5. (CoCo) VMs. Such VMs cooperate with the physical processor to better protect
  6. the confidentiality and integrity of data in the VM's memory, even in the
  7. face of a hypervisor/VMM that has been compromised and may behave maliciously.
  8. CoCo VMs on Hyper-V share the generic CoCo VM threat model and security
  9. objectives described in Documentation/security/snp-tdx-threat-model.rst. Note
  10. that Hyper-V specific code in Linux refers to CoCo VMs as "isolated VMs" or
  11. "isolation VMs".
  12. A Linux CoCo VM on Hyper-V requires the cooperation and interaction of the
  13. following:
  14. * Physical hardware with a processor that supports CoCo VMs
  15. * The hardware runs a version of Windows/Hyper-V with support for CoCo VMs
  16. * The VM runs a version of Linux that supports being a CoCo VM
  17. The physical hardware requirements are as follows:
  18. * AMD processor with SEV-SNP. Hyper-V does not run guest VMs with AMD SME,
  19. SEV, or SEV-ES encryption, and such encryption is not sufficient for a CoCo
  20. VM on Hyper-V.
  21. * Intel processor with TDX
  22. To create a CoCo VM, the "Isolated VM" attribute must be specified to Hyper-V
  23. when the VM is created. A VM cannot be changed from a CoCo VM to a normal VM,
  24. or vice versa, after it is created.
  25. Operational Modes
  26. -----------------
  27. Hyper-V CoCo VMs can run in two modes. The mode is selected when the VM is
  28. created and cannot be changed during the life of the VM.
  29. * Fully-enlightened mode. In this mode, the guest operating system is
  30. enlightened to understand and manage all aspects of running as a CoCo VM.
  31. * Paravisor mode. In this mode, a paravisor layer between the guest and the
  32. host provides some operations needed to run as a CoCo VM. The guest operating
  33. system can have fewer CoCo enlightenments than is required in the
  34. fully-enlightened case.
  35. Conceptually, fully-enlightened mode and paravisor mode may be treated as
  36. points on a spectrum spanning the degree of guest enlightenment needed to run
  37. as a CoCo VM. Fully-enlightened mode is one end of the spectrum. A full
  38. implementation of paravisor mode is the other end of the spectrum, where all
  39. aspects of running as a CoCo VM are handled by the paravisor, and a normal
  40. guest OS with no knowledge of memory encryption or other aspects of CoCo VMs
  41. can run successfully. However, the Hyper-V implementation of paravisor mode
  42. does not go this far, and is somewhere in the middle of the spectrum. Some
  43. aspects of CoCo VMs are handled by the Hyper-V paravisor while the guest OS
  44. must be enlightened for other aspects. Unfortunately, there is no
  45. standardized enumeration of feature/functions that might be provided in the
  46. paravisor, and there is no standardized mechanism for a guest OS to query the
  47. paravisor for the feature/functions it provides. The understanding of what
  48. the paravisor provides is hard-coded in the guest OS.
  49. Paravisor mode has similarities to the `Coconut project`_, which aims to provide
  50. a limited paravisor to provide services to the guest such as a virtual TPM.
  51. However, the Hyper-V paravisor generally handles more aspects of CoCo VMs
  52. than is currently envisioned for Coconut, and so is further toward the "no
  53. guest enlightenments required" end of the spectrum.
  54. .. _Coconut project: https://github.com/coconut-svsm/svsm
  55. In the CoCo VM threat model, the paravisor is in the guest security domain
  56. and must be trusted by the guest OS. By implication, the hypervisor/VMM must
  57. protect itself against a potentially malicious paravisor just like it
  58. protects against a potentially malicious guest.
  59. The hardware architectural approach to fully-enlightened vs. paravisor mode
  60. varies depending on the underlying processor.
  61. * With AMD SEV-SNP processors, in fully-enlightened mode the guest OS runs in
  62. VMPL 0 and has full control of the guest context. In paravisor mode, the
  63. guest OS runs in VMPL 2 and the paravisor runs in VMPL 0. The paravisor
  64. running in VMPL 0 has privileges that the guest OS in VMPL 2 does not have.
  65. Certain operations require the guest to invoke the paravisor. Furthermore, in
  66. paravisor mode the guest OS operates in "virtual Top Of Memory" (vTOM) mode
  67. as defined by the SEV-SNP architecture. This mode simplifies guest management
  68. of memory encryption when a paravisor is used.
  69. * With Intel TDX processor, in fully-enlightened mode the guest OS runs in an
  70. L1 VM. In paravisor mode, TD partitioning is used. The paravisor runs in the
  71. L1 VM, and the guest OS runs in a nested L2 VM.
  72. Hyper-V exposes a synthetic MSR to guests that describes the CoCo mode. This
  73. MSR indicates if the underlying processor uses AMD SEV-SNP or Intel TDX, and
  74. whether a paravisor is being used. It is straightforward to build a single
  75. kernel image that can boot and run properly on either architecture, and in
  76. either mode.
  77. Paravisor Effects
  78. -----------------
  79. Running in paravisor mode affects the following areas of generic Linux kernel
  80. CoCo VM functionality:
  81. * Initial guest memory setup. When a new VM is created in paravisor mode, the
  82. paravisor runs first and sets up the guest physical memory as encrypted. The
  83. guest Linux does normal memory initialization, except for explicitly marking
  84. appropriate ranges as decrypted (shared). In paravisor mode, Linux does not
  85. perform the early boot memory setup steps that are particularly tricky with
  86. AMD SEV-SNP in fully-enlightened mode.
  87. * #VC/#VE exception handling. In paravisor mode, Hyper-V configures the guest
  88. CoCo VM to route #VC and #VE exceptions to VMPL 0 and the L1 VM,
  89. respectively, and not the guest Linux. Consequently, these exception handlers
  90. do not run in the guest Linux and are not a required enlightenment for a
  91. Linux guest in paravisor mode.
  92. * CPUID flags. Both AMD SEV-SNP and Intel TDX provide a CPUID flag in the
  93. guest indicating that the VM is operating with the respective hardware
  94. support. While these CPUID flags are visible in fully-enlightened CoCo VMs,
  95. the paravisor filters out these flags and the guest Linux does not see them.
  96. Throughout the Linux kernel, explicitly testing these flags has mostly been
  97. eliminated in favor of the cc_platform_has() function, with the goal of
  98. abstracting the differences between SEV-SNP and TDX. But the
  99. cc_platform_has() abstraction also allows the Hyper-V paravisor configuration
  100. to selectively enable aspects of CoCo VM functionality even when the CPUID
  101. flags are not set. The exception is early boot memory setup on SEV-SNP, which
  102. tests the CPUID SEV-SNP flag. But not having the flag in Hyper-V paravisor
  103. mode VM achieves the desired effect or not running SEV-SNP specific early
  104. boot memory setup.
  105. * Device emulation. In paravisor mode, the Hyper-V paravisor provides
  106. emulation of devices such as the IO-APIC and TPM. Because the emulation
  107. happens in the paravisor in the guest context (instead of the hypervisor/VMM
  108. context), MMIO accesses to these devices must be encrypted references instead
  109. of the decrypted references that would be used in a fully-enlightened CoCo
  110. VM. The __ioremap_caller() function has been enhanced to make a callback to
  111. check whether a particular address range should be treated as encrypted
  112. (private). See the "is_private_mmio" callback.
  113. * Encrypt/decrypt memory transitions. In a CoCo VM, transitioning guest
  114. memory between encrypted and decrypted requires coordinating with the
  115. hypervisor/VMM. This is done via callbacks invoked from
  116. __set_memory_enc_pgtable(). In fully-enlightened mode, the normal SEV-SNP and
  117. TDX implementations of these callbacks are used. In paravisor mode, a Hyper-V
  118. specific set of callbacks is used. These callbacks invoke the paravisor so
  119. that the paravisor can coordinate the transitions and inform the hypervisor
  120. as necessary. See hv_vtom_init() where these callback are set up.
  121. * Interrupt injection. In fully enlightened mode, a malicious hypervisor
  122. could inject interrupts into the guest OS at times that violate x86/x64
  123. architectural rules. For full protection, the guest OS should include
  124. enlightenments that use the interrupt injection management features provided
  125. by CoCo-capable processors. In paravisor mode, the paravisor mediates
  126. interrupt injection into the guest OS, and ensures that the guest OS only
  127. sees interrupts that are "legal". The paravisor uses the interrupt injection
  128. management features provided by the CoCo-capable physical processor, thereby
  129. masking these complexities from the guest OS.
  130. Hyper-V Hypercalls
  131. ------------------
  132. When in fully-enlightened mode, hypercalls made by the Linux guest are routed
  133. directly to the hypervisor, just as in a non-CoCo VM. But in paravisor mode,
  134. normal hypercalls trap to the paravisor first, which may in turn invoke the
  135. hypervisor. But the paravisor is idiosyncratic in this regard, and a few
  136. hypercalls made by the Linux guest must always be routed directly to the
  137. hypervisor. These hypercall sites test for a paravisor being present, and use
  138. a special invocation sequence. See hv_post_message(), for example.
  139. Guest communication with Hyper-V
  140. --------------------------------
  141. Separate from the generic Linux kernel handling of memory encryption in Linux
  142. CoCo VMs, Hyper-V has VMBus and VMBus devices that communicate using memory
  143. shared between the Linux guest and the host. This shared memory must be
  144. marked decrypted to enable communication. Furthermore, since the threat model
  145. includes a compromised and potentially malicious host, the guest must guard
  146. against leaking any unintended data to the host through this shared memory.
  147. These Hyper-V and VMBus memory pages are marked as decrypted:
  148. * VMBus monitor pages
  149. * Synthetic interrupt controller (SynIC) related pages (unless supplied by
  150. the paravisor)
  151. * Per-cpu hypercall input and output pages (unless running with a paravisor)
  152. * VMBus ring buffers. The direct mapping is marked decrypted in
  153. __vmbus_establish_gpadl(). The secondary mapping created in
  154. hv_ringbuffer_init() must also include the "decrypted" attribute.
  155. When the guest writes data to memory that is shared with the host, it must
  156. ensure that only the intended data is written. Padding or unused fields must
  157. be initialized to zeros before copying into the shared memory so that random
  158. kernel data is not inadvertently given to the host.
  159. Similarly, when the guest reads memory that is shared with the host, it must
  160. validate the data before acting on it so that a malicious host cannot induce
  161. the guest to expose unintended data. Doing such validation can be tricky
  162. because the host can modify the shared memory areas even while or after
  163. validation is performed. For messages passed from the host to the guest in a
  164. VMBus ring buffer, the length of the message is validated, and the message is
  165. copied into a temporary (encrypted) buffer for further validation and
  166. processing. The copying adds a small amount of overhead, but is the only way
  167. to protect against a malicious host. See hv_pkt_iter_first().
  168. Many drivers for VMBus devices have been "hardened" by adding code to fully
  169. validate messages received over VMBus, instead of assuming that Hyper-V is
  170. acting cooperatively. Such drivers are marked as "allowed_in_isolated" in the
  171. vmbus_devs[] table. Other drivers for VMBus devices that are not needed in a
  172. CoCo VM have not been hardened, and they are not allowed to load in a CoCo
  173. VM. See vmbus_is_valid_offer() where such devices are excluded.
  174. Two VMBus devices depend on the Hyper-V host to do DMA data transfers:
  175. storvsc for disk I/O and netvsc for network I/O. storvsc uses the normal
  176. Linux kernel DMA APIs, and so bounce buffering through decrypted swiotlb
  177. memory is done implicitly. netvsc has two modes for data transfers. The first
  178. mode goes through send and receive buffer space that is explicitly allocated
  179. by the netvsc driver, and is used for most smaller packets. These send and
  180. receive buffers are marked decrypted by __vmbus_establish_gpadl(). Because
  181. the netvsc driver explicitly copies packets to/from these buffers, the
  182. equivalent of bounce buffering between encrypted and decrypted memory is
  183. already part of the data path. The second mode uses the normal Linux kernel
  184. DMA APIs, and is bounce buffered through swiotlb memory implicitly like in
  185. storvsc.
  186. Finally, the VMBus virtual PCI driver needs special handling in a CoCo VM.
  187. Linux PCI device drivers access PCI config space using standard APIs provided
  188. by the Linux PCI subsystem. On Hyper-V, these functions directly access MMIO
  189. space, and the access traps to Hyper-V for emulation. But in CoCo VMs, memory
  190. encryption prevents Hyper-V from reading the guest instruction stream to
  191. emulate the access. So in a CoCo VM, these functions must make a hypercall
  192. with arguments explicitly describing the access. See
  193. _hv_pcifront_read_config() and _hv_pcifront_write_config() and the
  194. "use_calls" flag indicating to use hypercalls.
  195. Confidential VMBus
  196. ------------------
  197. The confidential VMBus enables the confidential guest not to interact with
  198. the untrusted host partition and the untrusted hypervisor. Instead, the guest
  199. relies on the trusted paravisor to communicate with the devices processing
  200. sensitive data. The hardware (SNP or TDX) encrypts the guest memory and the
  201. register state while measuring the paravisor image using the platform security
  202. processor to ensure trusted and confidential computing.
  203. Confidential VMBus provides a secure communication channel between the guest
  204. and the paravisor, ensuring that sensitive data is protected from hypervisor-
  205. level access through memory encryption and register state isolation.
  206. Confidential VMBus is an extension of Confidential Computing (CoCo) VMs
  207. (a.k.a. "Isolated" VMs in Hyper-V terminology). Without Confidential VMBus,
  208. guest VMBus device drivers (the "VSC"s in VMBus terminology) communicate
  209. with VMBus servers (the VSPs) running on the Hyper-V host. The
  210. communication must be through memory that has been decrypted so the
  211. host can access it. With Confidential VMBus, one or more of the VSPs reside
  212. in the trusted paravisor layer in the guest VM. Since the paravisor layer also
  213. operates in encrypted memory, the memory used for communication with
  214. such VSPs does not need to be decrypted and thereby exposed to the
  215. Hyper-V host. The paravisor is responsible for communicating securely
  216. with the Hyper-V host as necessary.
  217. The data is transferred directly between the VM and a vPCI device (a.k.a.
  218. a PCI pass-thru device, see :doc:`vpci`) that is directly assigned to VTL2
  219. and that supports encrypted memory. In such a case, neither the host partition
  220. nor the hypervisor has any access to the data. The guest needs to establish
  221. a VMBus connection only with the paravisor for the channels that process
  222. sensitive data, and the paravisor abstracts the details of communicating
  223. with the specific devices away providing the guest with the well-established
  224. VSP (Virtual Service Provider) interface that has had support in the Hyper-V
  225. drivers for a decade.
  226. In the case the device does not support encrypted memory, the paravisor
  227. provides bounce-buffering, and although the data is not encrypted, the backing
  228. pages aren't mapped into the host partition through SLAT. While not impossible,
  229. it becomes much more difficult for the host partition to exfiltrate the data
  230. than it would be with a conventional VMBus connection where the host partition
  231. has direct access to the memory used for communication.
  232. Here is the data flow for a conventional VMBus connection (`C` stands for the
  233. client or VSC, `S` for the server or VSP, the `DEVICE` is a physical one, might
  234. be with multiple virtual functions)::
  235. +---- GUEST ----+ +----- DEVICE ----+ +----- HOST -----+
  236. | | | | | |
  237. | | | | | |
  238. | | | ========== |
  239. | | | | | |
  240. | | | | | |
  241. | | | | | |
  242. +----- C -------+ +-----------------+ +------- S ------+
  243. || ||
  244. || ||
  245. +------||------------------ VMBus --------------------------||------+
  246. | Interrupts, MMIO |
  247. +-------------------------------------------------------------------+
  248. and the Confidential VMBus connection::
  249. +---- GUEST --------------- VTL0 ------+ +-- DEVICE --+
  250. | | | |
  251. | +- PARAVISOR --------- VTL2 -----+ | | |
  252. | | +-- VMBus Relay ------+ ====+================ |
  253. | | | Interrupts, MMIO | | | | |
  254. | | +-------- S ----------+ | | +------------+
  255. | | || | |
  256. | +---------+ || | |
  257. | | Linux | || OpenHCL | |
  258. | | kernel | || | |
  259. | +---- C --+-----||---------------+ |
  260. | || || |
  261. +-------++------- C -------------------+ +------------+
  262. || | HOST |
  263. || +---- S -----+
  264. +-------||----------------- VMBus ---------------------------||-----+
  265. | Interrupts, MMIO |
  266. +-------------------------------------------------------------------+
  267. An implementation of the VMBus relay that offers the Confidential VMBus
  268. channels is available in the OpenVMM project as a part of the OpenHCL
  269. paravisor. Please refer to
  270. * https://openvmm.dev/, and
  271. * https://github.com/microsoft/openvmm
  272. for more information about the OpenHCL paravisor.
  273. A guest that is running with a paravisor must determine at runtime if
  274. Confidential VMBus is supported by the current paravisor. The x86_64-specific
  275. approach relies on the CPUID Virtualization Stack leaf; the ARM64 implementation
  276. is expected to support the Confidential VMBus unconditionally when running
  277. ARM CCA guests.
  278. Confidential VMBus is a characteristic of the VMBus connection as a whole,
  279. and of each VMBus channel that is created. When a Confidential VMBus
  280. connection is established, the paravisor provides the guest the message-passing
  281. path that is used for VMBus device creation and deletion, and it provides a
  282. per-CPU synthetic interrupt controller (SynIC) just like the SynIC that is
  283. offered by the Hyper-V host. Each VMBus device that is offered to the guest
  284. indicates the degree to which it participates in Confidential VMBus. The offer
  285. indicates if the device uses encrypted ring buffers, and if the device uses
  286. encrypted memory for DMA that is done outside the ring buffer. These settings
  287. may be different for different devices using the same Confidential VMBus
  288. connection.
  289. Although these settings are separate, in practice it'll always be encrypted
  290. ring buffer only, or both encrypted ring buffer and external data. If a channel
  291. is offered by the paravisor with confidential VMBus, the ring buffer can always
  292. be encrypted since it's strictly for communication between the VTL2 paravisor
  293. and the VTL0 guest. However, other memory regions are often used for e.g. DMA,
  294. so they need to be accessible by the underlying hardware, and must be
  295. unencrypted (unless the device supports encrypted memory). Currently, there are
  296. not any VSPs in OpenHCL that support encrypted external memory, but future
  297. versions are expected to enable this capability.
  298. Because some devices on a Confidential VMBus may require decrypted ring buffers
  299. and DMA transfers, the guest must interact with two SynICs -- the one provided
  300. by the paravisor and the one provided by the Hyper-V host when Confidential
  301. VMBus is not offered. Interrupts are always signaled by the paravisor SynIC,
  302. but the guest must check for messages and for channel interrupts on both SynICs.
  303. In the case of a confidential VMBus, regular SynIC access by the guest is
  304. intercepted by the paravisor (this includes various MSRs such as the SIMP and
  305. SIEFP, as well as hypercalls like HvPostMessage and HvSignalEvent). If the
  306. guest actually wants to communicate with the hypervisor, it has to use special
  307. mechanisms (GHCB page on SNP, or tdcall on TDX). Messages can be of either
  308. kind: with confidential VMBus, messages use the paravisor SynIC, and if the
  309. guest chose to communicate directly to the hypervisor, they use the hypervisor
  310. SynIC. For interrupt signaling, some channels may be running on the host
  311. (non-confidential, using the VMBus relay) and use the hypervisor SynIC, and
  312. some on the paravisor and use its SynIC. The RelIDs are coordinated by the
  313. OpenHCL VMBus server and are guaranteed to be unique regardless of whether
  314. the channel originated on the host or the paravisor.
  315. load_unaligned_zeropad()
  316. ------------------------
  317. When transitioning memory between encrypted and decrypted, the caller of
  318. set_memory_encrypted() or set_memory_decrypted() is responsible for ensuring
  319. the memory isn't in use and isn't referenced while the transition is in
  320. progress. The transition has multiple steps, and includes interaction with
  321. the Hyper-V host. The memory is in an inconsistent state until all steps are
  322. complete. A reference while the state is inconsistent could result in an
  323. exception that can't be cleanly fixed up.
  324. However, the kernel load_unaligned_zeropad() mechanism may make stray
  325. references that can't be prevented by the caller of set_memory_encrypted() or
  326. set_memory_decrypted(), so there's specific code in the #VC or #VE exception
  327. handler to fixup this case. But a CoCo VM running on Hyper-V may be
  328. configured to run with a paravisor, with the #VC or #VE exception routed to
  329. the paravisor. There's no architectural way to forward the exceptions back to
  330. the guest kernel, and in such a case, the load_unaligned_zeropad() fixup code
  331. in the #VC/#VE handlers doesn't run.
  332. To avoid this problem, the Hyper-V specific functions for notifying the
  333. hypervisor of the transition mark pages as "not present" while a transition
  334. is in progress. If load_unaligned_zeropad() causes a stray reference, a
  335. normal page fault is generated instead of #VC or #VE, and the page-fault-
  336. based handlers for load_unaligned_zeropad() fixup the reference. When the
  337. encrypted/decrypted transition is complete, the pages are marked as "present"
  338. again. See hv_vtom_clear_present() and hv_vtom_set_host_visibility().