| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336 |
- .. SPDX-License-Identifier: GPL-2.0
- Hibernating Guest VMs
- =====================
- Background
- ----------
- Linux supports the ability to hibernate itself in order to save power.
- Hibernation is sometimes called suspend-to-disk, as it writes a memory
- image to disk and puts the hardware into the lowest possible power
- state. Upon resume from hibernation, the hardware is restarted and the
- memory image is restored from disk so that it can resume execution
- where it left off. See the "Hibernation" section of
- Documentation/admin-guide/pm/sleep-states.rst.
- Hibernation is usually done on devices with a single user, such as a
- personal laptop. For example, the laptop goes into hibernation when
- the cover is closed, and resumes when the cover is opened again.
- Hibernation and resume happen on the same hardware, and Linux kernel
- code orchestrating the hibernation steps assumes that the hardware
- configuration is not changed while in the hibernated state.
- Hibernation can be initiated within Linux by writing "disk" to
- /sys/power/state or by invoking the reboot system call with the
- appropriate arguments. This functionality may be wrapped by user space
- commands such "systemctl hibernate" that are run directly from a
- command line or in response to events such as the laptop lid closing.
- Considerations for Guest VM Hibernation
- ---------------------------------------
- Linux guests on Hyper-V can also be hibernated, in which case the
- hardware is the virtual hardware provided by Hyper-V to the guest VM.
- Only the targeted guest VM is hibernated, while other guest VMs and
- the underlying Hyper-V host continue to run normally. While the
- underlying Windows Hyper-V and physical hardware on which it is
- running might also be hibernated using hibernation functionality in
- the Windows host, host hibernation and its impact on guest VMs is not
- in scope for this documentation.
- Resuming a hibernated guest VM can be more challenging than with
- physical hardware because VMs make it very easy to change the hardware
- configuration between the hibernation and resume. Even when the resume
- is done on the same VM that hibernated, the memory size might be
- changed, or virtual NICs or SCSI controllers might be added or
- removed. Virtual PCI devices assigned to the VM might be added or
- removed. Most such changes cause the resume steps to fail, though
- adding a new virtual NIC, SCSI controller, or vPCI device should work.
- Additional complexity can ensue because the disks of the hibernated VM
- can be moved to another newly created VM that otherwise has the same
- virtual hardware configuration. While it is desirable for resume from
- hibernation to succeed after such a move, there are challenges. See
- details on this scenario and its limitations in the "Resuming on a
- Different VM" section below.
- Hyper-V also provides ways to move a VM from one Hyper-V host to
- another. Hyper-V tries to ensure processor model and Hyper-V version
- compatibility using VM Configuration Versions, and prevents moves to
- a host that isn't compatible. Linux adapts to host and processor
- differences by detecting them at boot time, but such detection is not
- done when resuming execution in the hibernation image. If a VM is
- hibernated on one host, then resumed on a host with a different processor
- model or Hyper-V version, settings recorded in the hibernation image
- may not match the new host. Because Linux does not detect such
- mismatches when resuming the hibernation image, undefined behavior
- and failures could result.
- Enabling Guest VM Hibernation
- -----------------------------
- Hibernation of a Hyper-V guest VM is disabled by default because
- hibernation is incompatible with memory hot-add, as provided by the
- Hyper-V balloon driver. If hot-add is used and the VM hibernates, it
- hibernates with more memory than it started with. But when the VM
- resumes from hibernation, Hyper-V gives the VM only the originally
- assigned memory, and the memory size mismatch causes resume to fail.
- To enable a Hyper-V VM for hibernation, the Hyper-V administrator must
- enable the ACPI virtual S4 sleep state in the ACPI configuration that
- Hyper-V provides to the guest VM. Such enablement is accomplished by
- modifying a WMI property of the VM, the steps for which are outside
- the scope of this documentation but are available on the web.
- Enablement is treated as the indicator that the administrator
- prioritizes Linux hibernation in the VM over hot-add, so the Hyper-V
- balloon driver in Linux disables hot-add. Enablement is indicated if
- the contents of /sys/power/disk contains "platform" as an option. The
- enablement is also visible in /sys/bus/vmbus/hibernation. See function
- hv_is_hibernation_supported().
- Linux supports ACPI sleep states on x86, but not on arm64. So Linux
- guest VM hibernation is not available on Hyper-V for arm64.
- Initiating Guest VM Hibernation
- -------------------------------
- Guest VMs can self-initiate hibernation using the standard Linux
- methods of writing "disk" to /sys/power/state or the reboot system
- call. As an additional layer, Linux guests on Hyper-V support the
- "Shutdown" integration service, via which a Hyper-V administrator can
- tell a Linux VM to hibernate using a command outside the VM. The
- command generates a request to the Hyper-V shutdown driver in Linux,
- which sends the uevent "EVENT=hibernate". See kernel functions
- shutdown_onchannelcallback() and send_hibernate_uevent(). A udev rule
- must be provided in the VM that handles this event and initiates
- hibernation.
- Handling VMBus Devices During Hibernation & Resume
- --------------------------------------------------
- The VMBus bus driver, and the individual VMBus device drivers,
- implement suspend and resume functions that are called as part of the
- Linux orchestration of hibernation and of resuming from hibernation.
- The overall approach is to leave in place the data structures for the
- primary VMBus channels and their associated Linux devices, such as
- SCSI controllers and others, so that they are captured in the
- hibernation image. This approach allows any state associated with the
- device to be persisted across the hibernation/resume. When the VM
- resumes, the devices are re-offered by Hyper-V and are connected to
- the data structures that already exist in the resumed hibernation
- image.
- VMBus devices are identified by class and instance GUID. (See section
- "VMBus device creation/deletion" in
- Documentation/virt/hyperv/vmbus.rst.) Upon resume from hibernation,
- the resume functions expect that the devices offered by Hyper-V have
- the same class/instance GUIDs as the devices present at the time of
- hibernation. Having the same class/instance GUIDs allows the offered
- devices to be matched to the primary VMBus channel data structures in
- the memory of the now resumed hibernation image. If any devices are
- offered that don't match primary VMBus channel data structures that
- already exist, they are processed normally as newly added devices. If
- primary VMBus channels that exist in the resumed hibernation image are
- not matched with a device offered in the resumed VM, the resume
- sequence waits for 10 seconds, then proceeds. But the unmatched device
- is likely to cause errors in the resumed VM.
- When resuming existing primary VMBus channels, the newly offered
- relids might be different because relids can change on each VM boot,
- even if the VM configuration hasn't changed. The VMBus bus driver
- resume function matches the class/instance GUIDs, and updates the
- relids in case they have changed.
- VMBus sub-channels are not persisted in the hibernation image. Each
- VMBus device driver's suspend function must close any sub-channels
- prior to hibernation. Closing a sub-channel causes Hyper-V to send a
- RESCIND_CHANNELOFFER message, which Linux processes by freeing the
- channel data structures so that all vestiges of the sub-channel are
- removed. By contrast, primary channels are marked closed and their
- ring buffers are freed, but Hyper-V does not send a rescind message,
- so the channel data structure continues to exist. Upon resume, the
- device driver's resume function re-allocates the ring buffer and
- re-opens the existing channel. It then communicates with Hyper-V to
- re-open sub-channels from scratch.
- The Linux ends of Hyper-V sockets are forced closed at the time of
- hibernation. The guest can't force closing the host end of the socket,
- but any host-side actions on the host end will produce an error.
- VMBus devices use the same suspend function for the "freeze" and the
- "poweroff" phases, and the same resume function for the "thaw" and
- "restore" phases. See the "Entering Hibernation" section of
- Documentation/driver-api/pm/devices.rst for the sequencing of the
- phases.
- Detailed Hibernation Sequence
- -----------------------------
- 1. The Linux power management (PM) subsystem prepares for
- hibernation by freezing user space processes and allocating
- memory to hold the hibernation image.
- 2. As part of the "freeze" phase, Linux PM calls the "suspend"
- function for each VMBus device in turn. As described above, this
- function removes sub-channels, and leaves the primary channel in
- a closed state.
- 3. Linux PM calls the "suspend" function for the VMBus bus, which
- closes any Hyper-V socket channels and unloads the top-level
- VMBus connection with the Hyper-V host.
- 4. Linux PM disables non-boot CPUs, creates the hibernation image in
- the previously allocated memory, then re-enables non-boot CPUs.
- The hibernation image contains the memory data structures for the
- closed primary channels, but no sub-channels.
- 5. As part of the "thaw" phase, Linux PM calls the "resume" function
- for the VMBus bus, which re-establishes the top-level VMBus
- connection and requests that Hyper-V re-offer the VMBus devices.
- As offers are received for the primary channels, the relids are
- updated as previously described.
- 6. Linux PM calls the "resume" function for each VMBus device. Each
- device re-opens its primary channel, and communicates with Hyper-V
- to re-establish sub-channels if appropriate. The sub-channels
- are re-created as new channels since they were previously removed
- entirely in Step 2.
- 7. With VMBus devices now working again, Linux PM writes the
- hibernation image from memory to disk.
- 8. Linux PM repeats Steps 2 and 3 above as part of the "poweroff"
- phase. VMBus channels are closed and the top-level VMBus
- connection is unloaded.
- 9. Linux PM disables non-boot CPUs, and then enters ACPI sleep state
- S4. Hibernation is now complete.
- Detailed Resume Sequence
- ------------------------
- 1. The guest VM boots into a fresh Linux OS instance. During boot,
- the top-level VMBus connection is established, and synthetic
- devices are enabled. This happens via the normal paths that don't
- involve hibernation.
- 2. Linux PM hibernation code reads swap space is to find and read
- the hibernation image into memory. If there is no hibernation
- image, then this boot becomes a normal boot.
- 3. If this is a resume from hibernation, the "freeze" phase is used
- to shutdown VMBus devices and unload the top-level VMBus
- connection in the running fresh OS instance, just like Steps 2
- and 3 in the hibernation sequence.
- 4. Linux PM disables non-boot CPUs, and transfers control to the
- read-in hibernation image. In the now-running hibernation image,
- non-boot CPUs are restarted.
- 5. As part of the "resume" phase, Linux PM repeats Steps 5 and 6
- from the hibernation sequence. The top-level VMBus connection is
- re-established, and offers are received and matched to primary
- channels in the image. Relids are updated. VMBus device resume
- functions re-open primary channels and re-create sub-channels.
- 6. Linux PM exits the hibernation resume sequence and the VM is now
- running normally from the hibernation image.
- Key-Value Pair (KVP) Pseudo-Device Anomalies
- --------------------------------------------
- The VMBus KVP device behaves differently from other pseudo-devices
- offered by Hyper-V. When the KVP primary channel is closed, Hyper-V
- sends a rescind message, which causes all vestiges of the device to be
- removed. But Hyper-V then re-offers the device, causing it to be newly
- re-created. The removal and re-creation occurs during the "freeze"
- phase of hibernation, so the hibernation image contains the re-created
- KVP device. Similar behavior occurs during the "freeze" phase of the
- resume sequence while still in the fresh OS instance. But in both
- cases, the top-level VMBus connection is subsequently unloaded, which
- causes the device to be discarded on the Hyper-V side. So no harm is
- done and everything still works.
- Virtual PCI devices
- -------------------
- Virtual PCI devices are physical PCI devices that are mapped directly
- into the VM's physical address space so the VM can interact directly
- with the hardware. vPCI devices include those accessed via what Hyper-V
- calls "Discrete Device Assignment" (DDA), as well as SR-IOV NIC
- Virtual Functions (VF) devices. See Documentation/virt/hyperv/vpci.rst.
- Hyper-V DDA devices are offered to guest VMs after the top-level VMBus
- connection is established, just like VMBus synthetic devices. They are
- statically assigned to the VM, and their instance GUIDs don't change
- unless the Hyper-V administrator makes changes to the configuration.
- DDA devices are represented in Linux as virtual PCI devices that have
- a VMBus identity as well as a PCI identity. Consequently, Linux guest
- hibernation first handles DDA devices as VMBus devices in order to
- manage the VMBus channel. But then they are also handled as PCI
- devices using the hibernation functions implemented by their native
- PCI driver.
- SR-IOV NIC VFs also have a VMBus identity as well as a PCI
- identity, and overall are processed similarly to DDA devices. A
- difference is that VFs are not offered to the VM during initial boot
- of the VM. Instead, the VMBus synthetic NIC driver first starts
- operating and communicates to Hyper-V that it is prepared to accept a
- VF, and then the VF offer is made. However, the VMBus connection
- might later be unloaded and then re-established without the VM being
- rebooted, as happens in Steps 3 and 5 in the Detailed Hibernation
- Sequence above and in the Detailed Resume Sequence. In such a case,
- the VFs likely became part of the VM during initial boot, so when the
- VMBus connection is re-established, the VFs are offered on the
- re-established connection without intervention by the synthetic NIC driver.
- UIO Devices
- -----------
- A VMBus device can be exposed to user space using the Hyper-V UIO
- driver (uio_hv_generic.c) so that a user space driver can control and
- operate the device. However, the VMBus UIO driver does not support the
- suspend and resume operations needed for hibernation. If a VMBus
- device is configured to use the UIO driver, hibernating the VM fails
- and Linux continues to run normally. The most common use of the Hyper-V
- UIO driver is for DPDK networking, but there are other uses as well.
- Resuming on a Different VM
- --------------------------
- This scenario occurs in the Azure public cloud in that a hibernated
- customer VM only exists as saved configuration and disks -- the VM no
- longer exists on any Hyper-V host. When the customer VM is resumed, a
- new Hyper-V VM with identical configuration is created, likely on a
- different Hyper-V host. That new Hyper-V VM becomes the resumed
- customer VM, and the steps the Linux kernel takes to resume from the
- hibernation image must work in that new VM.
- While the disks and their contents are preserved from the original VM,
- the Hyper-V-provided VMBus instance GUIDs of the disk controllers and
- other synthetic devices would typically be different. The difference
- would cause the resume from hibernation to fail, so several things are
- done to solve this problem:
- * For VMBus synthetic devices that support only a single instance,
- Hyper-V always assigns the same instance GUIDs. For example, the
- Hyper-V mouse, the shutdown pseudo-device, the time sync pseudo
- device, etc., always have the same instance GUID, both for local
- Hyper-V installs as well as in the Azure cloud.
- * VMBus synthetic SCSI controllers may have multiple instances in a
- VM, and in the general case instance GUIDs vary from VM to VM.
- However, Azure VMs always have exactly two synthetic SCSI
- controllers, and Azure code overrides the normal Hyper-V behavior
- so these controllers are always assigned the same two instance
- GUIDs. Consequently, when a customer VM is resumed on a newly
- created VM, the instance GUIDs match. But this guarantee does not
- hold for local Hyper-V installs.
- * Similarly, VMBus synthetic NICs may have multiple instances in a
- VM, and the instance GUIDs vary from VM to VM. Again, Azure code
- overrides the normal Hyper-V behavior so that the instance GUID
- of a synthetic NIC in a customer VM does not change, even if the
- customer VM is deallocated or hibernated, and then re-constituted
- on a newly created VM. As with SCSI controllers, this behavior
- does not hold for local Hyper-V installs.
- * vPCI devices do not have the same instance GUIDs when resuming
- from hibernation on a newly created VM. Consequently, Azure does
- not support hibernation for VMs that have DDA devices such as
- NVMe controllers or GPUs. For SR-IOV NIC VFs, Azure removes the
- VF from the VM before it hibernates so that the hibernation image
- does not contain a VF device. When the VM is resumed it
- instantiates a new VF, rather than trying to match against a VF
- that is present in the hibernation image. Because Azure must
- remove any VFs before initiating hibernation, Azure VM
- hibernation must be initiated externally from the Azure Portal or
- Azure CLI, which in turn uses the Shutdown integration service to
- tell Linux to do the hibernation. If hibernation is self-initiated
- within the Azure VM, VFs remain in the hibernation image, and are
- not resumed properly.
- In summary, Azure takes special actions to remove VFs and to ensure
- that VMBus device instance GUIDs match on a new/different VM, allowing
- hibernation to work for most general-purpose Azure VMs sizes. While
- similar special actions could be taken when resuming on a different VM
- on a local Hyper-V install, orchestrating such actions is not provided
- out-of-the-box by local Hyper-V and so requires custom scripting.
|