Bug 1928838

Summary: Don't use VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE for virDomainSnapshotCreateXML when filesystems are already frozen by virDomainFSFreeze
Product: Red Hat OpenStack Reporter: Peter Krempa <pkrempa>
Component: openstack-novaAssignee: OSP DFG:Compute <osp-dfg-compute>
Status: CLOSED EOL QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: low Docs Contact:
Priority: low    
Version: 16.2 (Train)CC: alifshit, dasmith, eglynn, jhakimra, kchamart, sbauza, sgordon, vromanso
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2025-01-17 15:32:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Peter Krempa 2021-02-15 15:59:37 UTC
Description of problem:
Observed in logs from https://bugzilla.redhat.com/show_bug.cgi?id=1927136

Nova issues the following libvirt APIs: (trimmed irrelevant libvirt stuff from between)

2021-02-10 07:02:22.164+0000: 166314: debug : virDomainFSFreeze:11329 : dom=0x7f695000ce40, (VM: name=instance-00000018, uuid=0129f9e7-3016-496b-baa9-2cfc3d57414f), mountpoints=(nil), nmountpoints=0, flags=0x0

2021-02-10 07:02:24.097+0000: 166311: debug : virDomainSnapshotCreateXML:221 : dom=0x7f690c009da0, (VM: name=instance-00000018, uuid=0129f9e7-3016-496b-baa9-2cfc3d57414f), xmlDesc=<domainsnapshot>
  <disks>
    <disk name="/var/lib/nova/mnt/805af70202ed20867b0f31abdf6acba4/volume-880e38be-1905-470b-86c0-7a98783e8a67.6296bdcb-2cd4-4be5-921e-362930b2bcea" snapshot="external" type="file">
      <source file="/var/lib/nova/mnt/805af70202ed20867b0f31abdf6acba4/volume-880e38be-1905-470b-86c0-7a98783e8a67.feb74d31-e3ac-4a14-b077-a4253df148c6"/>
    </disk>
  </disks>
</domainsnapshot>
, flags=0x74

2021-02-10 07:02:24.133+0000: 166314: debug : virDomainSnapshotCreateXML:221 : dom=0x7f695000ce40, (VM: name=instance-00000018, uuid=0129f9e7-3016-496b-baa9-2cfc3d57414f), xmlDesc=<domainsnapshot>
  <disks>
    <disk name="/var/lib/nova/mnt/805af70202ed20867b0f31abdf6acba4/volume-880e38be-1905-470b-86c0-7a98783e8a67.6296bdcb-2cd4-4be5-921e-362930b2bcea" snapshot="external" type="file">
      <source file="/var/lib/nova/mnt/805af70202ed20867b0f31abdf6acba4/volume-880e38be-1905-470b-86c0-7a98783e8a67.feb74d31-e3ac-4a14-b077-a4253df148c6"/>
    </disk>
  </disks>
</domainsnapshot>
, flags=0x34

2021-02-10 07:02:25.752+0000: 166315: debug : virDomainFSThaw:11371 : dom=0x7f694c00cf30, (VM: name=instance-00000018, uuid=0129f9e7-3016-496b-baa9-2cfc3d57414f), flags=0x0

Flags for virDomainSnapshotCreateXML have following meaning:

VIR_DOMAIN_SNAPSHOT_CREATE_REDEFINE 	= 	1 (0x1; 1 << 0) Restore or alter metadata
VIR_DOMAIN_SNAPSHOT_CREATE_CURRENT 	= 	2 (0x2; 1 << 1) With redefine, make snapshot current
VIR_DOMAIN_SNAPSHOT_CREATE_NO_METADATA 	= 	4 (0x4; 1 << 2) Make snapshot without remembering it
VIR_DOMAIN_SNAPSHOT_CREATE_HALT 	= 	8 (0x8; 1 << 3) Stop running guest after snapshot
VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY 	= 	16 (0x10; 1 << 4) disk snapshot, not full system
VIR_DOMAIN_SNAPSHOT_CREATE_REUSE_EXT 	= 	32 (0x20; 1 << 5) reuse any existing external files
VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE 	= 	64 (0x40; 1 << 6) use guest agent to quiesce all mounted file systems within the domain
VIR_DOMAIN_SNAPSHOT_CREATE_ATOMIC 	= 	128 (0x80; 1 << 7) atomically avoid partial changes
VIR_DOMAIN_SNAPSHOT_CREATE_LIVE 	= 	256 (0x100; 1 << 8)create the snapshot while the guest is running
VIR_DOMAIN_SNAPSHOT_CREATE_VALIDATE 	= 	512 (0x200; 1 << 9) validate the XML against the schema

The first thing that happens is a virDomainFSFreeze, then virDomainSnapshotCreateXML which asserts VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE flag. This call (always) fails because the filesystems are already frozen. Then virDomainSnapshotCreateXML is issued without VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE, which succeeds. The filesystems are then unfrozen via virDomainFSThaw.

The above operations don't make sense. If an explicit virDomainFSFreeze is used, there's no point in using VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE which will actually always fail, since the guest agent doesn't allow a double freeze.

(Note there's also a bug in libvirt where filesystems are thawed in the failed virDomainSnapshotCreateXML invocation see https://bugzilla.redhat.com/show_bug.cgi?id=1928819 )


Version-Release number of selected component (if applicable):
libvirt-daemon-7.0.0-3.module+el8.4.0+9709+a99efd61.x86_64
qemu-kvm-5.2.0-4.module+el8.4.0+9676+589043b9.x86_64
kernel: 4.18.0-282.el8.x86_64
OSP16.2: 
openstack-nova-compute-20.4.2-2.20201224134938.81a3f4b.el8ost.1.noarch
How reproducible:


Steps to Reproduce:
1. See steps in https://bugzilla.redhat.com/show_bug.cgi?id=1927136
2.
3.

Comment 1 Kashyap Chamarthy 2021-02-16 16:43:41 UTC
Looking a the nova-compute's exception fragment[1], this is coming from the 
_volume_snapshot_create() method in Nova's libvirt driver, where the 
following seems to be the logic.

Before taking a snapshot, the _volume_snapshot_create() method checks if 
we can quiesce the guest:

  - if the guest is capable of quiescing, then it tries guest.snapshot()
    with the "quiesce=True" 

        [...] # if the user requests (by specifying as a parameter on the
        template image from which the guest is booting) to have quiesce
        be part of the snapshot, and if Nova can't honour that, then
        raise an error

  - but if the guest is _not_ capable of quiescing, then the 
    guest.snapshot() call is re-tried with "quiesce=False"

It was introduced in this Nova commit[3] to fix a bug where Nova was
attempting to quiesce when doing a volume (i.e. a detachable block
device) snapshot without checking if the guest is _capable_ of quiescing
or not.



[1] exception fragment from nova-compute.log
---------------------------------------------------------
[...]
2021-02-10 06:00:55.038 7 ERROR nova.virt.libvirt.driver [instance: 72786c63-160a-44d0-941e-3ce056afebe2]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 2897, in
 _volume_snapshot_create
2021-02-10 06:00:55.038 7 ERROR nova.virt.libvirt.driver [instance: 72786c63-160a-44d0-941e-3ce056afebe2]     reuse_ext=True, quiesce=True)
[...]
2021-02-10 05:59:29.293 7 ERROR nova.virt.libvirt.driver [instance: 72786c63-160a-44d0-941e-3ce056afebe2]   File "/usr/lib64/python3.6/site-packages/libvirt.py", line 2814, in snapshotCreate
XML
2021-02-10 05:59:29.293 7 ERROR nova.virt.libvirt.driver [instance: 72786c63-160a-44d0-941e-3ce056afebe2]     if ret is None:raise libvirtError('virDomainSnapshotCreateXML() failed', dom=sel
f)
2021-02-10 05:59:29.293 7 ERROR nova.virt.libvirt.driver [instance: 72786c63-160a-44d0-941e-3ce056afebe2] libvirt.libvirtError: internal error: unable to execute QEMU agent command 'guest-fs
freeze-freeze': The command guest-fsfreeze-freeze has been disabled for this instance
[...]
---------------------------------------------------------

[2] https://github.com/openstack/nova/blob/308c6007dcbced/nova/virt/libvirt/driver.py#L2791,#L2820

[3] https://opendev.org/openstack/nova/commit/e659a6e7cbb30 (libvirt: 
    check if we can quiesce before volume-backed snapshot; 2016-09-30)

Comment 2 Peter Krempa 2021-02-18 16:42:07 UTC
The problem isn't that 'guest.snapshot(quiesce=True)' is followed by 'guest.snapshot(quiesce=False)' if the former fails. That is a reasonable algorithm when the quiescing is done as integral part of the libvirt snapshot API.

The problem lies with an explicit quiesce done via 'virDomainFsFreeze' (https://github.com/openstack/nova/blob/308c6007dcbced8f4e97b1712ade66b27949b712/nova/virt/libvirt/guest.py#L546) followed  by a snapshot with quiesce=True, in libvirt terms virDomainFsFreeze, virDomainSnapshotCreateXML(...,VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE). The qemu guest agent doesn't allow quiescing/freezing if the filesystems are already frozen, so the snapshot with the quiescing enabled will always fail if the filesystems are already quiesced.

I've also updated the libvirt docs https://gitlab.com/libvirt/libvirt/-/commit/ec86b8fa29fa97b51382eb19ca2355c87dfcc38f to promote use of explicit quiescing.

Comment 3 Artom Lifshitz 2025-01-17 15:32:03 UTC
While the reported behavior isn't great, as far as I understand it the impact is minimal and not user-visible. Being realistic, we'll never get around to fixing this (and it's been 4 years since the bug report anyways). Closing.