Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1984209

Summary:	VDSM reports failed snapshot to engine, but it succeeded. Then engine deletes the volume and causes data corruption.
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Germano Veit Michel <gveitmic>
Component:	vdsm	Assignee:	Liran Rotenberg <lrotenbe>
Status:	CLOSED ERRATA	QA Contact:	Tamir <tamir>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.4.6	CC:	ahadas, bcholler, dfodor, emarcus, eshenitz, lsurette, mzamazal, pagranat, srevivo, ycui
Target Milestone:	ovirt-4.4.8
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	vdsm-4.40.80.4	Doc Type:	Bug Fix
Doc Text:	Previously, when failing to execute a snapshot and re-executing it later, the second try would fail due to using the previous execution data. In this release, this data will be used only when needed, in recovery mode.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-09-08 14:11:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Virt	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Germano Veit Michel 2021-07-21 00:29:49 UTC

Description of problem:

Think this is the same of BZ1916122, which wasnt very clear at the time.

It seems some leftover from a previous failed snapshot job on the VM (timeout, completed, aborted..) causes a subsequent snapshot to fail immediately incorrectly. But that second snapshot actually completes on libvirt/qemu later on.

1) First snapshot fails 

NOTE: this is not the object of this bug - this one is slow storage and engine cleans up correctly, but I think there is some leftover on VDSM causing a bug later.

2021-07-17 10:01:32,927+0300 INFO  (jsonrpc/7) [api.virt] START snapshot(snapDrives=[{'imageID': 'df8bcb00-3bc2-4876-ba37-a617d3969e08', 'baseVolumeID': '83195d30-9232-4f92-a37c-9c9fd2a867db', 'volumeID': 'ee4604e0-7ef8-4afd-9307-01faae5d2fe5', 'domainID': '94ef2326-eb1f-45e2-9727-6bbd9e02e6a9'}], snapMemory=None, frozen=False, jobUUID='d868f0a6-1563-4806-8e8d-758e6f9944a2', timeout=30) from=::ffff:192.168.213.20,33846, flow_id=8ed1a5a2-0b70-4c20-b4d0-5afb1298c85d, vmId=36afee56-fd82-47f9-9de0-621b282011a4 (api:48)


2021-07-17 10:01:32,928+0300 INFO  (virt/d868f0a6) [root] Running job 'd868f0a6-1563-4806-8e8d-758e6f9944a2'... (jobs:185)
2021-07-17 10:01:32,929+0300 INFO  (snap_abort/d868f0a6) [virt.vm] (vmId='36afee56-fd82-47f9-9de0-621b282011a4') Starting snapshot abort job, with check interval 60 (snapshot:616)

Within 1 minute, the job fails as it could not do a simple prepareImage command in time:

2021-07-17 10:02:33,975+0300 INFO  (virt/d868f0a6) [vdsm.api] FINISH prepareImage error=Connection timed out from=internal, task_id=ecb74abc-afcf-4521-914c-8f276e204c66 (api:52)
2021-07-17 10:02:33,976+0300 ERROR (virt/d868f0a6) [storage.TaskManager.Task] (Task='ecb74abc-afcf-4521-914c-8f276e204c66') Unexpected error (task:880)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run
    return fn(*args, **kargs)
  File "<decorator-gen-169>", line 2, in prepareImage
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 3267, in prepareImage
    allVols = dom.getAllVolumes()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 872, in getAllVolumes
    return self._manifest.getAllVolumes()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", line 291, in getAllVolumes
    volMetaPaths = self.oop.glob.glob(volMetaPattern)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py", line 106, in glob
    return self._iop.glob(pattern)
  File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 592, in glob
    return self._sendCommand("glob", {"pattern": pattern}, self.timeout)
  File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 473, in _sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
ioprocess.Timeout: Connection timed out

2021-07-17 10:02:33,977+0300 ERROR (virt/d868f0a6) [virt.vm] (vmId='36afee56-fd82-47f9-9de0-621b282011a4') unable to prepare the volume path for disk vda (snapshot:386)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/jobs/snapshot.py", line 383, in snapshot
    self._vm.cif.prepareVolumePath(new_drives[vm_dev_name])
  File "/usr/lib/python3.6/site-packages/vdsm/clientIF.py", line 429, in prepareVolumePath
    raise vm.VolumeError(drive)
vdsm.virt.vm.VolumeError: Bad volume specification {'device': 'disk', 'domainID': '94ef2326-eb1f-45e2-9727-6bbd9e02e6a9', 'imageID': 'df8bcb00-3bc2-4876-ba37-a617d3969e08', 'volumeID': 'ee4604e0-7ef8-4afd-9307-01faae5d2fe5', 'type': 'disk', 'diskType': 'file', 'poolID': 'aeba4874-3b9f-4a58-bd68-4277473c1a05', 'name': 'vda', 'format': 'cow'}

For this one engine cleans up correctly, VM is never snapshotted and the volume was deleted. All good.

2) Now here comes the problem, try to snapshot again, this time prepareImage completes and VDSM goes for the actual snapshot. But there are some red flags in the logs.

2021-07-17 10:11:01,797+0300 INFO  (jsonrpc/2) [api.virt] START snapshot(snapDrives=[{'imageID': 'df8bcb00-3bc2-4876-ba37-a617d3969e08', 'baseVolumeID': '83195d30-9232-4f92-a37c-9c9fd2a867db', 'volumeID': '3b1f90bd-19a9-4b69-94d3-6510e9f81078', 'domainID': '94ef2326-eb1f-45e2-9727-6bbd9e02e6a9'}], snapMemory=None, frozen=False, jobUUID='8b9a245b-2729-4dee-8620-d5767b658587', timeout=30) from=::ffff:192.168.213.20,33846, flow_id=719bea3b-8f53-411c-8de3-1334c481bc0e, vmId=36afee56-fd82-47f9-9de0-621b282011a4 (api:48)

This does not make much sense, it should have slept 1s here

    def _job_running_init(self):
        while not self._job_running() and not self._completed.is_set() and \
                self._timeout_not_reached() and not self._abort.is_set():
            time.sleep(1)
        if self._job_running():
            self._vm.log.debug("The snapshot job is running")
            return True
        return False

So it returned False, as it called from here and logged "Snapshot job didn't start on the domain".

    def run(self):
        monitoring_interval = min(60, self._timeout // 10)
        self._vm.log.info("Starting snapshot abort job, "
                          "with check interval %s",
                          monitoring_interval)
        # Waiting for the job to run on libvirt
        if self._job_running_init():
            while self._timeout_not_reached() and self._job_running() and not \
                    self._completed.is_set():
                self._vm.log.info("Time passed: %s, out of %s",
                                  self._running_time(), self._timeout)
                time.sleep(monitoring_interval)
            if not self._job_completed():
                self._abort_job()
        elif not self._job_completed():
            self._vm.log.error("Snapshot job didn't start on the domain")

  
Logs:

2021-07-17 10:11:01,799+0300 INFO  (snap_abort/8b9a245b) [virt.vm] (vmId='36afee56-fd82-47f9-9de0-621b282011a4') Starting snapshot abort job, with check interval 60 (snapshot:616)
2021-07-17 10:11:01,801+0300 ERROR (snap_abort/8b9a245b) [virt.vm] (vmId='36afee56-fd82-47f9-9de0-621b282011a4') Snapshot job didn't start on the domain (snapshot:627)

So I assume one of the following may have already been set: completed, abort, timeout? I see this may be loaded from snapshot recovery?

Even then go for the actual snapshot, here we have a tip that maybe the one already set was timeout?, see 

2021-07-17 10:11:08,205+0300 INFO  (virt/8b9a245b) [virt.vm] (vmId='36afee56-fd82-47f9-9de0-621b282011a4') Taking a live snapshot (drives=vda,memory=True) (snapshot:460)
2021-07-17 10:11:08,350+0300 INFO  (virt/8b9a245b) [virt.vm] (vmId='36afee56-fd82-47f9-9de0-621b282011a4') Snapshot timeout reached, operation aborted (snapshot:80)

Engine sees the failed job form vdsm, and dumpxml too early (before libvirt/qemu snap complete) and rolls back, deleting the volume


2021-07-17 10:11:11,183+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DumpXmlsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-65) [719bea3b-8f53-411c-8de3-1334c481bc0e] FINISH, DumpXmlsVDSCommand, return: {36afee56-fd82-47f9-9de0-621b282011a4=<domain type='kvm' id='20' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  <name>ITIP_769_PDNS1</name>
  <uuid>36afee56-fd82-47f9-9de0
  ...
    <disk type='file' device='disk' snapshot='no'>
      <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='threads' iothread='1'/>
      <source file='/rhev/data-center/mnt/glusterSD/rhgs1-hdd1:_rhv__hdd__vol01/94ef2326-eb1f-45e2-9727-6bbd9e02e6a9/images/df8bcb00-3bc2-4876-ba37-a617d3969e08/83195d30-9232-4f92-a37c-9c9fd2a867db' index='1'>
        <seclabel model='dac' relabel='no'/>
      </source>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <serial>df8bcb00-3bc2-4876-ba37-a617d3969e08</serial>
      <boot order='1'/>
      <alias name='ua-df8bcb00-3bc2-4876-ba37-a617d3969e08'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </disk>

2021-07-17 10:11:11,237+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DestroyImageVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-65) [719bea3b-8f53-411c-8de3-1334c481bc0e] START, DestroyImageVDSCommand( DestroyImageVDSCommandParameters:{storagePoolId='aeba4874-3b9f-4a58-bd68-4277473c1a05', ignoreFailoverLimit='false', storageDomainId='94ef2326-eb1f-45e2-9727-6bbd9e02e6a9', imageGroupId='df8bcb00-3bc2-4876-ba37-a617d3969e08', imageId='00000000-0000-0000-0000-000000000000', imageList='[3b1f90bd-19a9-4b69-94d3-6510e9f81078]', postZero='false', force='false'}), log id: 437303c


VM pauses
2021-07-17 10:12:08,463+0300 INFO  (libvirt/events) [virt.vm] (vmId='36afee56-fd82-47f9-9de0-621b282011a4') abnormal vm stop device ua-df8bcb00-3bc2-4876-ba37-a617d3969e08 error  (vm:4910)
2021-07-17 10:12:08,463+0300 INFO  (libvirt/events) [virt.vm] (vmId='36afee56-fd82-47f9-9de0-621b282011a4') CPU stopped: onIOError (vm:5742)
2021-07-17 10:12:08,465+0300 INFO  (libvirt/events) [virt.vm] (vmId='36afee56-fd82-47f9-9de0-621b282011a4') CPU stopped: onSuspend (vm:5742)
2021-07-17 10:12:08,503+0300 WARN  (libvirt/events) [virt.vm] (vmId='36afee56-fd82-47f9-9de0-621b282011a4') device vda reported I/O error (vm:3877)

Dumpxml from paused vm on sosreport shows it was using the volume deleted by the engine, so it seems that snapshot failed on vdsm side with that timeout but actually completed on libvirt and qemu.

    <disk type='file' device='disk' snapshot='no'>
      <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='threads' iothread='1'/>
      <source file='/rhev/data-center/mnt/glusterSD/rhgs1-hdd1:_rhv__hdd__vol01/94ef2326-eb1f-45e2-9727-6bbd9e02e6a9/images/df8bcb00-3bc2-4876-ba37-a617d3969e08/3b1f90bd-19a9-4b69-94d3-6510e9f81078' index='3'>
        <seclabel model='dac' relabel='no'/>
      </source>
      <backingStore type='file' index='1'>
        <format type='qcow2'/>
        <source file='/rhev/data-center/mnt/glusterSD/rhgs1-hdd1:_rhv__hdd__vol01/94ef2326-eb1f-45e2-9727-6bbd9e02e6a9/images/df8bcb00-3bc2-4876-ba37-a617d3969e08/83195d30-9232-4f92-a37c-9c9fd2a867db'>
          <seclabel model='dac' relabel='no'/>
        </source>
        <backingStore/>
      </backingStore>
      <target dev='vda' bus='virtio'/>
      <serial>df8bcb00-3bc2-4876-ba37-a617d3969e08</serial>
      <boot order='1'/>
      <alias name='ua-df8bcb00-3bc2-4876-ba37-a617d3969e08'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </disk>

Version-Release number of selected component (if applicable):
vdsm-4.40.50.10-1.el8ev.x86_64
rhvm-4.4.6.8-0.1.el8ev.noarch

How reproducible:
- Unknown
- I'll try later this week to add a fake exception in the code that is only hit on the first VM snapshot, will report here if it reproduces.

Comment 3 Arik 2021-07-28 06:13:27 UTC

If we were looking for justification to make changes in this mechanism (bz 1985973) then here we have one
We can try reproduce it if needed by modifying the vdsm code to fail preparing the volume(s) on the first attempt and see what happens on the next attempt

Comment 4 Liran Rotenberg 2021-07-28 08:17:42 UTC

I see few problems:
The main one is when failing, the metadata stay.
I just set raise on the same place of the 1st snapshot failure and I got the metadata on the VM:
    <ovirt-vm:snapshot_job>{"startTime": "1599532.231465163", "timeout": "1800", "abort": true, "completed": false, "jobUUID": "c249b960-0c47-4c79-857e-f33b4e3b79ab", "frozen": false, "memoryParams": {}}</ovirt-vm:snapshot_job>
When we start a snapshot we try to read it on the initialization phase.

        self._snapshot_job = read_snapshot_md(self._vm, self._lock)
        self._load_metadata()

It makes sense to have it only on recovery.

The second thing is where we look for the abort flag, currently it's only on later phases of the snapshot operation. We may want to catch it earlier.

Combining the two, the abort flag is true, we going with the 2nd snapshot and calling libvirt (pivot happens), but then we failing the job while it runs and succeeds.
We improved not long ago (4.4.7), some of the handling: https://gerrit.ovirt.org/#/c/ovirt-engine/+/113756/ but i'm not sure it's enough and we must at least clear the metadata or not read it when we aren't in recovery.

Comment 8 Polina 2021-08-23 11:27:06 UTC

Taking into consideration  https://bugzilla.redhat.com/show_bug.cgi?id=1984209#c7 QE can't reproduce the issue.
We are relying on the engineers verification by changing the code.

Comment 13 errata-xmlrpc 2021-09-08 14:11:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Virtualization Host security and bug fix update [ovirt-4.4.8]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3459