Bug 1916122

Summary:	Disk corruption caused by removing in use snapshot LV
Product:	[oVirt] ovirt-engine	Reporter:	Jean-Louis Dupond <jean-louis>
Component:	BLL.Storage	Assignee:	Liran Rotenberg <lrotenbe>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	meital avital <mavital>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.4.4.5	CC:	ahadas, bugs, bzlotnik, eshenitz, jean-louis, lrotenbe, tnisan
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-03-14 08:03:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Virt	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jean-Louis Dupond 2021-01-14 09:26:48 UTC

We had some VM which failed to snapshot after testing incremental backups on it.
But what was worse, is that the failed snapshot caused disk corruption on other VM's on the same storageDomain.


Why it went wrong:


The VM was snapshotted, a volume was created  (efb6948e-4e27-4224-8715-f555cbf3c01a), but the snapshot failed:
2021-01-13 22:12:26,001+0100 INFO  (jsonrpc/0) [api.virt] START snapshot(snapDrives=[{'imageID': 'e811105b-987a-4511-9378-ecc167413466', 'baseVolumeID': 'cb6cee06-1b8a-4e34-88a8-c96257a9a2f3', 'volumeID': 'efb6948e-4e27-4224-8715-f555cbf3c01a', 'domainID': 'cc29e364-6bf2-4a52-8213-3c83649fc067'}], snapMemory=None, frozen=False, jobUUID='cf577e3d-c6e5-4b16-9ce3-34532b8ad110', timeout=30) from=::ffff:x.x.x.x,47732, flow_id=2a2eb2aa-9a44-4045-8335-7e12995e6259, vmId=fc00512f-7d52-42d9-81b5-4c3fbc2eb0a3 (api:48)
2021-01-13 22:12:26,002+0100 INFO  (jsonrpc/0) [api.virt] FINISH snapshot return={'status': {'code': 0, 'message': 'Done'}} from=::ffff:x.x.x.x,47732, flow_id=2a2eb2aa-9a44-4045-8335-7e12995e6259, vmId=fc00512f-7d52-42d9-81b5-4c3fbc2eb0a3 (api:54)
2021-01-13 22:12:26,003+0100 INFO  (virt/cf577e3d) [root] Running job 'cf577e3d-c6e5-4b16-9ce3-34532b8ad110'... (jobs:185)
2021-01-13 22:12:26,003+0100 INFO  (snap_abort/cf577e3d) [virt.vm] (vmId='fc00512f-7d52-42d9-81b5-4c3fbc2eb0a3') Starting snapshot abort job, with check interval 60 (snapshot:616)
2021-01-13 22:12:26,004+0100 INFO  (virt/cf577e3d) [vdsm.api] START prepareImage(sdUUID='cc29e364-6bf2-4a52-8213-3c83649fc067', spUUID='d497efe5-2344-4d58-8985-7b053d3c35a3', imgUUID='e811105b-987a-4511-9378-ecc167413466', leafUUID='efb6948e-4e27-4224-8715-f555cbf3c01a', allowIllegal=False) from=internal, task_id=86e
69926-3563-4a87-a648-6ddab4679ed6 (api:48)
2021-01-13 22:12:26,022+0100 ERROR (snap_abort/cf577e3d) [virt.vm] (vmId='fc00512f-7d52-42d9-81b5-4c3fbc2eb0a3') Snapshot job didn't start on the domain (snapshot:627)
2021-01-13 22:12:26,033+0100 INFO  (jsonrpc/2) [api.host] START getJobs(job_type='virt', job_ids=['cf577e3d-c6e5-4b16-9ce3-34532b8ad110']) from=::ffff:x.x.x.x,47732, flow_id=2a2eb2aa-9a44-4045-8335-7e12995e6259 (api:48)
2021-01-13 22:12:26,034+0100 INFO  (jsonrpc/2) [api.host] FINISH getJobs return={'jobs': {'cf577e3d-c6e5-4b16-9ce3-34532b8ad110': {'id': 'cf577e3d-c6e5-4b16-9ce3-34532b8ad110', 'status': 'running', 'description': 'snapshot_vm', 'job_type': 'virt'}}, 'status': {'code': 0, 'message': 'Done'}} from=::ffff:x.x.x.x,47732, flow_id=2a2eb2aa-9a44-4045-8335-7e12995e6259 (api:54)
2021-01-13 22:12:26,142+0100 WARN  (virt/cf577e3d) [storage.LVM] Command ['/sbin/lvm', 'lvs', '--config', 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/3600a098038305663785d505652713452$|", "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 } backup {  retain_min=50  retain_days=0 }', '--noheadings', '--units', 'b', '--nosuffix', '--separator', '|', '--ignoreskippedcluster', '-o', 'uuid,name,vg_name,attr,size,seg_start_pe,devices,tags', 'cc29e364-6bf2-4a52-8213-3c83649fc067'] succeeded with warnings: ['  WARNING: Cannot find matching striped segment for cc29e364-6bf2-4a52-8213-3c83649fc067/d151699f-ca20-413e-9e75-3e0e30d9f21d.', '  WARNING: Cannot find matching striped segment for cc29e364-6bf2-4a52-8213-3c83649fc067/d151699f-ca20-413e-9e75-3e0e30d9f21d.'] (lvm:352)
2021-01-13 22:12:26,144+0100 WARN  (virt/cf577e3d) [storage.LVM] Removing stale lv: cc29e364-6bf2-4a52-8213-3c83649fc067/7eb32d79-f0d1-4663-826a-e5634314dba7 (lvm:753)
2021-01-13 22:12:26,163+0100 INFO  (virt/cf577e3d) [storage.VolumeManifest] Creating symlink from /dev/cc29e364-6bf2-4a52-8213-3c83649fc067/efb6948e-4e27-4224-8715-f555cbf3c01a to /rhev/data-center/mnt/blockSD/cc29e364-6bf2-4a52-8213-3c83649fc067/images/e811105b-987a-4511-9378-ecc167413466/efb6948e-4e27-4224-8715-f555cbf3c01a (blockVolume:143)
2021-01-13 22:12:26,180+0100 INFO  (virt/cf577e3d) [storage.LVM] Refreshing active lvs: vg=cc29e364-6bf2-4a52-8213-3c83649fc067 lvs=['cb6cee06-1b8a-4e34-88a8-c96257a9a2f3'] (lvm:1734)
2021-01-13 22:12:26,330+0100 INFO  (virt/cf577e3d) [storage.LVM] Activating lvs: vg=cc29e364-6bf2-4a52-8213-3c83649fc067 lvs=['efb6948e-4e27-4224-8715-f555cbf3c01a'] (lvm:1738)
2021-01-13 22:12:26,479+0100 INFO  (virt/cf577e3d) [storage.StorageDomain] Creating image run directory '/run/vdsm/storage/cc29e364-6bf2-4a52-8213-3c83649fc067/e811105b-987a-4511-9378-ecc167413466' (blockSD:1344)
2021-01-13 22:12:26,479+0100 INFO  (virt/cf577e3d) [storage.fileUtils] Creating directory: /run/vdsm/storage/cc29e364-6bf2-4a52-8213-3c83649fc067/e811105b-987a-4511-9378-ecc167413466 mode: None (fileUtils:201)
2021-01-13 22:12:26,480+0100 INFO  (virt/cf577e3d) [storage.StorageDomain] Creating symlink from /dev/cc29e364-6bf2-4a52-8213-3c83649fc067/cb6cee06-1b8a-4e34-88a8-c96257a9a2f3 to /run/vdsm/storage/cc29e364-6bf2-4a52-8213-3c83649fc067/e811105b-987a-4511-9378-ecc167413466/cb6cee06-1b8a-4e34-88a8-c96257a9a2f3 (blockSD:1349)
2021-01-13 22:12:26,480+0100 INFO  (virt/cf577e3d) [storage.StorageDomain] Creating symlink from /dev/cc29e364-6bf2-4a52-8213-3c83649fc067/efb6948e-4e27-4224-8715-f555cbf3c01a to /run/vdsm/storage/cc29e364-6bf2-4a52-8213-3c83649fc067/e811105b-987a-4511-9378-ecc167413466/efb6948e-4e27-4224-8715-f555cbf3c01a (blockSD:1349)
2021-01-13 22:12:26,599+0100 WARN  (virt/cf577e3d) [storage.LVM] Command ['/sbin/lvm', 'lvs', '--config', 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/3600a098038305663785d505652713452$|", "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 } backup {  retain_min=50  retain_days=0 }', '--noheadings', '--units', 'b', '--nosuffix', '--separator', '|', '--ignoreskippedcluster', '-o', 'uuid,name,vg_name,attr,size,seg_start_pe,devices,tags', 'cc29e364-6bf2-4a52-8213-3c83649fc067'] succeeded with warnings: ['  WARNING: Cannot find matching striped segment for cc29e364-6bf2-4a52-8213-3c83649fc067/d151699f-ca20-413e-9e75-3e0e30d9f21d.', '  WARNING: Cannot find matching striped segment for cc29e364-6bf2-4a52-8213-3c83649fc067/d151699f-ca20-413e-9e75-3e0e30d9f21d.'] (lvm:352)
2021-01-13 22:12:26,717+0100 INFO  (virt/cf577e3d) [storage.StorageDomain] Creating symlink from /run/vdsm/storage/cc29e364-6bf2-4a52-8213-3c83649fc067/e811105b-987a-4511-9378-ecc167413466 to /rhev/data-center/mnt/blockSD/cc29e364-6bf2-4a52-8213-3c83649fc067/images/e811105b-987a-4511-9378-ecc167413466 (blockSD:1314)
2021-01-13 22:12:26,718+0100 INFO  (virt/cf577e3d) [vdsm.api] FINISH prepareImage return={'path': '/rhev/data-center/mnt/blockSD/cc29e364-6bf2-4a52-8213-3c83649fc067/images/e811105b-987a-4511-9378-ecc167413466/efb6948e-4e27-4224-8715-f555cbf3c01a', 'info': {'type': 'block', 'path': '/rhev/data-center/mnt/blockSD/cc29e364-6bf2-4a52-8213-3c83649fc067/images/e811105b-987a-4511-9378-ecc167413466/efb6948e-4e27-4224-8715-f555cbf3c01a'}, 'imgVolumesInfo': [{'domainID': 'cc29e364-6bf2-4a52-8213-3c83649fc067', 'imageID': 'e811105b-987a-4511-9378-ecc167413466', 'volumeID': 'cb6cee06-1b8a-4e34-88a8-c96257a9a2f3', 'path': '/rhev/data-center/mnt/blockSD/cc29e364-6bf2-4a52-8213-3c83649fc067/images/e811105b-987a-4511-9378-ecc167413466/cb6cee06-1b8a-4e34-88a8-c96257a9a2f3', 'leasePath': '/dev/cc29e364-6bf2-4a52-8213-3c83649fc067/leases', 'leaseOffset': 114294784}, {'domainID': 'cc29e364-6bf2-4a52-8213-3c83649fc067', 'imageID': 'e811105b-987a-4511-9378-ecc167413466', 'volumeID': 'efb6948e-4e27-4224-8715-f555cbf3c01a', 'path': '/rhev/data-center/mnt/blockSD/cc29e364-6bf2-4a52-8213-3c83649fc067/images/e811105b-987a-4511-9378-ecc167413466/efb6948e-4e27-4224-8715-f555cbf3c01a', 'leasePath': '/dev/cc29e364-6bf2-4a52-8213-3c83649fc067/leases', 'leaseOffset': 147849216}]} from=internal, task_id=86e69926-3563-4a87-a648-6ddab4679ed6 (api:54)
2021-01-13 22:12:26,719+0100 INFO  (virt/cf577e3d) [vds] prepared volume path: /rhev/data-center/mnt/blockSD/cc29e364-6bf2-4a52-8213-3c83649fc067/images/e811105b-987a-4511-9378-ecc167413466/efb6948e-4e27-4224-8715-f555cbf3c01a (clientIF:513)
2021-01-13 22:12:26,720+0100 INFO  (virt/cf577e3d) [virt.vm] (vmId='fc00512f-7d52-42d9-81b5-4c3fbc2eb0a3') <?xml version='1.0' encoding='utf-8'?>
<domainsnapshot><disks><disk name="sda" snapshot="external" type="block"><source dev="/rhev/data-center/mnt/blockSD/cc29e364-6bf2-4a52-8213-3c83649fc067/images/e811105b-987a-4511-9378-ecc167413466/efb6948e-4e27-4224-8715-f555cbf3c01a" type="block"><seclabel model="dac" relabel="no" type="none" /></source></disk></disks></domainsnapshot> (snapshot:435)
2021-01-13 22:12:26,723+0100 INFO  (virt/cf577e3d) [virt.vm] (vmId='fc00512f-7d52-42d9-81b5-4c3fbc2eb0a3') Disabling drive monitoring (drivemonitor:66)
2021-01-13 22:12:26,723+0100 INFO  (virt/cf577e3d) [virt.vm] (vmId='fc00512f-7d52-42d9-81b5-4c3fbc2eb0a3') Freezing guest filesystems (vm:4134)
2021-01-13 22:12:26,724+0100 WARN  (virt/cf577e3d) [virt.vm] (vmId='fc00512f-7d52-42d9-81b5-4c3fbc2eb0a3') Unable to freeze guest filesystems: Guest agent is not responding: QEMU guest agent is not connected (vm:4143)
2021-01-13 22:12:26,724+0100 INFO  (virt/cf577e3d) [virt.vm] (vmId='fc00512f-7d52-42d9-81b5-4c3fbc2eb0a3') Taking a live snapshot (drives=sda,memory=True) (snapshot:460)
2021-01-13 22:12:26,832+0100 INFO  (virt/cf577e3d) [virt.vm] (vmId='fc00512f-7d52-42d9-81b5-4c3fbc2eb0a3') Snapshot timeout reached, operation aborted (snapshot:80)
2021-01-13 22:12:26,832+0100 INFO  (virt/cf577e3d) [virt.vm] (vmId='fc00512f-7d52-42d9-81b5-4c3fbc2eb0a3') Thawing guest filesystems (vm:4161)
2021-01-13 22:12:26,834+0100 WARN  (virt/cf577e3d) [virt.vm] (vmId='fc00512f-7d52-42d9-81b5-4c3fbc2eb0a3') Unable to thaw guest filesystems: Guest agent is not responding: QEMU guest agent is not connected (vm:4170)
2021-01-13 22:12:26,834+0100 INFO  (virt/cf577e3d) [virt.vm] (vmId='fc00512f-7d52-42d9-81b5-4c3fbc2eb0a3') Enabling drive monitoring (drivemonitor:59)
2021-01-13 22:12:26,837+0100 ERROR (virt/cf577e3d) [root] Job 'cf577e3d-c6e5-4b16-9ce3-34532b8ad110' failed (jobs:223)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/jobs/snapshot.py", line 120, in _run
    snap.snapshot()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/jobs/snapshot.py", line 495, in snapshot
    new_drives, vm_drives)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/jobs/snapshot.py", line 213, in teardown
    self.finalize_vm(memory_vol)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/jobs/snapshot.py", line 209, in finalize_vm
    raise exception.ActionStopped()
vdsm.common.exception.ActionStopped: Action was stopped: ()

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/jobs.py", line 159, in run
    self._run()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/jobs/snapshot.py", line 130, in _run
    raise exception.SnapshotFailed()
vdsm.common.exception.SnapshotFailed: Snapshot failed
2021-01-13 22:12:26,838+0100 INFO  (virt/cf577e3d) [root] Job 'cf577e3d-c6e5-4b16-9ce3-34532b8ad110' will be deleted in 3600 seconds (jobs:251)


Snapshot failed...
The cause why it failed seems to be something in libvirt/qemu, but when enabling debug & restarting libvirt, the VM went into down state.
So the root cause if that is unknown.



Nof after some time, oVirt removes the 'stale' volume:
2021-01-13 22:17:47,021+0100 WARN  (virt/56c66407) [storage.LVM] Removing stale lv: cc29e364-6bf2-4a52-8213-3c83649fc067/efb6948e-4e27-4224-8715-f555cbf3c01a (lvm:753)


But on the VM side, the disk was still in use! Even that the snapshot returned a failure, it seems like it did succeed:
    <disk type='block' device='disk' snapshot='no'>
      <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='native'/>
      <source dev='/rhev/data-center/mnt/blockSD/cc29e364-6bf2-4a52-8213-3c83649fc067/images/e811105b-987a-4511-9378-ecc167413466/7f6631ec-e95a-446a-9f13-839a965d2863' index='33'>
        <seclabel model='dac' relabel='no'/>
      </source>
      <backingStore type='block' index='32'>
        <format type='qcow2'/>
        <source dev='/rhev/data-center/mnt/blockSD/cc29e364-6bf2-4a52-8213-3c83649fc067/images/e811105b-987a-4511-9378-ecc167413466/efb6948e-4e27-4224-8715-f555cbf3c01a'>
          <seclabel model='dac' relabel='no'/>
        </source>
        <backingStore type='block' index='1'>
          <format type='qcow2'/>
          <source dev='/rhev/data-center/mnt/blockSD/cc29e364-6bf2-4a52-8213-3c83649fc067/images/e811105b-987a-4511-9378-ecc167413466/cb6cee06-1b8a-4e34-88a8-c96257a9a2f3'>
            <seclabel model='dac' relabel='no'/>
          </source>
          <backingStore/>
        </backingStore>
      </backingStore>
      <target dev='sda' bus='scsi'/>
      <serial>e811105b-987a-4511-9378-ecc167413466</serial>
      <boot order='1'/>
      <alias name='ua-e811105b-987a-4511-9378-ecc167413466'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    

But as the LV was removed (but was still in use), we had disk corruption on a snapshot of another vm on that storageDomain.    
This because qemu was still writing data to that path, but the LV was removed and another LV was created on the same blocks.
We also lost the data on the VM that was snapshotted (and failed) from the moment of the snapshot until the libvirt restart. Cause the changed blocks were written on a LV that was removed already.
If we did not remove the LV, we could still manually merge the snapshot into the base. And continue without issues.


I think we should improve the stale volume removal to better check if a volume is really stale/unused.

Comment 1 RHEL Program Management 2021-01-18 15:18:04 UTC

The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 2 Tal Nisan 2021-02-04 10:29:44 UTC

Arik, this looks like the FS freeze bug you guys fixed, can you please share the configuration needed in order to prevent this issue?

Comment 3 Arik 2021-02-04 11:51:11 UTC

So we're talking about a live-snapshot that doesn't include memory, right?
It also seems that VDSM initiated the freeze so I suspect the config value LiveSnapshotPerformFreezeInEngine is set to 'false'.
But even if LiveSnapshotPerformFreezeInEngine is set to 'true', the freeze would fail as it appears the guest agent is inaccessible.
Is there qemu-guest-agent running in the guest?

Comment 4 Arik 2021-02-04 11:57:21 UTC

^ That would explain the failure though, just why the FS wasn't freezed

Comment 5 Arik 2021-02-04 11:58:19 UTC

That *wouldn't* explain the failure though, just why the FS wasn't freezed

Comment 6 Jean-Louis Dupond 2021-02-04 12:26:36 UTC

This is indeed a live-snapshot without memory (its used in our backup flow atm).

# engine-config -g LiveSnapshotPerformFreezeInEngine
LiveSnapshotPerformFreezeInEngine: false version: general
(which is the default, advantages to enable this?)

The VM is already removed ... But very good chance it did not have an agent installed (because some OS'es are quite old :(()


But that freeze failed isn't such an issue here :)

Comment 7 Eyal Shenitzky 2021-02-23 11:07:04 UTC

Arik, 
Anything else that we could do here?

Comment 8 Arik 2021-02-23 11:49:54 UTC

I think we need to check the error handling of create-snapshot since it's not the first time we're told that when it fails, bad things happen on the storage

Comment 9 Eyal Shenitzky 2021-03-01 07:49:46 UTC

Arik, from the logs it seems that we should recognize and handle the case of failure in creating the snapshot on VDSM side.
Please have a look.

Comment 10 Arik 2021-03-01 08:14:33 UTC

(In reply to Eyal Shenitzky from comment #9)
> Arik, from the logs it seems that we should recognize and handle the case of
> failure in creating the snapshot on VDSM side.
> Please have a look.

It makes a lot of sense to improve the error handling on the VDSM side and if you've investigated the logs, could you please share your findings to save us time and help us determine if it's virt or storage?

Comment 11 Eyal Shenitzky 2021-03-01 08:34:11 UTC

From the logs in the description, it seems like the operation failed in the live-snapshot creation process (snapshot.py).
So it is probably around the creation of it or in the recovery mechanism.

Do you have other thoughts on it?

Comment 12 Benny Zlotnik 2021-03-01 10:07:34 UTC

I think there is an issue in the error handling in vdsm, since we this log[1]:
2021-01-13 22:12:26,832+0100 INFO  (virt/cf577e3d) [virt.vm] (vmId='fc00512f-7d52-42d9-81b5-4c3fbc2eb0a3') Snapshot timeout reached, operation aborted (snapshot:80)

But the operation started at:
2021-01-13 22:12:26,003+0100 INFO  (virt/cf577e3d) [root] Running job 'cf577e3d-c6e5-4b16-9ce3-34532b8ad110'... (jobs:185)
Not even a second earlier, while the timeout is 30 minutes

So this abort is treated like it was initiated by vdsm, but it is actually done by libvirt. What is unclear to me is how this snapshot ended being usable by the VM given how fast this failed.


Regarding the stale volume removal, it is not an actual lvremove, it just removes an LV uuid from vdsm's internal cache after seeing it's not present in the latest reload (invocation of `lvs` command). The LV was likely removed before it was used by the VM, because LVM would have prevented the removal of an in-use LV. 
Full logs (including engine) would help better understand the chain of events.



[1] https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/jobs/snapshot.py#L464

Comment 13 Jean-Louis Dupond 2021-03-01 11:32:00 UTC

(In reply to Benny Zlotnik from comment #12)
> 
> So this abort is treated like it was initiated by vdsm, but it is actually
> done by libvirt. What is unclear to me is how this snapshot ended being
> usable by the VM given how fast this failed.
> 
Most likely something in libvirt failed because we tried incremental backup stuff on the VM.
After restarting libvirt, the vm was restarted and the issue was resolved (so no libvirt debug logs)

> 
> Regarding the stale volume removal, it is not an actual lvremove, it just
> removes an LV uuid from vdsm's internal cache after seeing it's not present
> in the latest reload (invocation of `lvs` command). The LV was likely
> removed before it was used by the VM, because LVM would have prevented the
> removal of an in-use LV. 

Is that the case? Cause isn't the 'lvremove' executed on the SPM? And if the VM is running on another node, it will not know its in use?

The logs of the delete:
2021-01-13T22:12:17.362+01:00 xxxx [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] START, CreateVolumeVDSCommand( CreateVolumeVDSCommandParameters:{storagePoolId='d497efe5-2344-4d58-8985-7b053d3c35a3', ignoreFailoverLimit='false', storageDomainId='cc29e364-6bf2-4a52-8213-3c83649fc067', imageGroupId='e811105b-987a-4511-9378-ecc167413466', imageSizeInBytes='26843545600', volumeFormat='COW', newImageId='efb6948e-4e27-4224-8715-f555cbf3c01a', imageType='Sparse', newImageDescription='', imageInitialSizeInBytes='0', imageId='cb6cee06-1b8a-4e34-88a8-c96257a9a2f3', sourceImageGroupId='e811105b-987a-4511-9378-ecc167413466', shouldAddBitmaps='false'}), log id: 37f7989a
2021-01-13T22:12:17.412+01:00 xxxx [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] FINISH, CreateVolumeVDSCommand, return: efb6948e-4e27-4224-8715-f555cbf3c01a, log id: 37f7989a
2021-01-13T22:12:31.302+01:00 xxxx [org.ovirt.engine.core.vdsbroker.irsbroker.DestroyImageVDSCommand] START, DestroyImageVDSCommand( DestroyImageVDSCommandParameters:{storagePoolId='d497efe5-2344-4d58-8985-7b053d3c35a3', ignoreFailoverLimit='false', storageDomainId='cc29e364-6bf2-4a52-8213-3c83649fc067', imageGroupId='e811105b-987a-4511-9378-ecc167413466', imageId='00000000-0000-0000-0000-000000000000', imageList='[efb6948e-4e27-4224-8715-f555cbf3c01a]', postZero='false', force='false'}), log id: 6f053449


> Full logs (including engine) would help better understand the chain of
> events.
> 
I can send you the logs if wanted. But prefer via email then as it might contain data that should not be in public :)
> 
> 
> [1]
> https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/jobs/snapshot.py#L464

Comment 14 Liran Rotenberg 2021-03-01 11:37:06 UTC

(In reply to Benny Zlotnik from comment #12)
> I think there is an issue in the error handling in vdsm, since we this
> log[1]:
> 2021-01-13 22:12:26,832+0100 INFO  (virt/cf577e3d) [virt.vm]
> (vmId='fc00512f-7d52-42d9-81b5-4c3fbc2eb0a3') Snapshot timeout reached,
> operation aborted (snapshot:80)
> 
> But the operation started at:
> 2021-01-13 22:12:26,003+0100 INFO  (virt/cf577e3d) [root] Running job
> 'cf577e3d-c6e5-4b16-9ce3-34532b8ad110'... (jobs:185)
> Not even a second earlier, while the timeout is 30 minutes
> 
> So this abort is treated like it was initiated by vdsm, but it is actually
> done by libvirt. What is unclear to me is how this snapshot ended being
> usable by the VM given how fast this failed.
> 
> [1]
> https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/jobs/snapshot.py#L464

Yes the VDSM logs might be confusing, in line 464 as you indicated we are saying it's timeout.
This is because in[2] we are calling libvirt to abort the job. But in this case it seems we got it from libvirt itself.

I opened BZ1933669.

[2] https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/jobs/snapshot.py#L640

Comment 15 Benny Zlotnik 2021-03-01 11:49:28 UTC

(In reply to Jean-Louis Dupond from comment #13)
> (In reply to Benny Zlotnik from comment #12)
> > 
> > So this abort is treated like it was initiated by vdsm, but it is actually
> > done by libvirt. What is unclear to me is how this snapshot ended being
> > usable by the VM given how fast this failed.
> > 
> Most likely something in libvirt failed because we tried incremental backup
> stuff on the VM.
> After restarting libvirt, the vm was restarted and the issue was resolved
> (so no libvirt debug logs)
> 
> > 
> > Regarding the stale volume removal, it is not an actual lvremove, it just
> > removes an LV uuid from vdsm's internal cache after seeing it's not present
> > in the latest reload (invocation of `lvs` command). The LV was likely
> > removed before it was used by the VM, because LVM would have prevented the
> > removal of an in-use LV. 
> 
> Is that the case? Cause isn't the 'lvremove' executed on the SPM? And if the
> VM is running on another node, it will not know its in use?

you're right, for some reason I assumed there's just one host :)
Regardless, I do think the image was not reported at the time of the removal, as we have a check in the engine to see whether an image is used by a VM 

> The logs of the delete:
> 2021-01-13T22:12:17.362+01:00 xxxx
> [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] START,
> CreateVolumeVDSCommand(
> CreateVolumeVDSCommandParameters:{storagePoolId='d497efe5-2344-4d58-8985-
> 7b053d3c35a3', ignoreFailoverLimit='false',
> storageDomainId='cc29e364-6bf2-4a52-8213-3c83649fc067',
> imageGroupId='e811105b-987a-4511-9378-ecc167413466',
> imageSizeInBytes='26843545600', volumeFormat='COW',
> newImageId='efb6948e-4e27-4224-8715-f555cbf3c01a', imageType='Sparse',
> newImageDescription='', imageInitialSizeInBytes='0',
> imageId='cb6cee06-1b8a-4e34-88a8-c96257a9a2f3',
> sourceImageGroupId='e811105b-987a-4511-9378-ecc167413466',
> shouldAddBitmaps='false'}), log id: 37f7989a
> 2021-01-13T22:12:17.412+01:00 xxxx
> [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] FINISH,
> CreateVolumeVDSCommand, return: efb6948e-4e27-4224-8715-f555cbf3c01a, log
> id: 37f7989a
> 2021-01-13T22:12:31.302+01:00 xxxx
> [org.ovirt.engine.core.vdsbroker.irsbroker.DestroyImageVDSCommand] START,
> DestroyImageVDSCommand(
> DestroyImageVDSCommandParameters:{storagePoolId='d497efe5-2344-4d58-8985-
> 7b053d3c35a3', ignoreFailoverLimit='false',
> storageDomainId='cc29e364-6bf2-4a52-8213-3c83649fc067',
> imageGroupId='e811105b-987a-4511-9378-ecc167413466',
> imageId='00000000-0000-0000-0000-000000000000',
> imageList='[efb6948e-4e27-4224-8715-f555cbf3c01a]', postZero='false',
> force='false'}), log id: 6f053449
> 
> 
> > Full logs (including engine) would help better understand the chain of
> > events.
> > 
> I can send you the logs if wanted. But prefer via email then as it might
> contain data that should not be in public :)
you can send it in the email
> > 
> > [1]
> > https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/jobs/snapshot.py#L464

Comment 16 Benny Zlotnik 2021-03-01 11:52:57 UTC

(In reply to Liran Rotenberg from comment #14)
> (In reply to Benny Zlotnik from comment #12)
> > I think there is an issue in the error handling in vdsm, since we this
> > log[1]:
> > 2021-01-13 22:12:26,832+0100 INFO  (virt/cf577e3d) [virt.vm]
> > (vmId='fc00512f-7d52-42d9-81b5-4c3fbc2eb0a3') Snapshot timeout reached,
> > operation aborted (snapshot:80)
> > 
> > But the operation started at:
> > 2021-01-13 22:12:26,003+0100 INFO  (virt/cf577e3d) [root] Running job
> > 'cf577e3d-c6e5-4b16-9ce3-34532b8ad110'... (jobs:185)
> > Not even a second earlier, while the timeout is 30 minutes
> > 
> > So this abort is treated like it was initiated by vdsm, but it is actually
> > done by libvirt. What is unclear to me is how this snapshot ended being
> > usable by the VM given how fast this failed.
> > 
> > [1]
> > https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/jobs/snapshot.py#L464
> 
> Yes the VDSM logs might be confusing, in line 464 as you indicated we are
> saying it's timeout.
> This is because in[2] we are calling libvirt to abort the job. But in this
> case it seems we got it from libvirt itself.
> 
> I opened BZ1933669.
> 
> [2]
> https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/jobs/snapshot.py#L640

It's more than the logging, in this case it seems that we can't actually trust libvirt's abort as the image ended being used by the VM, so it seems we may need a distinct error to be sent to the engine to indicate it should not try to rollback

Comment 17 Liran Rotenberg 2021-03-01 13:28:14 UTC

(In reply to Benny Zlotnik from comment #16)
> 
> It's more than the logging, in this case it seems that we can't actually
> trust libvirt's abort as the image ended being used by the VM, so it seems
> we may need a distinct error to be sent to the engine to indicate it should
> not try to rollback

Yes I see your point. But, we never encounter it. When we get to the part of updating drives in VDSM we know rollback is too late (we didn't reach to it here).
libvirt reports the domain XML once it changed, in that case we should get the new chain, including the new volume in use. We also have the check to prevent this rollback in case it's there(your fix).
Unless there is some new behavior from libvirt, it's surprising to get a new volume in use, not appearing in the chain, making us to delete it.
The question is how it happened and why?
I don't think that it should happen from libvirt point of view, then doing special return to the engine in order to prevent the volume removal(always in such case?) is wrong.

Comment 18 Liran Rotenberg 2021-03-02 12:47:02 UTC

After a discussion with Benny, we might try to prevent rolling back in such situation.
This means the VM will have the new volume and use it, while it won't be existing on the domain XML volume chain.
It should be first tested to see there no big regression because of it.

Comment 19 Arik 2021-03-02 13:52:15 UTC

(In reply to Liran Rotenberg from comment #18)
> It should be first tested to see there no big regression because of it.

Seems we already choose not to remove a volume when we are not able to retrieve the volume-chain [1] - which is similar in a way.

[1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/snapshots/CreateSnapshotCommand.java#L236-L239

Comment 20 Liran Rotenberg 2021-03-11 15:46:28 UTC

Hi,
We tried to figure out what can be wrong and it just speculations on the air.
Without proper logs it will be hard to tell what went wrong. A volume in use shouldn't be removed of course.
I tried to see if the engine got the volume in the chain like you pointed out in comment #0, but didn't find anything.
Also, if it was in the chain, the code in comment #19, would save us from removing it.

Given that, if you can provide full logs (hopefully including libvirt), we can look more into it.
If that is not possible I recommend to close this bug until someone will be able to provide them.

Comment 21 Liran Rotenberg 2021-03-14 08:03:20 UTC

From offline discussion with Jean-Louis Dupond:
Thanks for having an in depth look. I don't have more to share at this moment, so guess we'll have to close the bug then.
If it occurs again I'll try to gather more info (and directly archive the engine and vdsm logs).

Therefore, I'm closing the bug. If it comes up again with more info please re-open this bug or a new one.