Created attachment 1908453 [details] engine.log Created attachment 1908453 [details] engine.log Description of problem: sometime snapshot removal fails, leaving VM with disk(s) in illegal state; subsequent tries to remove snapshot fail again Version-Release number of selected component (if applicable): ovirt-engine-4.5.2.4-1.el8.noarch, oVirt Node 4.5.2 (vdsm-4.50.2.2-1.el8) How reproducible: not always; issue happens now and then when backup system (storware vprotect, interacting with oVirt via API) tries to delete snapshots Actual results: snapshot removals fail, and subsequent tries fail again Expected results: snapshot is correctly removed Additional info: issue doesn't happen every time, VMs are always (almost) the same (5 or 6 out of about 40), with no apparent pattern (VM either linux or windows, big or small disks, all hypervisors, ...)
Created attachment 1908454 [details] vdsm log (from host running VM)
To help you dive in logs: the VM with illegal disks is named "minerva-ad". In logs you can find the originating event (vprotect removing a snapshot, causing illegal state disks) and a subsequent removal try, performed via ovirt UI
Note: this issue is not solved by patch in https://bugzilla.redhat.com/show_bug.cgi?id=1948599
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.
It seems the failure is coming from libvirt 2022-08-30 08:18:09,125+0000 ERROR (libvirt/events) [virt.vm] (vmId='53adff44-8506-41e7-86d1-5a6ca760721e') Block job e66bb7a4-b7a3-4f6c-825b-5a6c1935f5ec type COMMIT for drive sda has failed (vm:5972) do you have libvirtd/qemu logs?
Created attachment 1908628 [details] libvirt/qemu log for hypervisor This is libvirt/qemu log file from hypervisor running involved VM, but there's nothing relevant in it, last entry is a dozen days old
(In reply to Giulio Casella from comment #6) > Created attachment 1908628 [details] > libvirt/qemu log for hypervisor > > This is libvirt/qemu log file from hypervisor running involved VM, but > there's nothing relevant in it, last entry is a dozen days old What about $ journalctl -u libvirtd
Created attachment 1908629 [details] journalctl output journalctl -u libvirtd | grep "Aug 30" I can't see anything relevant in it
I would expect the error to appear in libvirt logs regardless of the log level but could you please try to make it more verbose and then try to reproduce, see: https://libvirt.org/kbase/debuglogs.html
Created attachment 1909002 [details] libvirt/qemu log for hypervisor I raised log level on all hypervisors (e.g. added: log_filters="1:qemu 1:libvirt 4:object 4:json 4:event 1:util" to file /etc/libvirt/libvirtd.conf, is that right?) This is the file /var/log/libvirt/qemu/mailserver.log ("mailserver" is the VM that now has illegal snapshot, undeletable). I can't see anything relevant in it.
This bug has low overall severity and is not going to be further verified by QE. If you believe special care is required, feel free to properly align relevant severity, flags and keywords to raise PM_Score or use one of the Bumps ('PrioBumpField', 'PrioBumpGSS', 'PrioBumpPM', 'PrioBumpQA') in Keywords to raise it's PM_Score above verification threashold (1000).
Thanks Giulio, I also don't see anything interesting in that log - did you raise the log level and then reproduced the failure to remove the snapshot? we need more information on the issue or on how to reproduce it. And while I realize this issue is important for you, please don't change the severity of this bug - failure to remove a snapshot does not meet the criteria of an urgent issue
Hi Arik, yes, i did: - added: log_filters="1:qemu 1:libvirt 4:object 4:json 4:event 1:util" to file /etc/libvirt/libvirtd.conf in every hypervisor - systemctl restart libvirtd in every hypervisor - reproduced error - sent (attached) libvirt/qemu log I suspect it's hard to reproduce, it's not happening to all VMs, and not always (at every snapshot removal). We can arrange a zoom call to reproduce it in my datacenter if you need.
Hello, We are in the same boat as Giulio, vProtect customer trying to backup oVirt machines. The fix in 1948599 didnt work for us. We have around ~50 VMs in our two node oVirt cluster stored on shared IBM SAN. Every now and then some snapshots removal fails that leads to a VM with disks in invalid state, for us its around 2-3 VMs per Day, random VMs, random sizes, no visible pattern. Recovering from this is hurt a lot as we have to clone the disks into a new VM that lead to a complete new backup chain for the "new" VM. Today i set our debug level as Giulio did, next time a VM went into invalid state i try to check logs for visible issues and report any finding here. Best regards, Sven
Hi Sven, thanks for the report. @Giulio, the logs that you have attached cover manual snapshot remove attempt. Can you please provide logs that cover the initial remove attempt made by vProtect? Maybe it will shed some more light on this issue. @sjansen Sven your logs might also help. Thanks Mark
Created attachment 1909631 [details] libvirt log Find attached libvirtd log (VERY verbose, level debug). The VM that has failed snapshot removal (performed by vProtect) is named "mailserver". Hope this helps. I can see a strange error (reported inside a json object) "No space left on device". It's plenty of free space out there, either in hypervisor and guest VM.
I just grepped for this error and i can find it on several VMs, all of them are/was affected by the snapshot issue. Full error is "qcow2_free_clusters failed: No space left on device".
Sven, can you please upload the qemu logs with that "qcow2_free_clusters failed: No space left on device" error?
(In reply to Giulio Casella from comment #17) > Created attachment 1909631 [details] > libvirt log > > Find attached libvirtd log (VERY verbose, level debug). The VM that has > failed snapshot removal (performed by vProtect) is named "mailserver". Hope > this helps. > I can see a strange error (reported inside a json object) "No space left on > device". It's plenty of free space out there, either in hypervisor and guest > VM. Can you add the engine and vdsm logs from the same time (VM and SPM) so we can correlate it?
Created attachment 1909736 [details] engine log Here is engine log
Created attachment 1909737 [details] vdsm log and this is vdsm log
I'm not 100% sure, but I believe yesterday SPM host was the same running failed VM
Created attachment 1912314 [details] vdsm.log of vm (c6b8d703-8644-46b3-a64f-26a2ef81abf6) with snapshot issue
Created attachment 1912315 [details] libvirt.log of vm (c6b8d703-8644-46b3-a64f-26a2ef81abf6) with snapshot issue
Created attachment 1912316 [details] engine.log of vm (c6b8d703-8644-46b3-a64f-26a2ef81abf6) with snapshot issue
Sorry for the late response, i was away for a few days. I had no snapshots issue for a few days but yesterday a new vm disk went into illegal state after vProtect tried to remove snapshot from this VM. The logs are aweful to read, so i merged all the vdsm files in the affacted timespan. This time, i cant find anything in libvirt.log of the affacted vm, just start/stop events, debug is enabled. All my logs are prefixed with sj_ The affected VM is: c6b8d703-8644-46b3-a64f-26a2ef81abf6 Disk went in illegal state when oVirt tried to remove this snapshot: 07ad3dd6-053c-4e62-a18d-57f5367cabae Snapshot created on 14.09. at 19:49 Unsucessful Snapshot removal on 15.09. between 23:20 - 23:35 I hope this logs give you some inside.
We see that remove-snapshot fails on live merge - the live merge operation gets triggered but the volume remains in the volume chain that is reported back to ovirt-engine and we've noticed "No space left on device" errors on affected VMs. ovirt seems to react properly to these failures (by marking the snapshots as illegal). The bug was filed on ovirt-engine but ovirt-engine can't do much. I'm afraid VDSM can't do much either. I'd suggest filing an issue on qemu to see if it's something that was addressed in centos stream 9 or whether something can be configured to investigate this further.
Arik, as I said in Comment 17 "It's plenty of free space out there, either in hypervisor and guest VM" (and in storage domain too), I can't figure out where that message comes from.
(In reply to Giulio Casella from comment #29) > Arik, as I said in Comment 17 "It's plenty of free space out there, either > in hypervisor and guest VM" (and in storage domain too), I can't figure out > where that message comes from. Yeah, I've noticed that - that's why I suggest to reach out to qemu devs so they can help figuring out the root cause
We've managed to make some progress on that - we suspect it might be a regression related to https://github.com/oVirt/vdsm/pull/189, moving this bug to github (https://github.com/oVirt/vdsm/issues/352) as do now for ovirt bugs and bz 2141371 would track the (possible) fix for RHV