Bug 2122525 - Live merge sometimes fails, "No space left on device" appear in the log
Summary: Live merge sometimes fails, "No space left on device" appear in the log
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: General
Version: 4.5.2
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Mark Kemel
QA Contact: Shir Fishbain
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-08-30 08:37 UTC by Giulio Casella
Modified: 2022-11-16 08:10 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-10-30 22:15:13 UTC
oVirt Team: Storage
Embargoed:


Attachments (Terms of Use)
engine.log (73.04 KB, application/x-bzip)
2022-08-30 08:37 UTC, Giulio Casella
no flags Details
vdsm log (from host running VM) (880.65 KB, application/x-bzip)
2022-08-30 08:38 UTC, Giulio Casella
no flags Details
libvirt/qemu log for hypervisor (9.22 KB, application/x-bzip)
2022-08-31 08:01 UTC, Giulio Casella
no flags Details
journalctl output (40.20 KB, application/x-bzip)
2022-08-31 08:14 UTC, Giulio Casella
no flags Details
libvirt/qemu log for hypervisor (86.77 KB, application/x-bzip)
2022-09-01 16:04 UTC, Giulio Casella
no flags Details
libvirt log (4.03 MB, application/x-bzip)
2022-09-05 12:58 UTC, Giulio Casella
no flags Details
engine log (2.27 MB, application/gzip)
2022-09-06 07:58 UTC, Giulio Casella
no flags Details
vdsm log (1.20 MB, application/x-bzip)
2022-09-06 07:59 UTC, Giulio Casella
no flags Details
vdsm.log of vm (c6b8d703-8644-46b3-a64f-26a2ef81abf6) with snapshot issue (18.86 MB, application/x-bzip)
2022-09-16 11:58 UTC, sjansen
no flags Details
libvirt.log of vm (c6b8d703-8644-46b3-a64f-26a2ef81abf6) with snapshot issue (30.93 KB, application/x-bzip)
2022-09-16 11:59 UTC, sjansen
no flags Details
engine.log of vm (c6b8d703-8644-46b3-a64f-26a2ef81abf6) with snapshot issue (778.26 KB, application/x-bzip)
2022-09-16 11:59 UTC, sjansen
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github oVirt vdsm issues 352 0 None open Live merge sometimes fails, "No space left on device" appear in the log 2022-11-16 08:08:34 UTC
Red Hat Issue Tracker RHV-47864 0 None None None 2022-08-30 08:39:20 UTC

Description Giulio Casella 2022-08-30 08:37:18 UTC
Created attachment 1908453 [details]
engine.log

Created attachment 1908453 [details]
engine.log

Description of problem: sometime snapshot removal fails, leaving VM with disk(s) in illegal state; subsequent tries to remove snapshot fail again


Version-Release number of selected component (if applicable): ovirt-engine-4.5.2.4-1.el8.noarch, oVirt Node 4.5.2 (vdsm-4.50.2.2-1.el8)


How reproducible: not always; issue happens now and then when backup system (storware vprotect, interacting with oVirt via API) tries to delete snapshots


Actual results: snapshot removals fail, and subsequent tries fail again 


Expected results: snapshot is correctly removed


Additional info: issue doesn't happen every time, VMs are always (almost) the same (5 or 6 out of about 40), with no apparent pattern (VM either linux or windows, big or small disks, all hypervisors, ...)

Comment 1 Giulio Casella 2022-08-30 08:38:16 UTC
Created attachment 1908454 [details]
vdsm log (from host running VM)

Comment 2 Giulio Casella 2022-08-30 08:41:45 UTC
To help you dive in logs: the VM with illegal disks is named "minerva-ad". In logs you can find the originating event (vprotect removing a snapshot, causing illegal state disks) and a subsequent removal try, performed via ovirt UI

Comment 3 Giulio Casella 2022-08-30 08:48:27 UTC
Note: this issue is not solved by patch in https://bugzilla.redhat.com/show_bug.cgi?id=1948599

Comment 4 RHEL Program Management 2022-08-30 10:25:49 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 5 Benny Zlotnik 2022-08-30 15:18:03 UTC
It seems the failure is coming from libvirt
2022-08-30 08:18:09,125+0000 ERROR (libvirt/events) [virt.vm] (vmId='53adff44-8506-41e7-86d1-5a6ca760721e') Block job e66bb7a4-b7a3-4f6c-825b-5a6c1935f5ec type COMMIT for drive sda has failed (vm:5972)

do you have libvirtd/qemu logs?

Comment 6 Giulio Casella 2022-08-31 08:01:53 UTC
Created attachment 1908628 [details]
libvirt/qemu log for hypervisor

This is libvirt/qemu log file from hypervisor running involved VM, but there's nothing relevant in it, last entry is a dozen days old

Comment 7 Benny Zlotnik 2022-08-31 08:06:13 UTC
(In reply to Giulio Casella from comment #6)
> Created attachment 1908628 [details]
> libvirt/qemu log for hypervisor
> 
> This is libvirt/qemu log file from hypervisor running involved VM, but
> there's nothing relevant in it, last entry is a dozen days old

What about 
$ journalctl -u libvirtd

Comment 8 Giulio Casella 2022-08-31 08:14:19 UTC
Created attachment 1908629 [details]
journalctl output

journalctl -u libvirtd |  grep "Aug 30"

I can't see anything relevant in it

Comment 9 Arik 2022-08-31 09:30:25 UTC
I would expect the error to appear in libvirt logs regardless of the log level but could you please try to make it more verbose and then try to reproduce, see:
https://libvirt.org/kbase/debuglogs.html

Comment 10 Giulio Casella 2022-09-01 16:04:00 UTC
Created attachment 1909002 [details]
libvirt/qemu log for hypervisor

I raised log level on all hypervisors
(e.g. added:
log_filters="1:qemu 1:libvirt 4:object 4:json 4:event 1:util"
to file /etc/libvirt/libvirtd.conf, is that right?)

This is the file /var/log/libvirt/qemu/mailserver.log ("mailserver" is the VM that now has illegal snapshot, undeletable). I can't see anything relevant in it.

Comment 11 Casper (RHV QE bot) 2022-09-01 16:33:24 UTC
This bug has low overall severity and is not going to be further verified by QE. If you believe special care is required, feel free to properly align relevant severity, flags and keywords to raise PM_Score or use one of the Bumps ('PrioBumpField', 'PrioBumpGSS', 'PrioBumpPM', 'PrioBumpQA') in Keywords to raise it's PM_Score above verification threashold (1000).

Comment 12 Arik 2022-09-04 08:44:43 UTC
Thanks Giulio, I also don't see anything interesting in that log - did you raise the log level and then reproduced the failure to remove the snapshot? we need more information on the issue or on how to reproduce it. And while I realize this issue is important for you, please don't change the severity of this bug - failure to remove a snapshot does not meet the criteria of an urgent issue

Comment 13 Casper (RHV QE bot) 2022-09-04 09:03:34 UTC
This bug has low overall severity and is not going to be further verified by QE. If you believe special care is required, feel free to properly align relevant severity, flags and keywords to raise PM_Score or use one of the Bumps ('PrioBumpField', 'PrioBumpGSS', 'PrioBumpPM', 'PrioBumpQA') in Keywords to raise it's PM_Score above verification threashold (1000).

Comment 14 Giulio Casella 2022-09-05 07:33:48 UTC
Hi Arik,
yes, i did:

- added: log_filters="1:qemu 1:libvirt 4:object 4:json 4:event 1:util"
to file /etc/libvirt/libvirtd.conf in every hypervisor
- systemctl restart libvirtd in every hypervisor
- reproduced error
- sent (attached) libvirt/qemu log

I suspect it's hard to reproduce, it's not happening to all VMs, and not always (at every snapshot removal).

We can arrange a zoom call to reproduce it in my datacenter if you need.

Comment 15 sjansen 2022-09-05 12:07:38 UTC
Hello,

We are in the same boat as Giulio, vProtect customer trying to backup oVirt machines. The fix in 1948599 didnt work for us.

We have around ~50 VMs in our two node oVirt cluster stored on shared IBM SAN. Every now and then some snapshots removal fails that leads to a VM with disks in invalid state, for us its around 2-3 VMs per Day, random VMs, random sizes, no visible pattern. Recovering from this is hurt a lot as we have to clone the disks into a new VM that lead to a complete new backup chain for the "new" VM.

Today i set our debug level as Giulio did, next time a VM went into invalid state i try to check logs for visible issues and report any finding here.

Best regards,

Sven

Comment 16 Mark Kemel 2022-09-05 12:50:42 UTC
Hi Sven, thanks for the report. 

@Giulio, the logs that you have attached cover manual snapshot remove attempt. Can you please provide logs that cover the initial remove attempt made by vProtect? Maybe it will shed some more light on this issue.
@sjansen Sven your logs might also help.

Thanks
Mark

Comment 17 Giulio Casella 2022-09-05 12:58:06 UTC
Created attachment 1909631 [details]
libvirt log

Find attached libvirtd log (VERY verbose, level debug). The VM that has failed snapshot removal (performed by vProtect) is named "mailserver". Hope this helps.
I can see a strange error (reported inside a json object) "No space left on device". It's plenty of free space out there, either in hypervisor and guest VM.

Comment 18 sjansen 2022-09-05 14:55:39 UTC
I just grepped for this error and i can find it on several VMs, all of them are/was affected by the snapshot issue. Full error is "qcow2_free_clusters failed: No space left on device".

Comment 19 Arik 2022-09-05 15:49:37 UTC
Sven, can you please upload the qemu logs with that "qcow2_free_clusters failed: No space left on device" error?

Comment 20 Benny Zlotnik 2022-09-06 07:29:13 UTC
(In reply to Giulio Casella from comment #17)
> Created attachment 1909631 [details]
> libvirt log
> 
> Find attached libvirtd log (VERY verbose, level debug). The VM that has
> failed snapshot removal (performed by vProtect) is named "mailserver". Hope
> this helps.
> I can see a strange error (reported inside a json object) "No space left on
> device". It's plenty of free space out there, either in hypervisor and guest
> VM.

Can you add the engine and vdsm logs from the same time (VM and SPM) so we can correlate it?

Comment 21 Giulio Casella 2022-09-06 07:58:39 UTC
Created attachment 1909736 [details]
engine log

Here is engine log

Comment 22 Giulio Casella 2022-09-06 07:59:19 UTC
Created attachment 1909737 [details]
vdsm log

and this is vdsm log

Comment 23 Giulio Casella 2022-09-06 08:01:03 UTC
I'm not 100% sure, but I believe yesterday SPM host was the same running failed VM

Comment 24 sjansen 2022-09-16 11:58:37 UTC
Created attachment 1912314 [details]
vdsm.log of vm (c6b8d703-8644-46b3-a64f-26a2ef81abf6) with snapshot issue

Comment 25 sjansen 2022-09-16 11:59:22 UTC
Created attachment 1912315 [details]
libvirt.log of vm (c6b8d703-8644-46b3-a64f-26a2ef81abf6) with snapshot issue

Comment 26 sjansen 2022-09-16 11:59:54 UTC
Created attachment 1912316 [details]
engine.log of vm (c6b8d703-8644-46b3-a64f-26a2ef81abf6) with snapshot issue

Comment 27 sjansen 2022-09-16 12:10:00 UTC
Sorry for the late response, i was away for a few days. I had no snapshots issue for a few days but yesterday a new vm disk went into illegal state after vProtect tried to remove snapshot from this VM.

The logs are aweful to read, so i merged all the vdsm files in the affacted timespan. This time, i cant find anything in libvirt.log of the affacted vm, just start/stop events, debug is enabled. All my logs are prefixed with sj_

The affected VM is: c6b8d703-8644-46b3-a64f-26a2ef81abf6
Disk went in illegal state when oVirt tried to remove this snapshot: 07ad3dd6-053c-4e62-a18d-57f5367cabae
Snapshot created on 14.09. at 19:49
Unsucessful Snapshot removal on 15.09. between 23:20 - 23:35

I hope this logs give you some inside.

Comment 28 Arik 2022-10-30 22:15:13 UTC
We see that remove-snapshot fails on live merge - the live merge operation gets triggered but the volume remains in the volume chain that is reported back to ovirt-engine and we've noticed "No space left on device" errors on affected VMs. ovirt seems to react properly to these failures (by marking the snapshots as illegal).

The bug was filed on ovirt-engine but ovirt-engine can't do much. I'm afraid VDSM can't do much either.
I'd suggest filing an issue on qemu to see if it's something that was addressed in centos stream 9 or whether something can be configured to investigate this further.

Comment 29 Giulio Casella 2022-11-02 09:00:58 UTC
Arik, as I said in Comment 17 "It's plenty of free space out there, either in hypervisor and guest VM" (and in storage domain too), I can't figure out where that message comes from.

Comment 30 Arik 2022-11-02 09:19:24 UTC
(In reply to Giulio Casella from comment #29)
> Arik, as I said in Comment 17 "It's plenty of free space out there, either
> in hypervisor and guest VM" (and in storage domain too), I can't figure out
> where that message comes from.

Yeah, I've noticed that - that's why I suggest to reach out to qemu devs so they can help figuring out the root cause

Comment 31 Arik 2022-11-16 08:08:35 UTC
We've managed to make some progress on that - we suspect it might be a regression related to https://github.com/oVirt/vdsm/pull/189, moving this bug to github (https://github.com/oVirt/vdsm/issues/352) as do now for ovirt bugs and bz 2141371 would track the (possible) fix for RHV


Note You need to log in before you can comment on or make changes to this bug.