1393241 – Most VMs paused on EIO do not resume after SD becomes VALID

Bug 1393241 - Most VMs paused on EIO do not resume after SD becomes VALID

Summary: Most VMs paused on EIO do not resume after SD becomes VALID

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.6.7
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-3.6.10
Target Release:	---
Assignee:	Francesco Romani
QA Contact:	sefi litmanovich
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-09 07:23 UTC by Germano Veit Michel
Modified:	2020-01-17 16:10 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-01-17 18:06:45 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:0109	0	normal	SHIPPED_LIVE	vdsm 3.6.10 bug fix and enhancement update	2017-01-17 22:48:48 UTC

Description Germano Veit Michel 2016-11-09 07:23:27 UTC

Description of problem:

1. Storage Failure (NFS)

2. Many VMs monitor become unresponsive (is the VM already paused?)

For example, these two VMs (eb93 and c0c5)

Thread-1510499::WARNING::2016-10-31 22:01:01,354::vm::5177::virt.vm::(_setUnresponsiveIfTimeout) vmId=`eb93a97c-ac98-4993-b05b-60ffac02b23d`::monitor become unresponsive (command timeout, age=63.0099999998)

Thread-1510499::WARNING::2016-10-31 22:01:01,363::vm::5177::virt.vm::(_setUnresponsiveIfTimeout) vmId=`c0c5d75b-3913-44bd-9827-7d128f9f8e3c`::monitor become unresponsive (command timeout, age=63.0099999998)

3. Storage domain monitor fails and SD becomes invalid

Thread-27::INFO::2016-10-31 22:01:13,562::monitor::299::Storage.Monitor::(_notifyStatusChanges) Domain deb3dbaa-a2d5-412e-b53a-9f2666a7ec12 became INVALID

And emits a useless(?) contEIO that returns on first if of the function. No problem.

Thread-1510530::DEBUG::2016-10-31 22:01:13,563::misc::804::Storage.Event.Storage.DomainMonitor.onDomainStateChange::(_emit) Calling registered method `contEIOVms`

4. Only ~10 minutes later EIO shows up on vdsm logs. Here are both VMs:

libvirtEventLoop::INFO::2016-10-31 22:10:05,559::vm::3795::virt.vm::(onIOError) vmId=`c0c5d75b-3913-44bd-9827-7d128f9f8e3c`::abnormal vm stop device virtio-disk0 error eio
libvirtEventLoop::INFO::2016-10-31 22:10:05,559::vm::5169::virt.vm::(_logGuestCpuStatus) vmId=`c0c5d75b-3913-44bd-9827-7d128f9f8e3c`::CPU stopped: onIOError

libvirtEventLoop::INFO::2016-10-31 22:10:05,643::vm::3795::virt.vm::(onIOError) vmId=`eb93a97c-ac98-4993-b05b-60ffac02b23d`::abnormal vm stop device ide0-0-0 error eio
libvirtEventLoop::INFO::2016-10-31 22:10:05,643::vm::5169::virt.vm::(_logGuestCpuStatus) vmId=`eb93a97c-ac98-4993-b05b-60ffac02b23d`::CPU stopped: onIOError

5. 23 seconds later Storage Monitoring works and domain becomes VALID

Thread-27::INFO::2016-10-31 22:10:28,591::monitor::299::Storage.Monitor::(_notifyStatusChanges) Domain deb3dbaa-a2d5-412e-b53a-9f2666a7ec12 became VALID
Thread-1510949::DEBUG::2016-10-31 22:10:28,592::misc::804::Storage.Event.Storage.DomainMonitor.onDomainStateChange::(_emit) Calling registered method `contEIOVms`

6. Only one of the VMs runs again:

Thread-1510950::INFO::2016-10-31 22:10:28,593::clientIF::180::vds::(contEIOVms) vmContainerLock acquired
Thread-1510950::INFO::2016-10-31 22:10:28,596::clientIF::187::vds::(contEIOVms) Cont vm eb93a97c-ac98-4993-b05b-60ffac02b23d in EIO

* There were 28 VMs, only 1 was Resumed.
* ALL pause events were with EIO.
* I can't find any similar bug that wasn't already closed.
* Looking at the vdsm tree I can see some work done on this, but nothing specific (to my eyes) to what I see here.
* The storage monitoring caught the INVALID then VALID, so it's not that corner case where the outage is too fast.
* Lots of monitors/periodic being discarded, even the VM that "continues" still has monitor not responding.
* VMs had to be powered off

Version-Release number of selected component (if applicable):
Red Hat Enterprise Virtualization Hypervisor release 7.2 (20160711.0.el7ev)
libvirt-1.2.17-13.el7_2.5.x86_64
vdsm-4.17.33
rhevm-3.6.7.5-0.1.el6.noarch

Actual results:
VMs paused until Power Off.

Expected results:
VMs on EIO continue

Additional info:
VDSM logs attached

Comment 3 Michal Skrivanek 2016-11-09 16:46:20 UTC

there is quite a difference between non responsive and paused. In general if it stops responding and doesn't come back then indeed it won't be able to resume those. There bugs about that all the way up to 3.6.9(?) and 4.0.4/4.0.5.
You would see those domains are either working or paused but vdsm doesn't report them (you'd see * in vdsClient list table output)
Can you check or see that? Reproduction scenario would help indeed

Comment 4 Germano Veit Michel 2016-11-10 00:15:27 UTC

Hi Michal! Thanks for taking this one.

(In reply to Michal Skrivanek from comment #3)
> there is quite a difference between non responsive and paused. In general if
> it stops responding and doesn't come back then indeed it won't be able to
> resume those. 

Yes, non-responsive is different than paused. Still, from the logs it looks like the monitor goes unresponsive when the VM pauses and the pause event arrived in vdsm much later than the VM was actually paused. Weird.

Before I try to reproduce, are you saying that if the monitor goes unresponsive before the libvirt pause event comes up then it is unlikely that the VM will be in the for loop of contEIO that resumes them?

> There bugs about that all the way up to 3.6.9(?) and
> 4.0.4/4.0.5.

Right, would you like to close this one and work in those other bugs? I just can't find them :(

> You would see those domains are either working or paused but vdsm doesn't
> report them (you'd see * in vdsClient list table output)
> Can you check or see that? Reproduction scenario would help indeed

Right, this makes sense to me. So the contEIO doesn't resume those VMs because listAllDomains(libvirt.VIR_CONNECT_LIST_DOMAINS_PAUSED) doesn't return them in the first place?

Thanks!

Comment 5 Michal Skrivanek 2016-11-30 12:12:32 UTC

Most likely candidates are bug 1398418, bug 1391506, bug 1364925

Comment 6 Michal Skrivanek 2016-12-01 13:18:46 UTC

So after those bugs are resolved I believe the current behavior is ok and we should not try to resume VMs paused due to EIO when host is non-responsive

Comment 11 sefi litmanovich 2016-12-29 14:34:18 UTC

Hi, so far I was not able to re produce the bug, not sure if there's some magic timing required or I'm just missing something.
My env is hosted engine with:
rhevm-3.6.7.5-0.1.el6.noarch
Host:
RHEL 7.3
vdsm-4.17.37-1.el7ev.noarch
libvirt-daemon-2.0.0-10.el7.x86_64

I am using NFS which I set up locally on my laptop, mounted with NFS version 4.0.
I created 25 VMs with disks on my local nfs SD and started them all.
In order to created the storage failure I just stop the nfs-server service.
I monitored vdsm.log and I could see all the messages mentioned in points 2-4 in Germano's description. 
All the vms in the engine were in status 'not responding'
As the NFS failure wasn't organic I didn't see the storage monitor back after 23 seconds, but had to start the server again and it took few more minutes until I got that message.
After the storage monitor was up again, all vms resumed and got to state 'up'.

Any ideas what we want to do here? I mean, how can I verify on 3.6.10 if I can't see the bug on 3.6.7.5?

Comment 12 Francesco Romani 2017-01-02 16:34:11 UTC

AFAIU we don't need doc_text here.

Comment 13 sefi litmanovich 2017-01-05 16:21:25 UTC

Verified on rhevm-3.6.10.2-0.2.el6.
Ran the flow I mentioned above once stopping nfs-server, another time using kill -9 to kill nfsd processes, got the same results mentioned in comment11. This doesn't ensure 100% that the problem solved but it's near impossible to re produce the bug, so I think it will have to do.

Comment 14 Germano Veit Michel 2017-01-09 02:34:04 UTC

(In reply to sefi litmanovich from comment #11)
> Hi, so far I was not able to re produce the bug, not sure if there's some
> magic timing required or I'm just missing something.
> My env is hosted engine with:
> rhevm-3.6.7.5-0.1.el6.noarch
> Host:
> RHEL 7.3
> vdsm-4.17.37-1.el7ev.noarch
> libvirt-daemon-2.0.0-10.el7.x86_64

Hi Sefi!

I'm pretty sure this is hard to hit and might be just impossible to have a 100% reproducer. However:

* RHEL 7.3 with RHV 3.6.7 do not seem to be supported match.
* The main component involved here is vdsm, not the engine.
* The customer hit this on vdsm 4.17.33, you used 4.17.37. 
* Michal said there are some recent improvements in this area, so using 4.17.37 you may have even lower chances of hitting it.

Hope this helps.

Comment 17 errata-xmlrpc 2017-01-17 18:06:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0109.html

Note You need to log in before you can comment on or make changes to this bug.