Description of problem: 1. Storage Failure (NFS) 2. Many VMs monitor become unresponsive (is the VM already paused?) For example, these two VMs (eb93 and c0c5) Thread-1510499::WARNING::2016-10-31 22:01:01,354::vm::5177::virt.vm::(_setUnresponsiveIfTimeout) vmId=`eb93a97c-ac98-4993-b05b-60ffac02b23d`::monitor become unresponsive (command timeout, age=63.0099999998) Thread-1510499::WARNING::2016-10-31 22:01:01,363::vm::5177::virt.vm::(_setUnresponsiveIfTimeout) vmId=`c0c5d75b-3913-44bd-9827-7d128f9f8e3c`::monitor become unresponsive (command timeout, age=63.0099999998) 3. Storage domain monitor fails and SD becomes invalid Thread-27::INFO::2016-10-31 22:01:13,562::monitor::299::Storage.Monitor::(_notifyStatusChanges) Domain deb3dbaa-a2d5-412e-b53a-9f2666a7ec12 became INVALID And emits a useless(?) contEIO that returns on first if of the function. No problem. Thread-1510530::DEBUG::2016-10-31 22:01:13,563::misc::804::Storage.Event.Storage.DomainMonitor.onDomainStateChange::(_emit) Calling registered method `contEIOVms` 4. Only ~10 minutes later EIO shows up on vdsm logs. Here are both VMs: libvirtEventLoop::INFO::2016-10-31 22:10:05,559::vm::3795::virt.vm::(onIOError) vmId=`c0c5d75b-3913-44bd-9827-7d128f9f8e3c`::abnormal vm stop device virtio-disk0 error eio libvirtEventLoop::INFO::2016-10-31 22:10:05,559::vm::5169::virt.vm::(_logGuestCpuStatus) vmId=`c0c5d75b-3913-44bd-9827-7d128f9f8e3c`::CPU stopped: onIOError libvirtEventLoop::INFO::2016-10-31 22:10:05,643::vm::3795::virt.vm::(onIOError) vmId=`eb93a97c-ac98-4993-b05b-60ffac02b23d`::abnormal vm stop device ide0-0-0 error eio libvirtEventLoop::INFO::2016-10-31 22:10:05,643::vm::5169::virt.vm::(_logGuestCpuStatus) vmId=`eb93a97c-ac98-4993-b05b-60ffac02b23d`::CPU stopped: onIOError 5. 23 seconds later Storage Monitoring works and domain becomes VALID Thread-27::INFO::2016-10-31 22:10:28,591::monitor::299::Storage.Monitor::(_notifyStatusChanges) Domain deb3dbaa-a2d5-412e-b53a-9f2666a7ec12 became VALID Thread-1510949::DEBUG::2016-10-31 22:10:28,592::misc::804::Storage.Event.Storage.DomainMonitor.onDomainStateChange::(_emit) Calling registered method `contEIOVms` 6. Only one of the VMs runs again: Thread-1510950::INFO::2016-10-31 22:10:28,593::clientIF::180::vds::(contEIOVms) vmContainerLock acquired Thread-1510950::INFO::2016-10-31 22:10:28,596::clientIF::187::vds::(contEIOVms) Cont vm eb93a97c-ac98-4993-b05b-60ffac02b23d in EIO * There were 28 VMs, only 1 was Resumed. * ALL pause events were with EIO. * I can't find any similar bug that wasn't already closed. * Looking at the vdsm tree I can see some work done on this, but nothing specific (to my eyes) to what I see here. * The storage monitoring caught the INVALID then VALID, so it's not that corner case where the outage is too fast. * Lots of monitors/periodic being discarded, even the VM that "continues" still has monitor not responding. * VMs had to be powered off Version-Release number of selected component (if applicable): Red Hat Enterprise Virtualization Hypervisor release 7.2 (20160711.0.el7ev) libvirt-1.2.17-13.el7_2.5.x86_64 vdsm-4.17.33 rhevm-3.6.7.5-0.1.el6.noarch Actual results: VMs paused until Power Off. Expected results: VMs on EIO continue Additional info: VDSM logs attached
there is quite a difference between non responsive and paused. In general if it stops responding and doesn't come back then indeed it won't be able to resume those. There bugs about that all the way up to 3.6.9(?) and 4.0.4/4.0.5. You would see those domains are either working or paused but vdsm doesn't report them (you'd see * in vdsClient list table output) Can you check or see that? Reproduction scenario would help indeed
Hi Michal! Thanks for taking this one. (In reply to Michal Skrivanek from comment #3) > there is quite a difference between non responsive and paused. In general if > it stops responding and doesn't come back then indeed it won't be able to > resume those. Yes, non-responsive is different than paused. Still, from the logs it looks like the monitor goes unresponsive when the VM pauses and the pause event arrived in vdsm much later than the VM was actually paused. Weird. Before I try to reproduce, are you saying that if the monitor goes unresponsive before the libvirt pause event comes up then it is unlikely that the VM will be in the for loop of contEIO that resumes them? > There bugs about that all the way up to 3.6.9(?) and > 4.0.4/4.0.5. Right, would you like to close this one and work in those other bugs? I just can't find them :( > You would see those domains are either working or paused but vdsm doesn't > report them (you'd see * in vdsClient list table output) > Can you check or see that? Reproduction scenario would help indeed Right, this makes sense to me. So the contEIO doesn't resume those VMs because listAllDomains(libvirt.VIR_CONNECT_LIST_DOMAINS_PAUSED) doesn't return them in the first place? Thanks!
Most likely candidates are bug 1398418, bug 1391506, bug 1364925
So after those bugs are resolved I believe the current behavior is ok and we should not try to resume VMs paused due to EIO when host is non-responsive
Hi, so far I was not able to re produce the bug, not sure if there's some magic timing required or I'm just missing something. My env is hosted engine with: rhevm-3.6.7.5-0.1.el6.noarch Host: RHEL 7.3 vdsm-4.17.37-1.el7ev.noarch libvirt-daemon-2.0.0-10.el7.x86_64 I am using NFS which I set up locally on my laptop, mounted with NFS version 4.0. I created 25 VMs with disks on my local nfs SD and started them all. In order to created the storage failure I just stop the nfs-server service. I monitored vdsm.log and I could see all the messages mentioned in points 2-4 in Germano's description. All the vms in the engine were in status 'not responding' As the NFS failure wasn't organic I didn't see the storage monitor back after 23 seconds, but had to start the server again and it took few more minutes until I got that message. After the storage monitor was up again, all vms resumed and got to state 'up'. Any ideas what we want to do here? I mean, how can I verify on 3.6.10 if I can't see the bug on 3.6.7.5?
AFAIU we don't need doc_text here.
Verified on rhevm-3.6.10.2-0.2.el6. Ran the flow I mentioned above once stopping nfs-server, another time using kill -9 to kill nfsd processes, got the same results mentioned in comment11. This doesn't ensure 100% that the problem solved but it's near impossible to re produce the bug, so I think it will have to do.
(In reply to sefi litmanovich from comment #11) > Hi, so far I was not able to re produce the bug, not sure if there's some > magic timing required or I'm just missing something. > My env is hosted engine with: > rhevm-3.6.7.5-0.1.el6.noarch > Host: > RHEL 7.3 > vdsm-4.17.37-1.el7ev.noarch > libvirt-daemon-2.0.0-10.el7.x86_64 Hi Sefi! I'm pretty sure this is hard to hit and might be just impossible to have a 100% reproducer. However: * RHEL 7.3 with RHV 3.6.7 do not seem to be supported match. * The main component involved here is vdsm, not the engine. * The customer hit this on vdsm 4.17.33, you used 4.17.37. * Michal said there are some recent improvements in this area, so using 4.17.37 you may have even lower chances of hitting it. Hope this helps.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0109.html