852008 – host fails to recover in case _recoverVm function fails (Recovering from crash or Initializing)

Bug 852008 - host fails to recover in case _recoverVm function fails (Recovering from crash or Initializing)

Summary: host fails to recover in case _recoverVm function fails (Recovering from cras...

Keywords:
Status:	CLOSED DUPLICATE of bug 991091
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.1.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Yaniv Bronhaim
QA Contact:
Docs Contact:
URL:
Whiteboard:	infra
Depends On:	846218
Blocks:
TreeView+	depends on / blocked

Reported:	2012-08-27 10:19 UTC by Michal Skrivanek
Modified:	2016-02-10 19:20 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	846218
Environment:
Last Closed:	2013-08-15 14:54:33 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Michal Skrivanek 2012-08-27 10:19:35 UTC

tracking this problem at vdsm as well. VDSM should try to be robust and handle the underlying vdsm error better.
"We should report that the VM is there, but not responsive. Just as we would have if Vdsm was not restarted"

+++ This bug was initially created as a clone of Bug #846218 +++

Description of problem:

host fails to recover (after service restart) when it tries to recover vm and libvirt fails to communicate with its monitor socket, which leaves host in non-operational status (engine reports that host is 'Recovering from crash or Initializing').

clientIFinit::ERROR::2012-08-06 18:51:46,461::clientIF::279::vds::(_recoverExistingVms) Vm's recovery failed
Traceback (most recent call last):
  File "/usr/share/vdsm/clientIF.py", line 244, in _recoverExistingVms
    vdsmVms = self.getVDSMVms()
  File "/usr/share/vdsm/clientIF.py", line 326, in getVDSMVms
    return [vm for vm in vms if self.isVDSMVm(vm)]
  File "/usr/share/vdsm/clientIF.py", line 286, in isVDSMVm
    vmdom = minidom.parseString(vm.XMLDesc(0))
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 381, in XMLDesc
    if ret is None: raise libvirtError ('virDomainGetXMLDesc() failed', dom=self)
libvirtError: Unable to read from monitor: Connection reset by peer

expected results:

in case _vmRecovery is failing, vdsm should continue in initialization process, and skip specific vm recovery.

--- Additional comment from gcheresh on 2012-08-07 04:22:32 EDT ---

Created attachment 602670 [details]
vdsm & libvirt full logs

--- Additional comment from michal.skrivanek on 2012-08-10 03:20:53 EDT ---

there's a gap in the logs and the start of the error occurrence in libvirt.log is not there. Can you reproduce?
Also, I suppose qemu logs are needed as well.

Libvirt should ultimately handle the situation and then vdsm could proceed further. I don't see much value in vdsm skipping over specific VM when there are underlying libvirt issues.
IMHO nothing to be fixed in VDSM, but certainly this should be investigated by libvirt/qemu guys. 

Moving to libvirt, though with current logs there's little to do

--- Additional comment from hateya on 2012-08-12 02:10:39 EDT ---

(In reply to comment #2)
> there's a gap in the logs and the start of the error occurrence in
> libvirt.log is not there. Can you reproduce?
> Also, I suppose qemu logs are needed as well.
> 
> Libvirt should ultimately handle the situation and then vdsm could proceed
> further. I don't see much value in vdsm skipping over specific VM when there
> are underlying libvirt issues.
> IMHO nothing to be fixed in VDSM, but certainly this should be investigated
> by libvirt/qemu guys. 
> 
> Moving to libvirt, though with current logs there's little to do

not sure why qemu logs are needed, and not sure why you moved it to libvirt, this is a bad behavior in vdsm which reviles a deadlock in vm recovery flow.

in this case, libvirt failed to communicate with qemu monitor socket and throw an exception (different issue), but vdsm should behave differently and not get stuck. 

we had similar issues in the past which we solved in vdsm, setting need-info on Dan to share his thought. 

I would have strongly recommend to move it back to vdsm.

--- Additional comment from danken on 2012-08-12 05:20:40 EDT ---

Vdsm must *not* skip a problematic vmRecovery as this may lead to brain split. A more reasonable choice is to have parallel  vmRecoveries, and block a faulty one until things get better.

However, I agree with Michal that we should understand why there was a monitor communication issue.

--- Additional comment from hateya on 2012-08-12 06:40:11 EDT ---

(In reply to comment #4)
> Vdsm must *not* skip a problematic vmRecovery as this may lead to brain
> split. A more reasonable choice is to have parallel  vmRecoveries, and block
> a faulty one until things get better.
> 
> However, I agree with Michal that we should understand why there was a
> monitor communication issue.

the purpose of this bug is deal with the deadlock or better handling in case vmRecovery fails, not why libvirt didn't succeed communicating with monitor (since qemu process was in D state, so its reasonable).

--- Additional comment from whuang on 2012-08-16 23:08:25 EDT ---

Hi, GenadiC 

Could you provide some detailed steps to reproduce this bug , and libvirt qemu-kvm and vdsm version , they are helpful for me to reproduce this bug .
Thanks very much .

Wenlong

--- Additional comment from gcheresh on 2012-08-20 08:01:50 EDT ---

(In reply to comment #6)
> Hi, GenadiC 
> 
> Could you provide some detailed steps to reproduce this bug , and libvirt
> qemu-kvm and vdsm version , they are helpful for me to reproduce this bug .
> Thanks very much .
> 
> Wenlong

It happened just once out of dozens time we tried, so it will not be easy to reproduce.
You need to play with hotplug/hotunplug on VM that has Port mirroring checkbox enabled (It happened after hot unplug on such VM)

The version we used:
LIBVIRT: libvirt-0.9.10-21.el6_3.3.x86_64 
VDSM: vdsm-4.9.6-26.0.el6_3.x86_64 
QEMU-KVM: qemu-img-rhev-0.12.1.2-2.298.el6_3.x86_64.rpm

--- Additional comment from whuang on 2012-08-20 22:49:19 EDT ---

(In reply to comment #7)
> (In reply to comment #6)
> > Hi, GenadiC 
> > 
> > Could you provide some detailed steps to reproduce this bug , and libvirt
> > qemu-kvm and vdsm version , they are helpful for me to reproduce this bug .
> > Thanks very much .
> > 
> > Wenlong
> 
> It happened just once out of dozens time we tried, so it will not be easy to
> reproduce.
> You need to play with hotplug/hotunplug on VM that has Port mirroring
> checkbox enabled (It happened after hot unplug on such VM)
> 
> The version we used:
> LIBVIRT: libvirt-0.9.10-21.el6_3.3.x86_64 
> VDSM: vdsm-4.9.6-26.0.el6_3.x86_64 
> QEMU-KVM: qemu-img-rhev-0.12.1.2-2.298.el6_3.x86_64.rpm

Hi, GenadiC 

Thanks for your information , but I can not reproduce it , would you help me to verify it once it is fixed ? 
Thanks very much 

Wenlong

--- Additional comment from gcheresh on 2012-08-21 03:54:41 EDT ---

Hi Wenlong, 
The moment it will be fixed, I'll help to verify it

--- Additional comment from whuang on 2012-08-21 04:11:20 EDT ---

(In reply to comment #9)
> Hi Wenlong, 
> The moment it will be fixed, I'll help to verify it

Thanks very much

--- Additional comment from michal.skrivanek on 2012-08-21 04:16:56 EDT ---

(In reply to comment #5)
> (In reply to comment #4)
> > Vdsm must *not* skip a problematic vmRecovery as this may lead to brain
> > split. A more reasonable choice is to have parallel  vmRecoveries, and block
> > a faulty one until things get better.
> > 
> > However, I agree with Michal that we should understand why there was a
> > monitor communication issue.
> 
> the purpose of this bug is deal with the deadlock or better handling in case
> vmRecovery fails, not why libvirt didn't succeed communicating with monitor
> (since qemu process was in D state, so its reasonable).
There's no VDMS deadlock. Just a lock. Once libvirt communication is reestablished it would continue normally.

Comment 1 RHEL Program Management 2012-12-14 08:16:35 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 3 Yaniv Bronhaim 2013-08-12 08:00:02 UTC

The bug 991091 leads to the same recovering stuck status. Both bugs should be fixed in the same way. During recovering existing vms, if we encounter libvirt error we should restart vdsm as we do in other communication errors. Currently we catch libvirt exception, do nothing, and keep vdsm in recovering state. IMHO, as we do with other libvirt exceptions, we should kill the vdsm instance  and try the recover again until it'll work. Otherwise, vdsm can't run properly anyway

Comment 4 Yaniv Bronhaim 2013-08-15 14:54:33 UTC


*** This bug has been marked as a duplicate of bug 991091 ***

Note You need to log in before you can comment on or make changes to this bug.