Bug 846218

Summary: [vdsm] host fails to recover in case _recoverVm function fails (Recovering from crash or Initializing)
Product: Red Hat Enterprise Linux 6 Reporter: GenadiC <gcheresh>
Component: libvirtAssignee: Martin Kletzander <mkletzan>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.3CC: abaron, acathrow, bazulay, dallan, danken, dyasny, dyuan, hateya, iheim, lpeer, michal.skrivanek, mzhan, rwu, whuang, ykaul
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: virt, infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 852008 (view as bug list) Environment:
Last Closed: 2012-09-06 14:50:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 852008    
Attachments:
Description Flags
vdsm & libvirt full logs none

Description GenadiC 2012-08-07 08:14:30 UTC
Description of problem:

host fails to recover (after service restart) when it tries to recover vm and libvirt fails to communicate with its monitor socket, which leaves host in non-operational status (engine reports that host is 'Recovering from crash or Initializing').

clientIFinit::ERROR::2012-08-06 18:51:46,461::clientIF::279::vds::(_recoverExistingVms) Vm's recovery failed
Traceback (most recent call last):
  File "/usr/share/vdsm/clientIF.py", line 244, in _recoverExistingVms
    vdsmVms = self.getVDSMVms()
  File "/usr/share/vdsm/clientIF.py", line 326, in getVDSMVms
    return [vm for vm in vms if self.isVDSMVm(vm)]
  File "/usr/share/vdsm/clientIF.py", line 286, in isVDSMVm
    vmdom = minidom.parseString(vm.XMLDesc(0))
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 381, in XMLDesc
    if ret is None: raise libvirtError ('virDomainGetXMLDesc() failed', dom=self)
libvirtError: Unable to read from monitor: Connection reset by peer

expected results:

in case _vmRecovery is failing, vdsm should continue in initialization process, and skip specific vm recovery.

Comment 1 GenadiC 2012-08-07 08:22:32 UTC
Created attachment 602670 [details]
vdsm & libvirt full logs

Comment 2 Michal Skrivanek 2012-08-10 07:20:53 UTC
there's a gap in the logs and the start of the error occurrence in libvirt.log is not there. Can you reproduce?
Also, I suppose qemu logs are needed as well.

Libvirt should ultimately handle the situation and then vdsm could proceed further. I don't see much value in vdsm skipping over specific VM when there are underlying libvirt issues.
IMHO nothing to be fixed in VDSM, but certainly this should be investigated by libvirt/qemu guys. 

Moving to libvirt, though with current logs there's little to do

Comment 3 Haim 2012-08-12 06:10:39 UTC
(In reply to comment #2)
> there's a gap in the logs and the start of the error occurrence in
> libvirt.log is not there. Can you reproduce?
> Also, I suppose qemu logs are needed as well.
> 
> Libvirt should ultimately handle the situation and then vdsm could proceed
> further. I don't see much value in vdsm skipping over specific VM when there
> are underlying libvirt issues.
> IMHO nothing to be fixed in VDSM, but certainly this should be investigated
> by libvirt/qemu guys. 
> 
> Moving to libvirt, though with current logs there's little to do

not sure why qemu logs are needed, and not sure why you moved it to libvirt, this is a bad behavior in vdsm which reviles a deadlock in vm recovery flow.

in this case, libvirt failed to communicate with qemu monitor socket and throw an exception (different issue), but vdsm should behave differently and not get stuck. 

we had similar issues in the past which we solved in vdsm, setting need-info on Dan to share his thought. 

I would have strongly recommend to move it back to vdsm.

Comment 4 Dan Kenigsberg 2012-08-12 09:20:40 UTC
Vdsm must *not* skip a problematic vmRecovery as this may lead to brain split. A more reasonable choice is to have parallel  vmRecoveries, and block a faulty one until things get better.

However, I agree with Michal that we should understand why there was a monitor communication issue.

Comment 5 Haim 2012-08-12 10:40:11 UTC
(In reply to comment #4)
> Vdsm must *not* skip a problematic vmRecovery as this may lead to brain
> split. A more reasonable choice is to have parallel  vmRecoveries, and block
> a faulty one until things get better.
> 
> However, I agree with Michal that we should understand why there was a
> monitor communication issue.

the purpose of this bug is deal with the deadlock or better handling in case vmRecovery fails, not why libvirt didn't succeed communicating with monitor (since qemu process was in D state, so its reasonable).

Comment 6 Huang Wenlong 2012-08-17 03:08:25 UTC
Hi, GenadiC 

Could you provide some detailed steps to reproduce this bug , and libvirt qemu-kvm and vdsm version , they are helpful for me to reproduce this bug .
Thanks very much .

Wenlong

Comment 7 GenadiC 2012-08-20 12:01:50 UTC
(In reply to comment #6)
> Hi, GenadiC 
> 
> Could you provide some detailed steps to reproduce this bug , and libvirt
> qemu-kvm and vdsm version , they are helpful for me to reproduce this bug .
> Thanks very much .
> 
> Wenlong

It happened just once out of dozens time we tried, so it will not be easy to reproduce.
You need to play with hotplug/hotunplug on VM that has Port mirroring checkbox enabled (It happened after hot unplug on such VM)

The version we used:
LIBVIRT: libvirt-0.9.10-21.el6_3.3.x86_64 
VDSM: vdsm-4.9.6-26.0.el6_3.x86_64 
QEMU-KVM: qemu-img-rhev-0.12.1.2-2.298.el6_3.x86_64.rpm

Comment 8 Huang Wenlong 2012-08-21 02:49:19 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > Hi, GenadiC 
> > 
> > Could you provide some detailed steps to reproduce this bug , and libvirt
> > qemu-kvm and vdsm version , they are helpful for me to reproduce this bug .
> > Thanks very much .
> > 
> > Wenlong
> 
> It happened just once out of dozens time we tried, so it will not be easy to
> reproduce.
> You need to play with hotplug/hotunplug on VM that has Port mirroring
> checkbox enabled (It happened after hot unplug on such VM)
> 
> The version we used:
> LIBVIRT: libvirt-0.9.10-21.el6_3.3.x86_64 
> VDSM: vdsm-4.9.6-26.0.el6_3.x86_64 
> QEMU-KVM: qemu-img-rhev-0.12.1.2-2.298.el6_3.x86_64.rpm

Hi, GenadiC 

Thanks for your information , but I can not reproduce it , would you help me to verify it once it is fixed ? 
Thanks very much 

Wenlong

Comment 9 GenadiC 2012-08-21 07:54:41 UTC
Hi Wenlong, 
The moment it will be fixed, I'll help to verify it

Comment 10 Huang Wenlong 2012-08-21 08:11:20 UTC
(In reply to comment #9)
> Hi Wenlong, 
> The moment it will be fixed, I'll help to verify it

Thanks very much

Comment 11 Michal Skrivanek 2012-08-21 08:16:56 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > Vdsm must *not* skip a problematic vmRecovery as this may lead to brain
> > split. A more reasonable choice is to have parallel  vmRecoveries, and block
> > a faulty one until things get better.
> > 
> > However, I agree with Michal that we should understand why there was a
> > monitor communication issue.
> 
> the purpose of this bug is deal with the deadlock or better handling in case
> vmRecovery fails, not why libvirt didn't succeed communicating with monitor
> (since qemu process was in D state, so its reasonable).
There's no VDMS deadlock. Just a lock. Once libvirt communication is reestablished it would continue normally.

Comment 12 Martin Kletzander 2012-09-05 17:04:31 UTC
I personaly fail to see any problem with libvirt in here except for 3GiB of text saying that the monitor socket is closed (if we do this without any other outside entity, then it is a slight problem), but that's not the problem this bug was created for.  So I'm wondering where is the error with VM recovery.  What is the root cause of QEMU closing the socket (or crashing totally) and how does it happen?  I'm not that familiar with what vdsm does with libvirt when port mirroring is on and what exactly is the VM recovery procedure from libvirt's point of view.

Can you check if there is QEMU process for the machine still alive in the state where the vdsm throws out libvirt's error from the description?

Comment 13 GenadiC 2012-09-06 14:50:42 UTC
(In reply to comment #12)
> I personaly fail to see any problem with libvirt in here except for 3GiB of
> text saying that the monitor socket is closed (if we do this without any
> other outside entity, then it is a slight problem), but that's not the
> problem this bug was created for.  So I'm wondering where is the error with
> VM recovery.  What is the root cause of QEMU closing the socket (or crashing
> totally) and how does it happen?  I'm not that familiar with what vdsm does
> with libvirt when port mirroring is on and what exactly is the VM recovery
> procedure from libvirt's point of view.
> 
> Can you check if there is QEMU process for the machine still alive in the
> state where the vdsm throws out libvirt's error from the description?

I am not able to reproduce this bug anymore, so I can't check the status of QEMU