Bug 846218
Summary: | [vdsm] host fails to recover in case _recoverVm function fails (Recovering from crash or Initializing) | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | GenadiC <gcheresh> | ||||
Component: | libvirt | Assignee: | Martin Kletzander <mkletzan> | ||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 6.3 | CC: | abaron, acathrow, bazulay, dallan, danken, dyasny, dyuan, hateya, iheim, lpeer, michal.skrivanek, mzhan, rwu, whuang, ykaul | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | virt, infra | ||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 852008 (view as bug list) | Environment: | |||||
Last Closed: | 2012-09-06 14:50:42 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 852008 | ||||||
Attachments: |
|
Description
GenadiC
2012-08-07 08:14:30 UTC
Created attachment 602670 [details]
vdsm & libvirt full logs
there's a gap in the logs and the start of the error occurrence in libvirt.log is not there. Can you reproduce? Also, I suppose qemu logs are needed as well. Libvirt should ultimately handle the situation and then vdsm could proceed further. I don't see much value in vdsm skipping over specific VM when there are underlying libvirt issues. IMHO nothing to be fixed in VDSM, but certainly this should be investigated by libvirt/qemu guys. Moving to libvirt, though with current logs there's little to do (In reply to comment #2) > there's a gap in the logs and the start of the error occurrence in > libvirt.log is not there. Can you reproduce? > Also, I suppose qemu logs are needed as well. > > Libvirt should ultimately handle the situation and then vdsm could proceed > further. I don't see much value in vdsm skipping over specific VM when there > are underlying libvirt issues. > IMHO nothing to be fixed in VDSM, but certainly this should be investigated > by libvirt/qemu guys. > > Moving to libvirt, though with current logs there's little to do not sure why qemu logs are needed, and not sure why you moved it to libvirt, this is a bad behavior in vdsm which reviles a deadlock in vm recovery flow. in this case, libvirt failed to communicate with qemu monitor socket and throw an exception (different issue), but vdsm should behave differently and not get stuck. we had similar issues in the past which we solved in vdsm, setting need-info on Dan to share his thought. I would have strongly recommend to move it back to vdsm. Vdsm must *not* skip a problematic vmRecovery as this may lead to brain split. A more reasonable choice is to have parallel vmRecoveries, and block a faulty one until things get better. However, I agree with Michal that we should understand why there was a monitor communication issue. (In reply to comment #4) > Vdsm must *not* skip a problematic vmRecovery as this may lead to brain > split. A more reasonable choice is to have parallel vmRecoveries, and block > a faulty one until things get better. > > However, I agree with Michal that we should understand why there was a > monitor communication issue. the purpose of this bug is deal with the deadlock or better handling in case vmRecovery fails, not why libvirt didn't succeed communicating with monitor (since qemu process was in D state, so its reasonable). Hi, GenadiC Could you provide some detailed steps to reproduce this bug , and libvirt qemu-kvm and vdsm version , they are helpful for me to reproduce this bug . Thanks very much . Wenlong (In reply to comment #6) > Hi, GenadiC > > Could you provide some detailed steps to reproduce this bug , and libvirt > qemu-kvm and vdsm version , they are helpful for me to reproduce this bug . > Thanks very much . > > Wenlong It happened just once out of dozens time we tried, so it will not be easy to reproduce. You need to play with hotplug/hotunplug on VM that has Port mirroring checkbox enabled (It happened after hot unplug on such VM) The version we used: LIBVIRT: libvirt-0.9.10-21.el6_3.3.x86_64 VDSM: vdsm-4.9.6-26.0.el6_3.x86_64 QEMU-KVM: qemu-img-rhev-0.12.1.2-2.298.el6_3.x86_64.rpm (In reply to comment #7) > (In reply to comment #6) > > Hi, GenadiC > > > > Could you provide some detailed steps to reproduce this bug , and libvirt > > qemu-kvm and vdsm version , they are helpful for me to reproduce this bug . > > Thanks very much . > > > > Wenlong > > It happened just once out of dozens time we tried, so it will not be easy to > reproduce. > You need to play with hotplug/hotunplug on VM that has Port mirroring > checkbox enabled (It happened after hot unplug on such VM) > > The version we used: > LIBVIRT: libvirt-0.9.10-21.el6_3.3.x86_64 > VDSM: vdsm-4.9.6-26.0.el6_3.x86_64 > QEMU-KVM: qemu-img-rhev-0.12.1.2-2.298.el6_3.x86_64.rpm Hi, GenadiC Thanks for your information , but I can not reproduce it , would you help me to verify it once it is fixed ? Thanks very much Wenlong Hi Wenlong, The moment it will be fixed, I'll help to verify it (In reply to comment #9) > Hi Wenlong, > The moment it will be fixed, I'll help to verify it Thanks very much (In reply to comment #5) > (In reply to comment #4) > > Vdsm must *not* skip a problematic vmRecovery as this may lead to brain > > split. A more reasonable choice is to have parallel vmRecoveries, and block > > a faulty one until things get better. > > > > However, I agree with Michal that we should understand why there was a > > monitor communication issue. > > the purpose of this bug is deal with the deadlock or better handling in case > vmRecovery fails, not why libvirt didn't succeed communicating with monitor > (since qemu process was in D state, so its reasonable). There's no VDMS deadlock. Just a lock. Once libvirt communication is reestablished it would continue normally. I personaly fail to see any problem with libvirt in here except for 3GiB of text saying that the monitor socket is closed (if we do this without any other outside entity, then it is a slight problem), but that's not the problem this bug was created for. So I'm wondering where is the error with VM recovery. What is the root cause of QEMU closing the socket (or crashing totally) and how does it happen? I'm not that familiar with what vdsm does with libvirt when port mirroring is on and what exactly is the VM recovery procedure from libvirt's point of view. Can you check if there is QEMU process for the machine still alive in the state where the vdsm throws out libvirt's error from the description? (In reply to comment #12) > I personaly fail to see any problem with libvirt in here except for 3GiB of > text saying that the monitor socket is closed (if we do this without any > other outside entity, then it is a slight problem), but that's not the > problem this bug was created for. So I'm wondering where is the error with > VM recovery. What is the root cause of QEMU closing the socket (or crashing > totally) and how does it happen? I'm not that familiar with what vdsm does > with libvirt when port mirroring is on and what exactly is the VM recovery > procedure from libvirt's point of view. > > Can you check if there is QEMU process for the machine still alive in the > state where the vdsm throws out libvirt's error from the description? I am not able to reproduce this bug anymore, so I can't check the status of QEMU |