Bug 851837
Summary: | vdsm: vdsm is stuck in recovery for almost an hour on NFS storage with running vm's when blocking storage from host | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Dafna Ron <dron> | ||||||
Component: | vdsm | Assignee: | Yaniv Bronhaim <ybronhei> | ||||||
Status: | CLOSED DUPLICATE | QA Contact: | Haim <hateya> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 6.5 | CC: | abaron, bazulay, iheim, lpeer, michal.skrivanek, michen, pstehlik, yeylon, ykaul | ||||||
Target Milestone: | rc | Keywords: | Regression, Reopened | ||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | infra | ||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 881732 (view as bug list) | Environment: | |||||||
Last Closed: | 2012-12-12 09:12:58 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 881732, 884650 | ||||||||
Bug Blocks: | 707622 | ||||||||
Attachments: |
|
On which of the hosts did this happen ? obviously on gold-vdsd, but which one was it (spm? & which log file?) (In reply to comment #2) > On which of the hosts did this happen ? obviously on gold-vdsd, but which > one was it (spm? & which log file?) gold-vdsd is the spm - it's vdsm.log.1.xz.spm In the log you can see that hsm initilization finished very quickly after vdsm startup and that the first VM recovery starts 6 minutes after which means the hang is somewhere in between. from looking at the code my guess is that getVDSMVms which calls self._libvirt.listDomainsID() is hung waiting for libvirt (a known issue which has been resolved already afaik). after new vdsm and libvirt it no longer reproduces. *** Bug 851832 has been marked as a duplicate of this bug. *** per comment #5 moving this BZ to CLOSED WORKSFORME moving to re-open per Dafna's test which proved to be a consistent reproducer. repro steps: 1) have 3 hosts in the data-center connected to 1 NFS data-domain 2) have VMs running on each host 3) block connection to master domain from SPM and another host randomally current results: host fails to initialize for at least 40 minutes while producing the following message: Thread-391::DEBUG::2012-11-28 17:46:30,479::BindingXMLRPC::894::vds::(wrapper) client [10.35.97.65]::call getCapabilities with () {} flowID [3f72c997] Thread-391::DEBUG::2012-11-28 17:46:30,480::BindingXMLRPC::900::vds::(wrapper) return getCapabilities with {'status': {'message': 'Recovering from crash or Initializing', 'code': 99}} we attached with a debugger (gdb) to the process and caught the following output: Thread 2 (Thread 0x7f6b157fb700 (LWP 14051)): #10 <built-in function virDomainGetXMLDesc> #12 file '/usr/lib64/python2.6/site-packages/libvirt.py', in 'XMLDesc' #15 file '/usr/share/vdsm/clientIF.py', in 'isVDSMVm' #18 file '/usr/share/vdsm/clientIF.py', in 'getVDSMVms' #21 file '/usr/share/vdsm/clientIF.py', in '_recoverExistingVms' #26 file '/usr/lib64/python2.6/threading.py', in 'run' #29 file '/usr/lib64/python2.6/threading.py', in '__bootstrap_inner' #32 file '/usr/lib64/python2.6/threading.py', in '__bootstrap' this shows us that vdsm is trying to recover existing VMs, apparently, during that process, it tries to prepare their disk, since storage is blocked, operation is failing. affects: while vdsm fails to answer getVdsCaps, engine moves host to non-responsive, and might send fencing command, which will kill running vms. Created attachment 653616 [details]
vdsm logs
This issue seems to happen due to the lack of ability on the VDSM side to recover all running vms through libvirt, the reason may be that libvirt is blocking on running qemu process. Haim, does libvirt respond to virsh -r capabilities ? and other domain related commands ? Reviewing the stack trace again (comment #8 ), it looks like VDSM is stack on recovering due to blocking on libvirt in function isVDSMVm , on the call to vm.XMLDesc (see functiona below) """ def isVDSMVm(self, vm): """ Return True if vm seems as if it was created by vdsm. """ try: vmdom = minidom.parseString(vm.XMLDesc(0)) sysinfo = vmdom.getElementsByTagName("sysinfo")[0] except libvirt.libvirtError, e: if e.get_error_code() == libvirt.VIR_ERR_NO_DOMAIN: self.log.error("domId: %s is dead", vm.UUIDString()) else: raise except IndexError: pass #no sysinfo in xml else: systype = sysinfo.getAttribute("type") if systype == "smbios": entries = sysinfo.getElementsByTagName("entry") for entry in entries: if entry.getAttribute("name") == "product": prod = entry.firstChild.data if prod in (caps.OSName.RHEL, caps.OSName.OVIRT, caps.OSName.RHEVH, caps.OSName.FEDORA, caps.OSName.DEBIAN): return True return False """ This is probably due to lack of response from qemu side due to the underlying NFS issue. Haim can you please provide: qemu & libvirt versions (+ libvirt logs) Thanks Haim! this thread stacktrace is very helpful! We should forward this issue to libvirt guys. It is very clear that the stuck issue happens because libvirt doesn't return from XMLDesc request for 5 minutes for each request, and we have 8 requests like this for each vm. In libvirt log we see same activity from 12:41:57 (exactly the time of vdsm reset), until 13:32:37 (same time when vdsm returns to response to getCaps request) this pattern returns 8 times: Request received here: 2012-11-28 13:17:37.838+0000: 14954: debug : qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7f49a8010bb0 refs=3 2012-11-28 13:17:37.846+0000: 14954: debug : qemuMonitorIOProcess:327 : QEMU_MONITOR_IO_PROCESS: mon=0x7f49a8010bb0 buf={"timestamp": {"seconds": 1354108657, "microseconds": 837141}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-ide0-0-0", "__com.redhat_debug_info": {"message": "Input/output error", "errno": 5}, "__com.redhat_reason": "eio", "operation": "write", "action": "stop"}}^M len=273 2012-11-28 13:17:37.846+0000: 14954: debug : qemuMonitorEmitIOError:954 : mon=0x7f49a8010bb0 2012-11-28 13:17:37.846+0000: 14954: debug : qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7f49a8010bb0 refs=4 2012-11-28 13:17:37.846+0000: 14954: debug : qemuProcessHandleIOError:854 : Transitioned guest NFS1-4 to paused state due to IO error 2012-11-28 13:17:37.846+0000: 14954: debug : qemuProcessHandleIOError:864 : Preserving lock state '(null)' 2012-11-28 13:17:37.885+0000: 14954: debug : qemuMonitorUnref:210 : QEMU_MONITOR_UNREF: mon=0x7f49a8010bb0 refs=3 2012-11-28 13:17:37.885+0000: 14954: debug : qemuMonitorUnref:210 : QEMU_MONITOR_UNREF: mon=0x7f49a8010bb0 refs=2 2012-11-28 13:17:37.886+0000: 14954: debug : qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7f49a8010bb0 refs=3 2012-11-28 13:17:37.886+0000: 14954: debug : qemuMonitorIOProcess:327 : QEMU_MONITOR_IO_PROCESS: mon=0x7f49a8010bb0 buf={"timestamp": {"seconds": 1354108657, "microseconds": 848889}, "event": "STOP"}^M len=81 2012-11-28 13:17:37.886+0000: 14954: debug : qemuMonitorEmitStop:921 : mon=0x7f49a8010bb0 2012-11-28 13:17:37.886+0000: 14954: debug : qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7f49a8010bb0 refs=4 2012-11-28 13:17:37.886+0000: 14954: debug : qemuMonitorUnref:210 : QEMU_MONITOR_UNREF: mon=0x7f49a8010bb0 refs=3 2012-11-28 13:17:37.886+0000: 14954: debug : qemuMonitorUnref:210 : QEMU_MONITOR_UNREF: mon=0x7f49a8010bb0 refs=2 Here we don't continue for 5 minutes - until next request 2012-11-28 13:22:37.837+0000: 14954: debug : qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7f49c000d750 refs=4 ... And we do such request for each vm, 5min * 8vms. Exactly the time that got for vsdm to continue libvirt-0.9.10-21.el6_3.6.x86_64 qemu-kvm-rhev-0.12.1.2-2.295.el6_3.5.x86_64 haim/dafna - per comment [1] can you please check if this reproduces if the VMs don't have balloon enabled? [1] https://bugzilla.redhat.com/show_bug.cgi?id=881732#c2 (In reply to comment #15) > haim/dafna - per comment [1] can you please check if this reproduces if the > VMs don't have balloon enabled? > [1] https://bugzilla.redhat.com/show_bug.cgi?id=881732#c2 how do we remove the balloon configuration from the xml ? or using engine ? (In reply to comment #16) > (In reply to comment #15) > > haim/dafna - per comment [1] can you please check if this reproduces if the > > VMs don't have balloon enabled? > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=881732#c2 > > how do we remove the balloon configuration from the xml ? or using engine ? it is a vm option, not exposed in UI but available via REST API I don't think it will actually help much. We can (liekly) get around recovery phase with no baloon device, but then we anyway start all the statistics threads polling for disk and those will get stuck as well. So do we gain anything at the end? The statistics threads don't effect the recovery process. The output of this bug is that without getting information about the vms status, vdsm shows them as non-responsive, although they all in pause state. In this senerio, vdsm doesn't show vms status at all until qemu returns with the description xml. |
Created attachment 607053 [details] logs Description of problem: after blocking storage domain in spm on NFS storage with running+writing vm's vdsm gets Timeout: Operation stuck on remote handler at 10:48:42 after vdsm restart we see a timeout again at 10:49:57 and at 10:54:32 we are still seeing that getCapabilities is getting Recovering from crash or Initializing (at which case vdsm looses connectivity with libvirt although libvirt is running). Version-Release number of selected component (if applicable): si15 vdsm-4.9.6-29.0.el6_3.x86_64 How reproducible: 100% Steps to Reproduce: 1. in two hosts cluster with NFS storage domain create and run vm's (make sure the vm's are writing). 2. block connectivity to the storage domain from both hosts using iptables. 3. Actual results: vdsm is stuck in recovery for more than 3 minutes Expected results: vdsm should recover after 3 minutes. Additional info: logs [root@gold-vdsd ~]# vdsClient -s 0 list table Recovering from crash or Initializing [root@gold-vdsd ~]# date Sun Aug 26 10:57:14 IDT 2012 [root@gold-vdsd ~]# virsh -r list Id Name State ---------------------------------------------------- 45 XP-1 running 46 XP-10 running 47 XP-2 running 48 XP-4 running 49 XP-5 running 50 XP-6 running 51 XP-7 running 52 XP-8 running 53 XP-9 running 55 XP-3 running