Bug 851837

Summary: vdsm: vdsm is stuck in recovery for almost an hour on NFS storage with running vm's when blocking storage from host
Product: Red Hat Enterprise Linux 6 Reporter: Dafna Ron <dron>
Component: vdsmAssignee: Yaniv Bronhaim <ybronhei>
Status: CLOSED DUPLICATE QA Contact: Haim <hateya>
Severity: urgent Docs Contact:
Priority: high    
Version: 6.5CC: abaron, bazulay, iheim, lpeer, michal.skrivanek, michen, pstehlik, yeylon, ykaul
Target Milestone: rcKeywords: Regression, Reopened
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 881732 (view as bug list) Environment:
Last Closed: 2012-12-12 09:12:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 881732, 884650    
Bug Blocks: 707622    
Attachments:
Description Flags
logs
none
vdsm logs none

Description Dafna Ron 2012-08-26 09:19:15 UTC
Created attachment 607053 [details]
logs

Description of problem:

after blocking storage domain in spm on NFS storage with running+writing vm's vdsm gets Timeout: Operation stuck on remote handler at 10:48:42 after vdsm restart we see a timeout again at 10:49:57 and at 10:54:32 we are still seeing that getCapabilities is getting Recovering from crash or Initializing (at which case vdsm looses connectivity with libvirt although libvirt is running). 

Version-Release number of selected component (if applicable):

si15
vdsm-4.9.6-29.0.el6_3.x86_64

How reproducible:

100%

Steps to Reproduce:
1. in two hosts cluster with NFS storage domain create and run vm's (make sure the vm's are writing). 
2. block connectivity to the storage domain from both hosts using iptables. 
3.
  
Actual results:

vdsm is stuck in recovery for more than 3 minutes

Expected results:

vdsm should recover after 3 minutes. 

Additional info: logs


[root@gold-vdsd ~]# vdsClient -s 0 list table
Recovering from crash or Initializing
[root@gold-vdsd ~]# date
Sun Aug 26 10:57:14 IDT 2012
[root@gold-vdsd ~]# virsh -r list 
 Id    Name                           State
----------------------------------------------------
 45    XP-1                           running
 46    XP-10                          running
 47    XP-2                           running
 48    XP-4                           running
 49    XP-5                           running
 50    XP-6                           running
 51    XP-7                           running
 52    XP-8                           running
 53    XP-9                           running
 55    XP-3                           running

Comment 2 Barak 2012-08-27 06:32:51 UTC
On which of the hosts did this happen ? obviously on gold-vdsd, but which one was it (spm? & which log file?)

Comment 3 Dafna Ron 2012-08-27 06:48:35 UTC
(In reply to comment #2)
> On which of the hosts did this happen ? obviously on gold-vdsd, but which
> one was it (spm? & which log file?)

gold-vdsd is the spm - it's vdsm.log.1.xz.spm

Comment 4 Ayal Baron 2012-10-15 23:12:16 UTC
In the log you can see that hsm initilization finished very quickly after vdsm startup and that the first VM recovery starts 6 minutes after which means the hang is somewhere in between.
from looking at the code my guess is that getVDSMVms which calls  self._libvirt.listDomainsID() is hung waiting for libvirt (a known issue which has been resolved already afaik).

Comment 5 Dafna Ron 2012-10-22 09:50:37 UTC
after new vdsm and libvirt it no longer reproduces.

Comment 6 Barak 2012-10-22 10:17:20 UTC
*** Bug 851832 has been marked as a duplicate of this bug. ***

Comment 7 Barak 2012-10-22 10:19:52 UTC
per comment #5 moving this BZ to CLOSED WORKSFORME

Comment 8 Haim 2012-11-28 16:54:14 UTC
moving to re-open per Dafna's test which proved to be a consistent reproducer. 

repro steps:

1) have 3 hosts in the data-center connected to 1 NFS data-domain
2) have VMs running on each host
3) block connection to master domain from SPM and another host randomally

current results:

host fails to initialize for at least 40 minutes while producing the following message:

Thread-391::DEBUG::2012-11-28 17:46:30,479::BindingXMLRPC::894::vds::(wrapper) client [10.35.97.65]::call getCapabilities with () {} flowID [3f72c997]
Thread-391::DEBUG::2012-11-28 17:46:30,480::BindingXMLRPC::900::vds::(wrapper) return getCapabilities with {'status': {'message': 'Recovering from crash or Initializing', 'code': 99}}

we attached with a debugger (gdb) to the process and caught the following output:

Thread 2 (Thread 0x7f6b157fb700 (LWP 14051)):
#10 <built-in function virDomainGetXMLDesc>
#12 file '/usr/lib64/python2.6/site-packages/libvirt.py', in 'XMLDesc'
#15 file '/usr/share/vdsm/clientIF.py', in 'isVDSMVm'
#18 file '/usr/share/vdsm/clientIF.py', in 'getVDSMVms'
#21 file '/usr/share/vdsm/clientIF.py', in '_recoverExistingVms'
#26 file '/usr/lib64/python2.6/threading.py', in 'run'
#29 file '/usr/lib64/python2.6/threading.py', in '__bootstrap_inner'
#32 file '/usr/lib64/python2.6/threading.py', in '__bootstrap'

this shows us that vdsm is trying to recover existing VMs, apparently, during that process, it tries to prepare their disk, since storage is blocked, operation is failing.

affects: while vdsm fails to answer getVdsCaps, engine moves host to non-responsive, and might send fencing command, which will kill running vms.

Comment 9 Haim 2012-11-28 16:57:57 UTC
Created attachment 653616 [details]
vdsm logs

Comment 10 Barak 2012-11-29 11:04:33 UTC
This issue seems to happen due to the lack of ability on the VDSM side to recover all running vms through libvirt, the reason may be that libvirt is blocking on running qemu process.

Haim, does libvirt respond to virsh -r capabilities ? and other domain related commands ?

Comment 11 Barak 2012-11-29 11:27:12 UTC
Reviewing the stack trace again (comment #8 ), it looks like VDSM is stack on recovering due to blocking on libvirt  in function isVDSMVm , on the call to vm.XMLDesc (see functiona below)

"""
 def isVDSMVm(self, vm):
        """
        Return True if vm seems as if it was created by vdsm.
        """
        try:
            vmdom = minidom.parseString(vm.XMLDesc(0))
            sysinfo = vmdom.getElementsByTagName("sysinfo")[0]
        except libvirt.libvirtError, e:
            if e.get_error_code() == libvirt.VIR_ERR_NO_DOMAIN:
                self.log.error("domId: %s is dead", vm.UUIDString())
            else:
                raise
        except IndexError:
            pass #no sysinfo in xml
        else:
            systype = sysinfo.getAttribute("type")
            if systype == "smbios":
                entries = sysinfo.getElementsByTagName("entry")
                for entry in entries:
                    if entry.getAttribute("name") == "product":
                        prod = entry.firstChild.data
                        if prod in (caps.OSName.RHEL, caps.OSName.OVIRT,
                                caps.OSName.RHEVH, caps.OSName.FEDORA,
                                caps.OSName.DEBIAN):
                            return True
        return False
"""

This is probably due to lack of response from qemu side due to the underlying NFS issue.

Comment 12 Barak 2012-11-29 12:28:59 UTC
Haim can you please provide:

qemu & libvirt versions (+ libvirt logs)

Comment 13 Yaniv Bronhaim 2012-11-29 12:43:26 UTC
Thanks Haim! this thread stacktrace is very helpful!

We should forward this issue to libvirt guys.

 It is very clear that the stuck issue happens because libvirt doesn't return from XMLDesc request for 5 minutes for each request, and we have 8 requests like this for each vm.

In libvirt log we see same activity from 12:41:57 (exactly the time of vdsm reset), until 13:32:37 (same time when vdsm returns to response to getCaps request)

this pattern returns 8 times:
Request received here:
2012-11-28 13:17:37.838+0000: 14954: debug : qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7f49a8010bb0 refs=3
2012-11-28 13:17:37.846+0000: 14954: debug : qemuMonitorIOProcess:327 : QEMU_MONITOR_IO_PROCESS: mon=0x7f49a8010bb0 buf={"timestamp": {"seconds": 1354108657, "microseconds": 837141}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-ide0-0-0", "__com.redhat_debug_info": {"message": "Input/output error", "errno": 5}, "__com.redhat_reason": "eio", "operation": "write", "action": "stop"}}^M
 len=273                                                                        
2012-11-28 13:17:37.846+0000: 14954: debug : qemuMonitorEmitIOError:954 : mon=0x7f49a8010bb0
2012-11-28 13:17:37.846+0000: 14954: debug : qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7f49a8010bb0 refs=4
2012-11-28 13:17:37.846+0000: 14954: debug : qemuProcessHandleIOError:854 : Transitioned guest NFS1-4 to paused state due to IO error
2012-11-28 13:17:37.846+0000: 14954: debug : qemuProcessHandleIOError:864 : Preserving lock state '(null)'
2012-11-28 13:17:37.885+0000: 14954: debug : qemuMonitorUnref:210 : QEMU_MONITOR_UNREF: mon=0x7f49a8010bb0 refs=3
2012-11-28 13:17:37.885+0000: 14954: debug : qemuMonitorUnref:210 : QEMU_MONITOR_UNREF: mon=0x7f49a8010bb0 refs=2
2012-11-28 13:17:37.886+0000: 14954: debug : qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7f49a8010bb0 refs=3
2012-11-28 13:17:37.886+0000: 14954: debug : qemuMonitorIOProcess:327 : QEMU_MONITOR_IO_PROCESS: mon=0x7f49a8010bb0 buf={"timestamp": {"seconds": 1354108657, "microseconds": 848889}, "event": "STOP"}^M
 len=81                                                                         
2012-11-28 13:17:37.886+0000: 14954: debug : qemuMonitorEmitStop:921 : mon=0x7f49a8010bb0
2012-11-28 13:17:37.886+0000: 14954: debug : qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7f49a8010bb0 refs=4
2012-11-28 13:17:37.886+0000: 14954: debug : qemuMonitorUnref:210 : QEMU_MONITOR_UNREF: mon=0x7f49a8010bb0 refs=3
2012-11-28 13:17:37.886+0000: 14954: debug : qemuMonitorUnref:210 : QEMU_MONITOR_UNREF: mon=0x7f49a8010bb0 refs=2

Here we don't continue for 5 minutes - until next request
2012-11-28 13:22:37.837+0000: 14954: debug : qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7f49c000d750 refs=4
...

And we do such request for each vm, 5min * 8vms. Exactly the time that got for vsdm to continue

Comment 14 Dafna Ron 2012-11-29 13:35:31 UTC
libvirt-0.9.10-21.el6_3.6.x86_64
qemu-kvm-rhev-0.12.1.2-2.295.el6_3.5.x86_64

Comment 15 Itamar Heim 2012-11-30 02:33:57 UTC
haim/dafna - per comment [1] can you please check if this reproduces if the VMs don't have balloon enabled?
[1] https://bugzilla.redhat.com/show_bug.cgi?id=881732#c2

Comment 16 Haim 2012-12-02 07:35:22 UTC
(In reply to comment #15)
> haim/dafna - per comment [1] can you please check if this reproduces if the
> VMs don't have balloon enabled?
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=881732#c2

how do we remove the balloon configuration from the xml ? or using engine ?

Comment 17 Itamar Heim 2012-12-02 07:47:21 UTC
(In reply to comment #16)
> (In reply to comment #15)
> > haim/dafna - per comment [1] can you please check if this reproduces if the
> > VMs don't have balloon enabled?
> > [1] https://bugzilla.redhat.com/show_bug.cgi?id=881732#c2
> 
> how do we remove the balloon configuration from the xml ? or using engine ?

it is a vm option, not exposed in UI but available via REST API

Comment 18 Michal Skrivanek 2012-12-04 08:18:17 UTC
I don't think it will actually help much. We can (liekly) get around recovery phase with no baloon device, but then we anyway start all the statistics threads polling for disk and those will get stuck as well. So do we gain anything at the end?

Comment 19 Yaniv Bronhaim 2012-12-04 08:35:08 UTC
The statistics threads don't effect the recovery process. 

The output of this bug is that without getting information about the vms status, vdsm shows them as non-responsive, although they all in pause state. 
In this senerio, vdsm doesn't show vms status at all until qemu returns with the  description xml.