Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 923227

Summary: [engine-backend] VM stays in up state after VDSM crashed and failed to initialize storage
Product: Red Hat Enterprise Virtualization Manager Reporter: Elad <ebenahar>
Component: ovirt-engineAssignee: Michal Skrivanek <michal.skrivanek>
Status: CLOSED WORKSFORME QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: acathrow, dyasny, hateya, iheim, lpeer, mbetak, Rhev-m-bugs, yeylon, ykaul
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard: virt
Fixed In Version: j Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-03-27 13:27:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vdsm+rhevm logs none

Description Elad 2013-03-19 13:26:15 UTC
Created attachment 712652 [details]
vdsm+rhevm logs

Description of problem:

VM status is reported as up although its host crashed. 
It happened to me when I blocked the connection between VDSM and the domain.


Version-Release number of selected component (if applicable):

rhevm-backend-3.2.0-10.14.beta1.el6ev.noarch
vdsm-4.10.2-11.0.el6ev.x86_64
libvirt-0.10.2-18.el6.x86_64


How reproducible:
100%

Steps to Reproduce:
1. Have one host and one iSCSI domain
2. Run a VM with 1 disk or more
3. Block the connection between host and the domain using Iptables.
4. VDSM will try to initialize connection to the domain and will fail. Then it will enter to non operational.
  
Actual results:
engine reports that the live VM is Up althouth it is down.

Expected results:
Engine should report that the VM is unknown

Additional info:

on VDSM:

Traceback (most recent call last):
  File "/usr/share/vdsm/clientIF.py", line 395, in _recoverExistingVms
    not self.irs.getConnectedStoragePoolsList()['poollist']:
AttributeError: 'NoneType' object has no attribute 'getConnectedStoragePoolsList'
VM Channels Listener::DEBUG::2013-03-19 14:23:06,655::vmChannels::60::vds::(_handle_timeouts) Timeout on fileno 17.

on RHEVM:

2013-03-19 15:02:23,533 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (QuartzScheduler_Worker-96) [79788000] Command GetCapabilitiesVDS execution failed. Error: VDSRecoveringException: Failed to initialize storage



see additional VDSM and RHEVM logs.

Comment 1 Martin Betak 2013-03-27 11:39:56 UTC
Hi, could you please specify at what instant did you block the connection to storage? I tried to reproduce it several times, but when I disabled at whether during the system boot or after the system started (up to login screen). The machine appeared in the engine as paused due to storage problems (after a while - see the engine refresh delay, until then it was "UP"). The machine even resumed and continued working properly after restarting iptables and vdsmd (and consequent reactivation of storage domain in the engine).

My VDSM failed with

Thread-22::ERROR::2013-03-27 12:30:28,724::domainMonitor::223::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain a7e5f59c-2877-475b-8afc-f760ba63defb monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 200, in _monitorDomain
    self.domain.selftest()
  File "/usr/share/vdsm/storage/nfsSD.py", line 108, in selftest
    fileSD.FileStorageDomain.selftest(self)
  File "/usr/share/vdsm/storage/fileSD.py", line 481, in selftest
    self.oop.os.statvfs(self.domaindir)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 275, in callCrabRPCFunction
    *args, **kwargs)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 180, in callCrabRPCFunction
    rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 146, in _recvAll
    raise Timeout()
Timeout


and shortly after that

MainThread::ERROR::2013-03-27 12:31:25,399::misc::173::Storage.Misc::(panic) Panic: Couldn't connect to supervdsm
Traceback (most recent call last):
  File "/usr/share/vdsm/supervdsm.py", line 195, in launch
    utils.retry(self._connect, Exception, timeout=60, tries=3)
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 934, in retry
    return func()
  File "/usr/share/vdsm/supervdsm.py", line 181, in _connect
    self._manager.connect()
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 474, in connect
    conn = Client(self._address, authkey=self._authkey)
  File "/usr/lib64/python2.6/multiprocessing/connection.py", line 143, in Client
    c = SocketClient(address)
  File "/usr/lib64/python2.6/multiprocessing/connection.py", line 263, in SocketClient
    s.connect(address)
  File "<string>", line 1, in connect
error: [Errno 2] No such file or directory

Maybe you could also specify the exact way of how you blocked the connection to storage. My was 

# iptables -A OUTPUT -d 10.34.63.204 -j REJECT

Thank you

Comment 2 Elad 2013-03-27 13:27:31 UTC
Hi Martin, 

I've also didn't managed to reproduce. I'm closing the bug for now.