Bug 990498

Summary: SuperVdsm not restart if deleting /var/run/vdsm/svdsm.sock file
Product: Red Hat Enterprise Virtualization Manager Reporter: vvyazmin <vvyazmin>
Component: vdsmAssignee: Yaniv Bronhaim <ybronhei>
Status: CLOSED NOTABUG QA Contact: vvyazmin <vvyazmin>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.3.0CC: acathrow, bazulay, hateya, iheim, lpeer, pstehlik, yeylon
Target Milestone: ---Keywords: Triaged
Target Release: 3.3.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-08-01 13:25:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm none

Description vvyazmin@redhat.com 2013-07-31 10:07:57 UTC
Created attachment 781035 [details]
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm

Description of problem:
SuperVdsm not restart if deleting /var/run/vdsm/svdsm.sock file

Version-Release number of selected component (if applicable):
RHEVM 3.3 - IS7 environment:

RHEVM: rhevm-3.3.0-0.11.master.el6ev.noarch
VDSM: vdsm-4.12.0-rc3.12.git139ec2f.el6ev.x86_64
LIBVIRT: libvirt-0.10.2-18.el6_4.9.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64
PythonSDK: rhevm-sdk-python-3.3.0.7-1.el6ev.noarch

How reproducible:
100%

Steps to Reproduce:
Create iSCSI DC with 2 hosts
Validate that vmds and supervdsm daemon are up and running
Verify when you run command with supervdsm execution all work OK (in my case was forceScsiScan command)
Delete /var/run/vdsm/svdsm.sock file on one of host
Run again same command with supervdsm execution

Actual results:
Action failed.

Expected results:
Supervdsm daemon restart, and execution command succeeds.

Impact on user:
Failed run command which depend on supervdsm execution

Workaround:
Manually restart supervdsm daemon, logout and login from RHEVM, and run again same command with supervdsm execution

Additional info:

UI:
Could not retrieve LUNs, please check your storage.

/var/log/ovirt-engine/engine.log
2013-07-31 12:29:21,548 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-9) START, GetDeviceListVDSCommand(HostName = tigris03.scl.lab.tlv.redhat.com, HostId = c9ce4f60-4d58-453d-9588-e5a7b5beb6fd, storageType=ISCSI), log id: 71461cc0
2013-07-31 12:29:21,617 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-9) Failed in GetDeviceListVDS method
2013-07-31 12:29:21,618 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-9) Error code BlockDeviceActionError and error message VDSGenericException: VDSErrorException: Failed to GetDeviceListVDS, error = Error block device action: ()
2013-07-31 12:29:21,618 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-9) Command org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand return value 
 
LUNListReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=600, mMessage=Error block device action: ()]]

2013-07-31 12:29:21,618 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-9) HostName = tigris03.scl.lab.tlv.redhat.com
2013-07-31 12:29:21,618 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-9) Command GetDeviceListVDS execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to GetDeviceListVDS, error = Error block device action: ()
2013-07-31 12:29:21,618 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-9) FINISH, GetDeviceListVDSCommand, log id: 71461cc0
2013-07-31 12:29:21,619 ERROR [org.ovirt.engine.core.bll.storage.GetDeviceListQuery] (ajp-/127.0.0.1:8702-9) Query GetDeviceListQuery failed. Exception message is VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to GetDeviceListVDS, error = Error block device action: () (Failed with VDSM error BlockDeviceActionError and code 600)
2013-07-31 12:29:23,597 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (ajp-/127.0.0.1:8702-8) START, GetDeviceListVDSCommand(HostName = tigris02.scl.lab.tlv.redhat.com, HostId = 3cca5914-a984-45e5-9e02-1e22f2073049, storageType=ISCSI), log id: 60c24d17


/var/log/vdsm/vdsm.log
Thread-129470::DEBUG::2013-07-31 10:27:02,432::BindingXMLRPC::177::vds::(wrapper) client [10.35.161.29]
Thread-129470::DEBUG::2013-07-31 10:27:02,433::task::579::TaskManager.Task::(_updateState) Task=`73d4d3eb-8487-4979-ba50-171053f6086c`::moving from state init -> state preparing
Thread-129470::INFO::2013-07-31 10:27:02,433::logUtils::44::dispatcher::(wrapper) Run and protect: getDeviceList(storageType=3, options={})
Thread-129470::DEBUG::2013-07-31 10:27:02,433::misc::807::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc.refreshStorage)
Thread-129470::DEBUG::2013-07-31 10:27:02,434::misc::809::SamplingMethod::(__call__) Got in to sampling method
Thread-129470::DEBUG::2013-07-31 10:27:02,434::misc::807::SamplingMethod::(__call__) Trying to enter sampling method (storage.iscsi.rescan)
Thread-129470::DEBUG::2013-07-31 10:27:02,434::misc::809::SamplingMethod::(__call__) Got in to sampling method
Thread-129470::DEBUG::2013-07-31 10:27:02,435::iscsiadm::91::Storage.Misc.excCmd::(_runCmd) '/usr/bin/sudo -n /sbin/iscsiadm -m session -R' (cwd None)
Thread-129470::DEBUG::2013-07-31 10:27:02,466::iscsiadm::91::Storage.Misc.excCmd::(_runCmd) SUCCESS: <err> = ''; <rc> = 0
Thread-129470::DEBUG::2013-07-31 10:27:02,467::misc::817::SamplingMethod::(__call__) Returning last result
Thread-129470::ERROR::2013-07-31 10:27:02,467::task::850::TaskManager.Task::(_setError) Task=`73d4d3eb-8487-4979-ba50-171053f6086c`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1886, in getDeviceList
    devices = self._getDeviceList(storageType)
  File "/usr/share/vdsm/storage/hsm.py", line 1890, in _getDeviceList
    sdCache.refreshStorage()
  File "/usr/share/vdsm/storage/misc.py", line 826, in helper
    return sm(*args, **kwargs)
  File "/usr/share/vdsm/storage/misc.py", line 811, in __call__
    self.__lastResult = self.__func(*args, **kwargs)
  File "/usr/share/vdsm/storage/sdc.py", line 83, in refreshStorage
    multipath.rescan()
  File "/usr/share/vdsm/storage/multipath.py", line 108, in rescan
    supervdsm.getProxy().forceScsiScan()
  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
    **kwargs)
  File "<string>", line 2, in forceScsiScan
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 722, in _callmethod
    self._connect()
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 709, in _connect
    conn = self._Client(self._token.address, authkey=self._authkey)
  File "/usr/lib64/python2.6/multiprocessing/connection.py", line 143, in Client
    c = SocketClient(address)
  File "/usr/lib64/python2.6/multiprocessing/connection.py", line 263, in SocketClient
    s.connect(address)
  File "<string>", line 1, in connect
error: [Errno 2] No such file or directory
Thread-129470::DEBUG::2013-07-31 10:27:02,470::task::869::TaskManager.Task::(_run) Task=`73d4d3eb-8487-4979-ba50-171053f6086c`::Task._run: 73d4d3eb-8487-4979-ba50-171053f6086c (3,) {} failed - stopping task
Thread-129470::DEBUG::2013-07-31 10:27:02,470::task::1194::TaskManager.Task::(stop) Task=`73d4d3eb-8487-4979-ba50-171053f6086c`::stopping in state preparing (force False)
Thread-129470::DEBUG::2013-07-31 10:27:02,470::task::974::TaskManager.Task::(_decref) Task=`73d4d3eb-8487-4979-ba50-171053f6086c`::ref 1 aborting True
Thread-129470::INFO::2013-07-31 10:27:02,471::task::1151::TaskManager.Task::(prepare) Task=`73d4d3eb-8487-4979-ba50-171053f6086c`::aborting: Task is aborted: u'[Errno 2] No such file or directory' - code 100
Thread-129470::DEBUG::2013-07-31 10:27:02,471::task::1156::TaskManager.Task::(prepare) Task=`73d4d3eb-8487-4979-ba50-171053f6086c`::Prepare: aborted: [Errno 2] No such file or directory
Thread-129470::DEBUG::2013-07-31 10:27:02,471::task::974::TaskManager.Task::(_decref) Task=`73d4d3eb-8487-4979-ba50-171053f6086c`::ref 0 aborting True
Thread-129470::DEBUG::2013-07-31 10:27:02,472::task::909::TaskManager.Task::(_doAbort) Task=`73d4d3eb-8487-4979-ba50-171053f6086c`::Task._doAbort: force False
Thread-129470::DEBUG::2013-07-31 10:27:02,472::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-129470::DEBUG::2013-07-31 10:27:02,472::task::579::TaskManager.Task::(_updateState) Task=`73d4d3eb-8487-4979-ba50-171053f6086c`::moving from state preparing -> state aborting
Thread-129470::DEBUG::2013-07-31 10:27:02,472::task::534::TaskManager.Task::(__state_aborting) Task=`73d4d3eb-8487-4979-ba50-171053f6086c`::_aborting: recover policy none
Thread-129470::DEBUG::2013-07-31 10:27:02,473::task::579::TaskManager.Task::(_updateState) Task=`73d4d3eb-8487-4979-ba50-171053f6086c`::moving from state aborting -> state failed
Thread-129470::DEBUG::2013-07-31 10:27:02,473::resourceManager::939::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-129470::DEBUG::2013-07-31 10:27:02,473::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-129470::ERROR::2013-07-31 10:27:02,473::dispatcher::70::Storage.Dispatcher.Protect::(run) [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 62, in run
    result = ctask.prepare(self.func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 1159, in prepare
    raise self.error
error: [Errno 2] No such file or directory

Comment 1 Barak 2013-08-01 13:25:42 UTC
This is a problem in python multiprocessing library (it validates the existence of the socket each communication).

Usually in linux deletion of a LDS  does not effect the processes that holds the inode (like in crubRPC).

However this is not an interesting bug,
No one should delete manually the svdsm.sock file under /var/run

It is the same like deleting the vdsm py pyo and pyc files .... vdsm will not work.

Hence moving it to CLOSE NOTABUG