Bug 875486

Summary: 3.2 - SuperVdsm is not functional after proccesing an exception resulted in delete network operation.
Product: Red Hat Enterprise Virtualization Manager Reporter: Meni Yakove <myakove>
Component: vdsmAssignee: Yaniv Bronhaim <ybronhei>
Status: CLOSED CURRENTRELEASE QA Contact: Meni Yakove <myakove>
Severity: high Docs Contact:
Priority: urgent    
Version: 3.2.0CC: bazulay, chetan, cpelland, dornelas, dyasny, hateya, iheim, lpeer, mavital, mburman, ybronhei, ykaul
Target Milestone: ---Keywords: ZStream
Target Release: 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: vdsm-4.10.2-1.0.el6 Doc Type: Bug Fix
Doc Text:
SuperVdsm (svdsm) was not functional after processing an exception which caused a network delete operation. Consequently, storage and network actions which required svdsm permissions failed. With this update, when svdsm recognizes that the vdsm pid does not exist, it sends a SIGKILL signal and terminates itself. The exception which causes the svdsm to malfunction does not occur.
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 884722    
Attachments:
Description Flags
vdsm.log none

Description Meni Yakove 2012-11-11 15:40:06 UTC
Description of problem:
This is the error that trigger supervdsm faulty stat:
MainProcess|Thread-205192::ERROR::2012-11-11 13:45:57,309::configNetwork::1373::setupNetworks::(setupNetworks) (23, "delNetwork: ['eth2', 'eth3', 'vnet1'] are not all nics enslaved to bond0")
Traceback (most recent call last):
  File "/usr/share/vdsm/configNetwork.py", line 1315, in setupNetworks
    implicitBonding=False)
  File "/usr/share/vdsm/configNetwork.py", line 1068, in delNetwork
    (nics, bonding))
ConfigNetworkError: (23, "delNetwork: ['eth2', 'eth3', 'vnet1'] are not all nics enslaved to bond0")
MainProcess|Thread-205192::ERROR::2012-11-11 13:45:57,318::supervdsmServer::68::SuperVdsm.ServerCallback::(wrapper) Error in setupNetworks
Traceback (most recent call last):
  File "/usr/share/vdsm/supervdsmServer.py", line 66, in wrapper
    return func(*args, **kwargs)
  File "/usr/share/vdsm/supervdsmServer.py", line 118, in setupNetworks
    return configNetwork.setupNetworks(networks, bondings, **options)
  File "/usr/share/vdsm/configNetwork.py", line 1315, in setupNetworks
    implicitBonding=False)
  File "/usr/share/vdsm/configNetwork.py", line 1068, in delNetwork
    (nics, bonding))
ConfigNetworkError: (23, "delNetwork: ['eth2', 'eth3', 'vnet1'] are not all nics enslaved to bond0")
Thread-205192::ERROR::2012-11-11 13:45:57,319::API::1138::vds::(setupNetworks) delNetwork: ['eth2', 'eth3', 'vnet1'] are not all nics enslaved to bond0
Traceback (most recent call last):
  File "/usr/share/vdsm/API.py", line 1136, in setupNetworks
    supervdsm.getProxy().setupNetworks(networks, bondings, options)
  File "/usr/share/vdsm/supervdsm.py", line 69, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 60, in <lambda>
    **kwargs)
  File "<string>", line 2, in setupNetworks
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
    raise convert_to_error(kind, result)
ConfigNetworkError: (23, "delNetwork: ['eth2', 'eth3', 'vnet1'] are not all nics enslaved to bond0")

This is the error when vdsm tries to communicate with supervdsm:
Thread-19::ERROR::2012-11-11 17:13:56,921::task::853::TaskManager.Task::(_setError) Task=`dc1304b1-e013-4a52-b93d-4b649bc23ffe`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2435, in getStorageDomainsList
    sdCache.refreshStorage()
  File "/usr/share/vdsm/storage/misc.py", line 1063, in helper
    return sm(*args, **kwargs)
  File "/usr/share/vdsm/storage/misc.py", line 1048, in __call__
    self.__lastResult = self.__func(*args, **kwargs)
  File "/usr/share/vdsm/storage/sdc.py", line 82, in refreshStorage
    multipath.rescan()
  File "/usr/share/vdsm/storage/multipath.py", line 70, in rescan
    supervdsm.getProxy().forceIScsiScan()
  File "/usr/share/vdsm/supervdsm.py", line 69, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 59, in <lambda>
    getattr(self._supervdsmProxy._svdsm, self._funcName)(*args,
AttributeError: 'ProxyCaller' object has no attribute 'forceIScsiScan'
Thread-19::DEBUG::2012-11-11 17:13:56,922::task::872::TaskManager.Task::(_run) Task=`dc1304b1-e013-4a52-b93d-4b649bc23ffe`::Task._run: dc1304b1-e013-4a52-b93d-4b649bc23ffe ('00000000-0000-0000-0000-000000000000', None, None, None) {} failed - stopping task
Thread-19::DEBUG::2012-11-11 17:13:56,922::task::1199::TaskManager.Task::(stop) Task=`dc1304b1-e013-4a52-b93d-4b649bc23ffe`::stopping in state preparing (force False)
Thread-19::DEBUG::2012-11-11 17:13:56,923::task::978::TaskManager.Task::(_decref) Task=`dc1304b1-e013-4a52-b93d-4b649bc23ffe`::ref 1 aborting True
Thread-19::INFO::2012-11-11 17:13:56,923::task::1157::TaskManager.Task::(prepare) Task=`dc1304b1-e013-4a52-b93d-4b649bc23ffe`::aborting: Task is aborted: u"'ProxyCaller' object has no attribute 'forceIScsiScan'" - code 100
Thread-19::DEBUG::2012-11-11 17:13:56,923::task::1162::TaskManager.Task::(prepare) Task=`dc1304b1-e013-4a52-b93d-4b649bc23ffe`::Prepare: aborted: 'ProxyCaller' object has no attribute 'forceIScsiScan'
Thread-19::DEBUG::2012-11-11 17:13:56,923::task::978::TaskManager.Task::(_decref) Task=`dc1304b1-e013-4a52-b93d-4b649bc23ffe`::ref 0 aborting True
Thread-19::DEBUG::2012-11-11 17:13:56,924::task::913::TaskManager.Task::(_doAbort) Task=`dc1304b1-e013-4a52-b93d-4b649bc23ffe`::Task._doAbort: force False
Thread-19::DEBUG::2012-11-11 17:13:56,924::resourceManager::844::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-19::DEBUG::2012-11-11 17:13:56,924::task::588::TaskManager.Task::(_updateState) Task=`dc1304b1-e013-4a52-b93d-4b649bc23ffe`::moving from state preparing -> state aborting
Thread-19::DEBUG::2012-11-11 17:13:56,924::task::537::TaskManager.Task::(__state_aborting) Task=`dc1304b1-e013-4a52-b93d-4b649bc23ffe`::_aborting: recover policy none
Thread-19::DEBUG::2012-11-11 17:13:56,925::task::588::TaskManager.Task::(_updateState) Task=`dc1304b1-e013-4a52-b93d-4b649bc23ffe`::moving from state aborting -> state failed
Thread-19::DEBUG::2012-11-11 17:13:56,925::resourceManager::809::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-19::DEBUG::2012-11-11 17:13:56,925::resourceManager::844::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-19::ERROR::2012-11-11 17:13:56,925::dispatcher::69::Storage.Dispatcher.Protect::(run) 'ProxyCaller' object has no attribute 'forceIScsiScan'
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 61, in run
    result = ctask.prepare(self.func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 1164, in prepare
    raise self.error
AttributeError: 'ProxyCaller' object has no attribute 'forceIScsiScan'



Current state and implications:

Each action that requires sudo (supervdsm) permissions fails. (all storage/network actions).

How to reproduce:

make network operation trow exaction.

mitigation:
kill supervdsm service.(restart vdsm doesn't help!!!)

Comment 2 Meni Yakove 2012-11-11 15:43:55 UTC
Created attachment 642901 [details]
vdsm.log

Comment 3 Yaniv Bronhaim 2012-11-11 17:19:22 UTC
We missed the panic call, it is between the delNetwork error to supervdsm attributeError:

Thread-208741::ERROR::2012-11-11 15:00:04,171::sdc::150::Storage.StorageDomainCache::(_findDomain) Error while looking for domain `46e8e180-16e8-4881-b867-f391d35d5cd3`
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 145, in _findDomain
    return mod.findDomain(sdUUID)
  File "/usr/share/vdsm/storage/blockSD.py", line 1095, in findDomain
    return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/blockSD.py", line 1069, in findDomainPath
    if _isSD(vg):
  File "/usr/share/vdsm/storage/blockSD.py", line 1092, in _isSD
    return STORAGE_DOMAIN_TAG in vg.tags
  File "/usr/share/vdsm/storage/lvm.py", line 68, in __getattr__
    raise AttributeError("Failed reload: %s" % self.name)
AttributeError: Failed reload: 46e8e180-16e8-4881-b867-f391d35d5cd3
Thread-208741::ERROR::2012-11-11 15:00:04,175::misc::184::Storage.Misc::(panic) Panic: Unrecoverable errors during SPM stop process.
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sp.py", line 397, in stopSpm
    self.masterDomain.releaseClusterLock()
  File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 121, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 152, in _findDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('46e8e180-16e8-4881-b867-f391d35d5cd3',)
MainThread::INFO::2012-11-11 15:00:04,247::vdsm::70::vds::(run) I am the actual vdsm 4.9-41.0

This shows we called misc.panic and somehow old svdsm pid was still exist.
supervdsm should die and delete its pid in such case, and it didn't.

We still need reproduction for that to be sure.

Comment 4 Yaniv Bronhaim 2012-11-12 16:37:12 UTC
patch upstream: http://gerrit.ovirt.org/#/c/9196/

Verifies that svdsm will kill itself when vdsm dies.

Comment 10 Meni Yakove 2012-12-26 08:40:19 UTC
Verified on vdsm-4.10.2-1.0.el6.x86_64

Comment 12 Itamar Heim 2013-06-11 08:49:56 UTC
3.2 has been released

Comment 13 Itamar Heim 2013-06-11 08:50:03 UTC
3.2 has been released

Comment 14 Itamar Heim 2013-06-11 08:50:03 UTC
3.2 has been released

Comment 15 Itamar Heim 2013-06-11 08:54:26 UTC
3.2 has been released

Comment 16 Itamar Heim 2013-06-11 09:24:55 UTC
3.2 has been released