Bug 875486 - 3.2 - SuperVdsm is not functional after proccesing an exception resulted in delete network operation.
Summary: 3.2 - SuperVdsm is not functional after proccesing an exception resulted in d...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.2.0
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: ---
: 3.2.0
Assignee: Yaniv Bronhaim
QA Contact: Meni Yakove
URL:
Whiteboard: infra
Depends On:
Blocks: 884722
TreeView+ depends on / blocked
 
Reported: 2012-11-11 15:40 UTC by Meni Yakove
Modified: 2018-12-01 15:54 UTC (History)
12 users (show)

Fixed In Version: vdsm-4.10.2-1.0.el6
Doc Type: Bug Fix
Doc Text:
SuperVdsm (svdsm) was not functional after processing an exception which caused a network delete operation. Consequently, storage and network actions which required svdsm permissions failed. With this update, when svdsm recognizes that the vdsm pid does not exist, it sends a SIGKILL signal and terminates itself. The exception which causes the svdsm to malfunction does not occur.
Clone Of:
Environment:
Last Closed:
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
vdsm.log (10.17 MB, text/x-log)
2012-11-11 15:43 UTC, Meni Yakove
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 341473 0 None None None Never

Description Meni Yakove 2012-11-11 15:40:06 UTC
Description of problem:
This is the error that trigger supervdsm faulty stat:
MainProcess|Thread-205192::ERROR::2012-11-11 13:45:57,309::configNetwork::1373::setupNetworks::(setupNetworks) (23, "delNetwork: ['eth2', 'eth3', 'vnet1'] are not all nics enslaved to bond0")
Traceback (most recent call last):
  File "/usr/share/vdsm/configNetwork.py", line 1315, in setupNetworks
    implicitBonding=False)
  File "/usr/share/vdsm/configNetwork.py", line 1068, in delNetwork
    (nics, bonding))
ConfigNetworkError: (23, "delNetwork: ['eth2', 'eth3', 'vnet1'] are not all nics enslaved to bond0")
MainProcess|Thread-205192::ERROR::2012-11-11 13:45:57,318::supervdsmServer::68::SuperVdsm.ServerCallback::(wrapper) Error in setupNetworks
Traceback (most recent call last):
  File "/usr/share/vdsm/supervdsmServer.py", line 66, in wrapper
    return func(*args, **kwargs)
  File "/usr/share/vdsm/supervdsmServer.py", line 118, in setupNetworks
    return configNetwork.setupNetworks(networks, bondings, **options)
  File "/usr/share/vdsm/configNetwork.py", line 1315, in setupNetworks
    implicitBonding=False)
  File "/usr/share/vdsm/configNetwork.py", line 1068, in delNetwork
    (nics, bonding))
ConfigNetworkError: (23, "delNetwork: ['eth2', 'eth3', 'vnet1'] are not all nics enslaved to bond0")
Thread-205192::ERROR::2012-11-11 13:45:57,319::API::1138::vds::(setupNetworks) delNetwork: ['eth2', 'eth3', 'vnet1'] are not all nics enslaved to bond0
Traceback (most recent call last):
  File "/usr/share/vdsm/API.py", line 1136, in setupNetworks
    supervdsm.getProxy().setupNetworks(networks, bondings, options)
  File "/usr/share/vdsm/supervdsm.py", line 69, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 60, in <lambda>
    **kwargs)
  File "<string>", line 2, in setupNetworks
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
    raise convert_to_error(kind, result)
ConfigNetworkError: (23, "delNetwork: ['eth2', 'eth3', 'vnet1'] are not all nics enslaved to bond0")

This is the error when vdsm tries to communicate with supervdsm:
Thread-19::ERROR::2012-11-11 17:13:56,921::task::853::TaskManager.Task::(_setError) Task=`dc1304b1-e013-4a52-b93d-4b649bc23ffe`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2435, in getStorageDomainsList
    sdCache.refreshStorage()
  File "/usr/share/vdsm/storage/misc.py", line 1063, in helper
    return sm(*args, **kwargs)
  File "/usr/share/vdsm/storage/misc.py", line 1048, in __call__
    self.__lastResult = self.__func(*args, **kwargs)
  File "/usr/share/vdsm/storage/sdc.py", line 82, in refreshStorage
    multipath.rescan()
  File "/usr/share/vdsm/storage/multipath.py", line 70, in rescan
    supervdsm.getProxy().forceIScsiScan()
  File "/usr/share/vdsm/supervdsm.py", line 69, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 59, in <lambda>
    getattr(self._supervdsmProxy._svdsm, self._funcName)(*args,
AttributeError: 'ProxyCaller' object has no attribute 'forceIScsiScan'
Thread-19::DEBUG::2012-11-11 17:13:56,922::task::872::TaskManager.Task::(_run) Task=`dc1304b1-e013-4a52-b93d-4b649bc23ffe`::Task._run: dc1304b1-e013-4a52-b93d-4b649bc23ffe ('00000000-0000-0000-0000-000000000000', None, None, None) {} failed - stopping task
Thread-19::DEBUG::2012-11-11 17:13:56,922::task::1199::TaskManager.Task::(stop) Task=`dc1304b1-e013-4a52-b93d-4b649bc23ffe`::stopping in state preparing (force False)
Thread-19::DEBUG::2012-11-11 17:13:56,923::task::978::TaskManager.Task::(_decref) Task=`dc1304b1-e013-4a52-b93d-4b649bc23ffe`::ref 1 aborting True
Thread-19::INFO::2012-11-11 17:13:56,923::task::1157::TaskManager.Task::(prepare) Task=`dc1304b1-e013-4a52-b93d-4b649bc23ffe`::aborting: Task is aborted: u"'ProxyCaller' object has no attribute 'forceIScsiScan'" - code 100
Thread-19::DEBUG::2012-11-11 17:13:56,923::task::1162::TaskManager.Task::(prepare) Task=`dc1304b1-e013-4a52-b93d-4b649bc23ffe`::Prepare: aborted: 'ProxyCaller' object has no attribute 'forceIScsiScan'
Thread-19::DEBUG::2012-11-11 17:13:56,923::task::978::TaskManager.Task::(_decref) Task=`dc1304b1-e013-4a52-b93d-4b649bc23ffe`::ref 0 aborting True
Thread-19::DEBUG::2012-11-11 17:13:56,924::task::913::TaskManager.Task::(_doAbort) Task=`dc1304b1-e013-4a52-b93d-4b649bc23ffe`::Task._doAbort: force False
Thread-19::DEBUG::2012-11-11 17:13:56,924::resourceManager::844::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-19::DEBUG::2012-11-11 17:13:56,924::task::588::TaskManager.Task::(_updateState) Task=`dc1304b1-e013-4a52-b93d-4b649bc23ffe`::moving from state preparing -> state aborting
Thread-19::DEBUG::2012-11-11 17:13:56,924::task::537::TaskManager.Task::(__state_aborting) Task=`dc1304b1-e013-4a52-b93d-4b649bc23ffe`::_aborting: recover policy none
Thread-19::DEBUG::2012-11-11 17:13:56,925::task::588::TaskManager.Task::(_updateState) Task=`dc1304b1-e013-4a52-b93d-4b649bc23ffe`::moving from state aborting -> state failed
Thread-19::DEBUG::2012-11-11 17:13:56,925::resourceManager::809::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-19::DEBUG::2012-11-11 17:13:56,925::resourceManager::844::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-19::ERROR::2012-11-11 17:13:56,925::dispatcher::69::Storage.Dispatcher.Protect::(run) 'ProxyCaller' object has no attribute 'forceIScsiScan'
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 61, in run
    result = ctask.prepare(self.func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 1164, in prepare
    raise self.error
AttributeError: 'ProxyCaller' object has no attribute 'forceIScsiScan'



Current state and implications:

Each action that requires sudo (supervdsm) permissions fails. (all storage/network actions).

How to reproduce:

make network operation trow exaction.

mitigation:
kill supervdsm service.(restart vdsm doesn't help!!!)

Comment 2 Meni Yakove 2012-11-11 15:43:55 UTC
Created attachment 642901 [details]
vdsm.log

Comment 3 Yaniv Bronhaim 2012-11-11 17:19:22 UTC
We missed the panic call, it is between the delNetwork error to supervdsm attributeError:

Thread-208741::ERROR::2012-11-11 15:00:04,171::sdc::150::Storage.StorageDomainCache::(_findDomain) Error while looking for domain `46e8e180-16e8-4881-b867-f391d35d5cd3`
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 145, in _findDomain
    return mod.findDomain(sdUUID)
  File "/usr/share/vdsm/storage/blockSD.py", line 1095, in findDomain
    return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/blockSD.py", line 1069, in findDomainPath
    if _isSD(vg):
  File "/usr/share/vdsm/storage/blockSD.py", line 1092, in _isSD
    return STORAGE_DOMAIN_TAG in vg.tags
  File "/usr/share/vdsm/storage/lvm.py", line 68, in __getattr__
    raise AttributeError("Failed reload: %s" % self.name)
AttributeError: Failed reload: 46e8e180-16e8-4881-b867-f391d35d5cd3
Thread-208741::ERROR::2012-11-11 15:00:04,175::misc::184::Storage.Misc::(panic) Panic: Unrecoverable errors during SPM stop process.
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sp.py", line 397, in stopSpm
    self.masterDomain.releaseClusterLock()
  File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 121, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 152, in _findDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('46e8e180-16e8-4881-b867-f391d35d5cd3',)
MainThread::INFO::2012-11-11 15:00:04,247::vdsm::70::vds::(run) I am the actual vdsm 4.9-41.0

This shows we called misc.panic and somehow old svdsm pid was still exist.
supervdsm should die and delete its pid in such case, and it didn't.

We still need reproduction for that to be sure.

Comment 4 Yaniv Bronhaim 2012-11-12 16:37:12 UTC
patch upstream: http://gerrit.ovirt.org/#/c/9196/

Verifies that svdsm will kill itself when vdsm dies.

Comment 10 Meni Yakove 2012-12-26 08:40:19 UTC
Verified on vdsm-4.10.2-1.0.el6.x86_64

Comment 12 Itamar Heim 2013-06-11 08:49:56 UTC
3.2 has been released

Comment 13 Itamar Heim 2013-06-11 08:50:03 UTC
3.2 has been released

Comment 14 Itamar Heim 2013-06-11 08:50:03 UTC
3.2 has been released

Comment 15 Itamar Heim 2013-06-11 08:54:26 UTC
3.2 has been released

Comment 16 Itamar Heim 2013-06-11 09:24:55 UTC
3.2 has been released


Note You need to log in before you can comment on or make changes to this bug.