Bug 846323

Summary: vdsm: remove storage domain fails on format command with Failed reload error
Product: Red Hat Enterprise Linux 6 Reporter: Dafna Ron <dron>
Component: vdsmAssignee: Eduardo Warszawski <ewarszaw>
Status: CLOSED ERRATA QA Contact: Leonid Natapov <lnatapov>
Severity: high Docs Contact:
Priority: high    
Version: 6.3CC: abaron, bazulay, cpelland, ewarszaw, hateya, iheim, ilvovsky, jbiddle, lpeer, ykaul
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: vdsm-4.9.6-36.0 Doc Type: Bug Fix
Doc Text:
Previously, removing multiple domains with extended luns would occasionally result in a race which would cause one of the removals to fail. This has been corrected so that only one lvm operation is used at a time, preventing the race and allowing domains to be successfully removed.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-12-04 19:04:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs
none
second remove failed logs none

Description Dafna Ron 2012-08-07 13:15:28 UTC
Created attachment 602746 [details]
logs

Description of problem:

I removed two domains (both extended with 2-3 luns) and failed remove on one of the domains during format. 

error in vdsm shows that it failed with reload on one of the luns

seems like there might be a race since the lun we fail to reload is a lun that belonged to the second domain that I already removed (so we are trying to read from cache). 

Version-Release number of selected component (if applicable):

si13

How reproducible:

unknown

Steps to Reproduce:
1. create several storage domain (try to extend 2 also)
2. remove the domains from UI
3.
  
Actual results:

we fail to remove one of the domains on format with reload error. 

Expected results:

we should not fail on remove. 

Additional info: full vdsm and engine logs


UI error: 

Error while executing action Remove Storage Domain: Error in storage domain action

backend error: 

2012-08-07 15:17:28,383 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (ajp-/127.0.0.1:8009-15) [3beddaa7] Failed in FormatStorageDomainVDS method
2012-08-07 15:17:28,383 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (ajp-/127.0.0.1:8009-15) [3beddaa7] Error code StorageDomainActionError and error message VDSGenericException: VDSErrorException: Failed to Form
atStorageDomainVDS, error = Error in storage domain action: ('sdUUID=7a627b01-4deb-4e10-aa5a-d8bffadd8a70',)
2012-08-07 15:17:28,383 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (ajp-/127.0.0.1:8009-15) [3beddaa7] Command org.ovirt.engine.core.vdsbroker.vdsbroker.FormatStorageDomainVDSCommand return value 
 Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturnForXmlRpc
mStatus                       Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc
mCode                         350
mMessage                      Error in storage domain action: ('sdUUID=7a627b01-4deb-4e10-aa5a-d8bffadd8a70',)


2012-08-07 15:17:28,383 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (ajp-/127.0.0.1:8009-15) [3beddaa7] Vds: blond-vdsh
2012-08-07 15:17:28,383 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (ajp-/127.0.0.1:8009-15) [3beddaa7] Command FormatStorageDomainVDS execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Fa
iled to FormatStorageDomainVDS, error = Error in storage domain action: ('sdUUID=7a627b01-4deb-4e10-aa5a-d8bffadd8a70',)
2012-08-07 15:17:28,383 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FormatStorageDomainVDSCommand] (ajp-/127.0.0.1:8009-15) [3beddaa7] FINISH, FormatStorageDomainVDSCommand, log id: a77b96f
2012-08-07 15:17:28,383 ERROR [org.ovirt.engine.core.bll.storage.RemoveStorageDomainCommand] (ajp-/127.0.0.1:8009-15) [3beddaa7] Command org.ovirt.engine.core.bll.storage.RemoveStorageDomainCommand throw Vdc Bll exception. With error me
ssage VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to FormatStorageDomainVDS, error = Error in storage domain action: ('sdUUID=7a627b01-4deb-4e10-aa5a-d8bff
add8a70',)



Thread-81104::ERROR::2012-08-07 15:16:04,169::task::853::TaskManager.Task::(_setError) Task=`d35dd9ad-5458-4f32-872a-a86c6449e133`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2226, in formatStorageDomain
    self._recycle(sd)
  File "/usr/share/vdsm/storage/hsm.py", line 2187, in _recycle
    dom.format(dom.sdUUID)
  File "/usr/share/vdsm/storage/blockSD.py", line 782, in format
    lvm.removeVG(sdUUID)
  File "/usr/share/vdsm/storage/lvm.py", line 825, in removeVG
    pvs = [pv.name for pv in _lvminfo.getAllPvs() if pv.vg_name == vgName]
  File "/usr/share/vdsm/storage/lvm.py", line 68, in __getattr__
    raise AttributeError("Failed reload: %s" % self.name)
AttributeError: Failed reload: /dev/mapper/1Dafna-si13-021344240
Thread-81104::DEBUG::2012-08-07 15:16:04,185::task::872::TaskManager.Task::(_run) Task=`d35dd9ad-5458-4f32-872a-a86c6449e133`::Task._run: d35dd9ad-5458-4f32-872a-a86c6449e133 ('7a627b01-4deb-4e10-aa5a-d8bffadd8a70', False) {} failed - s
topping task

Comment 1 Dafna Ron 2012-08-07 13:27:49 UTC
once the domain failed on format we cannot remove it with the following errors (attaching full logs for that too): 

Thread-83692::ERROR::2012-08-07 16:10:59,475::sdc::145::Storage.StorageDomainCache::(_findDomain) Error while looking for domain `7a627b01-4deb-4e10-aa5a-d8bffadd8a70`
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 140, in _findDomain
    return mod.findDomain(sdUUID)
  File "/usr/share/vdsm/storage/blockSD.py", line 1097, in findDomain
    return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/blockSD.py", line 290, in __init__
    metadata = selectMetadata(sdUUID)
  File "/usr/share/vdsm/storage/blockSD.py", line 259, in selectMetadata
    if len(mdProvider) > 0:
  File "/usr/share/vdsm/storage/persistentDict.py", line 61, in __len__
    return len(self.keys())
  File "/usr/share/vdsm/storage/persistentDict.py", line 105, in keys
    return list(self.__iter__())
  File "/usr/share/vdsm/storage/persistentDict.py", line 102, in __iter__
    return ifilter(lambda k: k in self._validatorDict, self._dict.__iter__())
  File "/usr/share/vdsm/storage/persistentDict.py", line 219, in __iter__
    with self._accessWrapper():
  File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__
    return self.gen.next()
  File "/usr/share/vdsm/storage/persistentDict.py", line 147, in _accessWrapper
    self.refresh()
  File "/usr/share/vdsm/storage/persistentDict.py", line 224, in refresh
    lines = self._metaRW.readlines()
  File "/usr/share/vdsm/storage/blockSD.py", line 224, in readlines
    lvm.activateLVs(self._vgName, self._lvName)
  File "/usr/share/vdsm/storage/lvm.py", line 1027, in activateLVs
    _setLVAvailability(vgName, toActivate, "y")
  File "/usr/share/vdsm/storage/lvm.py", line 719, in _setLVAvailability
    raise error(str(e))
CannotActivateLogicalVolumes: Cannot activate Logical Volumes: ('General Storage Exception: ("5 [] [\'  /dev/mapper/1Dafna-si13-011343925: read failed after 0 of 4096 at 53687025664: Input/output error\', \'  /dev/mapper/1Dafna-si13-011
343925: read failed after 0 of 4096 at 53687083008: Input/output error\', \'  /dev/mapper/1Dafna-si13-011343925: read failed after 0 of 4096 at 0: Input/output error\', \'  WARNING: Error counts reached a limit of 3. Device /dev/mapper/
1Dafna-si13-011343925 was disabled\', \'  /dev/mapper/1Dafna-si13-021343925: read failed after 0 of 4096 at 53687025664: Input/output error\', \'  /dev/mapper/1Dafna-si13-021343925: read failed after 0 of 4096 at 53687083008: Input/outp
ut error\', \'  /dev/mapper/1Dafna-si13-021343925: read failed after 0 of 4096 at 0: Input/output error\', \'  WARNING: Error counts reached a limit of 3. Device /dev/mapper/1Dafna-si13-021343925 was disabled\', \'  One or more specifie
d logical volume(s) not found.\']\\n7a627b01-4deb-4e10-aa5a-d8bffadd8a70/[\'metadata\']",)',)
Thread-83692::ERROR::2012-08-07 16:10:59,477::task::853::TaskManager.Task::(_setError) Task=`b0dcdaab-f7ce-447e-bb56-80acc1b9125e`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2226, in formatStorageDomain
    self._recycle(sd)
  File "/usr/share/vdsm/storage/hsm.py", line 2187, in _recycle
    dom.format(dom.sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 47, in __getattr__
    dom = self.getRealDomain()
  File "/usr/share/vdsm/storage/sdc.py", line 51, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 123, in _realProduce
    dom = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 147, in _findDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('7a627b01-4deb-4e10-aa5a-d8bffadd8a70',)
Thread-83695::INFO::2012-08-07 16:10:59,479::logUtils::39::dispatcher::(wrapper) Run and protect: getStoragePoolInfo, Return response: {'info': {'spm_id': 1, 'master_uuid': 'a93432d7-b235-4e4b-800d-43b3a42ba2ea', 'name': 'iSCSI', 'versi
on': '3', 'domains': 'a93432d7-b235-4e4b-800d-43b3a42ba2ea:Active', 'pool_status': 'connected', 'isoprefix': '', 'type': 'ISCSI', 'master_ver': 2, 'lver': 0}, 'dominfo': {'a93432d7-b235-4e4b-800d-43b3a42ba2ea': {'status': 'Active', 'dis
kfree': '49123688448', 'alerts': [], 'disktotal': '53284438016'}}}
Thread-83692::DEBUG::2012-08-07 16:10:59,479::task::872::TaskManager.Task::(_run) Task=`b0dcdaab-f7ce-447e-bb56-80acc1b9125e`::Task._run: b0dcdaab-f7ce-447e-bb56-80acc1b9125e ('7a627b01-4deb-4e10-aa5a-d8bffadd8a70', False) {} failed - stopping task
Thread-83695::DEBUG::2012-08-07 16:10:59,480::task::1172::TaskManager.Task::(prepare) Task=`74c2df23-2a82-4420-acfc-d6ee4ca591c4`::finished: {'info': {'spm_id': 1, 'master_uuid': 'a93432d7-b235-4e4b-800d-43b3a42ba2ea', 'name': 'iSCSI', 'version': '3', 'domains': 'a93432d7-b235-4e4b-800d-43b3a42ba2ea:Active', 'pool_status': 'connected', 'isoprefix': '', 'type': 'ISCSI', 'master_ver': 2, 'lver': 0}, 'dominfo': {'a93432d7-b235-4e4b-800d-43b3a42ba2ea': {'status': 'Active', 'diskfree': '49123688448', 'alerts': [], 'disktotal': '53284438016'}}}
Thread-83692::DEBUG::2012-08-07 16:10:59,480::task::1199::TaskManager.Task::(stop) Task=`b0dcdaab-f7ce-447e-bb56-80acc1b9125e`::stopping in state preparing (force False)
Thread-83695::DEBUG::2012-08-07 16:10:59,481::task::588::TaskManager.Task::(_updateState) Task=`74c2df23-2a82-4420-acfc-d6ee4ca591c4`::moving from state preparing -> state finished
Thread-83692::DEBUG::2012-08-07 16:10:59,481::task::978::TaskManager.Task::(_decref) Task=`b0dcdaab-f7ce-447e-bb56-80acc1b9125e`::ref 1 aborting True
Thread-83695::DEBUG::2012-08-07 16:10:59,482::resourceManager::809::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {'Storage.5acd19f1-c608-41f6-a4e4-4dc90850356e': < ResourceRef 'Storage.5acd19f1-c608-41f6-a4e4-4dc90850356e', isValid: 'True' obj: 'None'>}
Thread-83692::INFO::2012-08-07 16:10:59,482::task::1157::TaskManager.Task::(prepare) Task=`b0dcdaab-f7ce-447e-bb56-80acc1b9125e`::aborting: Task is aborted: 'Storage domain does not exist' - code 358
Thread-83695::DEBUG::2012-08-07 16:10:59,482::resourceManager::844::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}

Comment 2 Dafna Ron 2012-08-07 13:28:19 UTC
Created attachment 602753 [details]
second remove failed logs

Comment 3 Eduardo Warszawski 2012-08-09 18:08:20 UTC
http://gerrit.ovirt.org/#/c/7066/

Comment 6 Dafna Ron 2012-09-27 12:29:04 UTC
verified on si19

Comment 9 errata-xmlrpc 2012-12-04 19:04:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html