Created attachment 644828 [details] vdsm log [Storage][Infra] Although the extend succeeded there is an exception in vdsm ( collecting storage domain monitoring information fails) while trying to extend storage domain along with other storage operation like moving disk from one storage domain to another. Full vdsm log attached. Scenario: 1.Setup with 2 storage ISCSI domains. 2.Moving disk from one storage domain to another. 3.When the move process has began go to Storage tab of UI and edit one of the storage domains. 4.Try to extend storage domain. Thread-460::ERROR::2012-11-14 13:08:44,558::domainMonitor::208::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 328fe2e c-4849-4e64-92d8-333f9ff0b09b monitoring information Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 186, in _monitorDomain self.domain.selftest() File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__ return getattr(self.getRealDomain(), attrName) File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 119, in _realProduce self.refreshStorage() File "/usr/share/vdsm/storage/misc.py", line 1063, in helper return sm(*args, **kwargs) File "/usr/share/vdsm/storage/misc.py", line 1048, in __call__ self.__lastResult = self.__func(*args, **kwargs) File "/usr/share/vdsm/storage/sdc.py", line 82, in refreshStorage multipath.rescan() File "/usr/share/vdsm/storage/multipath.py", line 70, in rescan supervdsm.getProxy().forceIScsiScan() File "/usr/share/vdsm/supervdsm.py", line 69, in __call__ return callMethod() File "/usr/share/vdsm/supervdsm.py", line 60, in <lambda> **kwargs) File "<string>", line 2, in forceIScsiScan File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod raise convert_to_error(kind, result) RemoteError: --------------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib64/python2.6/multiprocessing/managers.py", line 216, in serve_client obj, exposed, gettypeid = id_to_obj[ident] KeyError: '1879e10' --------------------------------------------------------------------------- Thread-460::DEBUG::2012-11-14 13:08:44,581::domainMonitor::216::Storage.DomainMonitorThread::(_monitorDomain) Domain 328fe2ec-4849-4e64-92d8-333f9f f0b09b changed its status to Invalid
This part of log is not related to this bug. The bug appears when extend vg operation gets timeout. This happens because the operation waits too long for storage resource that another operation wasn't finished to use before the extend operation, using the same resource, was received. Please add the right log.
(In reply to comment #2) > This part of log is not related to this bug. > The bug appears when extend vg operation gets timeout. This happens because > the operation waits too long for storage resource that another operation > wasn't finished to use before the extend operation, using the same resource, > was received. > > Please add the right log. changed the title - please fix the keyError in task handling.
This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux.
I don't recall moving the bug to modified, probably by mistake.. sorry about that. I'm not sure about why this exception appears, the refreshing failed once and 5 minute after it worked, supervdsm functions in the next and previous call, and no restart of vdsm takes place in this part of log. Is this reproducible? What do you edit in the Storage tab?
Yeela, please take a look at this to see if it's been resolved already.
The failure happens in extendStorageDomain. When trying to extend the SD using a lun that is already used. This is an engine issue, vdsm returns the lun as used in getDeviceList: Thread-7481::INFO::2012-11-14 14:57:04,066::logUtils::39::dispatcher::(wrapper) Run and protect: getDeviceList, Return response: {'devList': [ {'status': 'used', 'fwrev': '0001', 'vgUUID': 'A3R7gy-7QPB-9UKl-Rx1M-kVqt-B4OH-AfItXd', 'pathlist': [{'connection': '10.35.64.10', 'iqn': 'iqn.1994-05.com.redhat:violet-rhel_612144559425363', 'portal': '1', 'port': '3260', 'initiatorname': 'default'}], 'logicalblocksize': '512', 'devtype': 'iSCSI', 'physicalblocksize': '512', 'serial': 'SIET_VIRTUAL-DISK', 'GUID': '1a95f4b3f-ba92-4250', 'productID': 'VIRTUAL-DISK', 'vendorID': 'IET', 'capacity': '10737418240', 'pathstatus': [{'physdev': 'sdg', 'state': 'active', 'type': 'iSCSI', 'lun': '1'}], 'pvUUID': 'fmS9Mj-bv7d-745o-PJd8-cn7m-CvDQ-DrjbSY'} ]} and afterwards vdsm gets a request to extendSD using the used lun: Thread-7512::INFO::2012-11-14 14:57:33,488::logUtils::37::dispatcher::(wrapper) Run and protect: extendStorageDomain(sdUUID='328fe2ec-4849-4e64-92d8-333f9ff0b09b', spUUID='e59ad86c-2e41-11e2-98f9-df8494d70a03', devlist=['1a95f4b3f-ba92-4250'], options=None) and fails: Thread-7512::DEBUG::2012-11-14 14:58:05,567::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/lvm pvcreate --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 dis able_after_error_count=3 filter = [ \\"a%1a95f4b3f-ba92-4250|3514f0c561000000b|3514f0c561000000c|3514f0c561000000d|3514f0c561000000e|36000402002ec07a76533518c00000000%\\", \\"r%.*%\\" ] } global { locking_type=1 prioritise_write_locks =1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " --metadatasize 128m --metadatacopies 2 --metadataignore y /dev/mapper/1a95f4b3f-ba92-4250' (cwd None) Thread-7512::DEBUG::2012-11-14 14:58:05,638::__init__::1164::Storage.Misc.excCmd::(_log) FAILED: <err> = ' Can\'t initialize physical volume "/dev/mapper/1a95f4b3f-ba92-4250" of volume group "d4cb8e1a-85e2-4475-a081-c26c188e7b1c" withou t -ff\n'; <rc> = 5 Regarding the keyError issue , it is a duplicate of BZ#908068, which is already fixed and verified.
(In reply to comment #11) > The failure happens in extendStorageDomain. When trying to extend the SD > using a lun that is already used. > > This is an engine issue, vdsm returns the lun as used in getDeviceList: > > Thread-7481::INFO::2012-11-14 > 14:57:04,066::logUtils::39::dispatcher::(wrapper) Run and protect: > getDeviceList, Return response: {'devList': [ > {'status': 'used', 'fwrev': '0001', 'vgUUID': > 'A3R7gy-7QPB-9UKl-Rx1M-kVqt-B4OH-AfItXd', 'pathlist': [{'connection': > '10.35.64.10', 'iqn': 'iqn.1994-05.com.redhat:violet-rhel_612144559425363', > 'portal': '1', 'port': '3260', 'initiatorname': 'default'}], > 'logicalblocksize': '512', 'devtype': 'iSCSI', 'physicalblocksize': '512', > 'serial': 'SIET_VIRTUAL-DISK', 'GUID': '1a95f4b3f-ba92-4250', 'productID': > 'VIRTUAL-DISK', 'vendorID': 'IET', 'capacity': '10737418240', 'pathstatus': > [{'physdev': 'sdg', 'state': 'active', 'type': 'iSCSI', 'lun': '1'}], > 'pvUUID': 'fmS9Mj-bv7d-745o-PJd8-cn7m-CvDQ-DrjbSY'} > ]} > > and afterwards vdsm gets a request to extendSD using the used lun: > > Thread-7512::INFO::2012-11-14 > 14:57:33,488::logUtils::37::dispatcher::(wrapper) Run and protect: > extendStorageDomain(sdUUID='328fe2ec-4849-4e64-92d8-333f9ff0b09b', > spUUID='e59ad86c-2e41-11e2-98f9-df8494d70a03', > devlist=['1a95f4b3f-ba92-4250'], options=None) > > and fails: > Thread-7512::DEBUG::2012-11-14 > 14:58:05,567::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n > /sbin/lvm pvcreate --config " devices { preferred_names = > [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 dis > able_after_error_count=3 filter = [ > \\"a%1a95f4b3f-ba92- > 4250|3514f0c561000000b|3514f0c561000000c|3514f0c561000000d|3514f0c561000000e| > 36000402002ec07a76533518c00000000%\\", \\"r%.*%\\" ] } global { > locking_type=1 prioritise_write_locks > =1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " > --metadatasize 128m --metadatacopies 2 --metadataignore y > /dev/mapper/1a95f4b3f-ba92-4250' (cwd None) > Thread-7512::DEBUG::2012-11-14 > 14:58:05,638::__init__::1164::Storage.Misc.excCmd::(_log) FAILED: <err> = ' > Can\'t initialize physical volume "/dev/mapper/1a95f4b3f-ba92-4250" of > volume group "d4cb8e1a-85e2-4475-a081-c26c188e7b1c" withou > t -ff\n'; <rc> = 5 Liron, I doubt this issue still exists but please verify.
Ayal, i'll try to sum up the possible issues IIUC (vdsm log), also adding needinfo? on Lesik to clarify what's the issue we refer to it as it's not clear from the bug and from the comments (atleast to me) what he meants to, as i don't see timeout error in the vdsm log. 1. When running moveImage, a shared lock is acquired on vdsm on the domains, while the extend attempt to acquire exclusive lock meaning that the extend operation on the domain will wait for that lock - the operation can fail on ResourceTimeout if the copy takes long time (possibly - it's not in the present log so i can't verify that). Thread-8048::INFO::2012-11-14 15:09:06,315::logUtils::37::dispatcher::(wrapper) Run and protect: moveImage(spUUID='e59ad86c-2e41-11e2-98f9-df8494d70a03', srcDomUUID='6093c92 d-8817-4199-a2dc-9c04b07344b3', dstDomUUID='328fe2ec-4849-4e64-92d8-333f9ff0b09b', imgUUID='3510cfb6-c455-4a40-b90a-137e68cf7e74', vmUUID='', op=2, postZero='false', force=' false') orce=' false') Thread-8048::DEBUG::2012-11-14 15:09:06,825::resourceManager::486::ResourceManager::(registerResource) Trying to register resource 'Storage.328fe2ec-4849-4e64-92d8-333f9ff0b09b' for lock type 'shared' Thread-8048::DEBUG::2012-11-14 15:09:06,825::resourceManager::528::ResourceManager::(registerResource) Resource 'Storage.328fe2ec-4849-4e64-92d8-333f9ff0b09b' is free. Now locking as 'shared' (1 active user) Thread-8048::DEBUG::2012-11-14 15:09:06,826::resourceManager::212::ResourceManager.Request::(grant) ResName=`Storage.328fe2ec-4849-4e64-92d8-333f9ff0b09b`ReqID=`51954bef-d40a-4e00-bb4d-34f5a9535724`::Granted request if we try to extend meanwhile we should wait: Run and protect: extendStorageDomain(sdUUID='328fe2ec-4849-4e64-92d8-333f9ff0b09b', spUUID='e 59ad86c-2e41-11e2-98f9-df8494d70a03', devlist=['1paikov7'], options=None) Thread-8082::DEBUG::2012-11-14 15:09:46,438::resourceManager::175::ResourceManager.Request::(__init__) ResName=`Storage.328fe2ec-4849-4e64-92d8-333f9ff0b09b`ReqID=`07c4d17a- bb72-4e47-b968-92ff2f185c38`::Request was made in '/usr/share/vdsm/storage/resourceManager.py' line '485' at 'registerResource' Thread-8082::DEBUG::2012-11-14 15:09:46,438::resourceManager::486::ResourceManager::(registerResource) Trying to register resource 'Storage.328fe2ec-4849-4e64-92d8-333f9ff0b 09b' for lock type 'exclusive' Thread-8082::DEBUG::2012-11-14 15:09:46,439::resourceManager::510::ResourceManager::(registerResource) Resource 'Storage.328fe2ec-4849-4e64-92d8-333f9ff0b09b' is currently l ocked, Entering queue (1 in queue) 2. Other issue is that the extend fails (Yeela wrote it)- we allow to extend with used luns, the operation can fail on initialize physical volume - which can happen always and is present in the log - doesn't seem related to the copy. 3. the log which is copy pasted: Thread-460::ERROR::2012-11-14 14:59:02,069::domainMonitor::208::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 328fe2ec-4849-4e64-92d8-333f9ff0b 09b monitoring information Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 186, in _monitorDomain self.domain.selftest() File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__ return getattr(self.getRealDomain(), attrName) File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 119, in _realProduce self.refreshStorage() File "/usr/share/vdsm/storage/misc.py", line 1063, in helper return sm(*args, **kwargs) File "/usr/share/vdsm/storage/misc.py", line 1048, in __call__ self.__lastResult = self.__func(*args, **kwargs) File "/usr/share/vdsm/storage/sdc.py", line 82, in refreshStorage multipath.rescan() File "/usr/share/vdsm/storage/multipath.py", line 70, in rescan supervdsm.getProxy().forceIScsiScan() File "/usr/share/vdsm/supervdsm.py", line 69, in __call__ return callMethod() File "/usr/share/vdsm/supervdsm.py", line 60, in <lambda> **kwargs) File "<string>", line 2, in forceIScsiScan File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod raise convert_to_error(kind, result) RemoteError: --------------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib64/python2.6/multiprocessing/managers.py", line 216, in serve_client obj, exposed, gettypeid = id_to_obj[ident] KeyError: '1879e10' --------------------------------------------------------------------------- Thread-460::DEBUG::2012-11-14 14:59:02,069::domainMonitor::216::Storage.DomainMonitorThread::(_monitorDomain) Domain 328fe2ec-4849-4e64-92d8-333f9ff0b09b changed its status to Invalid the monitoring on the domain failed, but before the operation failed before - so i don't see how that's related or relevant to the operation: DEBUG::2012-11-14 14:58:05,650::resourceManager::538::ResourceManager::(releaseResource) Trying to release resource 'Storage.328fe2ec-4849-4e64-92d8-333f9ff0b09b' Thread-7512::DEBUG::2012-11-14 14:58:05,650::resourceManager::553::ResourceManager::(releaseResource) Released resource 'Storage.328fe2ec-4849-4e64-92d8-333f9ff0b09b' (0 active users) Thread-7512::DEBUG::2012-11-14 14:58:05,650::resourceManager::558::ResourceManager::(releaseResource) Resource 'Storage.328fe2ec-4849-4e64-92d8-333f9ff0b09b' is free, finding out if anyone is waiting for it. Thread-7512::DEBUG::2012-11-14 14:58:05,651::resourceManager::565::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.328fe2ec-4849-4e64-92d8-333f9ff0b09b', Clearing records. Thread-7512::ERROR::2012-11-14 14:58:05,651::dispatcher::66::Storage.Dispatcher.Protect::(run) {'status': {'message': 'Failed to initialize physical device: ("[\'/dev/mapper/1a95f4b3f-ba92-4250\']",)', 'code': 601}}
bad wording in the last comment, in the previous comment i summed up the issues following the different logs attached in the comments, now- regards the key error, as Yeela mentioned it should have been solved in the link she provided. Regards the issue in Yeela log - it can always happen when we fail to extend, doesn't seems related here. the possible timeout error - caused by the taken locks on the doamin. Lesik/Haim - is this bug (as the title reflects - refers to the keyError) still relevent?
"Regards the issue in Yeela log - it can always happen when we fail to extend, doesn't seems related here." - i meant that the failure can happen regardless of waiting to acquire the lock, therefore not seem related to the timeout error.
Haim, Lesik - do we have any issue left here? (Haim, you wrote to fix the task error handling, Yeela gave link to a bug in which it should have been solved).
*** This bug has been marked as a duplicate of bug 908068 ***