Hide Forgot
Description of problem: After block connection to 10 min and unblock connection (from host to storage) The action prepareVolume was fail. clientIFinit::INFO::2011-07-31 10:53:57,539::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: prepareVolume, args: ( sdUUID=b07eb16c-66e3-4408-9e6e-71afd174edc7 spUUID=28c0b162-6c21-4165-9354-f6c4eb63333a imgUUID=d24ad0a9-edda-424d-9b7a-1bfbd531ba67 volUUID=1962ce6d-cf7c-44c5-bd14-9877d2b7ecb2) clientIFinit::DEBUG::2011-07-31 10:53:57,539::task::492::TaskManager.Task::(_debug) Task efc04cca-719a-4c4d-b647-413d0d49da4e: moving from state init -> state preparing b19b1ddb-adda-477f-bfb9-e6423f332d6f::DEBUG::2011-07-31 10:53:57,593::lvm::371::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ''; <rc> = 0 b19b1ddb-adda-477f-bfb9-e6423f332d6f::DEBUG::2011-07-31 10:53:57,595::lvm::463::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex b19b1ddb-adda-477f-bfb9-e6423f332d6f::DEBUG::2011-07-31 10:53:57,595::persistentDict::204::Storage.PersistentDict::(refresh) read lines (VGTagMetadataRW)=[u'CLASS=Data', u'VERSION=2', u'SDUUID=95d9264e-2578-4c2e-ae17-f760f67a1261', u'TYPE=ISCSI', u'DESCRIPTION=ISCSIDataDomain2', u'PV0=pv:1987653345,uuid:HGlHMt-SOEr-lrJA-LzPv-SUHQ-2rFt-sAUzTK,pestart:0,pecount:398,mapoffset:0', u'LEASERETRIES=3', u'VGUUID=tbFpPV-ihCO-f3N6-6wbv-X6jn-uoaS-3dm1KE', u'LOCKRENEWALINTERVALSEC=5', u'LOCKPOLICY=', u'POOL_UUID=28c0b162-6c21-4165-9354-f6c4eb63333a', u'LEASETIMESEC=60', u'IOOPTIMEOUTSEC=10', u'POOL_SPM_ID=-1', u'MASTER_VERSION=3', u'ROLE=Master', u'POOL_SPM_LVER=-1', u'POOL_DESCRIPTION=TestDataCenter', u'POOL_DOMAINS=b07eb16c-66e3-4408-9e6e-71afd174edc7:Active,95d9264e-2578-4c2e-ae17-f760f67a1261:Active', u'_SHA_CKSUM=1b6efa75d9bae6bd4d2d2f028809536af569b064'] clientIFinit::DEBUG::2011-07-31 10:53:57,596::resourceManager::154::ResourceManager.Request::(__init__) ResName=`Storage.b07eb16c-66e3-4408-9e6e-71afd174edc7`ReqID=`8619a893-81c4-4f19-926b-35ebe32e054c`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '1668' at 'public_prepareVolume' clientIFinit::DEBUG::2011-07-31 10:53:57,597::resourceManager::467::ResourceManager::(registerResource) Trying to register resource 'Storage.b07eb16c-66e3-4408-9e6e-71afd174edc7' for lock type 'shared' b19b1ddb-adda-477f-bfb9-e6423f332d6f::DEBUG::2011-07-31 10:53:57,597::lvm::551::OperationMutex::(_invalidatevgs) Operation 'lvm invalidate operation' got the operation mutex b19b1ddb-adda-477f-bfb9-e6423f332d6f::DEBUG::2011-07-31 10:53:57,597::lvm::553::OperationMutex::(_invalidatevgs) Operation 'lvm invalidate operation' released the operation mutex clientIFinit::DEBUG::2011-07-31 10:53:57,598::resourceManager::508::ResourceManager::(registerResource) Resource 'Storage.b07eb16c-66e3-4408-9e6e-71afd174edc7' is free. Now locking as 'shared' (1 active user) b19b1ddb-adda-477f-bfb9-e6423f332d6f::DEBUG::2011-07-31 10:53:57,598::lvm::562::OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' got the operation mutex clientIFinit::DEBUG::2011-07-31 10:53:57,598::resourceManager::191::ResourceManager.Request::(grant) ResName=`Storage.b07eb16c-66e3-4408-9e6e-71afd174edc7`ReqID=`8619a893-81c4-4f19-926b-35ebe32e054c`::Granted request b19b1ddb-adda-477f-bfb9-e6423f332d6f::DEBUG::2011-07-31 10:53:57,599::lvm::574::OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' released the operation mutex clientIFinit::DEBUG::2011-07-31 10:53:57,599::task::492::TaskManager.Task::(_debug) Task efc04cca-719a-4c4d-b647-413d0d49da4e: _resourcesAcquired: Storage.b07eb16c-66e3-4408-9e6e-71afd174edc7 (shared) b19b1ddb-adda-477f-bfb9-e6423f332d6f::DEBUG::2011-07-31 10:53:57,599::persistentDict::163::Storage.PersistentDict::(__init__) Created a persistant dict with LvMetadataRW backend clientIFinit::DEBUG::2011-07-31 10:53:57,600::task::492::TaskManager.Task::(_debug) Task efc04cca-719a-4c4d-b647-413d0d49da4e: ref 1 aborting False b19b1ddb-adda-477f-bfb9-e6423f332d6f::DEBUG::2011-07-31 10:53:57,600::blockSD::151::Storage.Misc.excCmd::(readlines) '/bin/dd iflag=direct skip=0 bs=2048 if=/dev/95d9264e-2578-4c2e-ae17-f760f67a1261/metadata count=1' (cwd None) clientIFinit::DEBUG::2011-07-31 10:53:57,600::resourceManager::154::ResourceManager.Request::(__init__) ResName=`b07eb16c-66e3-4408-9e6e-71afd174edc7_imageNS.d24ad0a9-edda-424d-9b7a-1bfbd531ba67`ReqID=`ccc6755f-983d-4fd5-b688-dd0c385d727f`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '1672' at 'public_prepareVolume' clientIFinit::DEBUG::2011-07-31 10:53:57,604::resourceManager::467::ResourceManager::(registerResource) Trying to register resource 'b07eb16c-66e3-4408-9e6e-71afd174edc7_imageNS.d24ad0a9-edda-424d-9b7a-1bfbd531ba67' for lock type 'exclusive' clientIFinit::WARNING::2011-07-31 10:53:57,605::resourceManager::500::ResourceManager::(registerResource) Resource factory failed to create resource 'b07eb16c-66e3-4408-9e6e-71afd174edc7_imageNS.d24ad0a9-edda-424d-9b7a-1bfbd531ba67'. Canceling request. Traceback (most recent call last): File "/usr/share/vdsm/storage/resourceManager.py", line 498, in registerResource obj = namespaceObj.factory.createResource(name, lockType) File "/usr/share/vdsm/storage/resourceFactories.py", line 163, in createResource volResourcesList = self.__getResourceCandidatesList(resourceName, lockType) File "/usr/share/vdsm/storage/resourceFactories.py", line 111, in __getResourceCandidatesList chain = image.Image(repoPath).getChain(sdUUID=self.sdUUID, imgUUID=resourceName) File "/usr/share/vdsm/storage/image.py", line 312, in getChain uuidlist = volclass.getImageVolumes(self.repoPath, sdUUID, imgUUID) File "/usr/share/vdsm/storage/blockVolume.py", line 502, in getImageVolumes return lvm.lvsByTag(sdUUID, "%s%s" % (TAG_PREFIX_IMAGE, imgUUID)) File "/usr/share/vdsm/storage/lvm.py", line 1168, in lvsByTag return [lv.name for lv in getLV(vg) if tag in lv.tags] File "/usr/share/vdsm/storage/lvm.py", line 815, in getLV raise se.LogicalVolumeDoesNotExistError("%s/%s" % (vgName, lvName)) Version-Release number of selected component (if applicable): Setup: 1)Two ISCSI storage domains on the same server 2)One vdsm host Steps to Reproduce: 1.Block connection from host to storage (using iptables command) 2.wait 10 min 3.unblock connection 4.Activate master storage domain Additional info: full log is attached. Almost all storage negative tests using iptables,because of this issue its a test blocker for our tests.
Created attachment 516003 [details] vdsm log
Verified - vdsm-4.9-95.el6 - above specified scenario no longer reproduces on current build. Master domain was properly activated and datacenter's status was set to up.
Re-opening, reproducible in vdsm-4.9-96.el6.x86_64. Scenario: 1. Have two hosts, one data-domain 2. Block storage using iptables on both hosts 3. After SPM goes non-operational and DC goes to maintenance, delete blocking iptables rules on both host. 4. Activate Non-operational host and master domain. ---- Former HSM cannot see iscsi lun, iscsi session is active but block device is not: [root@srh-11 ~]# iscsiadm -m session tcp: [11] 10.34.63.202:3260,1 iqn.2011-01.brq.str-01:jlibosva [root@srh-11 ~]# multipath -ll 1jlibosva2 dm-3 RH-str01,VIRTUAL-DISK size=200G features='0' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=0 status=enabled `- 14:0:0:2 sdd 8:48 failed faulty running 1jlibosva1 dm-4 RH-str01,VIRTUAL-DISK size=200G features='0' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=0 status=enabled `- 14:0:0:1 sdb 8:16 failed faulty running Note that this faulty host was SPM for a while even when it had blocked connection to storage. service vdsmd restart solves the problem. Not sure if this is the same bug or regression since there is no prepareVolume task in logs, but steps to reproduce seems very similar.
Created attachment 521499 [details] vdsm, spm-lock, metadata logs
You can clearly see that in the log lvm has read failures from all drives. This means even though you turned it on something in the way didn't pick that up. I suggest you pick this up with lvm and then continue downstack until you find the culprit. log excrept: Thread-137::DEBUG::2011-09-05 14:50:42,110::lvm::374::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \\"a%1IET_0005000d|1jlibosva1|1jlibosva2%\\", \\"r%.*%\\" ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free 24bc5ef6-8a37-48f1-a6fc-6b0a4e3140b5' (cwd None) Thread-137::DEBUG::2011-09-05 14:50:42,240::lvm::374::Storage.Misc.excCmd::(cmd) FAILED: <err> = ' /dev/mapper/1jlibosva2: read failed after 0 of 4096 at 214748299264: Input/output error\n /dev/mapper/1jlibosva2: read failed after 0 of 4096 at 214748356608: Input/output error\n /dev/mapper/1jlibosva2: read failed after 0 of 4096 at 0: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/1jlibosva2 was disabled\n /dev/mapper/1jlibosva1: read failed after 0 of 4096 at 214748299264: Input/output error\n /dev/mapper/1jlibosva1: read failed after 0 of 4096 at 214748356608: Input/output error\n /dev/mapper/1jlibosva1: read failed after 0 of 4096 at 0: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/1jlibosva1 was disabled\n /dev/mapper/1IET_0005000d: read failed after 0 of 4096 at 16106061824: Input/output error\n /dev/mapper/1IET_0005000d: read failed after 0 of 4096 at 16106119168: Input/output error\n /dev/mapper/1IET_0005000d: read failed after 0 of 4096 at 0: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/1IET_0005000d was disabled\n Volume group "24bc5ef6-8a37-48f1-a6fc-6b0a4e3140b5" not found\n'; <rc> = 5
After a few hours, devices were still faulty - I'd say iscsid doesn't refresh the luns, I used manually command 'iscsiadm -m session -R' to rescan and it solved the problem. If you think this is not vdsm's responsibility but iscsid I'm moving this bug back to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2011-1782.html