Bug 726949

Summary: VDSM:Prepare volume fail (after connection to iSCSI storage is blocked and unblocked)
Product: Red Hat Enterprise Linux 6 Reporter: Evgeniy German <egerman>
Component: vdsmAssignee: Eduardo Warszawski <ewarszaw>
Status: CLOSED ERRATA QA Contact: Tomas Dosek <tdosek>
Severity: high Docs Contact:
Priority: high    
Version: 6.1CC: abaron, bazulay, danken, iheim, ilvovsky, jlibosva, smizrahi, tdosek, yeylon, ykaul
Target Milestone: rcKeywords: TestBlocker
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: vdsm-4.9-92.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-06 07:36:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
vdsm log
none
vdsm, spm-lock, metadata logs none

Description Evgeniy German 2011-07-31 11:04:22 UTC
Description of problem:
After block connection to 10 min and unblock connection (from host to storage)
The action prepareVolume was fail.


clientIFinit::INFO::2011-07-31 10:53:57,539::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: prepareVolume, args: ( sdUUID=b07eb16c-66e3-4408-9e6e-71afd174edc7 spUUID=28c0b162-6c21-4165-9354-f6c4eb63333a imgUUID=d24ad0a9-edda-424d-9b7a-1bfbd531ba67 volUUID=1962ce6d-cf7c-44c5-bd14-9877d2b7ecb2)
clientIFinit::DEBUG::2011-07-31 10:53:57,539::task::492::TaskManager.Task::(_debug) Task efc04cca-719a-4c4d-b647-413d0d49da4e: moving from state init -> state preparing
b19b1ddb-adda-477f-bfb9-e6423f332d6f::DEBUG::2011-07-31 10:53:57,593::lvm::371::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ''; <rc> = 0
b19b1ddb-adda-477f-bfb9-e6423f332d6f::DEBUG::2011-07-31 10:53:57,595::lvm::463::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex
b19b1ddb-adda-477f-bfb9-e6423f332d6f::DEBUG::2011-07-31 10:53:57,595::persistentDict::204::Storage.PersistentDict::(refresh) read lines (VGTagMetadataRW)=[u'CLASS=Data', u'VERSION=2', u'SDUUID=95d9264e-2578-4c2e-ae17-f760f67a1261', u'TYPE=ISCSI', u'DESCRIPTION=ISCSIDataDomain2', u'PV0=pv:1987653345,uuid:HGlHMt-SOEr-lrJA-LzPv-SUHQ-2rFt-sAUzTK,pestart:0,pecount:398,mapoffset:0', u'LEASERETRIES=3', u'VGUUID=tbFpPV-ihCO-f3N6-6wbv-X6jn-uoaS-3dm1KE', u'LOCKRENEWALINTERVALSEC=5', u'LOCKPOLICY=', u'POOL_UUID=28c0b162-6c21-4165-9354-f6c4eb63333a', u'LEASETIMESEC=60', u'IOOPTIMEOUTSEC=10', u'POOL_SPM_ID=-1', u'MASTER_VERSION=3', u'ROLE=Master', u'POOL_SPM_LVER=-1', u'POOL_DESCRIPTION=TestDataCenter', u'POOL_DOMAINS=b07eb16c-66e3-4408-9e6e-71afd174edc7:Active,95d9264e-2578-4c2e-ae17-f760f67a1261:Active', u'_SHA_CKSUM=1b6efa75d9bae6bd4d2d2f028809536af569b064']
clientIFinit::DEBUG::2011-07-31 10:53:57,596::resourceManager::154::ResourceManager.Request::(__init__) ResName=`Storage.b07eb16c-66e3-4408-9e6e-71afd174edc7`ReqID=`8619a893-81c4-4f19-926b-35ebe32e054c`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '1668' at 'public_prepareVolume'
clientIFinit::DEBUG::2011-07-31 10:53:57,597::resourceManager::467::ResourceManager::(registerResource) Trying to register resource 'Storage.b07eb16c-66e3-4408-9e6e-71afd174edc7' for lock type 'shared'
b19b1ddb-adda-477f-bfb9-e6423f332d6f::DEBUG::2011-07-31 10:53:57,597::lvm::551::OperationMutex::(_invalidatevgs) Operation 'lvm invalidate operation' got the operation mutex
b19b1ddb-adda-477f-bfb9-e6423f332d6f::DEBUG::2011-07-31 10:53:57,597::lvm::553::OperationMutex::(_invalidatevgs) Operation 'lvm invalidate operation' released the operation mutex
clientIFinit::DEBUG::2011-07-31 10:53:57,598::resourceManager::508::ResourceManager::(registerResource) Resource 'Storage.b07eb16c-66e3-4408-9e6e-71afd174edc7' is free. Now locking as 'shared' (1 active user)
b19b1ddb-adda-477f-bfb9-e6423f332d6f::DEBUG::2011-07-31 10:53:57,598::lvm::562::OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' got the operation mutex
clientIFinit::DEBUG::2011-07-31 10:53:57,598::resourceManager::191::ResourceManager.Request::(grant) ResName=`Storage.b07eb16c-66e3-4408-9e6e-71afd174edc7`ReqID=`8619a893-81c4-4f19-926b-35ebe32e054c`::Granted request
b19b1ddb-adda-477f-bfb9-e6423f332d6f::DEBUG::2011-07-31 10:53:57,599::lvm::574::OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' released the operation mutex
clientIFinit::DEBUG::2011-07-31 10:53:57,599::task::492::TaskManager.Task::(_debug) Task efc04cca-719a-4c4d-b647-413d0d49da4e: _resourcesAcquired: Storage.b07eb16c-66e3-4408-9e6e-71afd174edc7 (shared)
b19b1ddb-adda-477f-bfb9-e6423f332d6f::DEBUG::2011-07-31 10:53:57,599::persistentDict::163::Storage.PersistentDict::(__init__) Created a persistant dict with LvMetadataRW backend
clientIFinit::DEBUG::2011-07-31 10:53:57,600::task::492::TaskManager.Task::(_debug) Task efc04cca-719a-4c4d-b647-413d0d49da4e: ref 1 aborting False
b19b1ddb-adda-477f-bfb9-e6423f332d6f::DEBUG::2011-07-31 10:53:57,600::blockSD::151::Storage.Misc.excCmd::(readlines) '/bin/dd iflag=direct skip=0 bs=2048 if=/dev/95d9264e-2578-4c2e-ae17-f760f67a1261/metadata count=1' (cwd None)
clientIFinit::DEBUG::2011-07-31 10:53:57,600::resourceManager::154::ResourceManager.Request::(__init__) ResName=`b07eb16c-66e3-4408-9e6e-71afd174edc7_imageNS.d24ad0a9-edda-424d-9b7a-1bfbd531ba67`ReqID=`ccc6755f-983d-4fd5-b688-dd0c385d727f`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '1672' at 'public_prepareVolume'
clientIFinit::DEBUG::2011-07-31 10:53:57,604::resourceManager::467::ResourceManager::(registerResource) Trying to register resource 'b07eb16c-66e3-4408-9e6e-71afd174edc7_imageNS.d24ad0a9-edda-424d-9b7a-1bfbd531ba67' for lock type 'exclusive'
clientIFinit::WARNING::2011-07-31 10:53:57,605::resourceManager::500::ResourceManager::(registerResource) Resource factory failed to create resource 'b07eb16c-66e3-4408-9e6e-71afd174edc7_imageNS.d24ad0a9-edda-424d-9b7a-1bfbd531ba67'. Canceling request.
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/resourceManager.py", line 498, in registerResource
    obj = namespaceObj.factory.createResource(name, lockType)
  File "/usr/share/vdsm/storage/resourceFactories.py", line 163, in createResource
    volResourcesList = self.__getResourceCandidatesList(resourceName, lockType)
  File "/usr/share/vdsm/storage/resourceFactories.py", line 111, in __getResourceCandidatesList
    chain = image.Image(repoPath).getChain(sdUUID=self.sdUUID, imgUUID=resourceName)
  File "/usr/share/vdsm/storage/image.py", line 312, in getChain
    uuidlist = volclass.getImageVolumes(self.repoPath, sdUUID, imgUUID)
  File "/usr/share/vdsm/storage/blockVolume.py", line 502, in getImageVolumes
    return lvm.lvsByTag(sdUUID, "%s%s" % (TAG_PREFIX_IMAGE, imgUUID))
  File "/usr/share/vdsm/storage/lvm.py", line 1168, in lvsByTag
    return [lv.name for lv in getLV(vg) if tag in lv.tags]
  File "/usr/share/vdsm/storage/lvm.py", line 815, in getLV
    raise se.LogicalVolumeDoesNotExistError("%s/%s" % (vgName, lvName))


Version-Release number of selected component (if applicable):
Setup:
1)Two ISCSI storage domains on the same server
2)One vdsm host 

Steps to Reproduce:
1.Block connection from host to storage (using iptables command)
2.wait 10 min
3.unblock connection 
4.Activate master storage domain
  
Additional info:
full log is attached.

Almost all storage negative tests using iptables,because of this issue its a test blocker for our tests.

Comment 1 Evgeniy German 2011-07-31 11:05:21 UTC
Created attachment 516003 [details]
vdsm log

Comment 6 Tomas Dosek 2011-08-25 10:16:37 UTC
Verified - vdsm-4.9-95.el6 - above specified scenario no longer reproduces on current build. Master domain was properly activated and datacenter's status was set to up.

Comment 7 Jakub Libosvar 2011-09-05 12:58:10 UTC
Re-opening, reproducible in vdsm-4.9-96.el6.x86_64.

Scenario:
1. Have two hosts, one data-domain
2. Block storage using iptables on both hosts
3. After SPM goes non-operational and DC goes to maintenance, delete blocking iptables rules on both host.
4. Activate Non-operational host and master domain.

----
Former HSM cannot see iscsi lun, iscsi session is active but block device is not:

[root@srh-11 ~]# iscsiadm -m session
tcp: [11] 10.34.63.202:3260,1 iqn.2011-01.brq.str-01:jlibosva
[root@srh-11 ~]# multipath -ll
1jlibosva2 dm-3 RH-str01,VIRTUAL-DISK
size=200G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=enabled
  `- 14:0:0:2 sdd 8:48 failed faulty running
1jlibosva1 dm-4 RH-str01,VIRTUAL-DISK
size=200G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=enabled
  `- 14:0:0:1 sdb 8:16 failed faulty running


Note that this faulty host was SPM for a while even when it had blocked connection to storage.


service vdsmd restart solves the problem.

Not sure if this is the same bug or regression since there is no prepareVolume task in logs, but steps to reproduce seems very similar.

Comment 8 Jakub Libosvar 2011-09-05 12:58:34 UTC
Created attachment 521499 [details]
vdsm, spm-lock, metadata logs

Comment 9 Saggi Mizrahi 2011-09-06 12:39:00 UTC
You can clearly see that in the log lvm has read failures from all drives. This means even though you turned it on something in the way didn't pick that up.

I suggest you pick this up with lvm and then continue downstack until you find the culprit.

log excrept:

Thread-137::DEBUG::2011-09-05 14:50:42,110::lvm::374::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \\"a%1IET_0005000d|1jlibosva1|1jlibosva2%\\", \\"r%.*%\\" ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free 24bc5ef6-8a37-48f1-a6fc-6b0a4e3140b5' (cwd None)

Thread-137::DEBUG::2011-09-05 14:50:42,240::lvm::374::Storage.Misc.excCmd::(cmd) FAILED: <err> = '  /dev/mapper/1jlibosva2: read failed after 0 of 4096 at 214748299264: Input/output error\n  /dev/mapper/1jlibosva2: read failed after 0 of 4096 at 214748356608: Input/output error\n  /dev/mapper/1jlibosva2: read failed after 0 of 4096 at 0: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/1jlibosva2 was disabled\n  /dev/mapper/1jlibosva1: read failed after 0 of 4096 at 214748299264: Input/output error\n  /dev/mapper/1jlibosva1: read failed after 0 of 4096 at 214748356608: Input/output error\n  /dev/mapper/1jlibosva1: read failed after 0 of 4096 at 0: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/1jlibosva1 was disabled\n  /dev/mapper/1IET_0005000d: read failed after 0 of 4096 at 16106061824: Input/output error\n  /dev/mapper/1IET_0005000d: read failed after 0 of 4096 at 16106119168: Input/output error\n  /dev/mapper/1IET_0005000d: read failed after 0 of 4096 at 0: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/1IET_0005000d was disabled\n  Volume group "24bc5ef6-8a37-48f1-a6fc-6b0a4e3140b5" not found\n'; <rc> = 5

Comment 10 Jakub Libosvar 2011-09-06 19:17:19 UTC
After a few hours, devices were still faulty - I'd say iscsid doesn't refresh the luns, I used manually command 'iscsiadm -m session -R' to rescan and it solved the problem. If you think this is not vdsm's responsibility but iscsid I'm moving this bug back to verified.

Comment 11 errata-xmlrpc 2011-12-06 07:36:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2011-1782.html