This service will be undergoing maintenance at 00:00 UTC, 2016-08-01. It is expected to last about 1 hours

Bug 820486

Summary: [multipath] extend storage domain fails in 2 hosts topology due to race in multipath
Product: Red Hat Enterprise Virtualization Manager Reporter: Haim <hateya>
Component: ovirt-engineAssignee: Ayal Baron <abaron>
Status: CLOSED CURRENTRELEASE QA Contact: yeylon <yeylon>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.1.0CC: abaron, acathrow, amureini, bazulay, danken, dron, gickowic, iheim, istein, jkt, lpeer, nsoffer, Rhev-m-bugs, scohen, srevivo, vchen, yeylon
Target Milestone: ---   
Target Release: 3.3.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: is27 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-02-08 11:02:58 EST Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Bug Depends On: 844180    
Bug Blocks:    
Attachments:
Description Flags
vdsm logs
none
vdsm logs from long run
none
vdsm + engine logs none

Description Haim 2012-05-10 02:56:49 EDT
Description of problem:

- edit storage domain (2 hosts on data-center, 1 is SPM)
- discover new target
- login to target
- select new lun 
- hit enter (VG extend flow)

result: operation is failed, as second host failed to connect to storage 

what really happens: 

- connectStorageServer is sent to second host - successful
- getDeviceVisability is sent to second host afterwards - failed:
 
Thread-57316::INFO::2012-05-10 09:04:57,230::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: getDevicesVisibility, args: ( guids=['1pvcreate-lun21324475'])
Thread-57316::DEBUG::2012-05-10 09:04:57,231::task::495::TaskManager.Task::(_debug) Task c192225e-5922-4301-a634-fb8386edfd2e: moving from state init -> state preparing
Thread-57316::DEBUG::2012-05-10 09:04:57,231::task::495::TaskManager.Task::(_debug) Task c192225e-5922-4301-a634-fb8386edfd2e: finished: {'visible': {'1pvcreate-lun21324475': False}}
Thread-57316::DEBUG::2012-05-10 09:04:57,232::task::495::TaskManager.Task::(_debug) Task c192225e-5922-4301-a634-fb8386edfd2e: moving from state preparing -> state finished
Thread-57316::DEBUG::2012-05-10 09:04:57,232::resourceManager::786::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-57316::DEBUG::2012-05-10 09:04:57,233::resourceManager::821::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-57316::DEBUG::2012-05-10 09:04:57,233::task::495::TaskManager.Task::(_debug) Task c192225e-5922-4301-a634-fb8386edfd2e: ref 0 aborting False
Thread-57316::INFO::2012-05-10 09:04:57,234::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: getDevicesVisibility, Return response: {'status': {'message': 'OK', 'code': 0}, 'visible': {'1pvcreate-lun21324475': False}}

multipath fails to map new device on time, hence, command fails, then, backend send disconnect storage server and operation is rolled back

testes on vdsm6.2.z++ (vdsm-4.9-112.14.1.g407afd.newLvmOn3_0.x86_64)
Comment 2 RHEL Product and Program Management 2012-05-14 00:04:48 EDT
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.
Comment 3 Dan Kenigsberg 2012-05-21 15:03:09 EDT
Eduardo, Haim, don't we have an old bug about this race wit multipath? Haven't we agreed that it can only be solved with polling+timeout on Engine?
Comment 5 Eduardo Warszawski 2012-07-08 05:52:10 EDT
The host is connected to the target in order to view a specific lun.
Takes time ( a few seconds) until the mapper is refreshed and the mutipath of the lun is created.
Device visibility is tested before and therefore fails.

If coonect succeded but the device is still invisible the engine should retry after a few seconds.
Comment 6 Eduardo Warszawski 2012-08-01 08:52:04 EDT
*** Bug 807759 has been marked as a duplicate of this bug. ***
Comment 7 Eduardo Warszawski 2012-08-09 03:24:26 EDT
*** Bug 834851 has been marked as a duplicate of this bug. ***
Comment 9 Eduardo Warszawski 2012-08-09 04:31:19 EDT
*** Bug 836122 has been marked as a duplicate of this bug. ***
Comment 11 Eduardo Warszawski 2012-08-12 10:09:04 EDT
Reproduced in loop, with manual connection (e.g., without connectStorageServer).
Comment 12 Gadi Ickowicz 2012-08-12 10:25:08 EDT
Created attachment 603786 [details]
vdsm logs

vdsm logs with the failed vm creation.

relevant lines from log:

Thread-1775::DEBUG::2012-08-12 16:50:54,815::BindingXMLRPC::864::vds::(wrapper) client [10.35.160.95]::call vmCreate with ({'bridge': 'rhevm', 'macAddr': 'aa:bb:dd:aa:bb:cc', 'nicModel': 'pv', 'vmId': '7096b660-d5
57-49e9-b682-f21cfea20fa3', 'devices': [{'index': '0', 'iface': 'ide', 'format': 'raw', 'readonly': 'False', 'propagateErrors': 'off', 'device': 'disk', 'GUID': '3514f0c5695800005', 'type': 'disk'}], 'memSize': '2
56', 'vmName': '7096b660-d557-49e9-b682-f21cfea20fa3', 'display': 'vnc'},) {}
Thread-1775::INFO::2012-08-12 16:50:54,831::API::579::vds::(_getNetworkIp) network None: using 0
Thread-1775::INFO::2012-08-12 16:50:54,832::API::216::vds::(create) vmContainerLock acquired by vm 7096b660-d557-49e9-b682-f21cfea20fa3
Thread-1776::DEBUG::2012-08-12 16:50:54,833::vm::564::vm.Vm::(_startUnderlyingVm) vmId=`7096b660-d557-49e9-b682-f21cfea20fa3`::Start
Thread-1775::DEBUG::2012-08-12 16:50:54,834::API::229::vds::(create) Total desktops after creation of 7096b660-d557-49e9-b682-f21cfea20fa3 is 1
Thread-1776::DEBUG::2012-08-12 16:50:54,834::vm::568::vm.Vm::(_startUnderlyingVm) vmId=`7096b660-d557-49e9-b682-f21cfea20fa3`::_ongoingCreations acquired
Thread-1775::DEBUG::2012-08-12 16:50:54,835::BindingXMLRPC::870::vds::(wrapper) return vmCreate with {'status': {'message': 'Done', 'code': 0}, 'vmList': {'status': 'WaitForLaunch', 'bridge': 'rhevm', 'macAddr': '
aa:bb:dd:aa:bb:cc', 'vmId': '7096b660-d557-49e9-b682-f21cfea20fa3', 'nicModel': 'pv', 'displayIp': '0', 'pid': '0', 'devices': [{'index': '0', 'iface': 'ide', 'format': 'raw', 'readonly': 'False', 'propagateErrors
': 'off', 'device': 'disk', 'GUID': '3514f0c5695800005', 'type': 'disk'}], 'vmType': 'kvm', 'displaySecurePort': '-1', 'memSize': '256', 'displayPort': '-1', 'clientIp': '', 'vmName': '7096b660-d557-49e9-b682-f21c
fea20fa3', 'display': 'vnc'}}
Thread-1776::INFO::2012-08-12 16:50:54,835::libvirtvm::1289::vm.Vm::(_run) vmId=`7096b660-d557-49e9-b682-f21cfea20fa3`::VM wrapper has started
Thread-1776::DEBUG::2012-08-12 16:50:54,836::vm::580::vm.Vm::(_startUnderlyingVm) vmId=`7096b660-d557-49e9-b682-f21cfea20fa3`::_ongoingCreations released
Thread-1776::ERROR::2012-08-12 16:50:54,837::vm::604::vm.Vm::(_startUnderlyingVm) vmId=`7096b660-d557-49e9-b682-f21cfea20fa3`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 570, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/libvirtvm.py", line 1294, in _run
    self.preparePaths(devices[vm.DISK_DEVICES])
  File "/usr/share/vdsm/vm.py", line 616, in preparePaths
    drive['path'] = self.cif.prepareVolumePath(drive, self.id)
  File "/usr/share/vdsm/clientIF.py", line 190, in prepareVolumePath
    raise vm.VolumeError(drive)
VolumeError: Bad volume specification {'index': '0', 'iface': 'ide', 'format': 'raw', 'readonly': 'False', 'propagateErrors': 'off', 'device': 'disk', 'GUID': '3514f0c5695800005', 'type': 'disk'}
Thread-1776::DEBUG::2012-08-12 16:50:54,840::vm::920::vm.Vm::(setDownStatus) vmId=`7096b660-d557-49e9-b682-f21cfea20fa3`::Changed state to Down: Bad volume specification {'index': '0', 'iface': 'ide', 'format': 'raw', 'readonly': 'False', 'propagateErrors': 'off', 'device': 'disk', 'GUID': '3514f0c5695800005', 'type': 'disk'}
Comment 13 Gadi Ickowicz 2012-08-13 02:36:53 EDT
Created attachment 603857 [details]
vdsm logs from long run

After 2000 runs, the test failed 13 times.
Comment 14 Ayal Baron 2012-08-22 06:32:11 EDT
This should be mitigated by "Bug 844180 - getDeviceList takes too long due to synchronous scsi-rescan that is taking too long".
Which should improve the statistics.
Not closing because we still need to investigate if we can solve this in a non-racy way.
Comment 16 Gadi Ickowicz 2012-09-10 03:20:09 EDT
Created attachment 611318 [details]
vdsm + engine logs

I ran into this race in an automated test using 2 hosts as well. When creating the first storage domain, the createStorageDomain succeeds, but createStoragePool fails (because it was run on the second host, which had connectStorageServer sent shortly before the createStoragePool)

Thread-41::INFO::2012-09-09 14:57:05,426::logUtils::37::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=3, spUUID='46d3b675-aec7-4f01-82a9-4a4459b49210', conList=[{'connection': '10.35.160.7',
'iqn': 'iqn.2008-05.com.xtremio:001b21b545c0', 'portal': '1', 'user': '', 'password': '******', 'id': '77823ee5-d5a0-4b80-adba-1120528ee440', 'port': '3260'}], options=None)
Thread-41::DEBUG::2012-09-09 14:57:05,427::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2008-05.com.xtremio:001b21b545c0 -I default -p 10.35.160.7:3260 --op=new' (cwd
 None)
Thread-41::DEBUG::2012-09-09 14:57:05,443::__init__::1164::Storage.Misc.excCmd::(_log) SUCCESS: <err> = ''; <rc> = 0
Thread-41::DEBUG::2012-09-09 14:57:05,444::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2008-05.com.xtremio:001b21b545c0 -I default -p 10.35.160.7:3260 -l' (cwd None)
Thread-41::DEBUG::2012-09-09 14:57:05,988::__init__::1164::Storage.Misc.excCmd::(_log) SUCCESS: <err> = ''; <rc> = 0
Thread-41::DEBUG::2012-09-09 14:57:05,994::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2008-05.com.xtremio:001b21b545c0 -I default -p 10.35.160.7:3260 -n node.startu
p -v manual --op=update' (cwd None)
Thread-41::DEBUG::2012-09-09 14:57:06,024::__init__::1164::Storage.Misc.excCmd::(_log) SUCCESS: <err> = ''; <rc> = 0
Thread-41::DEBUG::2012-09-09 14:57:06,025::lvm::460::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-41::DEBUG::2012-09-09 14:57:06,029::lvm::462::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-41::DEBUG::2012-09-09 14:57:06,030::lvm::472::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' got the operation mutex
Thread-41::DEBUG::2012-09-09 14:57:06,030::lvm::474::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' released the operation mutex
Thread-41::DEBUG::2012-09-09 14:57:06,030::lvm::493::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-41::DEBUG::2012-09-09 14:57:06,031::lvm::495::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-41::INFO::2012-09-09 14:57:06,031::logUtils::39::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': '77823ee5-d5a0-4b80-adba-1120528ee440'}]}
Thread-41::DEBUG::2012-09-09 14:57:06,031::task::1172::TaskManager.Task::(prepare) Task=`76d94924-4e07-45e0-b966-cfe25b9107bf`::finished: {'statuslist': [{'status': 0, 'id': '77823ee5-d5a0-4b80-adba-1120528ee440'}
]}
Thread-41::DEBUG::2012-09-09 14:57:06,032::task::588::TaskManager.Task::(_updateState) Task=`76d94924-4e07-45e0-b966-cfe25b9107bf`::moving from state preparing -> state finished
Thread-41::DEBUG::2012-09-09 14:57:06,032::resourceManager::809::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-41::DEBUG::2012-09-09 14:57:06,032::resourceManager::844::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-41::DEBUG::2012-09-09 14:57:06,033::task::978::TaskManager.Task::(_decref) Task=`76d94924-4e07-45e0-b966-cfe25b9107bf`::ref 0 aborting False
Thread-42::DEBUG::2012-09-09 14:57:06,065::BindingXMLRPC::164::vds::(wrapper) [10.35.70.44]
Thread-42::DEBUG::2012-09-09 14:57:06,066::task::588::TaskManager.Task::(_updateState) Task=`1c4fd0ec-8a52-4431-a020-1abb8eab3355`::moving from state init -> state preparing
Thread-42::INFO::2012-09-09 14:57:06,066::logUtils::37::dispatcher::(wrapper) Run and protect: createStoragePool(poolType=None, spUUID='46d3b675-aec7-4f01-82a9-4a4459b49210', poolName='TestDataCenter', masterDom='
925bec04-5d43-4be2-b799-eca7ddc0ecb8', domList=['925bec04-5d43-4be2-b799-eca7ddc0ecb8'], masterVersion=1, lockPolicy=None, lockRenewalIntervalSec=5, leaseTimeSec=60, ioOpTimeoutSec=10, leaseRetries=3, options=None
)
Thread-42::DEBUG::2012-09-09 14:57:06,067::misc::1080::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc.refreshStorage)
Thread-42::DEBUG::2012-09-09 14:57:06,067::misc::1082::SamplingMethod::(__call__) Got in to sampling method
Thread-42::DEBUG::2012-09-09 14:57:06,067::misc::1080::SamplingMethod::(__call__) Trying to enter sampling method (storage.iscsi.rescan)
Thread-42::DEBUG::2012-09-09 14:57:06,068::misc::1082::SamplingMethod::(__call__) Got in to sampling method
Thread-42::DEBUG::2012-09-09 14:57:06,068::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/iscsiadm -m session -R' (cwd None)
Thread-42::DEBUG::2012-09-09 14:57:06,085::__init__::1164::Storage.Misc.excCmd::(_log) SUCCESS: <err> = ''; <rc> = 0
Thread-42::DEBUG::2012-09-09 14:57:06,086::misc::1090::SamplingMethod::(__call__) Returning last result
Thread-42::DEBUG::2012-09-09 14:57:06,862::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/multipath' (cwd None)
Thread-42::DEBUG::2012-09-09 14:57:06,923::__init__::1164::Storage.Misc.excCmd::(_log) SUCCESS: <err> = ''; <rc> = 0
Thread-42::DEBUG::2012-09-09 14:57:06,923::lvm::460::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-42::DEBUG::2012-09-09 14:57:06,924::lvm::462::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-42::DEBUG::2012-09-09 14:57:06,924::lvm::472::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' got the operation mutex
Thread-42::DEBUG::2012-09-09 14:57:06,925::lvm::474::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' released the operation mutex
Thread-42::DEBUG::2012-09-09 14:57:06,925::lvm::493::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-42::DEBUG::2012-09-09 14:57:06,925::lvm::495::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-42::DEBUG::2012-09-09 14:57:06,926::misc::1090::SamplingMethod::(__call__) Returning last result
Thread-42::DEBUG::2012-09-09 14:57:06,926::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex
Thread-42::DEBUG::2012-09-09 14:57:06,928::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \\"a%3514f0c56958003d0|3514f0c56958003d1|3514f0c56958003d2%\\", \\"r%.*%\\" ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free 925bec04-5d43-4be2-b799-eca7ddc0ecb8' (cwd None)
Thread-42::DEBUG::2012-09-09 14:57:07,232::__init__::1164::Storage.Misc.excCmd::(_log) FAILED: <err> = '  Volume group "925bec04-5d43-4be2-b799-eca7ddc0ecb8" not found\n'; <rc> = 5
Thread-42::WARNING::2012-09-09 14:57:07,234::lvm::356::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] ['  Volume group "925bec04-5d43-4be2-b799-eca7ddc0ecb8" not found']
Thread-42::DEBUG::2012-09-09 14:57:07,235::lvm::379::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex
Comment 17 Nir Soffer 2013-12-11 03:19:40 EST
(In reply to Gadi Ickowicz from comment #16)
> Created attachment 611318 [details]
> vdsm + engine logs
> 
> I ran into this race in an automated test using 2 hosts as well. When
> creating the first storage domain, the createStorageDomain succeeds, but
> createStoragePool fails (because it was run on the second host, which had
> connectStorageServer sent shortly before the createStoragePool)

On which version? Does this happen on 3.3?
Comment 18 Gadi Ickowicz 2013-12-16 03:58:44 EST
(In reply to Nir Soffer from comment #17)
> (In reply to Gadi Ickowicz from comment #16)
> > Created attachment 611318 [details]
> > vdsm + engine logs
> > 
> > I ran into this race in an automated test using 2 hosts as well. When
> > creating the first storage domain, the createStorageDomain succeeds, but
> > createStoragePool fails (because it was run on the second host, which had
> > connectStorageServer sent shortly before the createStoragePool)
> 
> On which version? Does this happen on 3.3?

Just tested it on is27, it works fine
Comment 19 Nir Soffer 2013-12-16 05:26:21 EST
(In reply to Gadi Ickowicz from comment #18)
> (In reply to Nir Soffer from comment #17)
> > (In reply to Gadi Ickowicz from comment #16)
> > > Created attachment 611318 [details]
> > > vdsm + engine logs
> > > 
> > > I ran into this race in an automated test using 2 hosts as well. When
> > > creating the first storage domain, the createStorageDomain succeeds, but
> > > createStoragePool fails (because it was run on the second host, which had
> > > connectStorageServer sent shortly before the createStoragePool)
> > 
> > On which version? Does this happen on 3.3?
> 
> Just tested it on is27, it works fine

Do you want to close this?