Bug 820486
| Summary: | [multipath] extend storage domain fails in 2 hosts topology due to race in multipath | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Haim <hateya> | ||||||||
| Component: | ovirt-engine | Assignee: | Ayal Baron <abaron> | ||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | yeylon <yeylon> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 3.1.0 | CC: | abaron, acathrow, amureini, bazulay, danken, dron, gickowic, iheim, istein, jkt, lpeer, nsoffer, Rhev-m-bugs, scohen, srevivo, vchen, yeylon | ||||||||
| Target Milestone: | --- | ||||||||||
| Target Release: | 3.3.0 | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | storage | ||||||||||
| Fixed In Version: | is27 | Doc Type: | Bug Fix | ||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2014-02-08 16:02:58 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | 844180 | ||||||||||
| Bug Blocks: | |||||||||||
| Attachments: |
|
||||||||||
This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. Eduardo, Haim, don't we have an old bug about this race wit multipath? Haven't we agreed that it can only be solved with polling+timeout on Engine? The host is connected to the target in order to view a specific lun. Takes time ( a few seconds) until the mapper is refreshed and the mutipath of the lun is created. Device visibility is tested before and therefore fails. If coonect succeded but the device is still invisible the engine should retry after a few seconds. *** Bug 807759 has been marked as a duplicate of this bug. *** *** Bug 834851 has been marked as a duplicate of this bug. *** *** Bug 836122 has been marked as a duplicate of this bug. *** Reproduced in loop, with manual connection (e.g., without connectStorageServer). Created attachment 603786 [details]
vdsm logs
vdsm logs with the failed vm creation.
relevant lines from log:
Thread-1775::DEBUG::2012-08-12 16:50:54,815::BindingXMLRPC::864::vds::(wrapper) client [10.35.160.95]::call vmCreate with ({'bridge': 'rhevm', 'macAddr': 'aa:bb:dd:aa:bb:cc', 'nicModel': 'pv', 'vmId': '7096b660-d5
57-49e9-b682-f21cfea20fa3', 'devices': [{'index': '0', 'iface': 'ide', 'format': 'raw', 'readonly': 'False', 'propagateErrors': 'off', 'device': 'disk', 'GUID': '3514f0c5695800005', 'type': 'disk'}], 'memSize': '2
56', 'vmName': '7096b660-d557-49e9-b682-f21cfea20fa3', 'display': 'vnc'},) {}
Thread-1775::INFO::2012-08-12 16:50:54,831::API::579::vds::(_getNetworkIp) network None: using 0
Thread-1775::INFO::2012-08-12 16:50:54,832::API::216::vds::(create) vmContainerLock acquired by vm 7096b660-d557-49e9-b682-f21cfea20fa3
Thread-1776::DEBUG::2012-08-12 16:50:54,833::vm::564::vm.Vm::(_startUnderlyingVm) vmId=`7096b660-d557-49e9-b682-f21cfea20fa3`::Start
Thread-1775::DEBUG::2012-08-12 16:50:54,834::API::229::vds::(create) Total desktops after creation of 7096b660-d557-49e9-b682-f21cfea20fa3 is 1
Thread-1776::DEBUG::2012-08-12 16:50:54,834::vm::568::vm.Vm::(_startUnderlyingVm) vmId=`7096b660-d557-49e9-b682-f21cfea20fa3`::_ongoingCreations acquired
Thread-1775::DEBUG::2012-08-12 16:50:54,835::BindingXMLRPC::870::vds::(wrapper) return vmCreate with {'status': {'message': 'Done', 'code': 0}, 'vmList': {'status': 'WaitForLaunch', 'bridge': 'rhevm', 'macAddr': '
aa:bb:dd:aa:bb:cc', 'vmId': '7096b660-d557-49e9-b682-f21cfea20fa3', 'nicModel': 'pv', 'displayIp': '0', 'pid': '0', 'devices': [{'index': '0', 'iface': 'ide', 'format': 'raw', 'readonly': 'False', 'propagateErrors
': 'off', 'device': 'disk', 'GUID': '3514f0c5695800005', 'type': 'disk'}], 'vmType': 'kvm', 'displaySecurePort': '-1', 'memSize': '256', 'displayPort': '-1', 'clientIp': '', 'vmName': '7096b660-d557-49e9-b682-f21c
fea20fa3', 'display': 'vnc'}}
Thread-1776::INFO::2012-08-12 16:50:54,835::libvirtvm::1289::vm.Vm::(_run) vmId=`7096b660-d557-49e9-b682-f21cfea20fa3`::VM wrapper has started
Thread-1776::DEBUG::2012-08-12 16:50:54,836::vm::580::vm.Vm::(_startUnderlyingVm) vmId=`7096b660-d557-49e9-b682-f21cfea20fa3`::_ongoingCreations released
Thread-1776::ERROR::2012-08-12 16:50:54,837::vm::604::vm.Vm::(_startUnderlyingVm) vmId=`7096b660-d557-49e9-b682-f21cfea20fa3`::The vm start process failed
Traceback (most recent call last):
File "/usr/share/vdsm/vm.py", line 570, in _startUnderlyingVm
self._run()
File "/usr/share/vdsm/libvirtvm.py", line 1294, in _run
self.preparePaths(devices[vm.DISK_DEVICES])
File "/usr/share/vdsm/vm.py", line 616, in preparePaths
drive['path'] = self.cif.prepareVolumePath(drive, self.id)
File "/usr/share/vdsm/clientIF.py", line 190, in prepareVolumePath
raise vm.VolumeError(drive)
VolumeError: Bad volume specification {'index': '0', 'iface': 'ide', 'format': 'raw', 'readonly': 'False', 'propagateErrors': 'off', 'device': 'disk', 'GUID': '3514f0c5695800005', 'type': 'disk'}
Thread-1776::DEBUG::2012-08-12 16:50:54,840::vm::920::vm.Vm::(setDownStatus) vmId=`7096b660-d557-49e9-b682-f21cfea20fa3`::Changed state to Down: Bad volume specification {'index': '0', 'iface': 'ide', 'format': 'raw', 'readonly': 'False', 'propagateErrors': 'off', 'device': 'disk', 'GUID': '3514f0c5695800005', 'type': 'disk'}
Created attachment 603857 [details]
vdsm logs from long run
After 2000 runs, the test failed 13 times.
This should be mitigated by "Bug 844180 - getDeviceList takes too long due to synchronous scsi-rescan that is taking too long". Which should improve the statistics. Not closing because we still need to investigate if we can solve this in a non-racy way. Created attachment 611318 [details]
vdsm + engine logs
I ran into this race in an automated test using 2 hosts as well. When creating the first storage domain, the createStorageDomain succeeds, but createStoragePool fails (because it was run on the second host, which had connectStorageServer sent shortly before the createStoragePool)
Thread-41::INFO::2012-09-09 14:57:05,426::logUtils::37::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=3, spUUID='46d3b675-aec7-4f01-82a9-4a4459b49210', conList=[{'connection': '10.35.160.7',
'iqn': 'iqn.2008-05.com.xtremio:001b21b545c0', 'portal': '1', 'user': '', 'password': '******', 'id': '77823ee5-d5a0-4b80-adba-1120528ee440', 'port': '3260'}], options=None)
Thread-41::DEBUG::2012-09-09 14:57:05,427::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2008-05.com.xtremio:001b21b545c0 -I default -p 10.35.160.7:3260 --op=new' (cwd
None)
Thread-41::DEBUG::2012-09-09 14:57:05,443::__init__::1164::Storage.Misc.excCmd::(_log) SUCCESS: <err> = ''; <rc> = 0
Thread-41::DEBUG::2012-09-09 14:57:05,444::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2008-05.com.xtremio:001b21b545c0 -I default -p 10.35.160.7:3260 -l' (cwd None)
Thread-41::DEBUG::2012-09-09 14:57:05,988::__init__::1164::Storage.Misc.excCmd::(_log) SUCCESS: <err> = ''; <rc> = 0
Thread-41::DEBUG::2012-09-09 14:57:05,994::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2008-05.com.xtremio:001b21b545c0 -I default -p 10.35.160.7:3260 -n node.startu
p -v manual --op=update' (cwd None)
Thread-41::DEBUG::2012-09-09 14:57:06,024::__init__::1164::Storage.Misc.excCmd::(_log) SUCCESS: <err> = ''; <rc> = 0
Thread-41::DEBUG::2012-09-09 14:57:06,025::lvm::460::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-41::DEBUG::2012-09-09 14:57:06,029::lvm::462::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-41::DEBUG::2012-09-09 14:57:06,030::lvm::472::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' got the operation mutex
Thread-41::DEBUG::2012-09-09 14:57:06,030::lvm::474::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' released the operation mutex
Thread-41::DEBUG::2012-09-09 14:57:06,030::lvm::493::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-41::DEBUG::2012-09-09 14:57:06,031::lvm::495::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-41::INFO::2012-09-09 14:57:06,031::logUtils::39::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': '77823ee5-d5a0-4b80-adba-1120528ee440'}]}
Thread-41::DEBUG::2012-09-09 14:57:06,031::task::1172::TaskManager.Task::(prepare) Task=`76d94924-4e07-45e0-b966-cfe25b9107bf`::finished: {'statuslist': [{'status': 0, 'id': '77823ee5-d5a0-4b80-adba-1120528ee440'}
]}
Thread-41::DEBUG::2012-09-09 14:57:06,032::task::588::TaskManager.Task::(_updateState) Task=`76d94924-4e07-45e0-b966-cfe25b9107bf`::moving from state preparing -> state finished
Thread-41::DEBUG::2012-09-09 14:57:06,032::resourceManager::809::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-41::DEBUG::2012-09-09 14:57:06,032::resourceManager::844::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-41::DEBUG::2012-09-09 14:57:06,033::task::978::TaskManager.Task::(_decref) Task=`76d94924-4e07-45e0-b966-cfe25b9107bf`::ref 0 aborting False
Thread-42::DEBUG::2012-09-09 14:57:06,065::BindingXMLRPC::164::vds::(wrapper) [10.35.70.44]
Thread-42::DEBUG::2012-09-09 14:57:06,066::task::588::TaskManager.Task::(_updateState) Task=`1c4fd0ec-8a52-4431-a020-1abb8eab3355`::moving from state init -> state preparing
Thread-42::INFO::2012-09-09 14:57:06,066::logUtils::37::dispatcher::(wrapper) Run and protect: createStoragePool(poolType=None, spUUID='46d3b675-aec7-4f01-82a9-4a4459b49210', poolName='TestDataCenter', masterDom='
925bec04-5d43-4be2-b799-eca7ddc0ecb8', domList=['925bec04-5d43-4be2-b799-eca7ddc0ecb8'], masterVersion=1, lockPolicy=None, lockRenewalIntervalSec=5, leaseTimeSec=60, ioOpTimeoutSec=10, leaseRetries=3, options=None
)
Thread-42::DEBUG::2012-09-09 14:57:06,067::misc::1080::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc.refreshStorage)
Thread-42::DEBUG::2012-09-09 14:57:06,067::misc::1082::SamplingMethod::(__call__) Got in to sampling method
Thread-42::DEBUG::2012-09-09 14:57:06,067::misc::1080::SamplingMethod::(__call__) Trying to enter sampling method (storage.iscsi.rescan)
Thread-42::DEBUG::2012-09-09 14:57:06,068::misc::1082::SamplingMethod::(__call__) Got in to sampling method
Thread-42::DEBUG::2012-09-09 14:57:06,068::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/iscsiadm -m session -R' (cwd None)
Thread-42::DEBUG::2012-09-09 14:57:06,085::__init__::1164::Storage.Misc.excCmd::(_log) SUCCESS: <err> = ''; <rc> = 0
Thread-42::DEBUG::2012-09-09 14:57:06,086::misc::1090::SamplingMethod::(__call__) Returning last result
Thread-42::DEBUG::2012-09-09 14:57:06,862::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/multipath' (cwd None)
Thread-42::DEBUG::2012-09-09 14:57:06,923::__init__::1164::Storage.Misc.excCmd::(_log) SUCCESS: <err> = ''; <rc> = 0
Thread-42::DEBUG::2012-09-09 14:57:06,923::lvm::460::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-42::DEBUG::2012-09-09 14:57:06,924::lvm::462::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-42::DEBUG::2012-09-09 14:57:06,924::lvm::472::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' got the operation mutex
Thread-42::DEBUG::2012-09-09 14:57:06,925::lvm::474::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' released the operation mutex
Thread-42::DEBUG::2012-09-09 14:57:06,925::lvm::493::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-42::DEBUG::2012-09-09 14:57:06,925::lvm::495::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-42::DEBUG::2012-09-09 14:57:06,926::misc::1090::SamplingMethod::(__call__) Returning last result
Thread-42::DEBUG::2012-09-09 14:57:06,926::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex
Thread-42::DEBUG::2012-09-09 14:57:06,928::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \\"a%3514f0c56958003d0|3514f0c56958003d1|3514f0c56958003d2%\\", \\"r%.*%\\" ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free 925bec04-5d43-4be2-b799-eca7ddc0ecb8' (cwd None)
Thread-42::DEBUG::2012-09-09 14:57:07,232::__init__::1164::Storage.Misc.excCmd::(_log) FAILED: <err> = ' Volume group "925bec04-5d43-4be2-b799-eca7ddc0ecb8" not found\n'; <rc> = 5
Thread-42::WARNING::2012-09-09 14:57:07,234::lvm::356::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] [' Volume group "925bec04-5d43-4be2-b799-eca7ddc0ecb8" not found']
Thread-42::DEBUG::2012-09-09 14:57:07,235::lvm::379::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex
(In reply to Gadi Ickowicz from comment #16) > Created attachment 611318 [details] > vdsm + engine logs > > I ran into this race in an automated test using 2 hosts as well. When > creating the first storage domain, the createStorageDomain succeeds, but > createStoragePool fails (because it was run on the second host, which had > connectStorageServer sent shortly before the createStoragePool) On which version? Does this happen on 3.3? (In reply to Nir Soffer from comment #17) > (In reply to Gadi Ickowicz from comment #16) > > Created attachment 611318 [details] > > vdsm + engine logs > > > > I ran into this race in an automated test using 2 hosts as well. When > > creating the first storage domain, the createStorageDomain succeeds, but > > createStoragePool fails (because it was run on the second host, which had > > connectStorageServer sent shortly before the createStoragePool) > > On which version? Does this happen on 3.3? Just tested it on is27, it works fine (In reply to Gadi Ickowicz from comment #18) > (In reply to Nir Soffer from comment #17) > > (In reply to Gadi Ickowicz from comment #16) > > > Created attachment 611318 [details] > > > vdsm + engine logs > > > > > > I ran into this race in an automated test using 2 hosts as well. When > > > creating the first storage domain, the createStorageDomain succeeds, but > > > createStoragePool fails (because it was run on the second host, which had > > > connectStorageServer sent shortly before the createStoragePool) > > > > On which version? Does this happen on 3.3? > > Just tested it on is27, it works fine Do you want to close this? |
Description of problem: - edit storage domain (2 hosts on data-center, 1 is SPM) - discover new target - login to target - select new lun - hit enter (VG extend flow) result: operation is failed, as second host failed to connect to storage what really happens: - connectStorageServer is sent to second host - successful - getDeviceVisability is sent to second host afterwards - failed: Thread-57316::INFO::2012-05-10 09:04:57,230::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: getDevicesVisibility, args: ( guids=['1pvcreate-lun21324475']) Thread-57316::DEBUG::2012-05-10 09:04:57,231::task::495::TaskManager.Task::(_debug) Task c192225e-5922-4301-a634-fb8386edfd2e: moving from state init -> state preparing Thread-57316::DEBUG::2012-05-10 09:04:57,231::task::495::TaskManager.Task::(_debug) Task c192225e-5922-4301-a634-fb8386edfd2e: finished: {'visible': {'1pvcreate-lun21324475': False}} Thread-57316::DEBUG::2012-05-10 09:04:57,232::task::495::TaskManager.Task::(_debug) Task c192225e-5922-4301-a634-fb8386edfd2e: moving from state preparing -> state finished Thread-57316::DEBUG::2012-05-10 09:04:57,232::resourceManager::786::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} Thread-57316::DEBUG::2012-05-10 09:04:57,233::resourceManager::821::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-57316::DEBUG::2012-05-10 09:04:57,233::task::495::TaskManager.Task::(_debug) Task c192225e-5922-4301-a634-fb8386edfd2e: ref 0 aborting False Thread-57316::INFO::2012-05-10 09:04:57,234::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: getDevicesVisibility, Return response: {'status': {'message': 'OK', 'code': 0}, 'visible': {'1pvcreate-lun21324475': False}} multipath fails to map new device on time, hence, command fails, then, backend send disconnect storage server and operation is rolled back testes on vdsm6.2.z++ (vdsm-4.9-112.14.1.g407afd.newLvmOn3_0.x86_64)