Bug 967311
Summary: | [storage] In scale environment, some hosts become Non Responsive when adding first Storage Domain – java.util.concurrent.TimeoutException | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | vvyazmin <vvyazmin> | ||||||||||||
Component: | vdsm | Assignee: | Allon Mureinik <amureini> | ||||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Yuri Obshansky <yobshans> | ||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||
Priority: | unspecified | ||||||||||||||
Version: | 3.2.0 | CC: | acanan, adahms, amureini, bazulay, iheim, jkt, lpeer, scohen, yeylon | ||||||||||||
Target Milestone: | --- | ||||||||||||||
Target Release: | 3.5.0 | ||||||||||||||
Hardware: | x86_64 | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | storage | ||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2014-06-09 13:24:44 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Attachments: |
|
Created attachment 755807 [details]
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm
what extra info needed from me? All hosts except the SPM went to Non-Operational state after activate Storage Domain. I changed one host state to Maintenance and activated it successfully. Bug verified on version:3.4.0-0.16.rc.el6ev OS Version: RHEL - 6Server - 6.5.0.1.el6 Kernel Version: 2.6.32 - 431.5.1.el6.x86_64 KVM Version: 0.12.1.2 - 2.415.el6_5.6 LIBVIRT Version: libvirt-0.10.2-29.el6_5.5 VDSM Version: vdsm-4.14.7-0.2.rc.el6ev 2014-05-21 17:21:45,402 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2c1cc336] Command org.ovirt.engine.core.vdsb roker.vdsbroker.ConnectStoragePoolVDSCommand return value StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=304, mMessage=Cannot find master domain: 'spUUID=ffe1e4cc-6d84-41e8-91b0-7e2d4f1a9050, msdUUID=8e319f62-698e-4386-9866-a24cc5 529be6']] 2014-05-21 17:21:45,402 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2c1cc336] HostName = fake_host_20 2014-05-21 17:21:45,402 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2c1cc336] Command ConnectStoragePoolVDSComma nd(HostName = fake_host_20, HostId = 47fda055-3e3c-4b50-a029-f8df9d630597, storagePoolId = ffe1e4cc-6d84-41e8-91b0-7e2d4f1a9050, vds_spm_id = 20, masterDomainId = 8e319f62-698e-4386-98 66-a24cc5529be6, masterVersion = 1) execution failed. Exception: IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master doma in: 'spUUID=ffe1e4cc-6d84-41e8-91b0-7e2d4f1a9050, msdUUID=8e319f62-698e-4386-9866-a24cc5529be6' Created attachment 898042 [details]
Snapshot
Hosts in Non-Operational and Unsigned states
Created attachment 898345 [details]
Engine log
Created attachment 898346 [details]
vdsm log
I've changed Severity to Urgent because after bug verification part of hosts switched to Unsigned state and I can do nothing with RHEVM - only cleanup and populate data again. Looks like very important bug. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0504.html |
Created attachment 753321 [details] ## Logs rhevm, vdsm, libvirt Description of problem: In scale environment, some hosts become Non Responsive when adding first Storage Domain – java.util.concurrent.TimeoutException Version-Release number of selected component (if applicable): RHEVM 3.2 - SF17.1 environment: RHEVM: rhevm-3.2.0-11.28.el6ev.noarch VDSM: vdsm-4.10.2-21.0.el6ev.x86_64 LIBVIRT: libvirt-0.10.2-18.el6_4.5.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64 SANLOCK: sanlock-2.6-2.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. Create iSCSI DC with 50 hosts (in my case 50 fake hosts) 2. Add first Storage Domain Actual results: Some hosts become Non-Responsive Expected results: Succeed add first Storage Domain (in scale enviroment) without problems Impact on user: Workaround: Enter host in maintenance mode, reinstall VDSM via UI Additional info: /var/log/ovirt-engine/engine.log 2013-05-26 13:18:07,260 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-4-thread-61) [2d63dceb] Command ConnectStorageServerVDS execution failed. Exception: VDSNetw orkException: java.util.concurrent.TimeoutException 2013-05-26 13:18:07,261 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (pool-4-thread-61) [2d63dceb] FINISH, ConnectStorageServerVDSCommand, log id: cc17602 2013-05-26 13:18:07,261 ERROR [org.ovirt.engine.core.bll.storage.ConnectSingleAsyncOperation] (pool-4-thread-61) [2d63dceb] Failed to connect host Fake_Host_039 to storage pool 005_Fake_Host_DataCenter. Exception: {3}: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:167) [engine-bll.jar:] at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [engine-bll.jar:] at org.ovirt.engine.core.bll.storage.ISCSIStorageHelper.runConnectionStorageToDomain(ISCSIStorageHelper.java:54) [engine-bll.jar:] at org.ovirt.engine.core.bll.storage.ISCSIStorageHelper.runConnectionStorageToDomain(ISCSIStorageHelper.java:29) [engine-bll.jar:] at org.ovirt.engine.core.bll.storage.ISCSIStorageHelper.connectStorageToDomainByVdsId(ISCSIStorageHelper.java:216) [engine-bll.jar:] at org.ovirt.engine.core.bll.storage.ConnectSingleAsyncOperation.execute(ConnectSingleAsyncOperation.java:18) [engine-bll.jar:] at org.ovirt.engine.core.utils.SyncronizeNumberOfAsyncOperations$AsyncOpThread.call(SyncronizeNumberOfAsyncOperations.java:42) [engine-utils.jar:] at org.ovirt.engine.core.utils.SyncronizeNumberOfAsyncOperations$AsyncOpThread.call(SyncronizeNumberOfAsyncOperations.java:31) [engine-utils.jar:] at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalCallable.call(ThreadPoolUtil.java:99) [engine-utils.jar:] at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) [rt.jar:1.7.0_19] at java.util.concurrent.FutureTask.run(FutureTask.java:166) [rt.jar:1.7.0_19] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_19] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_19] at java.lang.Thread.run(Thread.java:722) [rt.jar:1.7.0_19] /var/log/vdsm/vdsm.log