Bug 967311

Summary: [storage] In scale environment, some hosts become Non Responsive when adding first Storage Domain – java.util.concurrent.TimeoutException
Product: Red Hat Enterprise Virtualization Manager Reporter: vvyazmin <vvyazmin>
Component: vdsmAssignee: Allon Mureinik <amureini>
Status: CLOSED CURRENTRELEASE QA Contact: Yuri Obshansky <yobshans>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: acanan, adahms, amureini, bazulay, iheim, jkt, lpeer, scohen, yeylon
Target Milestone: ---   
Target Release: 3.5.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-06-09 13:24:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
## Logs rhevm, vdsm, libvirt
none
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm
none
Snapshot
none
Engine log
none
vdsm log none

Description vvyazmin@redhat.com 2013-05-26 13:02:57 UTC
Created attachment 753321 [details]
## Logs rhevm, vdsm, libvirt

Description of problem:
In scale environment, some hosts become Non Responsive when adding first Storage Domain – java.util.concurrent.TimeoutException

Version-Release number of selected component (if applicable):
RHEVM 3.2 - SF17.1 environment:

RHEVM: rhevm-3.2.0-11.28.el6ev.noarch
VDSM: vdsm-4.10.2-21.0.el6ev.x86_64
LIBVIRT: libvirt-0.10.2-18.el6_4.5.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create iSCSI DC with 50 hosts (in my case 50 fake hosts)
2. Add first Storage Domain
  
Actual results:
Some hosts become Non-Responsive

Expected results:
Succeed add first Storage Domain (in scale enviroment) without problems

Impact on user:

Workaround:
Enter host in maintenance mode,  reinstall VDSM via UI

Additional info:

/var/log/ovirt-engine/engine.log
2013-05-26 13:18:07,260 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-4-thread-61) [2d63dceb] Command ConnectStorageServerVDS execution failed. Exception: VDSNetw
orkException: java.util.concurrent.TimeoutException
2013-05-26 13:18:07,261 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (pool-4-thread-61) [2d63dceb] FINISH, ConnectStorageServerVDSCommand, log id: cc17602
2013-05-26 13:18:07,261 ERROR [org.ovirt.engine.core.bll.storage.ConnectSingleAsyncOperation] (pool-4-thread-61) [2d63dceb] Failed to connect host Fake_Host_039 to storage pool 005_Fake_Host_DataCenter. Exception: {3}: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException
        at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:167) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.ISCSIStorageHelper.runConnectionStorageToDomain(ISCSIStorageHelper.java:54) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.ISCSIStorageHelper.runConnectionStorageToDomain(ISCSIStorageHelper.java:29) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.ISCSIStorageHelper.connectStorageToDomainByVdsId(ISCSIStorageHelper.java:216) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.ConnectSingleAsyncOperation.execute(ConnectSingleAsyncOperation.java:18) [engine-bll.jar:]
        at org.ovirt.engine.core.utils.SyncronizeNumberOfAsyncOperations$AsyncOpThread.call(SyncronizeNumberOfAsyncOperations.java:42) [engine-utils.jar:]
        at org.ovirt.engine.core.utils.SyncronizeNumberOfAsyncOperations$AsyncOpThread.call(SyncronizeNumberOfAsyncOperations.java:31) [engine-utils.jar:]
        at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalCallable.call(ThreadPoolUtil.java:99) [engine-utils.jar:]
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) [rt.jar:1.7.0_19]
        at java.util.concurrent.FutureTask.run(FutureTask.java:166) [rt.jar:1.7.0_19]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_19]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_19]
        at java.lang.Thread.run(Thread.java:722) [rt.jar:1.7.0_19]

/var/log/vdsm/vdsm.log

Comment 5 vvyazmin@redhat.com 2013-06-02 10:45:02 UTC
Created attachment 755807 [details]
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm

Comment 12 Aharon Canan 2014-05-12 11:46:10 UTC
what extra info needed from me?

Comment 13 Yuri Obshansky 2014-05-21 14:44:19 UTC
All hosts except the SPM went to Non-Operational state after activate Storage Domain. I changed one host state to Maintenance and activated it successfully.

Bug verified on version:3.4.0-0.16.rc.el6ev
OS Version: RHEL - 6Server - 6.5.0.1.el6
Kernel Version: 2.6.32 - 431.5.1.el6.x86_64
KVM Version: 0.12.1.2 - 2.415.el6_5.6
LIBVIRT Version: libvirt-0.10.2-29.el6_5.5
VDSM Version: vdsm-4.14.7-0.2.rc.el6ev


2014-05-21 17:21:45,402 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2c1cc336] Command org.ovirt.engine.core.vdsb
roker.vdsbroker.ConnectStoragePoolVDSCommand return value 
 StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=304, mMessage=Cannot find master domain: 'spUUID=ffe1e4cc-6d84-41e8-91b0-7e2d4f1a9050, msdUUID=8e319f62-698e-4386-9866-a24cc5
529be6']]
2014-05-21 17:21:45,402 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2c1cc336] HostName = fake_host_20
2014-05-21 17:21:45,402 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2c1cc336] Command ConnectStoragePoolVDSComma
nd(HostName = fake_host_20, HostId = 47fda055-3e3c-4b50-a029-f8df9d630597, storagePoolId = ffe1e4cc-6d84-41e8-91b0-7e2d4f1a9050, vds_spm_id = 20, masterDomainId = 8e319f62-698e-4386-98
66-a24cc5529be6, masterVersion = 1) execution failed. Exception: IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master doma
in: 'spUUID=ffe1e4cc-6d84-41e8-91b0-7e2d4f1a9050, msdUUID=8e319f62-698e-4386-9866-a24cc5529be6'

Comment 14 Yuri Obshansky 2014-05-21 14:48:09 UTC
Created attachment 898042 [details]
Snapshot

Hosts in Non-Operational and Unsigned states

Comment 15 Yuri Obshansky 2014-05-22 11:18:52 UTC
Created attachment 898345 [details]
Engine log

Comment 16 Yuri Obshansky 2014-05-22 11:19:23 UTC
Created attachment 898346 [details]
vdsm log

Comment 17 Yuri Obshansky 2014-05-22 11:22:20 UTC
I've changed Severity to Urgent because after bug verification part of hosts switched to Unsigned state and I can do nothing with RHEVM - only cleanup and populate data again. Looks like very important bug.

Comment 19 errata-xmlrpc 2014-06-09 13:24:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0504.html