Bug 967311 - [storage] In scale environment, some hosts become Non Responsive when adding first Storage Domain – java.util.concurrent.TimeoutException
[storage] In scale environment, some hosts become Non Responsive when adding ...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
3.2.0
x86_64 Linux
unspecified Severity medium
: ---
: 3.5.0
Assigned To: Allon Mureinik
Yuri Obshansky
storage
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-05-26 09:02 EDT by vvyazmin@redhat.com
Modified: 2016-02-10 11:33 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-06-09 09:24:44 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
## Logs rhevm, vdsm, libvirt (2.12 MB, application/x-gzip)
2013-05-26 09:02 EDT, vvyazmin@redhat.com
no flags Details
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm (3.57 MB, application/x-gzip)
2013-06-02 06:45 EDT, vvyazmin@redhat.com
no flags Details
Snapshot (688.04 KB, image/x-xcf)
2014-05-21 10:48 EDT, Yuri Obshansky
no flags Details
Engine log (3.24 MB, application/x-gzip)
2014-05-22 07:18 EDT, Yuri Obshansky
no flags Details
vdsm log (521.76 KB, application/x-gzip)
2014-05-22 07:19 EDT, Yuri Obshansky
no flags Details

  None (edit)
Description vvyazmin@redhat.com 2013-05-26 09:02:57 EDT
Created attachment 753321 [details]
## Logs rhevm, vdsm, libvirt

Description of problem:
In scale environment, some hosts become Non Responsive when adding first Storage Domain – java.util.concurrent.TimeoutException

Version-Release number of selected component (if applicable):
RHEVM 3.2 - SF17.1 environment:

RHEVM: rhevm-3.2.0-11.28.el6ev.noarch
VDSM: vdsm-4.10.2-21.0.el6ev.x86_64
LIBVIRT: libvirt-0.10.2-18.el6_4.5.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create iSCSI DC with 50 hosts (in my case 50 fake hosts)
2. Add first Storage Domain
  
Actual results:
Some hosts become Non-Responsive

Expected results:
Succeed add first Storage Domain (in scale enviroment) without problems

Impact on user:

Workaround:
Enter host in maintenance mode,  reinstall VDSM via UI

Additional info:

/var/log/ovirt-engine/engine.log
2013-05-26 13:18:07,260 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-4-thread-61) [2d63dceb] Command ConnectStorageServerVDS execution failed. Exception: VDSNetw
orkException: java.util.concurrent.TimeoutException
2013-05-26 13:18:07,261 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (pool-4-thread-61) [2d63dceb] FINISH, ConnectStorageServerVDSCommand, log id: cc17602
2013-05-26 13:18:07,261 ERROR [org.ovirt.engine.core.bll.storage.ConnectSingleAsyncOperation] (pool-4-thread-61) [2d63dceb] Failed to connect host Fake_Host_039 to storage pool 005_Fake_Host_DataCenter. Exception: {3}: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException
        at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:167) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.ISCSIStorageHelper.runConnectionStorageToDomain(ISCSIStorageHelper.java:54) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.ISCSIStorageHelper.runConnectionStorageToDomain(ISCSIStorageHelper.java:29) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.ISCSIStorageHelper.connectStorageToDomainByVdsId(ISCSIStorageHelper.java:216) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.storage.ConnectSingleAsyncOperation.execute(ConnectSingleAsyncOperation.java:18) [engine-bll.jar:]
        at org.ovirt.engine.core.utils.SyncronizeNumberOfAsyncOperations$AsyncOpThread.call(SyncronizeNumberOfAsyncOperations.java:42) [engine-utils.jar:]
        at org.ovirt.engine.core.utils.SyncronizeNumberOfAsyncOperations$AsyncOpThread.call(SyncronizeNumberOfAsyncOperations.java:31) [engine-utils.jar:]
        at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalCallable.call(ThreadPoolUtil.java:99) [engine-utils.jar:]
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) [rt.jar:1.7.0_19]
        at java.util.concurrent.FutureTask.run(FutureTask.java:166) [rt.jar:1.7.0_19]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_19]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_19]
        at java.lang.Thread.run(Thread.java:722) [rt.jar:1.7.0_19]

/var/log/vdsm/vdsm.log
Comment 5 vvyazmin@redhat.com 2013-06-02 06:45:02 EDT
Created attachment 755807 [details]
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm
Comment 12 Aharon Canan 2014-05-12 07:46:10 EDT
what extra info needed from me?
Comment 13 Yuri Obshansky 2014-05-21 10:44:19 EDT
All hosts except the SPM went to Non-Operational state after activate Storage Domain. I changed one host state to Maintenance and activated it successfully.

Bug verified on version:3.4.0-0.16.rc.el6ev
OS Version: RHEL - 6Server - 6.5.0.1.el6
Kernel Version: 2.6.32 - 431.5.1.el6.x86_64
KVM Version: 0.12.1.2 - 2.415.el6_5.6
LIBVIRT Version: libvirt-0.10.2-29.el6_5.5
VDSM Version: vdsm-4.14.7-0.2.rc.el6ev


2014-05-21 17:21:45,402 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2c1cc336] Command org.ovirt.engine.core.vdsb
roker.vdsbroker.ConnectStoragePoolVDSCommand return value 
 StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=304, mMessage=Cannot find master domain: 'spUUID=ffe1e4cc-6d84-41e8-91b0-7e2d4f1a9050, msdUUID=8e319f62-698e-4386-9866-a24cc5
529be6']]
2014-05-21 17:21:45,402 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2c1cc336] HostName = fake_host_20
2014-05-21 17:21:45,402 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2c1cc336] Command ConnectStoragePoolVDSComma
nd(HostName = fake_host_20, HostId = 47fda055-3e3c-4b50-a029-f8df9d630597, storagePoolId = ffe1e4cc-6d84-41e8-91b0-7e2d4f1a9050, vds_spm_id = 20, masterDomainId = 8e319f62-698e-4386-98
66-a24cc5529be6, masterVersion = 1) execution failed. Exception: IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master doma
in: 'spUUID=ffe1e4cc-6d84-41e8-91b0-7e2d4f1a9050, msdUUID=8e319f62-698e-4386-9866-a24cc5529be6'
Comment 14 Yuri Obshansky 2014-05-21 10:48:09 EDT
Created attachment 898042 [details]
Snapshot

Hosts in Non-Operational and Unsigned states
Comment 15 Yuri Obshansky 2014-05-22 07:18:52 EDT
Created attachment 898345 [details]
Engine log
Comment 16 Yuri Obshansky 2014-05-22 07:19:23 EDT
Created attachment 898346 [details]
vdsm log
Comment 17 Yuri Obshansky 2014-05-22 07:22:20 EDT
I've changed Severity to Urgent because after bug verification part of hosts switched to Unsigned state and I can do nothing with RHEVM - only cleanup and populate data again. Looks like very important bug.
Comment 19 errata-xmlrpc 2014-06-09 09:24:44 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0504.html

Note You need to log in before you can comment on or make changes to this bug.