1413845 – Engine fails to start properly

Bug 1413845 - Engine fails to start properly

Summary: Engine fails to start properly

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-hosted-engine-ha
Classification:	oVirt
Component:	Agent
Sub Component:
Version:	2.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	urgent
Target Milestone:	ovirt-4.2.0
Target Release:	2.2.0
Assignee:	Martin Sivák
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:	1439657
Blocks:	Gluster-HC-2 1433896
TreeView+	depends on / blocked

Reported:	2017-01-17 06:31 UTC by RamaKasturi
Modified:	2022-03-13 14:35 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-05-10 06:31:05 UTC
oVirt Team:	SLA
Embargoed:
Flags:	rule-engine: ovirt-4.2+ rule-engine: blocker+ rule-engine: planning_ack+ msivak: devel_ack+ mavital: testing_ack+

Attachments	(Terms of Use)
server.log (31.56 KB, text/plain) 2017-04-18 14:36 UTC, Denis Chaplygin	no flags	Details
engine.log (2.62 MB, text/plain) 2017-04-18 14:38 UTC, Denis Chaplygin	no flags	Details
Fencing policies which are enabled on cluster (162.79 KB, image/png) 2017-06-29 07:20 UTC, RamaKasturi	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-45179	None	None	None	2022-03-13 14:35:19 UTC
oVirt gerrit	72791	'None'	MERGED	he: Fixed vm absence detection	2021-01-04 14:16:28 UTC
oVirt gerrit	73266	'None'	MERGED	he: Fixed vm absence detection	2021-01-04 14:16:28 UTC
oVirt gerrit	74353	'None'	MERGED	he: Fixed vm absence detection	2021-01-04 14:16:28 UTC
oVirt gerrit	74354	'None'	MERGED	he: Fixed vm absence detection	2021-01-04 14:16:28 UTC
oVirt gerrit	75526	'None'	MERGED	he: Update existing HE VM to use urandom RNG source	2021-01-04 14:16:28 UTC
oVirt gerrit	76946	'None'	MERGED	he: Update existing HE VM to use urandom RNG source	2021-01-04 14:16:29 UTC
oVirt gerrit	83623	'None'	MERGED	Do not crash the agent on notification failure	2021-01-04 14:16:27 UTC

Description RamaKasturi 2017-01-17 06:31:27 UTC

Description of problem:
I see that HE vm does not migrate to another node when ovirtmgmt is down. ovirt-ha-agent tries to restart the host on another node but it fails to do so. when i check hosted-engine --vm-status i see the following output.

[root@rhsqa-grafton4 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : rhsqa-grafton4.lab.eng.blr.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : f08a8d7e
local_conf_timestamp               : 0
Host timestamp                     : 70746
Extra metadata (valid at timestamp):
	metadata_parse_version=1
	metadata_feature_version=1
	timestamp=70746 (Tue Jan 17 11:51:42 2017)
	host-id=1
	score=3400
	vm_conf_refresh_time=0 (Thu Jan  1 05:30:00 1970)
	conf_on_shared_storage=True
	maintenance=False
	state=EngineStart
	stopped=False


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : rhsqa-grafton5.lab.eng.blr.redhat.com
Host ID                            : 2
Engine status                      : {"reason": "Storage of VM is locked. Is another host already starting the VM?", "health": "bad", "vm": "already_locked", "detail": "down"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 06e4fbae
local_conf_timestamp               : 0
Host timestamp                     : 70816
Extra metadata (valid at timestamp):
	metadata_parse_version=1
	metadata_feature_version=1
	timestamp=70816 (Tue Jan 17 11:51:31 2017)
	host-id=2
	score=3400
	vm_conf_refresh_time=0 (Thu Jan  1 05:30:00 1970)
	conf_on_shared_storage=True
	maintenance=False
	state=EngineDown
	stopped=False


--== Host 3 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : rhsqa-grafton6.lab.eng.blr.redhat.com
Host ID                            : 3
Engine status                      : unknown stale-data
Score                              : 1800
stopped                            : False
Local maintenance                  : False
crc32                              : cc238e1d
local_conf_timestamp               : 0
Host timestamp                     : 69282
Extra metadata (valid at timestamp):
	metadata_parse_version=1
	metadata_feature_version=1
	timestamp=69282 (Tue Jan 17 11:25:57 2017)
	host-id=3
	score=1800
	vm_conf_refresh_time=0 (Thu Jan  1 05:30:00 1970)
	conf_on_shared_storage=True
	maintenance=False
	state=EngineUp
	stopped=False
You have new mail in /var/spool/mail/root

When i execute vdsClient -s 0 list table i see that both the vms are up and running but i cannot ping them.

[root@rhsqa-grafton6 ~]# vdsClient -s 0 list table
0015d21c-fc4e-440b-8601-31ab990b8712  20937  applinuxvm           Up                                       
ae48ef65-f289-42ec-8b1f-94215115e95d  19825  HostedEngine         Up                   10.70.44.74   

Version-Release number of selected component (if applicable):
[root@rhsqa-grafton4 ~]# rpm -qa | grep ovirt
ovirt-imageio-daemon-0.5.0-0.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
ovirt-hosted-engine-setup-2.1.0-0.0.master.git46cacd3.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
ovirt-imageio-common-0.5.0-0.el7ev.noarch
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
ovirt-hosted-engine-ha-2.1.0-0.0.master.git118000f.el7ev.noarch
ovirt-setup-lib-1.1.0-1.el7ev.noarch
ovirt-host-deploy-1.6.0-0.2.master.gitb76ad50.el7ev.noarch


How reproducible:
Always

Steps to Reproduce:
1. Deploy HC setup on arbiter volumes
2. Now start running HE and app vms on the third node which is the arbiter node.
3. Have powermanagement enabled.
4. Bring down ovirtmgmt network by running the command ifdown <nic>

Actual results:
I see that HE vm does not restart on another node plus I/O stops happening on the app vm. ovirt-ha-agent tries to restart HE vm on another node but it fails.

Expected results:
HE and app vms should restart on other nodes.

Additional info:

I see the following Tracebacks in ovirt-broker.log file

FilesystemBackend dom_type=glusterfs sd_uuid=4373df07-8826-4376-9a93-14463a8ce537'
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 166, in handle
    data)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 299, in _dispatch
    .set_storage_domain(client, sd_type, **options)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 66, in set_storage_domain
    self._backends[client].connect()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 461, in connect
    self._dom_type)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 108, in get_domain_path
    " in {1}".format(sd_uuid, parent))
BackendFailureException: path to storage domain 4373df07-8826-4376-9a93-14463a8ce537 not found in /rhev/data-center/mnt/glusterSD

But this domain actually exists.

[root@rhsqa-grafton4 ~]# ls -l /rhev/data-center/mnt/glusterSD/10.70.36.82\:_engine/
total 4
drwxr-xr-x. 5 vdsm kvm 4096 Jan 13 11:59 4373df07-8826-4376-9a93-14463a8ce537
-rwxr-xr-x. 1 vdsm kvm    0 Jan 17 11:59 __DIRECT_IO_TEST__

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 368, in wrapper
    return f(*a, **kw)
  File "/usr/lib/python2.7/site-packages/vdsm/concurrent.py", line 180, in run
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 216, in process_requests
    count=1,
  File "/usr/lib64/python2.7/asyncore.py", line 220, in loop
    poll_fun(timeout, map)
  File "/usr/lib64/python2.7/asyncore.py", line 181, in poll2
    if obj.readable():
  File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 61, in readable
    return self._delegate_call("readable")
  File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 143, in _delegate_call
    return getattr(self.__impl, name)(self)
AttributeError: 'NoneType' object has no attribute 'readable'


gluster volume info engine:
===============================

[root@rhsqa-grafton4 ~]# gluster volume info engine
 
Volume Name: engine
Type: Replicate
Volume ID: 6364da28-c876-49d1-ba59-0826e172e539
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.36.82:/rhgs/engine/enginebrick
Brick2: 10.70.36.83:/rhgs/engine/enginebrick
Brick3: 10.70.36.84:/rhgs/engine/enginebrick (arbiter)
Options Reconfigured:
auth.ssl-allow: 10.70.36.84,10.70.36.82,10.70.36.83
server.ssl: on
client.ssl: on
cluster.granular-entry-heal: on
user.cifs: off
network.ping-timeout: 30
performance.strict-o-direct: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
performance.low-prio-threads: 32
features.shard-block-size: 4MB
storage.owner-gid: 36
storage.owner-uid: 36
cluster.data-self-heal-algorithm: full
features.shard: on
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: off
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on

Comment 1 RamaKasturi 2017-01-17 07:35:43 UTC

sos reports can be found in the link below:
===========================================
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1413845/

Comment 2 Nikolai Sednev 2017-03-05 20:56:10 UTC

On active host (puma18) with HE-VM, I've set physical NIC associated with ovirtmgmt network down.
puma19 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : puma19
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : cd4acc61
local_conf_timestamp               : 656
Host timestamp                     : 641
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=641 (Sun Mar  5 22:45:53 2017)
        host-id=1
        score=3400
        vm_conf_refresh_time=656 (Sun Mar  5 22:46:08 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineDown
        stopped=False


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : puma18
Host ID                            : 2
Engine status                      : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : b17d39e0
local_conf_timestamp               : 36444
Host timestamp                     : 36429
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=36429 (Sun Mar  5 22:45:05 2017)
        host-id=2
        score=3400
        vm_conf_refresh_time=36444 (Sun Mar  5 22:45:21 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineStarting
        stopped=False

puma19 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : puma19
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 2f77164d
local_conf_timestamp               : 746
Host timestamp                     : 731
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=731 (Sun Mar  5 22:47:23 2017)
        host-id=1
        score=3400
        vm_conf_refresh_time=746 (Sun Mar  5 22:47:38 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineStart
        stopped=False


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : puma18
Host ID                            : 2
Engine status                      : unknown stale-data
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : b17d39e0
local_conf_timestamp               : 36444
Host timestamp                     : 36429
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=36429 (Sun Mar  5 22:45:05 2017)
        host-id=2
        score=3400
        vm_conf_refresh_time=36444 (Sun Mar  5 22:45:21 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineStarting
        stopped=False



HE-VM getting started on second host (puma19).

puma19 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : puma19
Host ID                            : 1
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "powering up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 8a8ae896
local_conf_timestamp               : 809
Host timestamp                     : 794
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=794 (Sun Mar  5 22:48:26 2017)
        host-id=1
        score=3400
        vm_conf_refresh_time=809 (Sun Mar  5 22:48:41 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineStarting
        stopped=False


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : puma18
Host ID                            : 2
Engine status                      : unknown stale-data
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : b17d39e0
local_conf_timestamp               : 36444
Host timestamp                     : 36429
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=36429 (Sun Mar  5 22:45:05 2017)
        host-id=2
        score=3400
        vm_conf_refresh_time=36444 (Sun Mar  5 22:45:21 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineStarting
        stopped=False



Engine got started on puma19:
puma19 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : puma19
Host ID                            : 1
Engine status                      : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 08091f98
local_conf_timestamp               : 869
Host timestamp                     : 854
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=854 (Sun Mar  5 22:49:26 2017)
        host-id=1
        score=3400
        vm_conf_refresh_time=869 (Sun Mar  5 22:49:41 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineStarting
        stopped=False


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : puma18
Host ID                            : 2
Engine status                      : unknown stale-data
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : b17d39e0
local_conf_timestamp               : 36444
Host timestamp                     : 36429
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=36429 (Sun Mar  5 22:45:05 2017)
        host-id=2
        score=3400
        vm_conf_refresh_time=36444 (Sun Mar  5 22:45:21 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineStarting
        stopped=False


I've failed to connect to engine using WEBUI and there was no ping to it.

I also see that somehow score was not deducted for puma18, although it lost it's connectivity to network.

puma19 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : puma19
Host ID                            : 1
Engine status                      : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 24cb1887
local_conf_timestamp               : 1169
Host timestamp                     : 1154
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=1154 (Sun Mar  5 22:54:26 2017)
        host-id=1
        score=3400
        vm_conf_refresh_time=1169 (Sun Mar  5 22:54:41 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineStarting
        stopped=False


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : puma18
Host ID                            : 2
Engine status                      : unknown stale-data
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : b17d39e0
local_conf_timestamp               : 36444
Host timestamp                     : 36429
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=36429 (Sun Mar  5 22:45:05 2017)
        host-id=2
        score=3400
        vm_conf_refresh_time=36444 (Sun Mar  5 22:45:21 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineStarting
        stopped=False



Components on hosts:
libvirt-client-2.0.0-10.el7_3.4.x86_64
qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64
rhevm-appliance-20160721.0-2.el7ev.noarch
mom-0.5.9-1.el7ev.noarch
ovirt-hosted-engine-setup-2.1.0.3-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
sanlock-3.4.0-1.el7.x86_64
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
vdsm-4.19.6-1.el7ev.x86_64
ovirt-host-deploy-1.6.0-1.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
ovirt-imageio-common-1.0.0-0.el7ev.noarch
ovirt-imageio-daemon-1.0.0-0.el7ev.noarch
ovirt-setup-lib-1.1.0-1.el7ev.noarch
ovirt-hosted-engine-ha-2.1.0.3-1.el7ev.noarch
Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016
Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

On engine:
rhev-guest-tools-iso-4.1-3.el7ev.noarch
rhevm-dependencies-4.1.0-1.el7ev.noarch
rhevm-doc-4.1.0-2.el7ev.noarch
rhevm-branding-rhev-4.1.0-1.el7ev.noarch
rhevm-setup-plugins-4.1.0-1.el7ev.noarch
rhevm-4.1.1.2-0.1.el7.noarch
Linux version 3.10.0-514.6.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Feb 17 19:21:31 EST 2017
Linux 3.10.0-514.6.2.el7.x86_64 #1 SMP Fri Feb 17 19:21:31 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Comment 3 Red Hat Bugzilla Rules Engine 2017-03-05 20:56:16 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 4 Denis Chaplygin 2017-03-06 05:37:46 UTC

There is still a lot of work on that bug, why you tried to verify it? It's not yet finished at all.

Comment 5 Nikolai Sednev 2017-03-06 09:28:23 UTC

(In reply to Denis Chaplygin from comment #4)
> There is still a lot of work on that bug, why you tried to verify it? It's
> not yet finished at all.

Because of it was moved to ON_QA.

 Sandro Bonazzola 2017-03-03 09:18:42 EST
Status: MODIFIED → ON_QA
Target Release: --- → 2.1.0.4

Comment 6 Yaniv Kaul 2017-03-23 13:15:58 UTC

Denis - why is it targeted to 4.1.4? Isn't 4.1.3 realistic?

Comment 7 Denis Chaplygin 2017-03-31 13:11:53 UTC

Did you had just a single physical NIC for both ovirtmgmt and gluster networks or separate NICs for ovirtmgmt and Gluster?

Comment 8 Nikolai Sednev 2017-04-02 05:47:30 UTC

(In reply to Denis Chaplygin from comment #7)
> Did you had just a single physical NIC for both ovirtmgmt and gluster
> networks or separate NICs for ovirtmgmt and Gluster?

I did not used gluster storage for my environment, I was using NFS instead.
The host had several NICs, although management traffic was coming only through one of them. Both management and storage traffic passed through a single NIC.

Comment 9 Denis Chaplygin 2017-04-13 15:04:25 UTC

Hosted engine itself works just fine - when host, running VM, looses it's connection to the storage, VM is automatically restarted. 

For some reason engine itself is not able to start with the following error in the server.log:

2017-04-13 14:48:51,344Z ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0190: Step handler org.jboss.as.server.DeployerChainAddHandler$FinalRuntimeStepHandler@1c4f021c for operation {"operation" => "add-deployer-chains","address" => []} at address [] failed handling operation rollback -- java.util.concurrent.TimeoutException: java.util.concurrent.TimeoutException
        at org.jboss.as.controller.OperationContextImpl.waitForRemovals(OperationContextImpl.java:498) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
        at org.jboss.as.controller.AbstractOperationContext$Step.handleResult(AbstractOperationContext.java:1369) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
        at org.jboss.as.controller.AbstractOperationContext$Step.finalizeInternal(AbstractOperationContext.java:1328) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
        at org.jboss.as.controller.AbstractOperationContext$Step.finalizeStep(AbstractOperationContext.java:1301) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
        at org.jboss.as.controller.AbstractOperationContext$Step.access$300(AbstractOperationContext.java:1185) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
        at org.jboss.as.controller.AbstractOperationContext.executeResultHandlerPhase(AbstractOperationContext.java:767) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
        at org.jboss.as.controller.AbstractOperationContext.processStages(AbstractOperationContext.java:644) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
        at org.jboss.as.controller.AbstractOperationContext.executeOperation(AbstractOperationContext.java:370) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
        at org.jboss.as.controller.OperationContextImpl.executeOperation(OperationContextImpl.java:1329) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
        at org.jboss.as.controller.ModelControllerImpl.boot(ModelControllerImpl.java:493) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
        at org.jboss.as.controller.AbstractControllerService.boot(AbstractControllerService.java:387) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
        at org.jboss.as.controller.AbstractControllerService.boot(AbstractControllerService.java:349) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
        at org.jboss.as.server.ServerService.boot(ServerService.java:397) [wildfly-server-2.2.0.Final.jar:2.2.0.Final]
        at org.jboss.as.server.ServerService.boot(ServerService.java:366) [wildfly-server-2.2.0.Final.jar:2.2.0.Final]
        at org.jboss.as.controller.AbstractControllerService$1.run(AbstractControllerService.java:299) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
        at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_121]

2017-04-13 14:48:51,346Z ERROR [org.jboss.as.controller.client] (Controller Boot Thread) WFLYCTL0190: Step handler org.jboss.as.server.DeployerChainAddHandler$FinalRuntimeStepHandler@1c4f021c for operation {"operation" => "add-deployer-chains","address" => []} at address [] failed handling operation rollback -- java.util.concurrent.TimeoutException


It is definitely not a gluster related issue, not a storage issue and not a lack of entropy issue (i checked all of those). Restarting ovirt-engine unit manually after a little delay fixes issue.

Comment 10 Piotr Kliczewski 2017-04-14 07:29:35 UTC

Denis,

Please provide the engine and server logs from the time you saw stack trace mentioned in comment #9.

Comment 11 Piotr Kliczewski 2017-04-14 07:42:37 UTC

Denis,

Can you try to increase deployment-timeout in subsystem urn:jboss:domain:deployment-scanner and rerun? You can find it in ovirt-engine.xml.

Comment 12 Denis Chaplygin 2017-04-18 14:36:55 UTC

Created attachment 1272355 [details]
server.log

Comment 13 Denis Chaplygin 2017-04-18 14:38:14 UTC

Created attachment 1272356 [details]
engine.log

Comment 14 Denis Chaplygin 2017-04-18 15:19:31 UTC

I increased it ten times:

[root@hc-he-lynx ovirt-engine]# cat /var/lib/ovirt-engine/jboss_runtime/config/ovirt-engine.xml |grep deployment-timeout
      <deployment-scanner path="/var/lib/ovirt-engine/jboss_runtime/deployments" deployment-timeout="12000"


But still have the same error:

2017-04-18 15:03:53,976Z ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0190: Step handler org.jboss.as.server.DeployerChainAddHandler$FinalRuntimeStepHandler@481734ad for operation {"operation" => "add-deployer-chains","address" => []} at address [] failed handling operation rollback -- java.util.concurrent.TimeoutException: java.util.concurrent.TimeoutException

Comment 15 Piotr Kliczewski 2017-04-19 07:19:36 UTC

Denis,

Is there anything which could cause hosted engine vm to be slow? I wonder how permanent is the storage on which the vm image is located. I was told that this issue do not occur on other types of storage than gluster.

Comment 16 Denis Chaplygin 2017-04-19 08:04:43 UTC

Well, that was one of the first things i excluded :-) That behaviour is observable on both glusterfs and nfs based storages.

Comment 19 Nikolai Sednev 2017-06-19 15:12:59 UTC

Worked for me on regular environment on regular hosted-engine deployed over NFS storage domain with additional pair of NFS data storage domains.

puma18 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : puma19
Host ID                            : 1
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : cb960afb
local_conf_timestamp               : 453498
Host timestamp                     : 453483
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=453483 (Mon Jun 19 17:59:26 2017)
        host-id=1
        score=3400
        vm_conf_refresh_time=453498 (Mon Jun 19 17:59:42 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : puma18
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 40bde156
local_conf_timestamp               : 15729
Host timestamp                     : 15714
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=15714 (Mon Jun 19 18:00:09 2017)
        host-id=2
        score=3400
        vm_conf_refresh_time=15729 (Mon Jun 19 18:00:24 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineDown
        stopped=False

puma18 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : puma19
Host ID                            : 1
Engine status                      : unknown stale-data
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : cb960afb
local_conf_timestamp               : 453498
Host timestamp                     : 453483
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=453483 (Mon Jun 19 17:59:26 2017)
        host-id=1
        score=3400
        vm_conf_refresh_time=453498 (Mon Jun 19 17:59:42 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : puma18
Host ID                            : 2
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 0dac2bfd
local_conf_timestamp               : 15947
Host timestamp                     : 15931
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=15931 (Mon Jun 19 18:03:46 2017)
        host-id=2
        score=3400
        vm_conf_refresh_time=15947 (Mon Jun 19 18:04:02 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False

As you may see, HE-VM got migrated from puma19 to puma18, once I've turned off NIC on puma19, which was associated with ovirtmgmt bridge.

puma18 was RHVH with these components:
sanlock-3.4.0-1.el7.x86_64
ovirt-node-ng-nodectl-4.1.3-0.20170608.1.el7.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
mom-0.5.9-1.el7ev.noarch
qemu-kvm-rhev-2.6.0-28.el7_3.10.x86_64
vdsm-4.19.18-1.el7ev.x86_64
ovirt-host-deploy-1.6.6-1.el7ev.noarch
ovirt-imageio-common-1.0.0-0.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
libvirt-client-2.0.0-10.el7_3.9.x86_64
ovirt-imageio-daemon-1.0.0-0.el7ev.noarch
ovirt-hosted-engine-ha-2.1.2-1.el7ev.noarch
ovirt-hosted-engine-setup-2.1.3-1.el7ev.noarch
ovirt-setup-lib-1.1.3-1.el7ev.noarch
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
Linux version 3.10.0-514.21.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Apr 22 02:41:35 EDT 2017
Linux 3.10.0-514.21.1.el7.x86_64 #1 SMP Sat Apr 22 02:41:35 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 7.3

Engine:
rhevm-dependencies-4.1.1-1.el7ev.noarch
rhevm-doc-4.1.3-1.el7ev.noarch
rhevm-4.1.3.4-0.1.el7.noarch
rhev-guest-tools-iso-4.1-5.el7ev.noarch
rhevm-setup-plugins-4.1.2-1.el7ev.noarch
rhevm-branding-rhev-4.1.0-2.el7ev.noarch
Linux version 3.10.0-681.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Tue Jun 13 22:55:13 EDT 2017
Linux 3.10.0-681.el7.x86_64 #1 SMP Tue Jun 13 22:55:13 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.4 Beta (Maipo)


Moving to verified, if there will be any issues related to HC, please reopen and specify as hyperconverged specific.

Comment 20 RamaKasturi 2017-06-28 11:28:39 UTC

Reopening this bug as when tested with hyperconverged setup i still see that when ovirtmgmt is down, HE is not able to start on another node.

engine.log:
================================
2017-06-28 07:06:54,812-04 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler4) [] START, GlusterServersListVDSCommand(Host
Name = rhsqa-grafton3.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af'}), log id: 3b5317
2017-06-28 07:07:03,934-04 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler5) [] EVENT_ID: HOST_INTERFACE_STATE_DOWN(610)
, Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Interface ens3f0 on host rhsqa-grafton2.lab.eng.blr.redhat.com, changed state to down
2017-06-28 07:07:09,255-04 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterTasksListVDSCommand] (DefaultQuartzScheduler3) [] START, GlusterTasksListVDSCommand(HostName
 = rhsqa-grafton2.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='7e22cac5-c24c-4af3-bbab-6af2d5d7c8ab'}), log id: 5a272251
2017-06-28 07:07:10,437-04 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterTasksListVDSCommand] (DefaultQuartzScheduler3) [] FINISH, GlusterTasksListVDSCommand, return
: [], log id: 5a272251
2017-06-28 07:07:13,535-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler9) [] Command 'GetAllVmStatsVDSCommand(HostName =
 rhsqa-grafton3.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af'})' execution failed: VDSGenericExceptio
n: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:07:13,535-04 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler9) [] Failed to fetch vms info for host 'rhsqa-graf
ton3.lab.eng.blr.redhat.com' - skipping VMs monitoring.
2017-06-28 07:07:13,541-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler4) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,
802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhsqa-grafton3.lab.eng.blr.redhat.com command GlusterServersListVDS failed: Heartbeat exceed
ed
2017-06-28 07:07:13,541-04 ERROR [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler4) [] Command 'GlusterServersListVDSCommand(Ho
stName = rhsqa-grafton3.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af'})' execution failed: VDSGeneric
Exception: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:07:13,541-04 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler4) [] FINISH, GlusterServersListVDSCommand, lo
g id: 3b5317
2017-06-28 07:07:13,541-04 ERROR [org.ovirt.engine.core.bll.gluster.GlusterSyncJob] (DefaultQuartzScheduler4) [] Error while refreshing server data for cluster 'Default' fro
m database: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exceeded (Failed with error V
DS_NETWORK_ERROR and code 5022)
2017-06-28 07:07:13,542-04 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler4) [] START, GlusterVolumesListVDSCommand(Host
Name = rhsqa-grafton3.lab.eng.blr.redhat.com, GlusterVolumesListVDSParameters:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af'}), log id: 5da48e6e
2017-06-28 07:07:13,542-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler6) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,
802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhsqa-grafton3.lab.eng.blr.redhat.com command SpmStatusVDS failed: Heartbeat exceeded
2017-06-28 07:07:13,542-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler10) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10
,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhsqa-grafton3.lab.eng.blr.redhat.com command GetStatsVDS failed: Heartbeat exceeded
2017-06-28 07:07:13,542-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler10) [] Command 'GetStatsVDSCommand(HostName = rhsqa-gr
afton3.lab.eng.blr.redhat.com, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af', vds='Host[rhsqa-grafton3.lab.eng.blr.redh
at.com,29e0cdcd-312d-41c7-bb2b-2777aac609af]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:07:13,542-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler10) [] Failed getting vds stats, host='rhsqa-grafton3.lab.eng.blr.redhat.com'(29e0cdcd-312d-41c7-bb2b-2777aac609af): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:07:13,542-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler10) [] Failure to refresh host 'rhsqa-grafton3.lab.eng.blr.redhat.com' runtime info: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:07:13,542-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler6) [] Command 'SpmStatusVDSCommand(HostName = rhsqa-grafton3.lab.eng.blr.redhat.com, SpmStatusVDSCommandParameters:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af', storagePoolId='00000001-0001-0001-0001-000000000311'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:07:13,542-04 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler10) [] Failed to refresh VDS, network error, continuing, vds='rhsqa-grafton3.lab.eng.blr.redhat.com'(29e0cdcd-312d-41c7-bb2b-2777aac609af): VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:07:13,543-04 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-33) [] Host 'rhsqa-grafton3.lab.eng.blr.redhat.com' is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:07:13,544-04 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-50) [] Host 'rhsqa-grafton3.lab.eng.blr.redhat.com' is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:07:13,544-04 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-39) [] Host 'rhsqa-grafton3.lab.eng.blr.redhat.com' is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:07:13,544-04 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-31) [] Host 'rhsqa-grafton3.lab.eng.blr.redhat.com' is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:07:13,552-04 INFO  [org.ovirt.engine.core.bll.storage.pool.SetStoragePoolStatusCommand] (DefaultQuartzScheduler6) [6e07475e] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: 00000001-0001-0001-0001-000000000311 Type: StoragePool
2017-06-28 07:07:13,553-04 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler6) [6e07475e] Storage Pool '00000001-0001-0001-0001-000000000311' - Updating Storage Domain '91166baa-54d0-48fd-9cb8-d225f6b9b8a2' status from 'Active' to 'Unknown', reason: null
2017-06-28 07:07:13,553-04 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler6) [6e07475e] Storage Pool '00000001-0001-0001-0001-000000000311' - Updating Storage Domain '323bccf6-0fb4-4031-b53f-9f7be6cf5182' status from 'Active' to 'Unknown', reason: null
2017-06-28 07:07:13,553-04 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler6) [6e07475e] Storage Pool '00000001-0001-0001-0001-000000000311' - Updating Storage Domain '8b7e37eb-6582-473f-90c9-96ad1039229a' status from 'Active' to 'Unknown', reason: null
2017-06-28 07:07:13,553-04 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler6) [6e07475e] Storage Pool '00000001-0001-0001-0001-000000000311' - Updating Storage Domain '90898c1f-87e9-4324-aae4-a35c4a9ae2fa' status from 'Active' to 'Unknown', reason: null
2017-06-28 07:07:13,569-04 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-50) [] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host rhsqa-grafton3.lab.eng.blr.redhat.com is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:07:13,570-04 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-31) [] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host rhsqa-grafton3.lab.eng.blr.redhat.com is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:07:13,579-04 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-39) [] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host rhsqa-grafton3.lab.eng.blr.redhat.com is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:07:13,581-04 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-33) [] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host rhsqa-grafton3.lab.eng.blr.redhat.com is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:07:13,594-04 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler6) [6e07475e] EVENT_ID: SYSTEM_CHANGE_STORAGE_POOL_STATUS_PROBLEMATIC_WITH_ERROR(987), Correlation ID: 6e07475e, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center Default. Setting Data Center status to Non Responsive (On host rhsqa-grafton3.lab.eng.blr.redhat.com, Error: Network error during communication with the Host.).
2017-06-28 07:08:09,656-04 WARN  [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) [] Retry failed
2017-06-28 07:08:09,656-04 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler6) [6e07475e] Exception during connection
2017-06-28 07:08:09,656-04 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler6) [6e07475e] Failed to invoke scheduled method updatingTimerElapsed: null
2017-06-28 07:08:09,681-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler9) [] Command 'GetAllVmStatsVDSCommand(HostName = rhsqa-grafton2.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='7e22cac5-c24c-4af3-bbab-6af2d5d7c8ab'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,681-04 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler9) [] Failed to fetch vms info for host 'rhsqa-grafton2.lab.eng.blr.redhat.com' - skipping VMs monitoring.
2017-06-28 07:08:09,681-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler8) [] Command 'GetAllVmStatsVDSCommand(HostName = rhsqa-grafton1.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='c0db6d13-b14b-4a50-bba5-55d54476963f'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,681-04 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler8) [] Failed to fetch vms info for host 'rhsqa-grafton1.lab.eng.blr.redhat.com' - skipping VMs monitoring.
2017-06-28 07:08:09,685-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler7) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhsqa-grafton2.lab.eng.blr.redhat.com command GetStatsVDS failed: Heartbeat exceeded
2017-06-28 07:08:09,686-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler7) [] Command 'GetStatsVDSCommand(HostName = rhsqa-grafton2.lab.eng.blr.redhat.com, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='7e22cac5-c24c-4af3-bbab-6af2d5d7c8ab', vds='Host[rhsqa-grafton2.lab.eng.blr.redhat.com,7e22cac5-c24c-4af3-bbab-6af2d5d7c8ab]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,686-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler7) [] Failed getting vds stats, host='rhsqa-grafton2.lab.eng.blr.redhat.com'(7e22cac5-c24c-4af3-bbab-6af2d5d7c8ab): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,686-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler7) [] Failure to refresh host 'rhsqa-grafton2.lab.eng.blr.redhat.com' runtime info: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,686-04 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler7) [] Failed to refresh VDS, network error, continuing, vds='rhsqa-grafton2.lab.eng.blr.redhat.com'(7e22cac5-c24c-4af3-bbab-6af2d5d7c8ab): VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,686-04 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-30) [] Host 'rhsqa-grafton2.lab.eng.blr.redhat.com' is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:08:09,686-04 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-41) [] Host 'rhsqa-grafton2.lab.eng.blr.redhat.com' is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:08:09,688-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler5) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhsqa-grafton1.lab.eng.blr.redhat.com command GetStatsVDS failed: Heartbeat exceeded
2017-06-28 07:08:09,688-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler5) [] Command 'GetStatsVDSCommand(HostName = rhsqa-grafton1.lab.eng.blr.redhat.com, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='c0db6d13-b14b-4a50-bba5-55d54476963f', vds='Host[rhsqa-grafton1.lab.eng.blr.redhat.com,c0db6d13-b14b-4a50-bba5-55d54476963f]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,688-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler5) [] Failed getting vds stats, host='rhsqa-grafton1.lab.eng.blr.redhat.com'(c0db6d13-b14b-4a50-bba5-55d54476963f): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,688-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler5) [] Failure to refresh host 'rhsqa-grafton1.lab.eng.blr.redhat.com' runtime info: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,688-04 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler5) [] Failed to refresh VDS, network error, continuing, vds='rhsqa-grafton1.lab.eng.blr.redhat.com'(c0db6d13-b14b-4a50-bba5-55d54476963f): VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,689-04 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-32) [] Host 'rhsqa-grafton1.lab.eng.blr.redhat.com' is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:08:09,703-04 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-30) [] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host rhsqa-grafton2.lab.eng.blr.redhat.com is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:08:09,781-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler3) [] Command 'GetAllVmStatsVDSCommand(HostName = rhsqa-grafton3.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,781-04 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler3) [] Failed to fetch vms info for host 'rhsqa-grafton3.lab.eng.blr.redhat.com' - skipping VMs monitoring.
2017-06-28 07:08:09,786-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler10) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhsqa-grafton3.lab.eng.blr.redhat.com command GetCapabilitiesVDS failed: Heartbeat exceeded
2017-06-28 07:08:09,786-04 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler10) [] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand' return value 'org.ovirt.engine.core.vdsbroker.vdsbroker.VDSInfoReturn@7a9b8b62'
2017-06-28 07:08:09,786-04 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler10) [] HostName = rhsqa-grafton3.lab.eng.blr.redhat.com
2017-06-28 07:08:09,786-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler10) [] Command 'GetCapabilitiesVDSCommand(HostName = rhsqa-grafton3.lab.eng.blr.redhat.com, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af', vds='Host[rhsqa-grafton3.lab.eng.blr.redhat.com,29e0cdcd-312d-41c7-bb2b-2777aac609af]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,786-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler10) [] Failure to refresh host 'rhsqa-grafton3.lab.eng.blr.redhat.com' runtime info: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,786-04 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler10) [] Failed to refresh VDS, network error, continuing, vds='rhsqa-grafton3.lab.eng.blr.redhat.com'(29e0cdcd-312d-41c7-bb2b-2777aac609af): VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,787-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler4) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhsqa-grafton3.lab.eng.blr.redhat.com command GlusterVolumesListVDS failed: Heartbeat exceeded
2017-06-28 07:08:09,787-04 ERROR [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler4) [] Command 'GlusterVolumesListVDSCommand(HostName = rhsqa-grafton3.lab.eng.blr.redhat.com, GlusterVolumesListVDSParameters:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,787-04 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler4) [] FINISH, GlusterVolumesListVDSCommand, log id: 5da48e6e
2017-06-28 07:08:09,787-04 ERROR [org.ovirt.engine.core.bll.gluster.GlusterSyncJob] (DefaultQuartzScheduler4) [] Error while refreshing Gluster lightweight data of cluster 'Default': EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exceeded (Failed with error VDS_NETWORK_ERROR and code 5022)
2017-06-28 07:08:10,444-04 INFO  [org.ovirt.engine.core.bll.gluster.tasks.GlusterTasksService] (DefaultQuartzScheduler8) [] No up server in cluster
2017-06-28 07:08:13,502-04 INFO  [org.ovirt.engine.core.bll.gluster.GlusterSnapshotSyncJob] (DefaultQuartzScheduler9) [] No UP server found in cluster 'Default' for snapshot monitoring
2017-06-28 07:08:13,504-04 INFO  [org.ovirt.engine.core.bll.gluster.GlusterSnapshotSyncJob] (DefaultQuartzScheduler9) [] No UP server found in cluster 'Default' for snapshot configurations monitoring
2017-06-28 07:09:08,750-04 WARN  [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) [] Retry failed
2017-06-28 07:09:08,750-04 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler4) [] Exception during connection
2017-06-28 07:09:08,750-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler4) [] Failure to refresh host 'rhsqa-grafton1.lab.eng.blr.redhat.com' runtime info: org/apache/commons/lang/exception/ExceptionUtils
2017-06-28 07:09:08,751-04 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler4) [] Failed to invoke scheduled method onTimer: null
2017-06-28 07:09:10,446-04 INFO  [org.ovirt.engine.core.bll.gluster.tasks.GlusterTasksService] (DefaultQuartzScheduler4) [] No up server in cluster
2017-06-28 07:11:00,843-04 WARN  [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) [] Retry failed
2017-06-28 07:11:00,843-04 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler5) [] Exception during connection
2017-06-28 07:11:00,843-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler5) [] Failure to refresh host 'rhsqa-grafton3.lab.eng.blr.redhat.com' runtime info: org/apache/commons/lang/exception/ExceptionUtils
2017-06-28 07:11:00,843-04 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler5) [] Failed to invoke scheduled method onTimer: null
2017-06-28 07:11:10,450-04 INFO  [org.ovirt.engine.core.bll.gluster.tasks.GlusterTasksService] (DefaultQuartzScheduler5) [] No up server in cluster
2017-06-28 07:11:56,893-04 WARN  [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) [] Retry failed
2017-06-28 07:11:56,893-04 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler2) [] Exception during connection
2017-06-28 07:11:56,894-04 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler2) [] Failed to invoke scheduled method poll: null
2017-06-28 07:12:10,452-04 INFO  [org.ovirt.engine.core.bll.gluster.tasks.GlusterTasksService] (DefaultQuartzScheduler3) [] No up server in cluster
2017-06-28 07:12:52,949-04 WARN  [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) [] Retry failed
2017-06-28 07:12:52,949-04 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler9) [] Exception during connection
2017-06-28 07:12:52,949-04 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler9) [] Failed to invoke scheduled method poll: null
2017-06-28 07:13:10,460-04 INFO  [org.ovirt.engine.core.bll.gluster.tasks.GlusterTasksService] (DefaultQuartzScheduler1) [] No up server in cluster
2017-06-28 07:13:13,507-04 INFO  [org.ovirt.engine.core.bll.gluster.GlusterSnapshotSyncJob] (DefaultQuartzScheduler1) [] No UP server found in cluster 'Default' for snapshot monitoring
2017-06-28 07:13:13,508-04 INFO  [org.ovirt.engine.core.bll.gluster.GlusterSnapshotSyncJob] (DefaultQuartzScheduler1) [] No UP server found in cluster 'Default' for snapshot configurations monitoring
2017-06-28 07:13:48,991-04 WARN  [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) [] Retry failed
2017-06-28 07:13:48,991-04 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler8) [] Exception during connection
2017-06-28 07:13:48,992-04 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler8) [] Failed to invoke scheduled method updatingTimerElapsed: null
2017-06-28 07:14:10,462-04 INFO  [org.ovirt.engine.core.bll.gluster.tasks.GlusterTasksService] (DefaultQuartzScheduler5) [] No up server in cluster
:


from server.log:
======================================

2017-06-28 05:48:55,631-04 INFO  [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[305d521d]-nio2-thread-1) Client session created
2017-06-28 05:48:55,692-04 INFO  [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[7e6ae5eb]-nio2-thread-1) Client session created
2017-06-28 05:48:55,769-04 INFO  [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[305d521d]-nio2-thread-2) Server version string: SSH-2.0-OpenSSH_6.6.1
2017-06-28 05:48:55,771-04 INFO  [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[305d521d]-nio2-thread-3) Kex: server->client aes128-ctr hmac-sha2-256 none
2017-06-28 05:48:55,771-04 INFO  [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[305d521d]-nio2-thread-3) Kex: client->server aes128-ctr hmac-sha2-256 none
2017-06-28 05:48:55,831-04 INFO  [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[7e6ae5eb]-nio2-thread-2) Server version string: SSH-2.0-OpenSSH_6.6.1
2017-06-28 05:48:55,833-04 INFO  [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[7e6ae5eb]-nio2-thread-3) Kex: server->client aes128-ctr hmac-sha2-256 none
2017-06-28 05:48:55,833-04 INFO  [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[7e6ae5eb]-nio2-thread-3) Kex: client->server aes128-ctr hmac-sha2-256 none
2017-06-28 05:59:33,455-04 INFO  [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[6e63229]-nio2-thread-1) Client session created
2017-06-28 05:59:33,586-04 INFO  [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[6e63229]-nio2-thread-2) Server version string: SSH-2.0-OpenSSH_6.6.1
2017-06-28 05:59:33,589-04 INFO  [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[6e63229]-nio2-thread-3) Kex: server->client aes128-ctr hmac-sha2-256 none
2017-06-28 05:59:33,589-04 INFO  [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[6e63229]-nio2-thread-3) Kex: client->server aes128-ctr hmac-sha2-256 none
2017-06-28 06:19:07,399-04 INFO  [org.apache.commons.httpclient.HttpMethodBase] (org.ovirt.thread.pool-6-thread-13) Response content length is not known
2017-06-28 06:19:09,232-04 INFO  [org.apache.commons.httpclient.HttpMethodBase] (org.ovirt.thread.pool-6-thread-40) Response content length is not known
2017-06-28 06:19:11,436-04 INFO  [org.apache.commons.httpclient.HttpMethodBase] (org.ovirt.thread.pool-6-thread-9) Response content length is not known
2017-06-28 06:19:11,978-04 INFO  [org.apache.commons.httpclient.HttpMethodBase] (org.ovirt.thread.pool-6-thread-23) Response content length is not known
2017-06-28 07:08:09,656-04 INFO  [org.quartz.core.JobRunShell] (DefaultQuartzScheduler6) Job DEFAULT.org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy.updatingTimerElapsed#-9223372036854775775 threw a JobExecutionException: : org.quartz.JobExecutionException: failed to execute job
        at sun.reflect.GeneratedMethodAccessor89.invoke(Unknown Source) [:1.8.0_131]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_131]
        at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_131]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:51) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_131]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
        at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131]

2017-06-28 07:09:08,751-04 INFO  [org.quartz.core.JobRunShell] (DefaultQuartzScheduler4) Job DEFAULT.org.ovirt.engine.core.vdsbroker.VdsManager.onTimer#-9223372036854775801 threw a JobExecutionException: : org.quartz.JobExecutionException: failed to execute job
        at sun.reflect.GeneratedMethodAccessor71.invoke(Unknown Source) [:1.8.0_131]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_131]
        at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_131]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:51) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_131]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
        at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131]

2017-06-28 07:10:04,799-04 INFO  [org.quartz.core.JobRunShell] (DefaultQuartzScheduler10) Job DEFAULT.org.ovirt.engine.core.vdsbroker.VdsManager.onTimer#-9223372036854775735 threw a JobExecutionException: : org.quartz.JobExecutionException: failed to execute job
        at sun.reflect.GeneratedMethodAccessor71.invoke(Unknown Source) [:1.8.0_131]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_131]
        at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_131]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:51) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_131]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
        at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131]

2017-06-28 07:11:00,843-04 INFO  [org.quartz.core.JobRunShell] (DefaultQuartzScheduler5) Job DEFAULT.org.ovirt.engine.core.vdsbroker.VdsManager.onTimer#-9223372036854775731 threw a JobExecutionException: : org.quartz.JobExecutionException: failed to execute job
        at sun.reflect.GeneratedMethodAccessor71.invoke(Unknown Source) [:1.8.0_131]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_131]
        at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_131]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:51) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_131]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
        at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131]

2017-06-28 07:11:56,894-04 INFO  [org.quartz.core.JobRunShell] (DefaultQuartzScheduler2) Job DEFAULT.org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher.poll#-9223372036854775799 threw a JobExecutionException: : org.quartz.JobExecutionException: failed to execute job
        at sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source) [:1.8.0_131]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_131]
        at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_131]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77) [scheduler.jar:]

Comment 21 Red Hat Bugzilla Rules Engine 2017-06-28 11:28:57 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 22 Piotr Kliczewski 2017-06-28 13:37:32 UTC

Please provide engine and vdsm logs. I wonder why vdsm was down in the first place.

Comment 23 RamaKasturi 2017-06-28 14:20:20 UTC

I have copied engine and vdsm logs to the location below

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1413845/

Comment 27 RamaKasturi 2017-06-29 07:20:05 UTC

Created attachment 1292799 [details]
Fencing policies which are enabled on cluster

Comment 28 Piotr Kliczewski 2017-06-29 07:22:31 UTC

I checked the logs and I see a bunch of connecting issues (Connection reset by peer). Just before the issue you referred to in comment #20 I saw:

2017-06-28 07:07:13,594-04 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler6) [6e07475e] EVENT_ID: SYSTEM_CHANGE_STORAGE_POOL_STATUS_PROBLEMATIC_WITH_ERROR(987), Correlation ID: 6e07475e, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center Default. Setting Data Center status to Non Responsive (On host rhsqa-grafton3.lab.eng.blr.redhat.com, Error: Network error during communication with the Host.).
2017-06-28 07:08:09,656-04 WARN  [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) [] Retry failed

It seems that vdsm could be killed by sanlock. Please make sure that your storage is accessible.

By the way please provide vdsm logs from the time the issue occurred.

Comment 47 Doron Fediuck 2017-07-13 15:54:07 UTC

We don't seem to be able to reproduce this on non-hyperconverged setup.
Pushing to 4.1.5 for now.
Can you please allow us access to an environment with this issue?

Comment 49 Denis Chaplygin 2017-08-01 15:19:14 UTC

(In reply to Doron Fediuck from comment #47)
> We don't seem to be able to reproduce this on non-hyperconverged setup.

I'm able to reproduce it on my NFS environment with exactly same symptoms - VM is restarted, but engine itself never finishes deploy.

Comment 50 Doron Fediuck 2017-08-02 12:25:09 UTC

This is unrelated to migration.
We managed to see that the VM properly started on a different host.
The main issue is that the J2EE application starts but seems to be blocked.
Still looking into this.

Comment 52 Martin Sivák 2017-09-06 10:58:28 UTC

There were quite significant changes to JBoss and hosted engine code, can you please check if this is still happening?

Comment 53 RamaKasturi 2017-09-06 11:25:51 UTC

sure, will retest and update this. keeping need info on me till then.

Comment 54 Yaniv Kaul 2017-10-02 14:17:34 UTC

(In reply to RamaKasturi from comment #53)
> sure, will retest and update this. keeping need info on me till then.

Ping?

Comment 55 RamaKasturi 2017-10-09 10:48:21 UTC

Hi Martin & Yaniv,

      Apoloiges for the late reply.I was on PTO last week and did not get a chance to re test this yet. Will test it out this week and clear the need info.

Thanks
kasturi

Comment 56 Yaniv Kaul 2017-10-15 08:10:21 UTC

(In reply to RamaKasturi from comment #55)
> Hi Martin & Yaniv,
> 
>       Apoloiges for the late reply.I was on PTO last week and did not get a
> chance to re test this yet. Will test it out this week and clear the need
> info.
> 
> Thanks
> kasturi

Re-setting the NEEDINFO till we get a response.

Comment 58 RamaKasturi 2017-11-03 12:54:58 UTC

Re-opening this bug as i see that HE vm does not get migrated when ovirtmgmt is down.

Versions:
=================
Red Hat Virtualization Manager Version: 4.1.7.4-0.1.el7
vdsm-4.19.35-1.el7ev.x86_64


Following Traceback seen in the machine where ovirtmgmt was down:
================================================================

MainThread::ERROR::2017-11-03 17:59:11,343::brokerlink::280::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Connection closed: Connection timed out
MainThread::ERROR::2017-11-03 17:59:11,349::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 191, in _run_agent
    return action(he)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 64, in action_proper
    return he.start_monitoring()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 432, in start_monitoring
    hostname=socket.gethostname())
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 118, in notify
    .format(event_type, detail, options, e))
RequestError: Failed to send notification about state_transition, detail StartState-ReinitializeFSM, options {'hostname': 'yarrow.lab.eng.blr.redhat.com'}: Connection timed 
out

MainThread::ERROR::2017-11-03 17:59:11,350::agent::206::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent
MainThread::WARNING::2017-11-03 17:59:16,355::agent::209::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '4'

Comment 59 RamaKasturi 2017-11-03 13:32:52 UTC

Latest sos reports are copied in the link below.

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1413845/

Comment 60 Yaniv Kaul 2017-11-03 16:25:00 UTC

1. Which interface is used for storage?
2. (just an idea) can you block DNS (via an iptables rule) without failing the ovirtmgmt network and retry?

Comment 61 Martin Sivák 2017-11-06 08:31:45 UTC

Hmm what is the network topology here? I assume a separate ovirtmgmt and storage networks are used and only ovirtmgmt is down, right?

Comment 62 RamaKasturi 2017-11-06 08:33:41 UTC

(In reply to Martin Sivák from comment #61)
> Hmm what is the network topology here? I assume a separate ovirtmgmt and
> storage networks are used and only ovirtmgmt is down, right?

Yes, that is right Martin.

Comment 63 Martin Sivák 2017-11-06 09:07:08 UTC

The patch I posted fixes just a symptom.

> > Hmm what is the network topology here? I assume a separate ovirtmgmt and
> > storage networks are used and only ovirtmgmt is down, right?
> 
> Yes, that is right Martin.

This means sanlock won't fence the host, because lockspace is still intact and accessible. On the other hand, the agent will see status updates and the score will be properly reduced.

I posted a patch that should properly handle notification failures and that will hopefully allow the logic to behave properly.

Btw, you really should have opened a new bug about this as we are mixing multiple different cases here...

Comment 64 RamaKasturi 2017-11-06 09:28:55 UTC

(In reply to Martin Sivák from comment #63)
> The patch I posted fixes just a symptom.
> 
> > > Hmm what is the network topology here? I assume a separate ovirtmgmt and
> > > storage networks are used and only ovirtmgmt is down, right?
> > 
> > Yes, that is right Martin.
> 
> This means sanlock won't fence the host, because lockspace is still intact
> and accessible. On the other hand, the agent will see status updates and the
> score will be properly reduced.
> 
> I posted a patch that should properly handle notification failures and that
> will hopefully allow the logic to behave properly.
> 
> Btw, you really should have opened a new bug about this as we are mixing
> multiple different cases here...

Hi Martin,

 If you look at the description of the bug the whole point is about this and not about anything else.

 This was the description of the problem:

I see that HE vm does not migrate to another node when ovirtmgmt is down. ovirt-ha-agent tries to restart the host on another node but it fails to do so. when i check hosted-engine --vm-status i see the following output.



Thanks
kasturi

Comment 65 Artyom 2018-01-25 09:32:30 UTC

I don't happen to me in a regular environment, so I assume it relevant for HC environment only, so please move it back to HC team. Just be aware of the bug https://bugzilla.redhat.com/show_bug.cgi?id=1528292 that can prevent from ovirt-engine to start.

Comment 66 meital avital 2018-01-25 09:39:29 UTC

Kasturi, can you please take care of this bug verification?

Comment 67 SATHEESARAN 2018-05-10 02:33:34 UTC

Tested with RHV 4.2.3 with the following steps
1. Created a hyperconverged setup with 3 nodes
2. When ovirtmgmt bridge is down on one of the host, the HE gets migrated to the other host automatically

Comment 68 Sandro Bonazzola 2018-05-10 06:31:05 UTC

This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.