Description of problem: I see that HE vm does not migrate to another node when ovirtmgmt is down. ovirt-ha-agent tries to restart the host on another node but it fails to do so. when i check hosted-engine --vm-status i see the following output. [root@rhsqa-grafton4 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : rhsqa-grafton4.lab.eng.blr.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : f08a8d7e local_conf_timestamp : 0 Host timestamp : 70746 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=70746 (Tue Jan 17 11:51:42 2017) host-id=1 score=3400 vm_conf_refresh_time=0 (Thu Jan 1 05:30:00 1970) conf_on_shared_storage=True maintenance=False state=EngineStart stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : rhsqa-grafton5.lab.eng.blr.redhat.com Host ID : 2 Engine status : {"reason": "Storage of VM is locked. Is another host already starting the VM?", "health": "bad", "vm": "already_locked", "detail": "down"} Score : 3400 stopped : False Local maintenance : False crc32 : 06e4fbae local_conf_timestamp : 0 Host timestamp : 70816 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=70816 (Tue Jan 17 11:51:31 2017) host-id=2 score=3400 vm_conf_refresh_time=0 (Thu Jan 1 05:30:00 1970) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False --== Host 3 status ==-- conf_on_shared_storage : True Status up-to-date : False Hostname : rhsqa-grafton6.lab.eng.blr.redhat.com Host ID : 3 Engine status : unknown stale-data Score : 1800 stopped : False Local maintenance : False crc32 : cc238e1d local_conf_timestamp : 0 Host timestamp : 69282 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=69282 (Tue Jan 17 11:25:57 2017) host-id=3 score=1800 vm_conf_refresh_time=0 (Thu Jan 1 05:30:00 1970) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False You have new mail in /var/spool/mail/root When i execute vdsClient -s 0 list table i see that both the vms are up and running but i cannot ping them. [root@rhsqa-grafton6 ~]# vdsClient -s 0 list table 0015d21c-fc4e-440b-8601-31ab990b8712 20937 applinuxvm Up ae48ef65-f289-42ec-8b1f-94215115e95d 19825 HostedEngine Up 10.70.44.74 Version-Release number of selected component (if applicable): [root@rhsqa-grafton4 ~]# rpm -qa | grep ovirt ovirt-imageio-daemon-0.5.0-0.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.0-0.0.master.git46cacd3.el7ev.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch ovirt-imageio-common-0.5.0-0.el7ev.noarch ovirt-vmconsole-host-1.0.4-1.el7ev.noarch ovirt-hosted-engine-ha-2.1.0-0.0.master.git118000f.el7ev.noarch ovirt-setup-lib-1.1.0-1.el7ev.noarch ovirt-host-deploy-1.6.0-0.2.master.gitb76ad50.el7ev.noarch How reproducible: Always Steps to Reproduce: 1. Deploy HC setup on arbiter volumes 2. Now start running HE and app vms on the third node which is the arbiter node. 3. Have powermanagement enabled. 4. Bring down ovirtmgmt network by running the command ifdown <nic> Actual results: I see that HE vm does not restart on another node plus I/O stops happening on the app vm. ovirt-ha-agent tries to restart HE vm on another node but it fails. Expected results: HE and app vms should restart on other nodes. Additional info: I see the following Tracebacks in ovirt-broker.log file FilesystemBackend dom_type=glusterfs sd_uuid=4373df07-8826-4376-9a93-14463a8ce537' Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 166, in handle data) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 299, in _dispatch .set_storage_domain(client, sd_type, **options) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 66, in set_storage_domain self._backends[client].connect() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 461, in connect self._dom_type) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 108, in get_domain_path " in {1}".format(sd_uuid, parent)) BackendFailureException: path to storage domain 4373df07-8826-4376-9a93-14463a8ce537 not found in /rhev/data-center/mnt/glusterSD But this domain actually exists. [root@rhsqa-grafton4 ~]# ls -l /rhev/data-center/mnt/glusterSD/10.70.36.82\:_engine/ total 4 drwxr-xr-x. 5 vdsm kvm 4096 Jan 13 11:59 4373df07-8826-4376-9a93-14463a8ce537 -rwxr-xr-x. 1 vdsm kvm 0 Jan 17 11:59 __DIRECT_IO_TEST__ Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 368, in wrapper return f(*a, **kw) File "/usr/lib/python2.7/site-packages/vdsm/concurrent.py", line 180, in run return func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 216, in process_requests count=1, File "/usr/lib64/python2.7/asyncore.py", line 220, in loop poll_fun(timeout, map) File "/usr/lib64/python2.7/asyncore.py", line 181, in poll2 if obj.readable(): File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 61, in readable return self._delegate_call("readable") File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 143, in _delegate_call return getattr(self.__impl, name)(self) AttributeError: 'NoneType' object has no attribute 'readable' gluster volume info engine: =============================== [root@rhsqa-grafton4 ~]# gluster volume info engine Volume Name: engine Type: Replicate Volume ID: 6364da28-c876-49d1-ba59-0826e172e539 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: 10.70.36.82:/rhgs/engine/enginebrick Brick2: 10.70.36.83:/rhgs/engine/enginebrick Brick3: 10.70.36.84:/rhgs/engine/enginebrick (arbiter) Options Reconfigured: auth.ssl-allow: 10.70.36.84,10.70.36.82,10.70.36.83 server.ssl: on client.ssl: on cluster.granular-entry-heal: on user.cifs: off network.ping-timeout: 30 performance.strict-o-direct: on cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 cluster.locking-scheme: granular performance.low-prio-threads: 32 features.shard-block-size: 4MB storage.owner-gid: 36 storage.owner-uid: 36 cluster.data-self-heal-algorithm: full features.shard: on cluster.server-quorum-type: server cluster.quorum-type: auto network.remote-dio: off cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off transport.address-family: inet performance.readdir-ahead: on nfs.disable: on
sos reports can be found in the link below: =========================================== http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1413845/
On active host (puma18) with HE-VM, I've set physical NIC associated with ovirtmgmt network down. puma19 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : puma19 Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : cd4acc61 local_conf_timestamp : 656 Host timestamp : 641 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=641 (Sun Mar 5 22:45:53 2017) host-id=1 score=3400 vm_conf_refresh_time=656 (Sun Mar 5 22:46:08 2017) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : puma18 Host ID : 2 Engine status : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "up"} Score : 3400 stopped : False Local maintenance : False crc32 : b17d39e0 local_conf_timestamp : 36444 Host timestamp : 36429 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=36429 (Sun Mar 5 22:45:05 2017) host-id=2 score=3400 vm_conf_refresh_time=36444 (Sun Mar 5 22:45:21 2017) conf_on_shared_storage=True maintenance=False state=EngineStarting stopped=False puma19 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : puma19 Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : 2f77164d local_conf_timestamp : 746 Host timestamp : 731 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=731 (Sun Mar 5 22:47:23 2017) host-id=1 score=3400 vm_conf_refresh_time=746 (Sun Mar 5 22:47:38 2017) conf_on_shared_storage=True maintenance=False state=EngineStart stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : False Hostname : puma18 Host ID : 2 Engine status : unknown stale-data Score : 3400 stopped : False Local maintenance : False crc32 : b17d39e0 local_conf_timestamp : 36444 Host timestamp : 36429 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=36429 (Sun Mar 5 22:45:05 2017) host-id=2 score=3400 vm_conf_refresh_time=36444 (Sun Mar 5 22:45:21 2017) conf_on_shared_storage=True maintenance=False state=EngineStarting stopped=False HE-VM getting started on second host (puma19). puma19 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : puma19 Host ID : 1 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "powering up"} Score : 3400 stopped : False Local maintenance : False crc32 : 8a8ae896 local_conf_timestamp : 809 Host timestamp : 794 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=794 (Sun Mar 5 22:48:26 2017) host-id=1 score=3400 vm_conf_refresh_time=809 (Sun Mar 5 22:48:41 2017) conf_on_shared_storage=True maintenance=False state=EngineStarting stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : False Hostname : puma18 Host ID : 2 Engine status : unknown stale-data Score : 3400 stopped : False Local maintenance : False crc32 : b17d39e0 local_conf_timestamp : 36444 Host timestamp : 36429 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=36429 (Sun Mar 5 22:45:05 2017) host-id=2 score=3400 vm_conf_refresh_time=36444 (Sun Mar 5 22:45:21 2017) conf_on_shared_storage=True maintenance=False state=EngineStarting stopped=False Engine got started on puma19: puma19 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : puma19 Host ID : 1 Engine status : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "up"} Score : 3400 stopped : False Local maintenance : False crc32 : 08091f98 local_conf_timestamp : 869 Host timestamp : 854 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=854 (Sun Mar 5 22:49:26 2017) host-id=1 score=3400 vm_conf_refresh_time=869 (Sun Mar 5 22:49:41 2017) conf_on_shared_storage=True maintenance=False state=EngineStarting stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : False Hostname : puma18 Host ID : 2 Engine status : unknown stale-data Score : 3400 stopped : False Local maintenance : False crc32 : b17d39e0 local_conf_timestamp : 36444 Host timestamp : 36429 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=36429 (Sun Mar 5 22:45:05 2017) host-id=2 score=3400 vm_conf_refresh_time=36444 (Sun Mar 5 22:45:21 2017) conf_on_shared_storage=True maintenance=False state=EngineStarting stopped=False I've failed to connect to engine using WEBUI and there was no ping to it. I also see that somehow score was not deducted for puma18, although it lost it's connectivity to network. puma19 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : puma19 Host ID : 1 Engine status : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "up"} Score : 3400 stopped : False Local maintenance : False crc32 : 24cb1887 local_conf_timestamp : 1169 Host timestamp : 1154 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1154 (Sun Mar 5 22:54:26 2017) host-id=1 score=3400 vm_conf_refresh_time=1169 (Sun Mar 5 22:54:41 2017) conf_on_shared_storage=True maintenance=False state=EngineStarting stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : False Hostname : puma18 Host ID : 2 Engine status : unknown stale-data Score : 3400 stopped : False Local maintenance : False crc32 : b17d39e0 local_conf_timestamp : 36444 Host timestamp : 36429 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=36429 (Sun Mar 5 22:45:05 2017) host-id=2 score=3400 vm_conf_refresh_time=36444 (Sun Mar 5 22:45:21 2017) conf_on_shared_storage=True maintenance=False state=EngineStarting stopped=False Components on hosts: libvirt-client-2.0.0-10.el7_3.4.x86_64 qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64 rhevm-appliance-20160721.0-2.el7ev.noarch mom-0.5.9-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.0.3-1.el7ev.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch sanlock-3.4.0-1.el7.x86_64 ovirt-vmconsole-host-1.0.4-1.el7ev.noarch vdsm-4.19.6-1.el7ev.x86_64 ovirt-host-deploy-1.6.0-1.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch ovirt-imageio-common-1.0.0-0.el7ev.noarch ovirt-imageio-daemon-1.0.0-0.el7ev.noarch ovirt-setup-lib-1.1.0-1.el7ev.noarch ovirt-hosted-engine-ha-2.1.0.3-1.el7ev.noarch Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016 Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) On engine: rhev-guest-tools-iso-4.1-3.el7ev.noarch rhevm-dependencies-4.1.0-1.el7ev.noarch rhevm-doc-4.1.0-2.el7ev.noarch rhevm-branding-rhev-4.1.0-1.el7ev.noarch rhevm-setup-plugins-4.1.0-1.el7ev.noarch rhevm-4.1.1.2-0.1.el7.noarch Linux version 3.10.0-514.6.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Feb 17 19:21:31 EST 2017 Linux 3.10.0-514.6.2.el7.x86_64 #1 SMP Fri Feb 17 19:21:31 EST 2017 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo)
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
There is still a lot of work on that bug, why you tried to verify it? It's not yet finished at all.
(In reply to Denis Chaplygin from comment #4) > There is still a lot of work on that bug, why you tried to verify it? It's > not yet finished at all. Because of it was moved to ON_QA. Sandro Bonazzola 2017-03-03 09:18:42 EST Status: MODIFIED → ON_QA Target Release: --- → 2.1.0.4
Denis - why is it targeted to 4.1.4? Isn't 4.1.3 realistic?
Did you had just a single physical NIC for both ovirtmgmt and gluster networks or separate NICs for ovirtmgmt and Gluster?
(In reply to Denis Chaplygin from comment #7) > Did you had just a single physical NIC for both ovirtmgmt and gluster > networks or separate NICs for ovirtmgmt and Gluster? I did not used gluster storage for my environment, I was using NFS instead. The host had several NICs, although management traffic was coming only through one of them. Both management and storage traffic passed through a single NIC.
Hosted engine itself works just fine - when host, running VM, looses it's connection to the storage, VM is automatically restarted. For some reason engine itself is not able to start with the following error in the server.log: 2017-04-13 14:48:51,344Z ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0190: Step handler org.jboss.as.server.DeployerChainAddHandler$FinalRuntimeStepHandler@1c4f021c for operation {"operation" => "add-deployer-chains","address" => []} at address [] failed handling operation rollback -- java.util.concurrent.TimeoutException: java.util.concurrent.TimeoutException at org.jboss.as.controller.OperationContextImpl.waitForRemovals(OperationContextImpl.java:498) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final] at org.jboss.as.controller.AbstractOperationContext$Step.handleResult(AbstractOperationContext.java:1369) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final] at org.jboss.as.controller.AbstractOperationContext$Step.finalizeInternal(AbstractOperationContext.java:1328) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final] at org.jboss.as.controller.AbstractOperationContext$Step.finalizeStep(AbstractOperationContext.java:1301) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final] at org.jboss.as.controller.AbstractOperationContext$Step.access$300(AbstractOperationContext.java:1185) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final] at org.jboss.as.controller.AbstractOperationContext.executeResultHandlerPhase(AbstractOperationContext.java:767) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final] at org.jboss.as.controller.AbstractOperationContext.processStages(AbstractOperationContext.java:644) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final] at org.jboss.as.controller.AbstractOperationContext.executeOperation(AbstractOperationContext.java:370) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final] at org.jboss.as.controller.OperationContextImpl.executeOperation(OperationContextImpl.java:1329) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final] at org.jboss.as.controller.ModelControllerImpl.boot(ModelControllerImpl.java:493) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final] at org.jboss.as.controller.AbstractControllerService.boot(AbstractControllerService.java:387) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final] at org.jboss.as.controller.AbstractControllerService.boot(AbstractControllerService.java:349) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final] at org.jboss.as.server.ServerService.boot(ServerService.java:397) [wildfly-server-2.2.0.Final.jar:2.2.0.Final] at org.jboss.as.server.ServerService.boot(ServerService.java:366) [wildfly-server-2.2.0.Final.jar:2.2.0.Final] at org.jboss.as.controller.AbstractControllerService$1.run(AbstractControllerService.java:299) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final] at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_121] 2017-04-13 14:48:51,346Z ERROR [org.jboss.as.controller.client] (Controller Boot Thread) WFLYCTL0190: Step handler org.jboss.as.server.DeployerChainAddHandler$FinalRuntimeStepHandler@1c4f021c for operation {"operation" => "add-deployer-chains","address" => []} at address [] failed handling operation rollback -- java.util.concurrent.TimeoutException It is definitely not a gluster related issue, not a storage issue and not a lack of entropy issue (i checked all of those). Restarting ovirt-engine unit manually after a little delay fixes issue.
Denis, Please provide the engine and server logs from the time you saw stack trace mentioned in comment #9.
Denis, Can you try to increase deployment-timeout in subsystem urn:jboss:domain:deployment-scanner and rerun? You can find it in ovirt-engine.xml.
Created attachment 1272355 [details] server.log
Created attachment 1272356 [details] engine.log
I increased it ten times: [root@hc-he-lynx ovirt-engine]# cat /var/lib/ovirt-engine/jboss_runtime/config/ovirt-engine.xml |grep deployment-timeout <deployment-scanner path="/var/lib/ovirt-engine/jboss_runtime/deployments" deployment-timeout="12000" But still have the same error: 2017-04-18 15:03:53,976Z ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0190: Step handler org.jboss.as.server.DeployerChainAddHandler$FinalRuntimeStepHandler@481734ad for operation {"operation" => "add-deployer-chains","address" => []} at address [] failed handling operation rollback -- java.util.concurrent.TimeoutException: java.util.concurrent.TimeoutException
Denis, Is there anything which could cause hosted engine vm to be slow? I wonder how permanent is the storage on which the vm image is located. I was told that this issue do not occur on other types of storage than gluster.
Well, that was one of the first things i excluded :-) That behaviour is observable on both glusterfs and nfs based storages.
Worked for me on regular environment on regular hosted-engine deployed over NFS storage domain with additional pair of NFS data storage domains. puma18 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : puma19 Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 3400 stopped : False Local maintenance : False crc32 : cb960afb local_conf_timestamp : 453498 Host timestamp : 453483 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=453483 (Mon Jun 19 17:59:26 2017) host-id=1 score=3400 vm_conf_refresh_time=453498 (Mon Jun 19 17:59:42 2017) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : puma18 Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : 40bde156 local_conf_timestamp : 15729 Host timestamp : 15714 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=15714 (Mon Jun 19 18:00:09 2017) host-id=2 score=3400 vm_conf_refresh_time=15729 (Mon Jun 19 18:00:24 2017) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False puma18 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : False Hostname : puma19 Host ID : 1 Engine status : unknown stale-data Score : 3400 stopped : False Local maintenance : False crc32 : cb960afb local_conf_timestamp : 453498 Host timestamp : 453483 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=453483 (Mon Jun 19 17:59:26 2017) host-id=1 score=3400 vm_conf_refresh_time=453498 (Mon Jun 19 17:59:42 2017) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : puma18 Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 3400 stopped : False Local maintenance : False crc32 : 0dac2bfd local_conf_timestamp : 15947 Host timestamp : 15931 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=15931 (Mon Jun 19 18:03:46 2017) host-id=2 score=3400 vm_conf_refresh_time=15947 (Mon Jun 19 18:04:02 2017) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False As you may see, HE-VM got migrated from puma19 to puma18, once I've turned off NIC on puma19, which was associated with ovirtmgmt bridge. puma18 was RHVH with these components: sanlock-3.4.0-1.el7.x86_64 ovirt-node-ng-nodectl-4.1.3-0.20170608.1.el7.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch mom-0.5.9-1.el7ev.noarch qemu-kvm-rhev-2.6.0-28.el7_3.10.x86_64 vdsm-4.19.18-1.el7ev.x86_64 ovirt-host-deploy-1.6.6-1.el7ev.noarch ovirt-imageio-common-1.0.0-0.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch libvirt-client-2.0.0-10.el7_3.9.x86_64 ovirt-imageio-daemon-1.0.0-0.el7ev.noarch ovirt-hosted-engine-ha-2.1.2-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.3-1.el7ev.noarch ovirt-setup-lib-1.1.3-1.el7ev.noarch ovirt-vmconsole-host-1.0.4-1.el7ev.noarch Linux version 3.10.0-514.21.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Apr 22 02:41:35 EDT 2017 Linux 3.10.0-514.21.1.el7.x86_64 #1 SMP Sat Apr 22 02:41:35 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 7.3 Engine: rhevm-dependencies-4.1.1-1.el7ev.noarch rhevm-doc-4.1.3-1.el7ev.noarch rhevm-4.1.3.4-0.1.el7.noarch rhev-guest-tools-iso-4.1-5.el7ev.noarch rhevm-setup-plugins-4.1.2-1.el7ev.noarch rhevm-branding-rhev-4.1.0-2.el7ev.noarch Linux version 3.10.0-681.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Tue Jun 13 22:55:13 EDT 2017 Linux 3.10.0-681.el7.x86_64 #1 SMP Tue Jun 13 22:55:13 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.4 Beta (Maipo) Moving to verified, if there will be any issues related to HC, please reopen and specify as hyperconverged specific.
Reopening this bug as when tested with hyperconverged setup i still see that when ovirtmgmt is down, HE is not able to start on another node. engine.log: ================================ 2017-06-28 07:06:54,812-04 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler4) [] START, GlusterServersListVDSCommand(Host Name = rhsqa-grafton3.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af'}), log id: 3b5317 2017-06-28 07:07:03,934-04 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler5) [] EVENT_ID: HOST_INTERFACE_STATE_DOWN(610) , Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Interface ens3f0 on host rhsqa-grafton2.lab.eng.blr.redhat.com, changed state to down 2017-06-28 07:07:09,255-04 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterTasksListVDSCommand] (DefaultQuartzScheduler3) [] START, GlusterTasksListVDSCommand(HostName = rhsqa-grafton2.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='7e22cac5-c24c-4af3-bbab-6af2d5d7c8ab'}), log id: 5a272251 2017-06-28 07:07:10,437-04 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterTasksListVDSCommand] (DefaultQuartzScheduler3) [] FINISH, GlusterTasksListVDSCommand, return : [], log id: 5a272251 2017-06-28 07:07:13,535-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler9) [] Command 'GetAllVmStatsVDSCommand(HostName = rhsqa-grafton3.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af'})' execution failed: VDSGenericExceptio n: VDSNetworkException: Heartbeat exceeded 2017-06-28 07:07:13,535-04 INFO [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler9) [] Failed to fetch vms info for host 'rhsqa-graf ton3.lab.eng.blr.redhat.com' - skipping VMs monitoring. 2017-06-28 07:07:13,541-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler4) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10, 802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhsqa-grafton3.lab.eng.blr.redhat.com command GlusterServersListVDS failed: Heartbeat exceed ed 2017-06-28 07:07:13,541-04 ERROR [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler4) [] Command 'GlusterServersListVDSCommand(Ho stName = rhsqa-grafton3.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af'})' execution failed: VDSGeneric Exception: VDSNetworkException: Heartbeat exceeded 2017-06-28 07:07:13,541-04 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler4) [] FINISH, GlusterServersListVDSCommand, lo g id: 3b5317 2017-06-28 07:07:13,541-04 ERROR [org.ovirt.engine.core.bll.gluster.GlusterSyncJob] (DefaultQuartzScheduler4) [] Error while refreshing server data for cluster 'Default' fro m database: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exceeded (Failed with error V DS_NETWORK_ERROR and code 5022) 2017-06-28 07:07:13,542-04 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler4) [] START, GlusterVolumesListVDSCommand(Host Name = rhsqa-grafton3.lab.eng.blr.redhat.com, GlusterVolumesListVDSParameters:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af'}), log id: 5da48e6e 2017-06-28 07:07:13,542-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler6) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10, 802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhsqa-grafton3.lab.eng.blr.redhat.com command SpmStatusVDS failed: Heartbeat exceeded 2017-06-28 07:07:13,542-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler10) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10 ,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhsqa-grafton3.lab.eng.blr.redhat.com command GetStatsVDS failed: Heartbeat exceeded 2017-06-28 07:07:13,542-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler10) [] Command 'GetStatsVDSCommand(HostName = rhsqa-gr afton3.lab.eng.blr.redhat.com, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af', vds='Host[rhsqa-grafton3.lab.eng.blr.redh at.com,29e0cdcd-312d-41c7-bb2b-2777aac609af]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded 2017-06-28 07:07:13,542-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler10) [] Failed getting vds stats, host='rhsqa-grafton3.lab.eng.blr.redhat.com'(29e0cdcd-312d-41c7-bb2b-2777aac609af): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exceeded 2017-06-28 07:07:13,542-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler10) [] Failure to refresh host 'rhsqa-grafton3.lab.eng.blr.redhat.com' runtime info: VDSGenericException: VDSNetworkException: Heartbeat exceeded 2017-06-28 07:07:13,542-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler6) [] Command 'SpmStatusVDSCommand(HostName = rhsqa-grafton3.lab.eng.blr.redhat.com, SpmStatusVDSCommandParameters:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af', storagePoolId='00000001-0001-0001-0001-000000000311'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded 2017-06-28 07:07:13,542-04 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler10) [] Failed to refresh VDS, network error, continuing, vds='rhsqa-grafton3.lab.eng.blr.redhat.com'(29e0cdcd-312d-41c7-bb2b-2777aac609af): VDSGenericException: VDSNetworkException: Heartbeat exceeded 2017-06-28 07:07:13,543-04 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-33) [] Host 'rhsqa-grafton3.lab.eng.blr.redhat.com' is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued. 2017-06-28 07:07:13,544-04 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-50) [] Host 'rhsqa-grafton3.lab.eng.blr.redhat.com' is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued. 2017-06-28 07:07:13,544-04 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-39) [] Host 'rhsqa-grafton3.lab.eng.blr.redhat.com' is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued. 2017-06-28 07:07:13,544-04 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-31) [] Host 'rhsqa-grafton3.lab.eng.blr.redhat.com' is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued. 2017-06-28 07:07:13,552-04 INFO [org.ovirt.engine.core.bll.storage.pool.SetStoragePoolStatusCommand] (DefaultQuartzScheduler6) [6e07475e] Running command: SetStoragePoolStatusCommand internal: true. Entities affected : ID: 00000001-0001-0001-0001-000000000311 Type: StoragePool 2017-06-28 07:07:13,553-04 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler6) [6e07475e] Storage Pool '00000001-0001-0001-0001-000000000311' - Updating Storage Domain '91166baa-54d0-48fd-9cb8-d225f6b9b8a2' status from 'Active' to 'Unknown', reason: null 2017-06-28 07:07:13,553-04 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler6) [6e07475e] Storage Pool '00000001-0001-0001-0001-000000000311' - Updating Storage Domain '323bccf6-0fb4-4031-b53f-9f7be6cf5182' status from 'Active' to 'Unknown', reason: null 2017-06-28 07:07:13,553-04 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler6) [6e07475e] Storage Pool '00000001-0001-0001-0001-000000000311' - Updating Storage Domain '8b7e37eb-6582-473f-90c9-96ad1039229a' status from 'Active' to 'Unknown', reason: null 2017-06-28 07:07:13,553-04 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler6) [6e07475e] Storage Pool '00000001-0001-0001-0001-000000000311' - Updating Storage Domain '90898c1f-87e9-4324-aae4-a35c4a9ae2fa' status from 'Active' to 'Unknown', reason: null 2017-06-28 07:07:13,569-04 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-50) [] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host rhsqa-grafton3.lab.eng.blr.redhat.com is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued. 2017-06-28 07:07:13,570-04 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-31) [] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host rhsqa-grafton3.lab.eng.blr.redhat.com is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued. 2017-06-28 07:07:13,579-04 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-39) [] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host rhsqa-grafton3.lab.eng.blr.redhat.com is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued. 2017-06-28 07:07:13,581-04 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-33) [] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host rhsqa-grafton3.lab.eng.blr.redhat.com is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued. 2017-06-28 07:07:13,594-04 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler6) [6e07475e] EVENT_ID: SYSTEM_CHANGE_STORAGE_POOL_STATUS_PROBLEMATIC_WITH_ERROR(987), Correlation ID: 6e07475e, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center Default. Setting Data Center status to Non Responsive (On host rhsqa-grafton3.lab.eng.blr.redhat.com, Error: Network error during communication with the Host.). 2017-06-28 07:08:09,656-04 WARN [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) [] Retry failed 2017-06-28 07:08:09,656-04 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler6) [6e07475e] Exception during connection 2017-06-28 07:08:09,656-04 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler6) [6e07475e] Failed to invoke scheduled method updatingTimerElapsed: null 2017-06-28 07:08:09,681-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler9) [] Command 'GetAllVmStatsVDSCommand(HostName = rhsqa-grafton2.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='7e22cac5-c24c-4af3-bbab-6af2d5d7c8ab'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded 2017-06-28 07:08:09,681-04 INFO [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler9) [] Failed to fetch vms info for host 'rhsqa-grafton2.lab.eng.blr.redhat.com' - skipping VMs monitoring. 2017-06-28 07:08:09,681-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler8) [] Command 'GetAllVmStatsVDSCommand(HostName = rhsqa-grafton1.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='c0db6d13-b14b-4a50-bba5-55d54476963f'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded 2017-06-28 07:08:09,681-04 INFO [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler8) [] Failed to fetch vms info for host 'rhsqa-grafton1.lab.eng.blr.redhat.com' - skipping VMs monitoring. 2017-06-28 07:08:09,685-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler7) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhsqa-grafton2.lab.eng.blr.redhat.com command GetStatsVDS failed: Heartbeat exceeded 2017-06-28 07:08:09,686-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler7) [] Command 'GetStatsVDSCommand(HostName = rhsqa-grafton2.lab.eng.blr.redhat.com, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='7e22cac5-c24c-4af3-bbab-6af2d5d7c8ab', vds='Host[rhsqa-grafton2.lab.eng.blr.redhat.com,7e22cac5-c24c-4af3-bbab-6af2d5d7c8ab]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded 2017-06-28 07:08:09,686-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler7) [] Failed getting vds stats, host='rhsqa-grafton2.lab.eng.blr.redhat.com'(7e22cac5-c24c-4af3-bbab-6af2d5d7c8ab): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exceeded 2017-06-28 07:08:09,686-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler7) [] Failure to refresh host 'rhsqa-grafton2.lab.eng.blr.redhat.com' runtime info: VDSGenericException: VDSNetworkException: Heartbeat exceeded 2017-06-28 07:08:09,686-04 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler7) [] Failed to refresh VDS, network error, continuing, vds='rhsqa-grafton2.lab.eng.blr.redhat.com'(7e22cac5-c24c-4af3-bbab-6af2d5d7c8ab): VDSGenericException: VDSNetworkException: Heartbeat exceeded 2017-06-28 07:08:09,686-04 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-30) [] Host 'rhsqa-grafton2.lab.eng.blr.redhat.com' is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued. 2017-06-28 07:08:09,686-04 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-41) [] Host 'rhsqa-grafton2.lab.eng.blr.redhat.com' is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued. 2017-06-28 07:08:09,688-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler5) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhsqa-grafton1.lab.eng.blr.redhat.com command GetStatsVDS failed: Heartbeat exceeded 2017-06-28 07:08:09,688-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler5) [] Command 'GetStatsVDSCommand(HostName = rhsqa-grafton1.lab.eng.blr.redhat.com, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='c0db6d13-b14b-4a50-bba5-55d54476963f', vds='Host[rhsqa-grafton1.lab.eng.blr.redhat.com,c0db6d13-b14b-4a50-bba5-55d54476963f]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded 2017-06-28 07:08:09,688-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler5) [] Failed getting vds stats, host='rhsqa-grafton1.lab.eng.blr.redhat.com'(c0db6d13-b14b-4a50-bba5-55d54476963f): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exceeded 2017-06-28 07:08:09,688-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler5) [] Failure to refresh host 'rhsqa-grafton1.lab.eng.blr.redhat.com' runtime info: VDSGenericException: VDSNetworkException: Heartbeat exceeded 2017-06-28 07:08:09,688-04 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler5) [] Failed to refresh VDS, network error, continuing, vds='rhsqa-grafton1.lab.eng.blr.redhat.com'(c0db6d13-b14b-4a50-bba5-55d54476963f): VDSGenericException: VDSNetworkException: Heartbeat exceeded 2017-06-28 07:08:09,689-04 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-32) [] Host 'rhsqa-grafton1.lab.eng.blr.redhat.com' is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued. 2017-06-28 07:08:09,703-04 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-30) [] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host rhsqa-grafton2.lab.eng.blr.redhat.com is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued. 2017-06-28 07:08:09,781-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler3) [] Command 'GetAllVmStatsVDSCommand(HostName = rhsqa-grafton3.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded 2017-06-28 07:08:09,781-04 INFO [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler3) [] Failed to fetch vms info for host 'rhsqa-grafton3.lab.eng.blr.redhat.com' - skipping VMs monitoring. 2017-06-28 07:08:09,786-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler10) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhsqa-grafton3.lab.eng.blr.redhat.com command GetCapabilitiesVDS failed: Heartbeat exceeded 2017-06-28 07:08:09,786-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler10) [] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand' return value 'org.ovirt.engine.core.vdsbroker.vdsbroker.VDSInfoReturn@7a9b8b62' 2017-06-28 07:08:09,786-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler10) [] HostName = rhsqa-grafton3.lab.eng.blr.redhat.com 2017-06-28 07:08:09,786-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler10) [] Command 'GetCapabilitiesVDSCommand(HostName = rhsqa-grafton3.lab.eng.blr.redhat.com, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af', vds='Host[rhsqa-grafton3.lab.eng.blr.redhat.com,29e0cdcd-312d-41c7-bb2b-2777aac609af]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded 2017-06-28 07:08:09,786-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler10) [] Failure to refresh host 'rhsqa-grafton3.lab.eng.blr.redhat.com' runtime info: VDSGenericException: VDSNetworkException: Heartbeat exceeded 2017-06-28 07:08:09,786-04 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler10) [] Failed to refresh VDS, network error, continuing, vds='rhsqa-grafton3.lab.eng.blr.redhat.com'(29e0cdcd-312d-41c7-bb2b-2777aac609af): VDSGenericException: VDSNetworkException: Heartbeat exceeded 2017-06-28 07:08:09,787-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler4) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhsqa-grafton3.lab.eng.blr.redhat.com command GlusterVolumesListVDS failed: Heartbeat exceeded 2017-06-28 07:08:09,787-04 ERROR [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler4) [] Command 'GlusterVolumesListVDSCommand(HostName = rhsqa-grafton3.lab.eng.blr.redhat.com, GlusterVolumesListVDSParameters:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded 2017-06-28 07:08:09,787-04 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler4) [] FINISH, GlusterVolumesListVDSCommand, log id: 5da48e6e 2017-06-28 07:08:09,787-04 ERROR [org.ovirt.engine.core.bll.gluster.GlusterSyncJob] (DefaultQuartzScheduler4) [] Error while refreshing Gluster lightweight data of cluster 'Default': EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exceeded (Failed with error VDS_NETWORK_ERROR and code 5022) 2017-06-28 07:08:10,444-04 INFO [org.ovirt.engine.core.bll.gluster.tasks.GlusterTasksService] (DefaultQuartzScheduler8) [] No up server in cluster 2017-06-28 07:08:13,502-04 INFO [org.ovirt.engine.core.bll.gluster.GlusterSnapshotSyncJob] (DefaultQuartzScheduler9) [] No UP server found in cluster 'Default' for snapshot monitoring 2017-06-28 07:08:13,504-04 INFO [org.ovirt.engine.core.bll.gluster.GlusterSnapshotSyncJob] (DefaultQuartzScheduler9) [] No UP server found in cluster 'Default' for snapshot configurations monitoring 2017-06-28 07:09:08,750-04 WARN [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) [] Retry failed 2017-06-28 07:09:08,750-04 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler4) [] Exception during connection 2017-06-28 07:09:08,750-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler4) [] Failure to refresh host 'rhsqa-grafton1.lab.eng.blr.redhat.com' runtime info: org/apache/commons/lang/exception/ExceptionUtils 2017-06-28 07:09:08,751-04 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler4) [] Failed to invoke scheduled method onTimer: null 2017-06-28 07:09:10,446-04 INFO [org.ovirt.engine.core.bll.gluster.tasks.GlusterTasksService] (DefaultQuartzScheduler4) [] No up server in cluster 2017-06-28 07:11:00,843-04 WARN [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) [] Retry failed 2017-06-28 07:11:00,843-04 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler5) [] Exception during connection 2017-06-28 07:11:00,843-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler5) [] Failure to refresh host 'rhsqa-grafton3.lab.eng.blr.redhat.com' runtime info: org/apache/commons/lang/exception/ExceptionUtils 2017-06-28 07:11:00,843-04 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler5) [] Failed to invoke scheduled method onTimer: null 2017-06-28 07:11:10,450-04 INFO [org.ovirt.engine.core.bll.gluster.tasks.GlusterTasksService] (DefaultQuartzScheduler5) [] No up server in cluster 2017-06-28 07:11:56,893-04 WARN [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) [] Retry failed 2017-06-28 07:11:56,893-04 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler2) [] Exception during connection 2017-06-28 07:11:56,894-04 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler2) [] Failed to invoke scheduled method poll: null 2017-06-28 07:12:10,452-04 INFO [org.ovirt.engine.core.bll.gluster.tasks.GlusterTasksService] (DefaultQuartzScheduler3) [] No up server in cluster 2017-06-28 07:12:52,949-04 WARN [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) [] Retry failed 2017-06-28 07:12:52,949-04 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler9) [] Exception during connection 2017-06-28 07:12:52,949-04 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler9) [] Failed to invoke scheduled method poll: null 2017-06-28 07:13:10,460-04 INFO [org.ovirt.engine.core.bll.gluster.tasks.GlusterTasksService] (DefaultQuartzScheduler1) [] No up server in cluster 2017-06-28 07:13:13,507-04 INFO [org.ovirt.engine.core.bll.gluster.GlusterSnapshotSyncJob] (DefaultQuartzScheduler1) [] No UP server found in cluster 'Default' for snapshot monitoring 2017-06-28 07:13:13,508-04 INFO [org.ovirt.engine.core.bll.gluster.GlusterSnapshotSyncJob] (DefaultQuartzScheduler1) [] No UP server found in cluster 'Default' for snapshot configurations monitoring 2017-06-28 07:13:48,991-04 WARN [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) [] Retry failed 2017-06-28 07:13:48,991-04 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler8) [] Exception during connection 2017-06-28 07:13:48,992-04 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler8) [] Failed to invoke scheduled method updatingTimerElapsed: null 2017-06-28 07:14:10,462-04 INFO [org.ovirt.engine.core.bll.gluster.tasks.GlusterTasksService] (DefaultQuartzScheduler5) [] No up server in cluster : from server.log: ====================================== 2017-06-28 05:48:55,631-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[305d521d]-nio2-thread-1) Client session created 2017-06-28 05:48:55,692-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[7e6ae5eb]-nio2-thread-1) Client session created 2017-06-28 05:48:55,769-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[305d521d]-nio2-thread-2) Server version string: SSH-2.0-OpenSSH_6.6.1 2017-06-28 05:48:55,771-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[305d521d]-nio2-thread-3) Kex: server->client aes128-ctr hmac-sha2-256 none 2017-06-28 05:48:55,771-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[305d521d]-nio2-thread-3) Kex: client->server aes128-ctr hmac-sha2-256 none 2017-06-28 05:48:55,831-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[7e6ae5eb]-nio2-thread-2) Server version string: SSH-2.0-OpenSSH_6.6.1 2017-06-28 05:48:55,833-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[7e6ae5eb]-nio2-thread-3) Kex: server->client aes128-ctr hmac-sha2-256 none 2017-06-28 05:48:55,833-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[7e6ae5eb]-nio2-thread-3) Kex: client->server aes128-ctr hmac-sha2-256 none 2017-06-28 05:59:33,455-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[6e63229]-nio2-thread-1) Client session created 2017-06-28 05:59:33,586-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[6e63229]-nio2-thread-2) Server version string: SSH-2.0-OpenSSH_6.6.1 2017-06-28 05:59:33,589-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[6e63229]-nio2-thread-3) Kex: server->client aes128-ctr hmac-sha2-256 none 2017-06-28 05:59:33,589-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[6e63229]-nio2-thread-3) Kex: client->server aes128-ctr hmac-sha2-256 none 2017-06-28 06:19:07,399-04 INFO [org.apache.commons.httpclient.HttpMethodBase] (org.ovirt.thread.pool-6-thread-13) Response content length is not known 2017-06-28 06:19:09,232-04 INFO [org.apache.commons.httpclient.HttpMethodBase] (org.ovirt.thread.pool-6-thread-40) Response content length is not known 2017-06-28 06:19:11,436-04 INFO [org.apache.commons.httpclient.HttpMethodBase] (org.ovirt.thread.pool-6-thread-9) Response content length is not known 2017-06-28 06:19:11,978-04 INFO [org.apache.commons.httpclient.HttpMethodBase] (org.ovirt.thread.pool-6-thread-23) Response content length is not known 2017-06-28 07:08:09,656-04 INFO [org.quartz.core.JobRunShell] (DefaultQuartzScheduler6) Job DEFAULT.org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy.updatingTimerElapsed#-9223372036854775775 threw a JobExecutionException: : org.quartz.JobExecutionException: failed to execute job at sun.reflect.GeneratedMethodAccessor89.invoke(Unknown Source) [:1.8.0_131] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_131] at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_131] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:51) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_131] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131] at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131] 2017-06-28 07:09:08,751-04 INFO [org.quartz.core.JobRunShell] (DefaultQuartzScheduler4) Job DEFAULT.org.ovirt.engine.core.vdsbroker.VdsManager.onTimer#-9223372036854775801 threw a JobExecutionException: : org.quartz.JobExecutionException: failed to execute job at sun.reflect.GeneratedMethodAccessor71.invoke(Unknown Source) [:1.8.0_131] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_131] at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_131] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:51) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_131] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131] at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131] 2017-06-28 07:10:04,799-04 INFO [org.quartz.core.JobRunShell] (DefaultQuartzScheduler10) Job DEFAULT.org.ovirt.engine.core.vdsbroker.VdsManager.onTimer#-9223372036854775735 threw a JobExecutionException: : org.quartz.JobExecutionException: failed to execute job at sun.reflect.GeneratedMethodAccessor71.invoke(Unknown Source) [:1.8.0_131] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_131] at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_131] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:51) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_131] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131] at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131] 2017-06-28 07:11:00,843-04 INFO [org.quartz.core.JobRunShell] (DefaultQuartzScheduler5) Job DEFAULT.org.ovirt.engine.core.vdsbroker.VdsManager.onTimer#-9223372036854775731 threw a JobExecutionException: : org.quartz.JobExecutionException: failed to execute job at sun.reflect.GeneratedMethodAccessor71.invoke(Unknown Source) [:1.8.0_131] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_131] at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_131] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:51) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_131] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131] at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131] 2017-06-28 07:11:56,894-04 INFO [org.quartz.core.JobRunShell] (DefaultQuartzScheduler2) Job DEFAULT.org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher.poll#-9223372036854775799 threw a JobExecutionException: : org.quartz.JobExecutionException: failed to execute job at sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source) [:1.8.0_131] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_131] at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_131] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77) [scheduler.jar:]
Please provide engine and vdsm logs. I wonder why vdsm was down in the first place.
I have copied engine and vdsm logs to the location below http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1413845/
Created attachment 1292799 [details] Fencing policies which are enabled on cluster
I checked the logs and I see a bunch of connecting issues (Connection reset by peer). Just before the issue you referred to in comment #20 I saw: 2017-06-28 07:07:13,594-04 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler6) [6e07475e] EVENT_ID: SYSTEM_CHANGE_STORAGE_POOL_STATUS_PROBLEMATIC_WITH_ERROR(987), Correlation ID: 6e07475e, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center Default. Setting Data Center status to Non Responsive (On host rhsqa-grafton3.lab.eng.blr.redhat.com, Error: Network error during communication with the Host.). 2017-06-28 07:08:09,656-04 WARN [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) [] Retry failed It seems that vdsm could be killed by sanlock. Please make sure that your storage is accessible. By the way please provide vdsm logs from the time the issue occurred.
We don't seem to be able to reproduce this on non-hyperconverged setup. Pushing to 4.1.5 for now. Can you please allow us access to an environment with this issue?
(In reply to Doron Fediuck from comment #47) > We don't seem to be able to reproduce this on non-hyperconverged setup. I'm able to reproduce it on my NFS environment with exactly same symptoms - VM is restarted, but engine itself never finishes deploy.
This is unrelated to migration. We managed to see that the VM properly started on a different host. The main issue is that the J2EE application starts but seems to be blocked. Still looking into this.
There were quite significant changes to JBoss and hosted engine code, can you please check if this is still happening?
sure, will retest and update this. keeping need info on me till then.
(In reply to RamaKasturi from comment #53) > sure, will retest and update this. keeping need info on me till then. Ping?
Hi Martin & Yaniv, Apoloiges for the late reply.I was on PTO last week and did not get a chance to re test this yet. Will test it out this week and clear the need info. Thanks kasturi
(In reply to RamaKasturi from comment #55) > Hi Martin & Yaniv, > > Apoloiges for the late reply.I was on PTO last week and did not get a > chance to re test this yet. Will test it out this week and clear the need > info. > > Thanks > kasturi Re-setting the NEEDINFO till we get a response.
Re-opening this bug as i see that HE vm does not get migrated when ovirtmgmt is down. Versions: ================= Red Hat Virtualization Manager Version: 4.1.7.4-0.1.el7 vdsm-4.19.35-1.el7ev.x86_64 Following Traceback seen in the machine where ovirtmgmt was down: ================================================================ MainThread::ERROR::2017-11-03 17:59:11,343::brokerlink::280::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Connection closed: Connection timed out MainThread::ERROR::2017-11-03 17:59:11,349::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 191, in _run_agent return action(he) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 64, in action_proper return he.start_monitoring() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 432, in start_monitoring hostname=socket.gethostname()) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 118, in notify .format(event_type, detail, options, e)) RequestError: Failed to send notification about state_transition, detail StartState-ReinitializeFSM, options {'hostname': 'yarrow.lab.eng.blr.redhat.com'}: Connection timed out MainThread::ERROR::2017-11-03 17:59:11,350::agent::206::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::WARNING::2017-11-03 17:59:16,355::agent::209::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '4'
Latest sos reports are copied in the link below. http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1413845/
1. Which interface is used for storage? 2. (just an idea) can you block DNS (via an iptables rule) without failing the ovirtmgmt network and retry?
Hmm what is the network topology here? I assume a separate ovirtmgmt and storage networks are used and only ovirtmgmt is down, right?
(In reply to Martin Sivák from comment #61) > Hmm what is the network topology here? I assume a separate ovirtmgmt and > storage networks are used and only ovirtmgmt is down, right? Yes, that is right Martin.
The patch I posted fixes just a symptom. > > Hmm what is the network topology here? I assume a separate ovirtmgmt and > > storage networks are used and only ovirtmgmt is down, right? > > Yes, that is right Martin. This means sanlock won't fence the host, because lockspace is still intact and accessible. On the other hand, the agent will see status updates and the score will be properly reduced. I posted a patch that should properly handle notification failures and that will hopefully allow the logic to behave properly. Btw, you really should have opened a new bug about this as we are mixing multiple different cases here...
(In reply to Martin Sivák from comment #63) > The patch I posted fixes just a symptom. > > > > Hmm what is the network topology here? I assume a separate ovirtmgmt and > > > storage networks are used and only ovirtmgmt is down, right? > > > > Yes, that is right Martin. > > This means sanlock won't fence the host, because lockspace is still intact > and accessible. On the other hand, the agent will see status updates and the > score will be properly reduced. > > I posted a patch that should properly handle notification failures and that > will hopefully allow the logic to behave properly. > > Btw, you really should have opened a new bug about this as we are mixing > multiple different cases here... Hi Martin, If you look at the description of the bug the whole point is about this and not about anything else. This was the description of the problem: I see that HE vm does not migrate to another node when ovirtmgmt is down. ovirt-ha-agent tries to restart the host on another node but it fails to do so. when i check hosted-engine --vm-status i see the following output. Thanks kasturi
I don't happen to me in a regular environment, so I assume it relevant for HC environment only, so please move it back to HC team. Just be aware of the bug https://bugzilla.redhat.com/show_bug.cgi?id=1528292 that can prevent from ovirt-engine to start.
Kasturi, can you please take care of this bug verification?
Tested with RHV 4.2.3 with the following steps 1. Created a hyperconverged setup with 3 nodes 2. When ovirtmgmt bridge is down on one of the host, the HE gets migrated to the other host automatically
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017. Since the problem described in this bug report should be resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.