Bug 1413845
| Summary: | Engine fails to start properly | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-hosted-engine-ha | Reporter: | RamaKasturi <knarra> | ||||||||
| Component: | Agent | Assignee: | Martin Sivák <msivak> | ||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | SATHEESARAN <sasundar> | ||||||||
| Severity: | urgent | Docs Contact: | |||||||||
| Priority: | medium | ||||||||||
| Version: | 2.1.0 | CC: | alukiano, bugs, dchaplyg, dfediuck, fdelorey, knarra, mavital, mperina, pkliczew, sabose, sasundar | ||||||||
| Target Milestone: | ovirt-4.2.0 | Keywords: | Reopened, Triaged | ||||||||
| Target Release: | 2.2.0 | Flags: | rule-engine:
ovirt-4.2+
rule-engine: blocker+ rule-engine: planning_ack+ msivak: devel_ack+ mavital: testing_ack+ |
||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: |
The status of the engine VM wasn't correctly detected due to a change in how vdsm reports it and this was breaking the migration.
|
Story Points: | --- | ||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2018-05-10 06:31:05 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | 1439657 | ||||||||||
| Bug Blocks: | 1277939, 1433896 | ||||||||||
| Attachments: |
|
||||||||||
sos reports can be found in the link below: =========================================== http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1413845/ On active host (puma18) with HE-VM, I've set physical NIC associated with ovirtmgmt network down.
puma19 ~]# hosted-engine --vm-status
--== Host 1 status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : puma19
Host ID : 1
Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score : 3400
stopped : False
Local maintenance : False
crc32 : cd4acc61
local_conf_timestamp : 656
Host timestamp : 641
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=641 (Sun Mar 5 22:45:53 2017)
host-id=1
score=3400
vm_conf_refresh_time=656 (Sun Mar 5 22:46:08 2017)
conf_on_shared_storage=True
maintenance=False
state=EngineDown
stopped=False
--== Host 2 status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : puma18
Host ID : 2
Engine status : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "up"}
Score : 3400
stopped : False
Local maintenance : False
crc32 : b17d39e0
local_conf_timestamp : 36444
Host timestamp : 36429
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=36429 (Sun Mar 5 22:45:05 2017)
host-id=2
score=3400
vm_conf_refresh_time=36444 (Sun Mar 5 22:45:21 2017)
conf_on_shared_storage=True
maintenance=False
state=EngineStarting
stopped=False
puma19 ~]# hosted-engine --vm-status
--== Host 1 status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : puma19
Host ID : 1
Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score : 3400
stopped : False
Local maintenance : False
crc32 : 2f77164d
local_conf_timestamp : 746
Host timestamp : 731
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=731 (Sun Mar 5 22:47:23 2017)
host-id=1
score=3400
vm_conf_refresh_time=746 (Sun Mar 5 22:47:38 2017)
conf_on_shared_storage=True
maintenance=False
state=EngineStart
stopped=False
--== Host 2 status ==--
conf_on_shared_storage : True
Status up-to-date : False
Hostname : puma18
Host ID : 2
Engine status : unknown stale-data
Score : 3400
stopped : False
Local maintenance : False
crc32 : b17d39e0
local_conf_timestamp : 36444
Host timestamp : 36429
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=36429 (Sun Mar 5 22:45:05 2017)
host-id=2
score=3400
vm_conf_refresh_time=36444 (Sun Mar 5 22:45:21 2017)
conf_on_shared_storage=True
maintenance=False
state=EngineStarting
stopped=False
HE-VM getting started on second host (puma19).
puma19 ~]# hosted-engine --vm-status
--== Host 1 status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : puma19
Host ID : 1
Engine status : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "powering up"}
Score : 3400
stopped : False
Local maintenance : False
crc32 : 8a8ae896
local_conf_timestamp : 809
Host timestamp : 794
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=794 (Sun Mar 5 22:48:26 2017)
host-id=1
score=3400
vm_conf_refresh_time=809 (Sun Mar 5 22:48:41 2017)
conf_on_shared_storage=True
maintenance=False
state=EngineStarting
stopped=False
--== Host 2 status ==--
conf_on_shared_storage : True
Status up-to-date : False
Hostname : puma18
Host ID : 2
Engine status : unknown stale-data
Score : 3400
stopped : False
Local maintenance : False
crc32 : b17d39e0
local_conf_timestamp : 36444
Host timestamp : 36429
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=36429 (Sun Mar 5 22:45:05 2017)
host-id=2
score=3400
vm_conf_refresh_time=36444 (Sun Mar 5 22:45:21 2017)
conf_on_shared_storage=True
maintenance=False
state=EngineStarting
stopped=False
Engine got started on puma19:
puma19 ~]# hosted-engine --vm-status
--== Host 1 status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : puma19
Host ID : 1
Engine status : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "up"}
Score : 3400
stopped : False
Local maintenance : False
crc32 : 08091f98
local_conf_timestamp : 869
Host timestamp : 854
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=854 (Sun Mar 5 22:49:26 2017)
host-id=1
score=3400
vm_conf_refresh_time=869 (Sun Mar 5 22:49:41 2017)
conf_on_shared_storage=True
maintenance=False
state=EngineStarting
stopped=False
--== Host 2 status ==--
conf_on_shared_storage : True
Status up-to-date : False
Hostname : puma18
Host ID : 2
Engine status : unknown stale-data
Score : 3400
stopped : False
Local maintenance : False
crc32 : b17d39e0
local_conf_timestamp : 36444
Host timestamp : 36429
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=36429 (Sun Mar 5 22:45:05 2017)
host-id=2
score=3400
vm_conf_refresh_time=36444 (Sun Mar 5 22:45:21 2017)
conf_on_shared_storage=True
maintenance=False
state=EngineStarting
stopped=False
I've failed to connect to engine using WEBUI and there was no ping to it.
I also see that somehow score was not deducted for puma18, although it lost it's connectivity to network.
puma19 ~]# hosted-engine --vm-status
--== Host 1 status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : puma19
Host ID : 1
Engine status : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "up"}
Score : 3400
stopped : False
Local maintenance : False
crc32 : 24cb1887
local_conf_timestamp : 1169
Host timestamp : 1154
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=1154 (Sun Mar 5 22:54:26 2017)
host-id=1
score=3400
vm_conf_refresh_time=1169 (Sun Mar 5 22:54:41 2017)
conf_on_shared_storage=True
maintenance=False
state=EngineStarting
stopped=False
--== Host 2 status ==--
conf_on_shared_storage : True
Status up-to-date : False
Hostname : puma18
Host ID : 2
Engine status : unknown stale-data
Score : 3400
stopped : False
Local maintenance : False
crc32 : b17d39e0
local_conf_timestamp : 36444
Host timestamp : 36429
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=36429 (Sun Mar 5 22:45:05 2017)
host-id=2
score=3400
vm_conf_refresh_time=36444 (Sun Mar 5 22:45:21 2017)
conf_on_shared_storage=True
maintenance=False
state=EngineStarting
stopped=False
Components on hosts:
libvirt-client-2.0.0-10.el7_3.4.x86_64
qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64
rhevm-appliance-20160721.0-2.el7ev.noarch
mom-0.5.9-1.el7ev.noarch
ovirt-hosted-engine-setup-2.1.0.3-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
sanlock-3.4.0-1.el7.x86_64
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
vdsm-4.19.6-1.el7ev.x86_64
ovirt-host-deploy-1.6.0-1.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
ovirt-imageio-common-1.0.0-0.el7ev.noarch
ovirt-imageio-daemon-1.0.0-0.el7ev.noarch
ovirt-setup-lib-1.1.0-1.el7ev.noarch
ovirt-hosted-engine-ha-2.1.0.3-1.el7ev.noarch
Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016
Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)
On engine:
rhev-guest-tools-iso-4.1-3.el7ev.noarch
rhevm-dependencies-4.1.0-1.el7ev.noarch
rhevm-doc-4.1.0-2.el7ev.noarch
rhevm-branding-rhev-4.1.0-1.el7ev.noarch
rhevm-setup-plugins-4.1.0-1.el7ev.noarch
rhevm-4.1.1.2-0.1.el7.noarch
Linux version 3.10.0-514.6.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Feb 17 19:21:31 EST 2017
Linux 3.10.0-514.6.2.el7.x86_64 #1 SMP Fri Feb 17 19:21:31 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release. There is still a lot of work on that bug, why you tried to verify it? It's not yet finished at all. (In reply to Denis Chaplygin from comment #4) > There is still a lot of work on that bug, why you tried to verify it? It's > not yet finished at all. Because of it was moved to ON_QA. Sandro Bonazzola 2017-03-03 09:18:42 EST Status: MODIFIED → ON_QA Target Release: --- → 2.1.0.4 Denis - why is it targeted to 4.1.4? Isn't 4.1.3 realistic? Did you had just a single physical NIC for both ovirtmgmt and gluster networks or separate NICs for ovirtmgmt and Gluster? (In reply to Denis Chaplygin from comment #7) > Did you had just a single physical NIC for both ovirtmgmt and gluster > networks or separate NICs for ovirtmgmt and Gluster? I did not used gluster storage for my environment, I was using NFS instead. The host had several NICs, although management traffic was coming only through one of them. Both management and storage traffic passed through a single NIC. Hosted engine itself works just fine - when host, running VM, looses it's connection to the storage, VM is automatically restarted.
For some reason engine itself is not able to start with the following error in the server.log:
2017-04-13 14:48:51,344Z ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0190: Step handler org.jboss.as.server.DeployerChainAddHandler$FinalRuntimeStepHandler@1c4f021c for operation {"operation" => "add-deployer-chains","address" => []} at address [] failed handling operation rollback -- java.util.concurrent.TimeoutException: java.util.concurrent.TimeoutException
at org.jboss.as.controller.OperationContextImpl.waitForRemovals(OperationContextImpl.java:498) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
at org.jboss.as.controller.AbstractOperationContext$Step.handleResult(AbstractOperationContext.java:1369) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
at org.jboss.as.controller.AbstractOperationContext$Step.finalizeInternal(AbstractOperationContext.java:1328) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
at org.jboss.as.controller.AbstractOperationContext$Step.finalizeStep(AbstractOperationContext.java:1301) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
at org.jboss.as.controller.AbstractOperationContext$Step.access$300(AbstractOperationContext.java:1185) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
at org.jboss.as.controller.AbstractOperationContext.executeResultHandlerPhase(AbstractOperationContext.java:767) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
at org.jboss.as.controller.AbstractOperationContext.processStages(AbstractOperationContext.java:644) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
at org.jboss.as.controller.AbstractOperationContext.executeOperation(AbstractOperationContext.java:370) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
at org.jboss.as.controller.OperationContextImpl.executeOperation(OperationContextImpl.java:1329) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
at org.jboss.as.controller.ModelControllerImpl.boot(ModelControllerImpl.java:493) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
at org.jboss.as.controller.AbstractControllerService.boot(AbstractControllerService.java:387) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
at org.jboss.as.controller.AbstractControllerService.boot(AbstractControllerService.java:349) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
at org.jboss.as.server.ServerService.boot(ServerService.java:397) [wildfly-server-2.2.0.Final.jar:2.2.0.Final]
at org.jboss.as.server.ServerService.boot(ServerService.java:366) [wildfly-server-2.2.0.Final.jar:2.2.0.Final]
at org.jboss.as.controller.AbstractControllerService$1.run(AbstractControllerService.java:299) [wildfly-controller-2.2.0.Final.jar:2.2.0.Final]
at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_121]
2017-04-13 14:48:51,346Z ERROR [org.jboss.as.controller.client] (Controller Boot Thread) WFLYCTL0190: Step handler org.jboss.as.server.DeployerChainAddHandler$FinalRuntimeStepHandler@1c4f021c for operation {"operation" => "add-deployer-chains","address" => []} at address [] failed handling operation rollback -- java.util.concurrent.TimeoutException
It is definitely not a gluster related issue, not a storage issue and not a lack of entropy issue (i checked all of those). Restarting ovirt-engine unit manually after a little delay fixes issue.
Denis, Please provide the engine and server logs from the time you saw stack trace mentioned in comment #9. Denis, Can you try to increase deployment-timeout in subsystem urn:jboss:domain:deployment-scanner and rerun? You can find it in ovirt-engine.xml. Created attachment 1272355 [details]
server.log
Created attachment 1272356 [details]
engine.log
I increased it ten times:
[root@hc-he-lynx ovirt-engine]# cat /var/lib/ovirt-engine/jboss_runtime/config/ovirt-engine.xml |grep deployment-timeout
<deployment-scanner path="/var/lib/ovirt-engine/jboss_runtime/deployments" deployment-timeout="12000"
But still have the same error:
2017-04-18 15:03:53,976Z ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0190: Step handler org.jboss.as.server.DeployerChainAddHandler$FinalRuntimeStepHandler@481734ad for operation {"operation" => "add-deployer-chains","address" => []} at address [] failed handling operation rollback -- java.util.concurrent.TimeoutException: java.util.concurrent.TimeoutException
Denis, Is there anything which could cause hosted engine vm to be slow? I wonder how permanent is the storage on which the vm image is located. I was told that this issue do not occur on other types of storage than gluster. Well, that was one of the first things i excluded :-) That behaviour is observable on both glusterfs and nfs based storages. Worked for me on regular environment on regular hosted-engine deployed over NFS storage domain with additional pair of NFS data storage domains.
puma18 ~]# hosted-engine --vm-status
--== Host 1 status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : puma19
Host ID : 1
Engine status : {"health": "good", "vm": "up", "detail": "up"}
Score : 3400
stopped : False
Local maintenance : False
crc32 : cb960afb
local_conf_timestamp : 453498
Host timestamp : 453483
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=453483 (Mon Jun 19 17:59:26 2017)
host-id=1
score=3400
vm_conf_refresh_time=453498 (Mon Jun 19 17:59:42 2017)
conf_on_shared_storage=True
maintenance=False
state=EngineUp
stopped=False
--== Host 2 status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : puma18
Host ID : 2
Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score : 3400
stopped : False
Local maintenance : False
crc32 : 40bde156
local_conf_timestamp : 15729
Host timestamp : 15714
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=15714 (Mon Jun 19 18:00:09 2017)
host-id=2
score=3400
vm_conf_refresh_time=15729 (Mon Jun 19 18:00:24 2017)
conf_on_shared_storage=True
maintenance=False
state=EngineDown
stopped=False
puma18 ~]# hosted-engine --vm-status
--== Host 1 status ==--
conf_on_shared_storage : True
Status up-to-date : False
Hostname : puma19
Host ID : 1
Engine status : unknown stale-data
Score : 3400
stopped : False
Local maintenance : False
crc32 : cb960afb
local_conf_timestamp : 453498
Host timestamp : 453483
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=453483 (Mon Jun 19 17:59:26 2017)
host-id=1
score=3400
vm_conf_refresh_time=453498 (Mon Jun 19 17:59:42 2017)
conf_on_shared_storage=True
maintenance=False
state=EngineUp
stopped=False
--== Host 2 status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : puma18
Host ID : 2
Engine status : {"health": "good", "vm": "up", "detail": "up"}
Score : 3400
stopped : False
Local maintenance : False
crc32 : 0dac2bfd
local_conf_timestamp : 15947
Host timestamp : 15931
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=15931 (Mon Jun 19 18:03:46 2017)
host-id=2
score=3400
vm_conf_refresh_time=15947 (Mon Jun 19 18:04:02 2017)
conf_on_shared_storage=True
maintenance=False
state=EngineUp
stopped=False
As you may see, HE-VM got migrated from puma19 to puma18, once I've turned off NIC on puma19, which was associated with ovirtmgmt bridge.
puma18 was RHVH with these components:
sanlock-3.4.0-1.el7.x86_64
ovirt-node-ng-nodectl-4.1.3-0.20170608.1.el7.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
mom-0.5.9-1.el7ev.noarch
qemu-kvm-rhev-2.6.0-28.el7_3.10.x86_64
vdsm-4.19.18-1.el7ev.x86_64
ovirt-host-deploy-1.6.6-1.el7ev.noarch
ovirt-imageio-common-1.0.0-0.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
libvirt-client-2.0.0-10.el7_3.9.x86_64
ovirt-imageio-daemon-1.0.0-0.el7ev.noarch
ovirt-hosted-engine-ha-2.1.2-1.el7ev.noarch
ovirt-hosted-engine-setup-2.1.3-1.el7ev.noarch
ovirt-setup-lib-1.1.3-1.el7ev.noarch
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
Linux version 3.10.0-514.21.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Apr 22 02:41:35 EDT 2017
Linux 3.10.0-514.21.1.el7.x86_64 #1 SMP Sat Apr 22 02:41:35 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 7.3
Engine:
rhevm-dependencies-4.1.1-1.el7ev.noarch
rhevm-doc-4.1.3-1.el7ev.noarch
rhevm-4.1.3.4-0.1.el7.noarch
rhev-guest-tools-iso-4.1-5.el7ev.noarch
rhevm-setup-plugins-4.1.2-1.el7ev.noarch
rhevm-branding-rhev-4.1.0-2.el7ev.noarch
Linux version 3.10.0-681.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Tue Jun 13 22:55:13 EDT 2017
Linux 3.10.0-681.el7.x86_64 #1 SMP Tue Jun 13 22:55:13 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.4 Beta (Maipo)
Moving to verified, if there will be any issues related to HC, please reopen and specify as hyperconverged specific.
Reopening this bug as when tested with hyperconverged setup i still see that when ovirtmgmt is down, HE is not able to start on another node.
engine.log:
================================
2017-06-28 07:06:54,812-04 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler4) [] START, GlusterServersListVDSCommand(Host
Name = rhsqa-grafton3.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af'}), log id: 3b5317
2017-06-28 07:07:03,934-04 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler5) [] EVENT_ID: HOST_INTERFACE_STATE_DOWN(610)
, Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Interface ens3f0 on host rhsqa-grafton2.lab.eng.blr.redhat.com, changed state to down
2017-06-28 07:07:09,255-04 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterTasksListVDSCommand] (DefaultQuartzScheduler3) [] START, GlusterTasksListVDSCommand(HostName
= rhsqa-grafton2.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='7e22cac5-c24c-4af3-bbab-6af2d5d7c8ab'}), log id: 5a272251
2017-06-28 07:07:10,437-04 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterTasksListVDSCommand] (DefaultQuartzScheduler3) [] FINISH, GlusterTasksListVDSCommand, return
: [], log id: 5a272251
2017-06-28 07:07:13,535-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler9) [] Command 'GetAllVmStatsVDSCommand(HostName =
rhsqa-grafton3.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af'})' execution failed: VDSGenericExceptio
n: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:07:13,535-04 INFO [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler9) [] Failed to fetch vms info for host 'rhsqa-graf
ton3.lab.eng.blr.redhat.com' - skipping VMs monitoring.
2017-06-28 07:07:13,541-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler4) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,
802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhsqa-grafton3.lab.eng.blr.redhat.com command GlusterServersListVDS failed: Heartbeat exceed
ed
2017-06-28 07:07:13,541-04 ERROR [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler4) [] Command 'GlusterServersListVDSCommand(Ho
stName = rhsqa-grafton3.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af'})' execution failed: VDSGeneric
Exception: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:07:13,541-04 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler4) [] FINISH, GlusterServersListVDSCommand, lo
g id: 3b5317
2017-06-28 07:07:13,541-04 ERROR [org.ovirt.engine.core.bll.gluster.GlusterSyncJob] (DefaultQuartzScheduler4) [] Error while refreshing server data for cluster 'Default' fro
m database: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exceeded (Failed with error V
DS_NETWORK_ERROR and code 5022)
2017-06-28 07:07:13,542-04 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler4) [] START, GlusterVolumesListVDSCommand(Host
Name = rhsqa-grafton3.lab.eng.blr.redhat.com, GlusterVolumesListVDSParameters:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af'}), log id: 5da48e6e
2017-06-28 07:07:13,542-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler6) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,
802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhsqa-grafton3.lab.eng.blr.redhat.com command SpmStatusVDS failed: Heartbeat exceeded
2017-06-28 07:07:13,542-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler10) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10
,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhsqa-grafton3.lab.eng.blr.redhat.com command GetStatsVDS failed: Heartbeat exceeded
2017-06-28 07:07:13,542-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler10) [] Command 'GetStatsVDSCommand(HostName = rhsqa-gr
afton3.lab.eng.blr.redhat.com, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af', vds='Host[rhsqa-grafton3.lab.eng.blr.redh
at.com,29e0cdcd-312d-41c7-bb2b-2777aac609af]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:07:13,542-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler10) [] Failed getting vds stats, host='rhsqa-grafton3.lab.eng.blr.redhat.com'(29e0cdcd-312d-41c7-bb2b-2777aac609af): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:07:13,542-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler10) [] Failure to refresh host 'rhsqa-grafton3.lab.eng.blr.redhat.com' runtime info: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:07:13,542-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler6) [] Command 'SpmStatusVDSCommand(HostName = rhsqa-grafton3.lab.eng.blr.redhat.com, SpmStatusVDSCommandParameters:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af', storagePoolId='00000001-0001-0001-0001-000000000311'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:07:13,542-04 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler10) [] Failed to refresh VDS, network error, continuing, vds='rhsqa-grafton3.lab.eng.blr.redhat.com'(29e0cdcd-312d-41c7-bb2b-2777aac609af): VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:07:13,543-04 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-33) [] Host 'rhsqa-grafton3.lab.eng.blr.redhat.com' is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:07:13,544-04 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-50) [] Host 'rhsqa-grafton3.lab.eng.blr.redhat.com' is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:07:13,544-04 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-39) [] Host 'rhsqa-grafton3.lab.eng.blr.redhat.com' is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:07:13,544-04 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-31) [] Host 'rhsqa-grafton3.lab.eng.blr.redhat.com' is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:07:13,552-04 INFO [org.ovirt.engine.core.bll.storage.pool.SetStoragePoolStatusCommand] (DefaultQuartzScheduler6) [6e07475e] Running command: SetStoragePoolStatusCommand internal: true. Entities affected : ID: 00000001-0001-0001-0001-000000000311 Type: StoragePool
2017-06-28 07:07:13,553-04 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler6) [6e07475e] Storage Pool '00000001-0001-0001-0001-000000000311' - Updating Storage Domain '91166baa-54d0-48fd-9cb8-d225f6b9b8a2' status from 'Active' to 'Unknown', reason: null
2017-06-28 07:07:13,553-04 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler6) [6e07475e] Storage Pool '00000001-0001-0001-0001-000000000311' - Updating Storage Domain '323bccf6-0fb4-4031-b53f-9f7be6cf5182' status from 'Active' to 'Unknown', reason: null
2017-06-28 07:07:13,553-04 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler6) [6e07475e] Storage Pool '00000001-0001-0001-0001-000000000311' - Updating Storage Domain '8b7e37eb-6582-473f-90c9-96ad1039229a' status from 'Active' to 'Unknown', reason: null
2017-06-28 07:07:13,553-04 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler6) [6e07475e] Storage Pool '00000001-0001-0001-0001-000000000311' - Updating Storage Domain '90898c1f-87e9-4324-aae4-a35c4a9ae2fa' status from 'Active' to 'Unknown', reason: null
2017-06-28 07:07:13,569-04 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-50) [] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host rhsqa-grafton3.lab.eng.blr.redhat.com is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:07:13,570-04 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-31) [] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host rhsqa-grafton3.lab.eng.blr.redhat.com is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:07:13,579-04 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-39) [] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host rhsqa-grafton3.lab.eng.blr.redhat.com is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:07:13,581-04 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-33) [] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host rhsqa-grafton3.lab.eng.blr.redhat.com is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:07:13,594-04 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler6) [6e07475e] EVENT_ID: SYSTEM_CHANGE_STORAGE_POOL_STATUS_PROBLEMATIC_WITH_ERROR(987), Correlation ID: 6e07475e, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center Default. Setting Data Center status to Non Responsive (On host rhsqa-grafton3.lab.eng.blr.redhat.com, Error: Network error during communication with the Host.).
2017-06-28 07:08:09,656-04 WARN [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) [] Retry failed
2017-06-28 07:08:09,656-04 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler6) [6e07475e] Exception during connection
2017-06-28 07:08:09,656-04 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler6) [6e07475e] Failed to invoke scheduled method updatingTimerElapsed: null
2017-06-28 07:08:09,681-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler9) [] Command 'GetAllVmStatsVDSCommand(HostName = rhsqa-grafton2.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='7e22cac5-c24c-4af3-bbab-6af2d5d7c8ab'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,681-04 INFO [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler9) [] Failed to fetch vms info for host 'rhsqa-grafton2.lab.eng.blr.redhat.com' - skipping VMs monitoring.
2017-06-28 07:08:09,681-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler8) [] Command 'GetAllVmStatsVDSCommand(HostName = rhsqa-grafton1.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='c0db6d13-b14b-4a50-bba5-55d54476963f'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,681-04 INFO [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler8) [] Failed to fetch vms info for host 'rhsqa-grafton1.lab.eng.blr.redhat.com' - skipping VMs monitoring.
2017-06-28 07:08:09,685-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler7) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhsqa-grafton2.lab.eng.blr.redhat.com command GetStatsVDS failed: Heartbeat exceeded
2017-06-28 07:08:09,686-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler7) [] Command 'GetStatsVDSCommand(HostName = rhsqa-grafton2.lab.eng.blr.redhat.com, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='7e22cac5-c24c-4af3-bbab-6af2d5d7c8ab', vds='Host[rhsqa-grafton2.lab.eng.blr.redhat.com,7e22cac5-c24c-4af3-bbab-6af2d5d7c8ab]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,686-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler7) [] Failed getting vds stats, host='rhsqa-grafton2.lab.eng.blr.redhat.com'(7e22cac5-c24c-4af3-bbab-6af2d5d7c8ab): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,686-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler7) [] Failure to refresh host 'rhsqa-grafton2.lab.eng.blr.redhat.com' runtime info: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,686-04 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler7) [] Failed to refresh VDS, network error, continuing, vds='rhsqa-grafton2.lab.eng.blr.redhat.com'(7e22cac5-c24c-4af3-bbab-6af2d5d7c8ab): VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,686-04 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-30) [] Host 'rhsqa-grafton2.lab.eng.blr.redhat.com' is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:08:09,686-04 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-41) [] Host 'rhsqa-grafton2.lab.eng.blr.redhat.com' is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:08:09,688-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler5) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhsqa-grafton1.lab.eng.blr.redhat.com command GetStatsVDS failed: Heartbeat exceeded
2017-06-28 07:08:09,688-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler5) [] Command 'GetStatsVDSCommand(HostName = rhsqa-grafton1.lab.eng.blr.redhat.com, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='c0db6d13-b14b-4a50-bba5-55d54476963f', vds='Host[rhsqa-grafton1.lab.eng.blr.redhat.com,c0db6d13-b14b-4a50-bba5-55d54476963f]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,688-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler5) [] Failed getting vds stats, host='rhsqa-grafton1.lab.eng.blr.redhat.com'(c0db6d13-b14b-4a50-bba5-55d54476963f): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,688-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler5) [] Failure to refresh host 'rhsqa-grafton1.lab.eng.blr.redhat.com' runtime info: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,688-04 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler5) [] Failed to refresh VDS, network error, continuing, vds='rhsqa-grafton1.lab.eng.blr.redhat.com'(c0db6d13-b14b-4a50-bba5-55d54476963f): VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,689-04 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-32) [] Host 'rhsqa-grafton1.lab.eng.blr.redhat.com' is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:08:09,703-04 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-30) [] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host rhsqa-grafton2.lab.eng.blr.redhat.com is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued.
2017-06-28 07:08:09,781-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler3) [] Command 'GetAllVmStatsVDSCommand(HostName = rhsqa-grafton3.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,781-04 INFO [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler3) [] Failed to fetch vms info for host 'rhsqa-grafton3.lab.eng.blr.redhat.com' - skipping VMs monitoring.
2017-06-28 07:08:09,786-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler10) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhsqa-grafton3.lab.eng.blr.redhat.com command GetCapabilitiesVDS failed: Heartbeat exceeded
2017-06-28 07:08:09,786-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler10) [] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand' return value 'org.ovirt.engine.core.vdsbroker.vdsbroker.VDSInfoReturn@7a9b8b62'
2017-06-28 07:08:09,786-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler10) [] HostName = rhsqa-grafton3.lab.eng.blr.redhat.com
2017-06-28 07:08:09,786-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler10) [] Command 'GetCapabilitiesVDSCommand(HostName = rhsqa-grafton3.lab.eng.blr.redhat.com, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af', vds='Host[rhsqa-grafton3.lab.eng.blr.redhat.com,29e0cdcd-312d-41c7-bb2b-2777aac609af]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,786-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler10) [] Failure to refresh host 'rhsqa-grafton3.lab.eng.blr.redhat.com' runtime info: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,786-04 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler10) [] Failed to refresh VDS, network error, continuing, vds='rhsqa-grafton3.lab.eng.blr.redhat.com'(29e0cdcd-312d-41c7-bb2b-2777aac609af): VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,787-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler4) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhsqa-grafton3.lab.eng.blr.redhat.com command GlusterVolumesListVDS failed: Heartbeat exceeded
2017-06-28 07:08:09,787-04 ERROR [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler4) [] Command 'GlusterVolumesListVDSCommand(HostName = rhsqa-grafton3.lab.eng.blr.redhat.com, GlusterVolumesListVDSParameters:{runAsync='true', hostId='29e0cdcd-312d-41c7-bb2b-2777aac609af'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2017-06-28 07:08:09,787-04 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler4) [] FINISH, GlusterVolumesListVDSCommand, log id: 5da48e6e
2017-06-28 07:08:09,787-04 ERROR [org.ovirt.engine.core.bll.gluster.GlusterSyncJob] (DefaultQuartzScheduler4) [] Error while refreshing Gluster lightweight data of cluster 'Default': EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exceeded (Failed with error VDS_NETWORK_ERROR and code 5022)
2017-06-28 07:08:10,444-04 INFO [org.ovirt.engine.core.bll.gluster.tasks.GlusterTasksService] (DefaultQuartzScheduler8) [] No up server in cluster
2017-06-28 07:08:13,502-04 INFO [org.ovirt.engine.core.bll.gluster.GlusterSnapshotSyncJob] (DefaultQuartzScheduler9) [] No UP server found in cluster 'Default' for snapshot monitoring
2017-06-28 07:08:13,504-04 INFO [org.ovirt.engine.core.bll.gluster.GlusterSnapshotSyncJob] (DefaultQuartzScheduler9) [] No UP server found in cluster 'Default' for snapshot configurations monitoring
2017-06-28 07:09:08,750-04 WARN [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) [] Retry failed
2017-06-28 07:09:08,750-04 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler4) [] Exception during connection
2017-06-28 07:09:08,750-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler4) [] Failure to refresh host 'rhsqa-grafton1.lab.eng.blr.redhat.com' runtime info: org/apache/commons/lang/exception/ExceptionUtils
2017-06-28 07:09:08,751-04 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler4) [] Failed to invoke scheduled method onTimer: null
2017-06-28 07:09:10,446-04 INFO [org.ovirt.engine.core.bll.gluster.tasks.GlusterTasksService] (DefaultQuartzScheduler4) [] No up server in cluster
2017-06-28 07:11:00,843-04 WARN [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) [] Retry failed
2017-06-28 07:11:00,843-04 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler5) [] Exception during connection
2017-06-28 07:11:00,843-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler5) [] Failure to refresh host 'rhsqa-grafton3.lab.eng.blr.redhat.com' runtime info: org/apache/commons/lang/exception/ExceptionUtils
2017-06-28 07:11:00,843-04 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler5) [] Failed to invoke scheduled method onTimer: null
2017-06-28 07:11:10,450-04 INFO [org.ovirt.engine.core.bll.gluster.tasks.GlusterTasksService] (DefaultQuartzScheduler5) [] No up server in cluster
2017-06-28 07:11:56,893-04 WARN [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) [] Retry failed
2017-06-28 07:11:56,893-04 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler2) [] Exception during connection
2017-06-28 07:11:56,894-04 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler2) [] Failed to invoke scheduled method poll: null
2017-06-28 07:12:10,452-04 INFO [org.ovirt.engine.core.bll.gluster.tasks.GlusterTasksService] (DefaultQuartzScheduler3) [] No up server in cluster
2017-06-28 07:12:52,949-04 WARN [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) [] Retry failed
2017-06-28 07:12:52,949-04 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler9) [] Exception during connection
2017-06-28 07:12:52,949-04 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler9) [] Failed to invoke scheduled method poll: null
2017-06-28 07:13:10,460-04 INFO [org.ovirt.engine.core.bll.gluster.tasks.GlusterTasksService] (DefaultQuartzScheduler1) [] No up server in cluster
2017-06-28 07:13:13,507-04 INFO [org.ovirt.engine.core.bll.gluster.GlusterSnapshotSyncJob] (DefaultQuartzScheduler1) [] No UP server found in cluster 'Default' for snapshot monitoring
2017-06-28 07:13:13,508-04 INFO [org.ovirt.engine.core.bll.gluster.GlusterSnapshotSyncJob] (DefaultQuartzScheduler1) [] No UP server found in cluster 'Default' for snapshot configurations monitoring
2017-06-28 07:13:48,991-04 WARN [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) [] Retry failed
2017-06-28 07:13:48,991-04 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler8) [] Exception during connection
2017-06-28 07:13:48,992-04 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler8) [] Failed to invoke scheduled method updatingTimerElapsed: null
2017-06-28 07:14:10,462-04 INFO [org.ovirt.engine.core.bll.gluster.tasks.GlusterTasksService] (DefaultQuartzScheduler5) [] No up server in cluster
:
from server.log:
======================================
2017-06-28 05:48:55,631-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[305d521d]-nio2-thread-1) Client session created
2017-06-28 05:48:55,692-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[7e6ae5eb]-nio2-thread-1) Client session created
2017-06-28 05:48:55,769-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[305d521d]-nio2-thread-2) Server version string: SSH-2.0-OpenSSH_6.6.1
2017-06-28 05:48:55,771-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[305d521d]-nio2-thread-3) Kex: server->client aes128-ctr hmac-sha2-256 none
2017-06-28 05:48:55,771-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[305d521d]-nio2-thread-3) Kex: client->server aes128-ctr hmac-sha2-256 none
2017-06-28 05:48:55,831-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[7e6ae5eb]-nio2-thread-2) Server version string: SSH-2.0-OpenSSH_6.6.1
2017-06-28 05:48:55,833-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[7e6ae5eb]-nio2-thread-3) Kex: server->client aes128-ctr hmac-sha2-256 none
2017-06-28 05:48:55,833-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[7e6ae5eb]-nio2-thread-3) Kex: client->server aes128-ctr hmac-sha2-256 none
2017-06-28 05:59:33,455-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[6e63229]-nio2-thread-1) Client session created
2017-06-28 05:59:33,586-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[6e63229]-nio2-thread-2) Server version string: SSH-2.0-OpenSSH_6.6.1
2017-06-28 05:59:33,589-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[6e63229]-nio2-thread-3) Kex: server->client aes128-ctr hmac-sha2-256 none
2017-06-28 05:59:33,589-04 INFO [org.apache.sshd.client.session.ClientSessionImpl] (sshd-SshClient[6e63229]-nio2-thread-3) Kex: client->server aes128-ctr hmac-sha2-256 none
2017-06-28 06:19:07,399-04 INFO [org.apache.commons.httpclient.HttpMethodBase] (org.ovirt.thread.pool-6-thread-13) Response content length is not known
2017-06-28 06:19:09,232-04 INFO [org.apache.commons.httpclient.HttpMethodBase] (org.ovirt.thread.pool-6-thread-40) Response content length is not known
2017-06-28 06:19:11,436-04 INFO [org.apache.commons.httpclient.HttpMethodBase] (org.ovirt.thread.pool-6-thread-9) Response content length is not known
2017-06-28 06:19:11,978-04 INFO [org.apache.commons.httpclient.HttpMethodBase] (org.ovirt.thread.pool-6-thread-23) Response content length is not known
2017-06-28 07:08:09,656-04 INFO [org.quartz.core.JobRunShell] (DefaultQuartzScheduler6) Job DEFAULT.org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy.updatingTimerElapsed#-9223372036854775775 threw a JobExecutionException: : org.quartz.JobExecutionException: failed to execute job
at sun.reflect.GeneratedMethodAccessor89.invoke(Unknown Source) [:1.8.0_131]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_131]
at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_131]
at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77) [scheduler.jar:]
at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:51) [scheduler.jar:]
at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_131]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131]
2017-06-28 07:09:08,751-04 INFO [org.quartz.core.JobRunShell] (DefaultQuartzScheduler4) Job DEFAULT.org.ovirt.engine.core.vdsbroker.VdsManager.onTimer#-9223372036854775801 threw a JobExecutionException: : org.quartz.JobExecutionException: failed to execute job
at sun.reflect.GeneratedMethodAccessor71.invoke(Unknown Source) [:1.8.0_131]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_131]
at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_131]
at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77) [scheduler.jar:]
at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:51) [scheduler.jar:]
at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_131]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131]
2017-06-28 07:10:04,799-04 INFO [org.quartz.core.JobRunShell] (DefaultQuartzScheduler10) Job DEFAULT.org.ovirt.engine.core.vdsbroker.VdsManager.onTimer#-9223372036854775735 threw a JobExecutionException: : org.quartz.JobExecutionException: failed to execute job
at sun.reflect.GeneratedMethodAccessor71.invoke(Unknown Source) [:1.8.0_131]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_131]
at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_131]
at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77) [scheduler.jar:]
at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:51) [scheduler.jar:]
at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_131]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131]
2017-06-28 07:11:00,843-04 INFO [org.quartz.core.JobRunShell] (DefaultQuartzScheduler5) Job DEFAULT.org.ovirt.engine.core.vdsbroker.VdsManager.onTimer#-9223372036854775731 threw a JobExecutionException: : org.quartz.JobExecutionException: failed to execute job
at sun.reflect.GeneratedMethodAccessor71.invoke(Unknown Source) [:1.8.0_131]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_131]
at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_131]
at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77) [scheduler.jar:]
at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:51) [scheduler.jar:]
at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_131]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131]
2017-06-28 07:11:56,894-04 INFO [org.quartz.core.JobRunShell] (DefaultQuartzScheduler2) Job DEFAULT.org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher.poll#-9223372036854775799 threw a JobExecutionException: : org.quartz.JobExecutionException: failed to execute job
at sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source) [:1.8.0_131]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_131]
at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_131]
at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77) [scheduler.jar:]
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release. Please provide engine and vdsm logs. I wonder why vdsm was down in the first place. I have copied engine and vdsm logs to the location below http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1413845/ Created attachment 1292799 [details]
Fencing policies which are enabled on cluster
I checked the logs and I see a bunch of connecting issues (Connection reset by peer). Just before the issue you referred to in comment #20 I saw: 2017-06-28 07:07:13,594-04 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler6) [6e07475e] EVENT_ID: SYSTEM_CHANGE_STORAGE_POOL_STATUS_PROBLEMATIC_WITH_ERROR(987), Correlation ID: 6e07475e, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center Default. Setting Data Center status to Non Responsive (On host rhsqa-grafton3.lab.eng.blr.redhat.com, Error: Network error during communication with the Host.). 2017-06-28 07:08:09,656-04 WARN [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) [] Retry failed It seems that vdsm could be killed by sanlock. Please make sure that your storage is accessible. By the way please provide vdsm logs from the time the issue occurred. We don't seem to be able to reproduce this on non-hyperconverged setup. Pushing to 4.1.5 for now. Can you please allow us access to an environment with this issue? (In reply to Doron Fediuck from comment #47) > We don't seem to be able to reproduce this on non-hyperconverged setup. I'm able to reproduce it on my NFS environment with exactly same symptoms - VM is restarted, but engine itself never finishes deploy. This is unrelated to migration. We managed to see that the VM properly started on a different host. The main issue is that the J2EE application starts but seems to be blocked. Still looking into this. There were quite significant changes to JBoss and hosted engine code, can you please check if this is still happening? sure, will retest and update this. keeping need info on me till then. (In reply to RamaKasturi from comment #53) > sure, will retest and update this. keeping need info on me till then. Ping? Hi Martin & Yaniv,
Apoloiges for the late reply.I was on PTO last week and did not get a chance to re test this yet. Will test it out this week and clear the need info.
Thanks
kasturi
(In reply to RamaKasturi from comment #55) > Hi Martin & Yaniv, > > Apoloiges for the late reply.I was on PTO last week and did not get a > chance to re test this yet. Will test it out this week and clear the need > info. > > Thanks > kasturi Re-setting the NEEDINFO till we get a response. Re-opening this bug as i see that HE vm does not get migrated when ovirtmgmt is down.
Versions:
=================
Red Hat Virtualization Manager Version: 4.1.7.4-0.1.el7
vdsm-4.19.35-1.el7ev.x86_64
Following Traceback seen in the machine where ovirtmgmt was down:
================================================================
MainThread::ERROR::2017-11-03 17:59:11,343::brokerlink::280::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Connection closed: Connection timed out
MainThread::ERROR::2017-11-03 17:59:11,349::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 191, in _run_agent
return action(he)
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 64, in action_proper
return he.start_monitoring()
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 432, in start_monitoring
hostname=socket.gethostname())
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 118, in notify
.format(event_type, detail, options, e))
RequestError: Failed to send notification about state_transition, detail StartState-ReinitializeFSM, options {'hostname': 'yarrow.lab.eng.blr.redhat.com'}: Connection timed
out
MainThread::ERROR::2017-11-03 17:59:11,350::agent::206::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent
MainThread::WARNING::2017-11-03 17:59:16,355::agent::209::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '4'
Latest sos reports are copied in the link below. http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1413845/ 1. Which interface is used for storage? 2. (just an idea) can you block DNS (via an iptables rule) without failing the ovirtmgmt network and retry? Hmm what is the network topology here? I assume a separate ovirtmgmt and storage networks are used and only ovirtmgmt is down, right? (In reply to Martin Sivák from comment #61) > Hmm what is the network topology here? I assume a separate ovirtmgmt and > storage networks are used and only ovirtmgmt is down, right? Yes, that is right Martin. The patch I posted fixes just a symptom.
> > Hmm what is the network topology here? I assume a separate ovirtmgmt and
> > storage networks are used and only ovirtmgmt is down, right?
>
> Yes, that is right Martin.
This means sanlock won't fence the host, because lockspace is still intact and accessible. On the other hand, the agent will see status updates and the score will be properly reduced.
I posted a patch that should properly handle notification failures and that will hopefully allow the logic to behave properly.
Btw, you really should have opened a new bug about this as we are mixing multiple different cases here...
(In reply to Martin Sivák from comment #63) > The patch I posted fixes just a symptom. > > > > Hmm what is the network topology here? I assume a separate ovirtmgmt and > > > storage networks are used and only ovirtmgmt is down, right? > > > > Yes, that is right Martin. > > This means sanlock won't fence the host, because lockspace is still intact > and accessible. On the other hand, the agent will see status updates and the > score will be properly reduced. > > I posted a patch that should properly handle notification failures and that > will hopefully allow the logic to behave properly. > > Btw, you really should have opened a new bug about this as we are mixing > multiple different cases here... Hi Martin, If you look at the description of the bug the whole point is about this and not about anything else. This was the description of the problem: I see that HE vm does not migrate to another node when ovirtmgmt is down. ovirt-ha-agent tries to restart the host on another node but it fails to do so. when i check hosted-engine --vm-status i see the following output. Thanks kasturi I don't happen to me in a regular environment, so I assume it relevant for HC environment only, so please move it back to HC team. Just be aware of the bug https://bugzilla.redhat.com/show_bug.cgi?id=1528292 that can prevent from ovirt-engine to start. Kasturi, can you please take care of this bug verification? Tested with RHV 4.2.3 with the following steps 1. Created a hyperconverged setup with 3 nodes 2. When ovirtmgmt bridge is down on one of the host, the HE gets migrated to the other host automatically This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017. Since the problem described in this bug report should be resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |
Description of problem: I see that HE vm does not migrate to another node when ovirtmgmt is down. ovirt-ha-agent tries to restart the host on another node but it fails to do so. when i check hosted-engine --vm-status i see the following output. [root@rhsqa-grafton4 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : rhsqa-grafton4.lab.eng.blr.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : f08a8d7e local_conf_timestamp : 0 Host timestamp : 70746 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=70746 (Tue Jan 17 11:51:42 2017) host-id=1 score=3400 vm_conf_refresh_time=0 (Thu Jan 1 05:30:00 1970) conf_on_shared_storage=True maintenance=False state=EngineStart stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : rhsqa-grafton5.lab.eng.blr.redhat.com Host ID : 2 Engine status : {"reason": "Storage of VM is locked. Is another host already starting the VM?", "health": "bad", "vm": "already_locked", "detail": "down"} Score : 3400 stopped : False Local maintenance : False crc32 : 06e4fbae local_conf_timestamp : 0 Host timestamp : 70816 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=70816 (Tue Jan 17 11:51:31 2017) host-id=2 score=3400 vm_conf_refresh_time=0 (Thu Jan 1 05:30:00 1970) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False --== Host 3 status ==-- conf_on_shared_storage : True Status up-to-date : False Hostname : rhsqa-grafton6.lab.eng.blr.redhat.com Host ID : 3 Engine status : unknown stale-data Score : 1800 stopped : False Local maintenance : False crc32 : cc238e1d local_conf_timestamp : 0 Host timestamp : 69282 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=69282 (Tue Jan 17 11:25:57 2017) host-id=3 score=1800 vm_conf_refresh_time=0 (Thu Jan 1 05:30:00 1970) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False You have new mail in /var/spool/mail/root When i execute vdsClient -s 0 list table i see that both the vms are up and running but i cannot ping them. [root@rhsqa-grafton6 ~]# vdsClient -s 0 list table 0015d21c-fc4e-440b-8601-31ab990b8712 20937 applinuxvm Up ae48ef65-f289-42ec-8b1f-94215115e95d 19825 HostedEngine Up 10.70.44.74 Version-Release number of selected component (if applicable): [root@rhsqa-grafton4 ~]# rpm -qa | grep ovirt ovirt-imageio-daemon-0.5.0-0.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.0-0.0.master.git46cacd3.el7ev.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch ovirt-imageio-common-0.5.0-0.el7ev.noarch ovirt-vmconsole-host-1.0.4-1.el7ev.noarch ovirt-hosted-engine-ha-2.1.0-0.0.master.git118000f.el7ev.noarch ovirt-setup-lib-1.1.0-1.el7ev.noarch ovirt-host-deploy-1.6.0-0.2.master.gitb76ad50.el7ev.noarch How reproducible: Always Steps to Reproduce: 1. Deploy HC setup on arbiter volumes 2. Now start running HE and app vms on the third node which is the arbiter node. 3. Have powermanagement enabled. 4. Bring down ovirtmgmt network by running the command ifdown <nic> Actual results: I see that HE vm does not restart on another node plus I/O stops happening on the app vm. ovirt-ha-agent tries to restart HE vm on another node but it fails. Expected results: HE and app vms should restart on other nodes. Additional info: I see the following Tracebacks in ovirt-broker.log file FilesystemBackend dom_type=glusterfs sd_uuid=4373df07-8826-4376-9a93-14463a8ce537' Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 166, in handle data) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 299, in _dispatch .set_storage_domain(client, sd_type, **options) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 66, in set_storage_domain self._backends[client].connect() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 461, in connect self._dom_type) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 108, in get_domain_path " in {1}".format(sd_uuid, parent)) BackendFailureException: path to storage domain 4373df07-8826-4376-9a93-14463a8ce537 not found in /rhev/data-center/mnt/glusterSD But this domain actually exists. [root@rhsqa-grafton4 ~]# ls -l /rhev/data-center/mnt/glusterSD/10.70.36.82\:_engine/ total 4 drwxr-xr-x. 5 vdsm kvm 4096 Jan 13 11:59 4373df07-8826-4376-9a93-14463a8ce537 -rwxr-xr-x. 1 vdsm kvm 0 Jan 17 11:59 __DIRECT_IO_TEST__ Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 368, in wrapper return f(*a, **kw) File "/usr/lib/python2.7/site-packages/vdsm/concurrent.py", line 180, in run return func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 216, in process_requests count=1, File "/usr/lib64/python2.7/asyncore.py", line 220, in loop poll_fun(timeout, map) File "/usr/lib64/python2.7/asyncore.py", line 181, in poll2 if obj.readable(): File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 61, in readable return self._delegate_call("readable") File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 143, in _delegate_call return getattr(self.__impl, name)(self) AttributeError: 'NoneType' object has no attribute 'readable' gluster volume info engine: =============================== [root@rhsqa-grafton4 ~]# gluster volume info engine Volume Name: engine Type: Replicate Volume ID: 6364da28-c876-49d1-ba59-0826e172e539 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: 10.70.36.82:/rhgs/engine/enginebrick Brick2: 10.70.36.83:/rhgs/engine/enginebrick Brick3: 10.70.36.84:/rhgs/engine/enginebrick (arbiter) Options Reconfigured: auth.ssl-allow: 10.70.36.84,10.70.36.82,10.70.36.83 server.ssl: on client.ssl: on cluster.granular-entry-heal: on user.cifs: off network.ping-timeout: 30 performance.strict-o-direct: on cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 cluster.locking-scheme: granular performance.low-prio-threads: 32 features.shard-block-size: 4MB storage.owner-gid: 36 storage.owner-uid: 36 cluster.data-self-heal-algorithm: full features.shard: on cluster.server-quorum-type: server cluster.quorum-type: auto network.remote-dio: off cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off transport.address-family: inet performance.readdir-ahead: on nfs.disable: on