Bug 1228171 - engine cant renew connectivity to host - heartbeat exceeded
Summary: engine cant renew connectivity to host - heartbeat exceeded
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: ovirt-engine-core
Version: 3.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.6.0
Assignee: Piotr Kliczewski
QA Contact: Pavel Stehlik
URL:
Whiteboard: infra
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-06-04 10:49 UTC by Roy Golan
Modified: 2019-04-28 11:19 UTC (History)
8 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2015-07-06 12:26:38 UTC
oVirt Team: Infra
Embargoed:


Attachments (Terms of Use)
engine.log (2.95 KB, text/plain)
2015-06-04 10:58 UTC, Roy Golan
no flags Details
engine.log during debug session (17.82 MB, text/plain)
2015-06-09 06:49 UTC, Roy Golan
no flags Details
DEBUG log (911.30 KB, application/x-gzip)
2015-06-09 08:27 UTC, Roy Golan
no flags Details

Description Roy Golan 2015-06-04 10:49:31 UTC
Description of problem:

Whenever my engine looses connectivity with my hosts it can't reconnect. It always says the heartbeat exceeded 


2015-06-04 13:46:39,376 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-49) [5a0fd004] Failure to refresh Vds runtime info: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2015-06-04 13:46:39,379 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-49) [5a0fd004] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded
	at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:183) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:101) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:61) [vdsbroker.jar:]
	at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:]
	at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:500) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsStats(HostMonitoring.java:469) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:113) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:83) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:222) [vdsbroker.jar:]
	at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) [:1.7.0_75]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_75]
	at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_75]
	at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:]
	at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:]
	at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]

2015-06-04 13:46:39,401 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-49) [5a0fd004] Failed to refresh VDS, network error, continuing, vds='freak-he-host'(f938999b-2c27-4da9-98ab-f6ffd1e3b8bf): VDSGenericException: VDSNetworkException: Heartbeat exeeded
2015-06-04 13:46:39,534 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (Stomp Reactor) [] Connecting to freak.usersys.redhat.com/10.35.4.174
2015-06-04 13:46:48,757 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-81) [] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand' return value '
TaskStatusListReturnForXmlRpc:{mStatus='StatusForXmlRpc [mCode=5022, mMessage=Heartbeat exeeded]'

Version-Release number of selected component (if applicable):

engine from master - hash f9a7fe22202d37ecce013b82c49988ca12a6b834
vdsm 4.17.0 on RHEL 7.1

How reproducible:
100%

Steps to Reproduce:
1.have an active host
2.cut the engine-host connection somehow
3.bring the connection up after 15 minutes

Comment 1 Roy Golan 2015-06-04 10:58:33 UTC
Created attachment 1034660 [details]
engine.log

Comment 2 Roy Golan 2015-06-04 11:00:02 UTC
also the host status doesn't change. and the non-responding treatment doesn't start at all. the host is effectively non-responding - every cycle there's a network exception.

Comment 3 Oved Ourfali 2015-06-05 07:20:40 UTC
Please also attach the host logs.

Comment 4 Oved Ourfali 2015-06-05 07:23:55 UTC
Another question, you have no fencing, right? 
What happens if you move the host to maintenance and activate it?

Comment 5 Piotr Kliczewski 2015-06-07 19:54:36 UTC
Please provide logs with debug enabled on the engine side and vdsm logs.

Comment 6 Roy Golan 2015-06-09 06:49:11 UTC
Created attachment 1036599 [details]
engine.log during debug session

Comment 7 Piotr Kliczewski 2015-06-09 08:04:26 UTC
Looking at the logs I can see two issues:

1. Network issue - Caused by: java.net.UnknownHostException: freak.usersys.redhat.com
2. There is an issue with vmsMonitoring. Next there is "Unable to process messages" and last "Unrecognized message received". After that I can see heartbeat issues.

I am not able to tell what happened because there was debug set. Can you please check whether there are any exceptions in server.log and vdsm.log?


2015-06-09 09:37:41,848 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler_Worker-21) [] Failed to invoke scheduled method vmsMonitoring: null
2015-06-09 09:37:42,732 INFO  [org.ovirt.engine.core.bll.OvfDataUpdater] (DefaultQuartzScheduler_Worker-41) [] Attempting to update VMs/Templates Ovf.
2015-06-09 09:37:42,734 INFO  [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] (DefaultQuartzScheduler_Worker-41) [31ab2e3] Running command: ProcessOvfUpdateForStoragePoolCommand internal: true. Entities affected :  ID: 00000001-0001-0001-0001-00000000014a Type: StoragePool
2015-06-09 09:37:42,764 INFO  [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] (DefaultQuartzScheduler_Worker-41) [31ab2e3] Lock freed to object 'EngineLock:{exclusiveLocks='[00000001-0001-0001-0001-00000000014a=<OVF_UPDATE, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}'
2015-06-09 09:37:46,179 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVDSCommand] (DefaultQuartzScheduler_Worker-57) [] START, FullListVDSCommand(HostName = , FullListVDSCommandParameters:{runAsync='true', hostId='f938999b-2c27-4da9-98ab-f6ffd1e3b8bf', vds='Host[,f938999b-2c27-4da9-98ab-f6ffd1e3b8bf]', vmIds='[50e59bbd-f829-4761-ba1b-6db1a6acc3b8]'}), log id: 7eb04952
2015-06-09 09:37:46,184 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVDSCommand] (DefaultQuartzScheduler_Worker-57) [] FINISH, FullListVDSCommand, return: [{status=Up, nicModel=rtl8139,pv, emulatedMachine=rhel6.5.0, vmId=50e59bbd-f829-4761-ba1b-6db1a6acc3b8, pid=11380, clientIp=, devices=[Ljava.lang.Object;@679de778, smp=1, vmType=kvm, displayIp=0, displaySecurePort=-1, memSize=4096, displayPort=5900, cpuType=SandyBridge, spiceSecureChannels=smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir, guestDiskMapping={1b52d184-a370-48a7-8={name=/dev/vda}}, vmName=HostedEngine, display=vnc, pauseCode=NOERR}], log id: 7eb04952
2015-06-09 09:37:47,394 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler_Worker-57) [] Failed to invoke scheduled method vmsMonitoring: null
2015-06-09 09:37:52,104 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVDSCommand] (DefaultQuartzScheduler_Worker-3) [] START, FullListVDSCommand(HostName = , FullListVDSCommandParameters:{runAsync='true', hostId='f938999b-2c27-4da9-98ab-f6ffd1e3b8bf', vds='Host[,f938999b-2c27-4da9-98ab-f6ffd1e3b8bf]', vmIds='[50e59bbd-f829-4761-ba1b-6db1a6acc3b8]'}), log id: d0b6921
2015-06-09 09:37:52,145 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVDSCommand] (DefaultQuartzScheduler_Worker-3) [] FINISH, FullListVDSCommand, return: [{status=Up, nicModel=rtl8139,pv, emulatedMachine=rhel6.5.0, vmId=50e59bbd-f829-4761-ba1b-6db1a6acc3b8, pid=11380, clientIp=, devices=[Ljava.lang.Object;@5575bf49, smp=1, vmType=kvm, displayIp=0, displaySecurePort=-1, memSize=4096, displayPort=5900, cpuType=SandyBridge, spiceSecureChannels=smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir, guestDiskMapping={1b52d184-a370-48a7-8={name=/dev/vda}}, vmName=HostedEngine, display=vnc, pauseCode=NOERR}], log id: d0b6921
2015-06-09 09:38:18,075 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (Stomp Reactor) [] Unable to process messages
2015-06-09 09:38:32,171 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler_Worker-99) [4437065a] Command 'GetStatsVDSCommand(HostName = freak-he-host, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='f938999b-2c27-4da9-98ab-f6ffd1e3b8bf', vds='Host[freak-he-host,f938999b-2c27-4da9-98ab-f6ffd1e3b8bf]'})' execution failed: VDSGenericException: VDSNetworkException: Unrecognized message received

Comment 8 Roy Golan 2015-06-09 08:27:14 UTC
Created attachment 1036637 [details]
DEBUG log

Comment 9 Piotr Kliczewski 2015-06-17 07:44:05 UTC
There was an issue with partially merged vdsm changes. Please check whether this issue is still there with the latest master.

Comment 10 Roy Golan 2015-07-06 12:26:38 UTC
I see timeouts every once in a while but they are getting corrected and  the heartbeat exceeding problem disappeared after vdsm update to latest master.

If I'll see a wrong behaviour or timeouts I can't explain I'll open a new bug.


Note You need to log in before you can comment on or make changes to this bug.