Description of problem: Whenever my engine looses connectivity with my hosts it can't reconnect. It always says the heartbeat exceeded 2015-06-04 13:46:39,376 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-49) [5a0fd004] Failure to refresh Vds runtime info: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2015-06-04 13:46:39,379 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-49) [5a0fd004] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:183) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:101) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:61) [vdsbroker.jar:] at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:500) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsStats(HostMonitoring.java:469) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:113) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:83) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:222) [vdsbroker.jar:] at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) [:1.7.0_75] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_75] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_75] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:] 2015-06-04 13:46:39,401 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-49) [5a0fd004] Failed to refresh VDS, network error, continuing, vds='freak-he-host'(f938999b-2c27-4da9-98ab-f6ffd1e3b8bf): VDSGenericException: VDSNetworkException: Heartbeat exeeded 2015-06-04 13:46:39,534 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (Stomp Reactor) [] Connecting to freak.usersys.redhat.com/10.35.4.174 2015-06-04 13:46:48,757 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-81) [] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand' return value ' TaskStatusListReturnForXmlRpc:{mStatus='StatusForXmlRpc [mCode=5022, mMessage=Heartbeat exeeded]' Version-Release number of selected component (if applicable): engine from master - hash f9a7fe22202d37ecce013b82c49988ca12a6b834 vdsm 4.17.0 on RHEL 7.1 How reproducible: 100% Steps to Reproduce: 1.have an active host 2.cut the engine-host connection somehow 3.bring the connection up after 15 minutes
Created attachment 1034660 [details] engine.log
also the host status doesn't change. and the non-responding treatment doesn't start at all. the host is effectively non-responding - every cycle there's a network exception.
Please also attach the host logs.
Another question, you have no fencing, right? What happens if you move the host to maintenance and activate it?
Please provide logs with debug enabled on the engine side and vdsm logs.
Created attachment 1036599 [details] engine.log during debug session
Looking at the logs I can see two issues: 1. Network issue - Caused by: java.net.UnknownHostException: freak.usersys.redhat.com 2. There is an issue with vmsMonitoring. Next there is "Unable to process messages" and last "Unrecognized message received". After that I can see heartbeat issues. I am not able to tell what happened because there was debug set. Can you please check whether there are any exceptions in server.log and vdsm.log? 2015-06-09 09:37:41,848 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler_Worker-21) [] Failed to invoke scheduled method vmsMonitoring: null 2015-06-09 09:37:42,732 INFO [org.ovirt.engine.core.bll.OvfDataUpdater] (DefaultQuartzScheduler_Worker-41) [] Attempting to update VMs/Templates Ovf. 2015-06-09 09:37:42,734 INFO [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] (DefaultQuartzScheduler_Worker-41) [31ab2e3] Running command: ProcessOvfUpdateForStoragePoolCommand internal: true. Entities affected : ID: 00000001-0001-0001-0001-00000000014a Type: StoragePool 2015-06-09 09:37:42,764 INFO [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] (DefaultQuartzScheduler_Worker-41) [31ab2e3] Lock freed to object 'EngineLock:{exclusiveLocks='[00000001-0001-0001-0001-00000000014a=<OVF_UPDATE, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}' 2015-06-09 09:37:46,179 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVDSCommand] (DefaultQuartzScheduler_Worker-57) [] START, FullListVDSCommand(HostName = , FullListVDSCommandParameters:{runAsync='true', hostId='f938999b-2c27-4da9-98ab-f6ffd1e3b8bf', vds='Host[,f938999b-2c27-4da9-98ab-f6ffd1e3b8bf]', vmIds='[50e59bbd-f829-4761-ba1b-6db1a6acc3b8]'}), log id: 7eb04952 2015-06-09 09:37:46,184 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVDSCommand] (DefaultQuartzScheduler_Worker-57) [] FINISH, FullListVDSCommand, return: [{status=Up, nicModel=rtl8139,pv, emulatedMachine=rhel6.5.0, vmId=50e59bbd-f829-4761-ba1b-6db1a6acc3b8, pid=11380, clientIp=, devices=[Ljava.lang.Object;@679de778, smp=1, vmType=kvm, displayIp=0, displaySecurePort=-1, memSize=4096, displayPort=5900, cpuType=SandyBridge, spiceSecureChannels=smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir, guestDiskMapping={1b52d184-a370-48a7-8={name=/dev/vda}}, vmName=HostedEngine, display=vnc, pauseCode=NOERR}], log id: 7eb04952 2015-06-09 09:37:47,394 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler_Worker-57) [] Failed to invoke scheduled method vmsMonitoring: null 2015-06-09 09:37:52,104 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVDSCommand] (DefaultQuartzScheduler_Worker-3) [] START, FullListVDSCommand(HostName = , FullListVDSCommandParameters:{runAsync='true', hostId='f938999b-2c27-4da9-98ab-f6ffd1e3b8bf', vds='Host[,f938999b-2c27-4da9-98ab-f6ffd1e3b8bf]', vmIds='[50e59bbd-f829-4761-ba1b-6db1a6acc3b8]'}), log id: d0b6921 2015-06-09 09:37:52,145 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVDSCommand] (DefaultQuartzScheduler_Worker-3) [] FINISH, FullListVDSCommand, return: [{status=Up, nicModel=rtl8139,pv, emulatedMachine=rhel6.5.0, vmId=50e59bbd-f829-4761-ba1b-6db1a6acc3b8, pid=11380, clientIp=, devices=[Ljava.lang.Object;@5575bf49, smp=1, vmType=kvm, displayIp=0, displaySecurePort=-1, memSize=4096, displayPort=5900, cpuType=SandyBridge, spiceSecureChannels=smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir, guestDiskMapping={1b52d184-a370-48a7-8={name=/dev/vda}}, vmName=HostedEngine, display=vnc, pauseCode=NOERR}], log id: d0b6921 2015-06-09 09:38:18,075 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (Stomp Reactor) [] Unable to process messages 2015-06-09 09:38:32,171 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler_Worker-99) [4437065a] Command 'GetStatsVDSCommand(HostName = freak-he-host, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='f938999b-2c27-4da9-98ab-f6ffd1e3b8bf', vds='Host[freak-he-host,f938999b-2c27-4da9-98ab-f6ffd1e3b8bf]'})' execution failed: VDSGenericException: VDSNetworkException: Unrecognized message received
Created attachment 1036637 [details] DEBUG log
There was an issue with partially merged vdsm changes. Please check whether this issue is still there with the latest master.
I see timeouts every once in a while but they are getting corrected and the heartbeat exceeding problem disappeared after vdsm update to latest master. If I'll see a wrong behaviour or timeouts I can't explain I'll open a new bug.