Created attachment 977699 [details] logs Description of problem: Our automation jobs fails randomly with: 2015-01-08 11:07:08,085 WARN [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) Exception thrown during message processing 2015-01-08 11:07:11,031 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) Connecting to /10.35.160.35 2015-01-08 11:07:11,032 WARN [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) Retry failed 2015-01-08 11:07:11,032 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler_Worker-77) Exception during connection 2015-01-08 11:07:11,034 WARN [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) Exception thrown during message processing Version-Release number of selected component (if applicable): rhevm-3.5.0-0.26.el6ev.noarch How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Update: Also occur on rhevm-3.5.0-0.28.el6ev.noarch
What test are you running when you're getting these errors? Any influence on the engine / hosts when this happens? It seems like failure to connect to the host.
Above log entries are only side effects. Looking at the engine log I can see that we lost connection with a host: 2015-01-08 11:07:07,988 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler_Worker-35) Command GetStatsVDSCommand(HostName = host_mixed_1, HostId = e4b6edae-aab4-4d58-bd8f-ac82db3806fe, vds=Host[host_mixed_1,e4b6edae-aab4-4d58-bd8f-ac82db3806fe]) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2015-01-08 11:07:07,990 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-57) Command SpmStatusVDSCommand(HostName = host_mixed_1, HostId = e4b6edae-aab4-4d58-bd8f-ac82db3806fe, storagePoolId = 2d820c0e-f472-463d-b0d4-5f26b81a873d) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded Looking at the vdsm logs I can see that jsonrpc server was stopped: MainThread::DEBUG::2015-01-08 11:06:59,340::vdsm::58::vds::(sigtermHandler) Received signal 15 MainThread::DEBUG::2015-01-08 11:06:59,340::protocoldetector::144::vds.MultiProtocolAcceptor::(stop) Stopping Acceptor MainThread::INFO::2015-01-08 11:06:59,341::__init__::565::jsonrpc.JsonRpcServer::(stop) Stopping JsonRPC Server Detector thread::DEBUG::2015-01-08 11:06:59,342::protocoldetector::115::vds.MultiProtocolAcceptor::(_cleanup) Cleaning Acceptor MainThread::INFO::2015-01-08 11:06:59,349::vmchannels::188::vds::(stop) VM channels listener was stopped. MainThread::INFO::2015-01-08 11:06:59,349::momIF::91::MOM::(stop) Shutting down MOM Do you know what was the reason to stop? I can see that the connection was recovered at 2015-01-08 11:07:20,083.
Hi Oved, Just creates a vm (Don't forget this happens randomly) and that is it. Hi Piotr, I have no idea what cause this but it occurs randomly and I don't do any action that can't cause this.
According to the log VDSM was shut down. The engine handled that as expected. Closing this one as NOTABUG. Please check who restarted VDSM and why. The engine recovered properly after VDSM was up again.
If the engine is indeed stuck as you mentioned, please contact us immediately (me and Piotr) with the connection details, to examine the environment. We can't fix such an issue without a reproducer. The logs seem okay.
Oved, Please don't close this bug. "The engine handled that as expected." but no one stops vdsm
Can you try to reproduce and give us access to the environment? In the meantime moving to 3.5.Z assuming / hoping we will have such an environment.
Raz - in addition I see you have NullPointerException when in AddVmCommand. Should you open another bug on that?
Hi Oved, The NullPointerException is the cause of this bug i guess or a result of this bug
Can you please check with the latest code whether this bug still occurs?
Didn't see this in vt13.9
For the records, I saw this "Heartbeat exeeded" problem with Engine (ovirt-engine-3.6.0-0.0.master.20150510172243.git48679b7) and VDSM (alpha-1): 2015-06-22 03:08:40,241 ERROR [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler_Worker-99) [4d8ae5c5] Command 'GlusterVolumesListVDSCommand(HostName = H7, GlusterVolumesListVDSParameters:{runAsync='true', hostId='dbe59c2e-5402-4acd-969a-bb3d700b1d70'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded Maybe it is caused by network problems (eg: an overloaded network) and it doesn't recover until vdsmd is restarted.
Can you attach your engine and host logs?
Not easy to reproduce. Please let closed. I will reopen and provide logs when I have cleaner ones.
Hi, I have news. It happened again on 3.6 alpha-2, but I'm not 100% sure if the Heartbeat errors are directly related: - Everything was working fine *for days* - I shutdown VM from inside the guest - VM was shut down in Engine - I changed VM's virtual CPUs from 1 to 2 (this was not possible while VM was up...probably another bug or UX issue). - Then I found Host down in Engine (in "Connecting..." state) - I restarted vdsmd (twice), but Engine was still unable to connect to Host - I restarted Engine and Host recovered (green)!!! <--- this means that Engine communication with one Host got dizzy until Engine was restarted (other hosts stayed green all the time). - After reconnecting the Host its Master Data Storage (gluster) was down (red) - I had to "Activate" it (is it really necessary to manually reactivate?) - I was finally able to restart the VM During all this steps another VM continued running without problems on the same host and storage. This means that host and storage worked fine all the time. The problem seems to be in Engine. Next time, I will try to only restart Engine, without restarting vsdmd. We have to fix this for 3.6 release. Please reopen and set to high priority, or let me know if you prefer a separate BZ.
Do you have engine and vdsm logs?
Created attachment 1055631 [details] Engine log On line 10950 the VM was shutdown and problems started.
Engine messages: http://pastebin.com/ksDP8VT0
Looks similar to: https://bugzilla.redhat.com/show_bug.cgi?id=1228171
Created attachment 1055634 [details] VDSM logs
Summary: Engine messages: http://pastebin.com/ksDP8VT0 Engine Logs: https://bugzilla.redhat.com/attachment.cgi?id=1055631 VDSM Logs: https://bugzilla.redhat.com/attachment.cgi?id=1055634 - Everything was working fine *for days* - I shutdown VM from inside the guest - VM was shut down in Engine - I changed VM's virtual CPUs from 1 to 2 - Then I found Host down in Engine (in "Connecting..." state) - I restarted vdsmd (twice), but Engine was still unable to connect to Host - I restarted Engine and Host recovered (green)!!! <--- this means that Engine communication with one Host got dizzy until Engine was restarted (other hosts stayed green all the time). - After reconnecting the Host its Master Data Storage (gluster) was down (red) - I activated the storage - I succesfully restarted the VM
I was still seeing Host continuously getting "Non Responsive" and found out there is an abnormal very high latency between Engine and the host at the moment (800 to 1000 ms ping). Maybe it's just the connection between Engine and host doesn't support high latency. What's the timeout? Can it be increased?
I can see confirmation of the issue that you see with very high latency. By default we expect to receive a network frame within 10 seconds or we assume that there is an issue with the connection and we attempt to reconnect. You can tune heartbeat timeout by changing vdsHeartbeatInSeconds in the engine config.
Host is "Non Responsive" but its VMs are up and stats are being updated on Engine. Seems like 1) the Engine-host communication protocol is timing out too fast and 2) it's not recovering. You will probably be able to reproduce this issue by testing Engine-Host connection through a high latency connection. I will be checking if the host recovers without restarting engine. --- Ok, I'll increase vdsHeartbeatInSeconds, but please not that latency is bad (arround 1 [s]), but not too bad (10 [s])...and that Host doesn't seem to recover without restarting Engine. I'm entering IRC (I'm kripper)
I found this related BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1201779 1) engine-config -g vdsHeartbeatInSeconds Error fetching vdsHeartbeatInSeconds value: no such entry. Please verify key name and property file support. 2) I didn't have an "vdsHeartbeatInSeconds.type=Integer" entry in my /etc/ovirt-engine/engine-config/engine-config.properties, so I added it. 3) Now I get: engine-config -g vdsHeartbeatInSeconds vdsHeartbeatInSeconds: 10 version: general 4) I restarted Engine, but the Host still didn't recover 5) I restarted vdsmd and Host recovered. I will be checking if problem persist with the new 'vdsHeartbeatInSeconds config. My engine was probably setup using master, so all we have to do is check if we are adding the 'vdsHeartbeatInSeconds' entry when upgrading from previous versions. I saw many related BZ reports. Maybe it's sane to check for important configurations like this one since I'm probably not the only one missing this config entry. Thanks Piotr. I believe we are done.
Please keep in mind that having too big heartbeat will influence host life cycle logic. You need to balance the value for your environment to have is as smallest as possible and still have connection open.
Ok. The default 10 [s] are fine IMO. The issues are that: 1) The entry was not defined and I suspect is treated as less then 10 in this case 2) Host seems to be unable to recover even when latency is fixed. For replicating 2), I would try to use a very low heartbeat to simulate a failure and see if it always recovers.
Just for the records, 'vdsHeartbeatInSeconds' must be set to a safe value depending on your network latency (a value that will never be exceeded). Too low values will make the Data Center unstable because of false positives (VM will be migrated without a reason), while too high values will delay the migration of a VM when it actually goes down. Going further, you will probably have problems when having one Engine controlling hosts on different networks (1 local and 1 overseas) sine apparently the same vdsHeartbeatInSeconds value is used for all hosts. Besides, for a saturated 1 [s] latency network, a vdsHeartbeatInSeconds = 10 was not enough, so it seems like the value must currently be guessed by trial and error. Maybe Engine should learn reasonable delays per host and detect when there is a clear outlier. For this overseas setup, I'm going for vdsHeartbeatInSeconds = 300 to avoid problems at the moment.