Bug 1180079 - JSON-RPC | engine gets stuck randomly | ERROR (ResponseWorker) Exception during connection
Summary: JSON-RPC | engine gets stuck randomly | ERROR (ResponseWorker) Exception duri...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.5.1
Assignee: Piotr Kliczewski
QA Contact:
URL:
Whiteboard: infra
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-01-08 10:04 UTC by Raz Tamir
Modified: 2016-02-10 19:15 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-02-22 07:29:17 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs (2.46 MB, application/x-bzip)
2015-01-08 10:04 UTC, Raz Tamir
no flags Details
Engine log (6.88 MB, text/plain)
2015-07-24 06:48 UTC, Christopher Pereira
no flags Details
VDSM logs (542.25 KB, application/x-rar)
2015-07-24 06:57 UTC, Christopher Pereira
no flags Details

Description Raz Tamir 2015-01-08 10:04:11 UTC
Created attachment 977699 [details]
logs

Description of problem:
Our automation jobs fails randomly with:
2015-01-08 11:07:08,085 WARN  [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) Exception thrown during message processing
2015-01-08 11:07:11,031 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) Connecting to /10.35.160.35
2015-01-08 11:07:11,032 WARN  [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) Retry failed
2015-01-08 11:07:11,032 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler_Worker-77) Exception during connection
2015-01-08 11:07:11,034 WARN  [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) Exception thrown during message processing



Version-Release number of selected component (if applicable):
rhevm-3.5.0-0.26.el6ev.noarch

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Raz Tamir 2015-01-08 10:05:53 UTC
Update:
Also occur on rhevm-3.5.0-0.28.el6ev.noarch

Comment 2 Oved Ourfali 2015-01-08 11:44:05 UTC
What test are you running when you're getting these errors?
Any influence on the engine / hosts when this happens?
It seems like failure to connect to the host.

Comment 3 Piotr Kliczewski 2015-01-08 11:50:52 UTC
Above log entries are only side effects. Looking at the engine log I can see that we lost connection with a host:

2015-01-08 11:07:07,988 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler_Worker-35) Command GetStatsVDSCommand(HostName = host_mixed_1, HostId = e4b6edae-aab4-4d58-bd8f-ac82db3806fe, vds=Host[host_mixed_1,e4b6edae-aab4-4d58-bd8f-ac82db3806fe]) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2015-01-08 11:07:07,990 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-57) Command SpmStatusVDSCommand(HostName = host_mixed_1, HostId = e4b6edae-aab4-4d58-bd8f-ac82db3806fe, storagePoolId = 2d820c0e-f472-463d-b0d4-5f26b81a873d) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded

Looking at the vdsm logs I can see that jsonrpc server was stopped:

MainThread::DEBUG::2015-01-08 11:06:59,340::vdsm::58::vds::(sigtermHandler) Received signal 15
MainThread::DEBUG::2015-01-08 11:06:59,340::protocoldetector::144::vds.MultiProtocolAcceptor::(stop) Stopping Acceptor
MainThread::INFO::2015-01-08 11:06:59,341::__init__::565::jsonrpc.JsonRpcServer::(stop) Stopping JsonRPC Server
Detector thread::DEBUG::2015-01-08 11:06:59,342::protocoldetector::115::vds.MultiProtocolAcceptor::(_cleanup) Cleaning Acceptor
MainThread::INFO::2015-01-08 11:06:59,349::vmchannels::188::vds::(stop) VM channels listener was stopped.
MainThread::INFO::2015-01-08 11:06:59,349::momIF::91::MOM::(stop) Shutting down MOM

Do you know what was the reason to stop? I can see that the connection was recovered at 2015-01-08 11:07:20,083.

Comment 4 Raz Tamir 2015-01-08 11:58:21 UTC
Hi Oved, Just creates a vm (Don't forget this happens randomly) and that is it.

Hi Piotr, I have no idea what cause this but it occurs randomly and I don't do any action that can't cause this.

Comment 5 Oved Ourfali 2015-01-08 12:13:23 UTC
According to the log VDSM was shut down.
The engine handled that as expected.

Closing this one as NOTABUG.
Please check who restarted VDSM and why. The engine recovered properly after VDSM was up again.

Comment 6 Oved Ourfali 2015-01-08 12:22:17 UTC
If the engine is indeed stuck as you mentioned, please contact us immediately (me and Piotr) with the connection details, to examine the environment. We can't fix such an issue without a reproducer. The logs seem okay.

Comment 7 Raz Tamir 2015-01-08 12:54:21 UTC
Oved,
Please don't close this bug.
"The engine handled that as expected." but no one stops vdsm

Comment 8 Oved Ourfali 2015-01-08 12:57:12 UTC
Can you try to reproduce and give us access to the environment?
In the meantime moving to 3.5.Z assuming / hoping we will have such an environment.

Comment 9 Oved Ourfali 2015-01-08 13:14:06 UTC
Raz - in addition I see you have NullPointerException when in AddVmCommand. Should you open another bug on that?

Comment 11 Raz Tamir 2015-01-11 09:03:38 UTC
Hi Oved,
The NullPointerException is the cause of this bug i guess or a result of this bug

Comment 14 Piotr Kliczewski 2015-02-04 12:05:48 UTC
Can you please check with the latest code whether this bug still occurs?

Comment 15 Raz Tamir 2015-02-04 12:12:06 UTC
Didn't see this in vt13.9

Comment 16 Christopher Pereira 2015-06-26 23:46:04 UTC
For the records,

I saw this "Heartbeat exeeded" problem with Engine (ovirt-engine-3.6.0-0.0.master.20150510172243.git48679b7) and VDSM (alpha-1):

2015-06-22 03:08:40,241 ERROR [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler_Worker-99) [4d8ae5c5] Command 'GlusterVolumesListVDSCommand(HostName = H7, GlusterVolumesListVDSParameters:{runAsync='true', hostId='dbe59c2e-5402-4acd-969a-bb3d700b1d70'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded

Maybe it is caused by network problems (eg: an overloaded network) and it doesn't recover until vdsmd is restarted.

Comment 17 Oved Ourfali 2015-06-27 05:49:08 UTC
Can you attach your engine and host  logs?

Comment 18 Christopher Pereira 2015-07-07 21:20:26 UTC
Not easy to reproduce.
Please let closed.
I will reopen and provide logs when I have cleaner ones.

Comment 19 Christopher Pereira 2015-07-24 03:14:04 UTC
Hi, I have news.

It happened again on 3.6 alpha-2, but I'm not 100% sure if the Heartbeat errors are directly related:

- Everything was working fine *for days*
- I shutdown VM from inside the guest
- VM was shut down in Engine
- I changed VM's virtual CPUs from 1 to 2 (this was not possible while VM was up...probably another bug or UX issue).
- Then I found Host down in Engine (in "Connecting..." state)
- I restarted vdsmd (twice), but Engine was still unable to connect to Host
- I restarted Engine and Host recovered (green)!!! <--- this means that Engine communication with one Host got dizzy until Engine was restarted (other hosts stayed green all the time).

- After reconnecting the Host its Master Data Storage (gluster) was down (red)
- I had to "Activate" it (is it really necessary to manually reactivate?)
- I was finally able to restart the VM

During all this steps another VM continued running without problems on the same host and storage. This means that host and storage worked fine all the time. The problem seems to be in Engine.

Next time, I will try to only restart Engine, without restarting vsdmd.

We have to fix this for 3.6 release.
Please reopen and set to high priority, or let me know if you prefer a separate BZ.

Comment 20 Piotr Kliczewski 2015-07-24 06:42:35 UTC
Do you have engine and vdsm logs?

Comment 21 Christopher Pereira 2015-07-24 06:48:05 UTC
Created attachment 1055631 [details]
Engine log

On line 10950 the VM was shutdown and problems started.

Comment 22 Christopher Pereira 2015-07-24 06:49:26 UTC
Engine messages: http://pastebin.com/ksDP8VT0

Comment 23 Christopher Pereira 2015-07-24 06:50:54 UTC
Looks similar to: https://bugzilla.redhat.com/show_bug.cgi?id=1228171

Comment 24 Christopher Pereira 2015-07-24 06:57:53 UTC
Created attachment 1055634 [details]
VDSM logs

Comment 25 Christopher Pereira 2015-07-24 07:01:12 UTC
Summary:

Engine messages: http://pastebin.com/ksDP8VT0
Engine Logs: https://bugzilla.redhat.com/attachment.cgi?id=1055631
VDSM Logs: https://bugzilla.redhat.com/attachment.cgi?id=1055634

- Everything was working fine *for days*
- I shutdown VM from inside the guest
- VM was shut down in Engine
- I changed VM's virtual CPUs from 1 to 2
- Then I found Host down in Engine (in "Connecting..." state)
- I restarted vdsmd (twice), but Engine was still unable to connect to Host
- I restarted Engine and Host recovered (green)!!! <--- this means that Engine communication with one Host got dizzy until Engine was restarted (other hosts stayed green all the time).

- After reconnecting the Host its Master Data Storage (gluster) was down (red)
- I activated the storage
- I succesfully restarted the VM

Comment 26 Christopher Pereira 2015-07-24 07:10:51 UTC
I was still seeing Host continuously getting "Non Responsive" and found out there is an abnormal very high latency between Engine and the host at the moment (800 to 1000 ms ping).

Maybe it's just the connection between Engine and host doesn't support high latency.
What's the timeout? Can it be increased?

Comment 27 Piotr Kliczewski 2015-07-24 07:19:32 UTC
I can see confirmation of the issue that you see with very high latency. By default we expect to receive a network frame within 10 seconds or we assume that there is an issue with the connection and we attempt to reconnect.

You can tune heartbeat timeout by changing vdsHeartbeatInSeconds in the engine config.

Comment 28 Christopher Pereira 2015-07-24 07:24:07 UTC
Host is "Non Responsive" but its VMs are up and stats are being updated on Engine.
Seems like 1) the Engine-host communication protocol is timing out too fast and 2) it's not recovering.

You will probably be able to reproduce this issue by testing Engine-Host connection through a high latency connection.

I will be checking if the host recovers without restarting engine.

---

Ok, I'll increase vdsHeartbeatInSeconds, but please not that latency is bad (arround 1 [s]), but not too bad (10 [s])...and that Host doesn't seem to recover without restarting Engine.

I'm entering IRC (I'm kripper)

Comment 29 Christopher Pereira 2015-07-24 07:44:32 UTC
I found this related BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1201779

1) engine-config -g vdsHeartbeatInSeconds
Error fetching vdsHeartbeatInSeconds value: no such entry. Please verify key name and property file support.

2) I didn't have an "vdsHeartbeatInSeconds.type=Integer" entry in my /etc/ovirt-engine/engine-config/engine-config.properties, so I added it.

3) Now I get:
engine-config -g vdsHeartbeatInSeconds
vdsHeartbeatInSeconds: 10 version: general

4) I restarted Engine, but the Host still didn't recover

5) I restarted vdsmd and Host recovered. I will be checking if problem persist with the new 'vdsHeartbeatInSeconds  config.

My engine was probably setup using master, so all we have to do is check if we are adding the 'vdsHeartbeatInSeconds' entry when upgrading from previous versions.

I saw many related BZ reports. Maybe it's sane to check for important configurations like this one since I'm probably not the only one missing this config entry.

Thanks Piotr. I believe we are done.

Comment 30 Piotr Kliczewski 2015-07-24 07:47:58 UTC
Please keep in mind that having too big heartbeat will influence host life cycle logic. You need to balance the value for your environment to have is as smallest as possible and still have connection open.

Comment 31 Christopher Pereira 2015-07-24 07:56:11 UTC
Ok. The default 10 [s] are fine IMO.
The issues are that:

1) The entry was not defined and I suspect is treated as less then 10 in this case

2) Host seems to be unable to recover even when latency is fixed.

For replicating 2), I would try to use a very low heartbeat to simulate a failure and see if it always recovers.

Comment 32 Christopher Pereira 2015-07-24 09:01:51 UTC
Just for the records, 'vdsHeartbeatInSeconds' must be set to a safe value depending on your network latency (a value that will never be exceeded).
Too low values will make the Data Center unstable because of false positives (VM will be migrated without a reason), while too high values will delay the migration of a VM when it actually goes down.

Going further, you will probably have problems when having one Engine controlling hosts on different networks (1 local and 1 overseas) sine apparently the same vdsHeartbeatInSeconds value is used for all hosts.

Besides, for a saturated 1 [s] latency network, a vdsHeartbeatInSeconds = 10 was not enough, so it seems like the value must currently be guessed by trial and error.

Maybe Engine should learn reasonable delays per host and detect when there is a clear outlier.

For this overseas setup, I'm going for vdsHeartbeatInSeconds = 300 to avoid problems at the moment.


Note You need to log in before you can comment on or make changes to this bug.