1201779 – engine periodically logs network errors to nodes

Bug 1201779 - engine periodically logs network errors to nodes

Summary: engine periodically logs network errors to nodes

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	oVirt
Classification:	Retired
Component:	ovirt-engine-core
Sub Component:
Version:	3.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.6.0
Assignee:	Piotr Kliczewski
QA Contact:	Pavel Stehlik
Docs Contact:
URL:
Whiteboard:	infra
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-03-13 13:35 UTC by Chris Adams
Modified:	2016-02-10 19:34 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-06-08 12:37:22 UTC
oVirt Team:	Infra
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
engine log (124.24 KB, application/x-xz) 2015-04-07 13:55 UTC, Chris Adams	no flags	Details
vdsm log from node3 (765.69 KB, application/x-xz) 2015-04-07 13:55 UTC, Chris Adams	no flags	Details
vdsm log from node9 (785.82 KB, application/x-xz) 2015-04-07 13:56 UTC, Chris Adams	no flags	Details
View All

Description Chris Adams 2015-03-13 13:35:42 UTC

I am periodically getting errors from the engine about nodes "not responding" in the web UI.  There doesn't appear to be any network issue, all the VMs stay up, and the nodes come back quickly.  It doesn't appear to be any particular node, or on any particular repeating time pattern (it can happen several times in a few hours, or go a couple of days between).

My setup is oVirt 3.5.1, hosted engine on CentOS 6, nodes running CentOS 7, all up to date as of 3.5.1 release.  There are 7 nodes (2 of which are running the hosted engine HA, the other 5 are "plain"), each node has 2 gigabit NICs connected.  There are 2 switches in a logical stack, each node has a link to each switch as a bond running LACP, then a 802.1q VLAN trunk on top of the bond.  The nodes each have a separate (non-bond, non-trunk) IPMI/DRAC network connection as well for power management.

Here are a couple of engine.log entries (please let me know what logs might help with this).  Note that the second one is trying to access node9, which is the node currently hosting the engine (so that traffic should never have even left the physical box, which would rule out any switch/etc. problem).

2015-03-10 04:42:23,310 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand] (DefaultQuartzScheduler_Worker-40) [75b9e6d9] Command ListVDSCommand(HostName = node5, HostId = 8dfd0195-f386-4e16-9379-a5287221d5bd, vds=Host[node5,8dfd0195-f386-4e16-9379-a5287221d5bd]) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded

2015-03-12 18:35:40,094 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler_Worker-45) [24b50a69] Command GetStatsVDSCommand(HostName = node9, HostId = 869be57d-4333-4d21-93a6-f440de32ec4c, vds=Host[node9,869be57d-4333-4d21-93a6-f440de32ec4c]) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues

Comment 1 gabicr 2015-03-16 09:46:02 UTC

Same here!

3.5.1 not hosted engine


2015-03-16 11:39:02,250 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-98) Failure to refresh Vds runtime info: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded
        at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:183) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand.executeVdsBrokerCommand(ListVDSCommand.java:24) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:96) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:56) [vdsbroker.jar:]
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:31) [dal.jar:]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:418) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.fetchRunningVms(VdsUpdateRunTimeInfo.java:991) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refreshVmStats(VdsUpdateRunTimeInfo.java:940) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refreshVdsRunTimeInfo(VdsUpdateRunTimeInfo.java:658) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refresh(VdsUpdateRunTimeInfo.java:494) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:236) [vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) [:1.7.0_75]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_75]
        at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_75]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]

Comment 2 gabicr 2015-03-16 09:47:27 UTC

One more observation: seem to happen on SPM node mainly!

Comment 3 Pedro Silva 2015-03-16 11:03:37 UTC

I also confirm this possible bug.

oVirt 3.5.1 Hosted Engine with 3 fedora 20 nodes also with 2 LACP bonded NICs each for the ovirtmgmt network. The VM networks are on other NICs. All nodes are also gluster nodes.

No "heartbeat exceeded" warnings though. Instead:

2015-03-14 03:14:50,079 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (org.ovirt.thread.pool-8-thread-11) Could not get Host statistics for Host jazz, Error is org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022): org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022)

Comment 4 Piotr Kliczewski 2015-03-22 19:53:55 UTC

Usually heartbeat exceeded give a hint network issues. You could try to increase heartbeat interval (10 seconds by default).

When we see message timeout it could mean that the operation took longer than expected (3 minutes).

Please provide logs from both ends for analysis.

Comment 5 Lior Vernia 2015-04-02 12:41:27 UTC

Both Sahina and Dima talked to me today about this bug... It seems to be not that uncommon.

Comment 6 Oved Ourfali 2015-04-02 12:49:12 UTC

(In reply to Lior Vernia from comment #5)
> Both Sahina and Dima talked to me today about this bug... It seems to be not
> that uncommon.

Sahina - if it happens all the time, can you try to increase the heartbeat internal? (flag name is vdsHeartbeatInSeconds).

Comment 7 Piotr Kliczewski 2015-04-03 14:28:30 UTC

Lior, This issue is not related to what Sahina and Dima saw. Please take a look at BZ #1208525.

This bug seems to be networking issue. Please provide steps to reproduce if you think that it is not networking issue.

Comment 8 Chris Adams 2015-04-03 15:14:32 UTC

Sorry for the delayed response, have been away and then busy with projects.

What logs specifically are you looking for?

Comment 9 Lior Vernia 2015-04-05 08:26:17 UTC

Nope, just thought it was related, haven't run into either of them myself.

Comment 10 Piotr Kliczewski 2015-04-07 06:51:52 UTC

Please provide vdsm and engine logs.

Comment 11 Chris Adams 2015-04-07 13:55:34 UTC

Created attachment 1011762 [details]
engine log

Comment 12 Chris Adams 2015-04-07 13:55:52 UTC

Created attachment 1011763 [details]
vdsm log from node3

Comment 13 Chris Adams 2015-04-07 13:56:09 UTC

Created attachment 1011764 [details]
vdsm log from node9

Comment 14 Chris Adams 2015-04-07 13:57:02 UTC

I've added the engine log, along with vdsm logs from a couple of nodes from the times they had errors logged this morning.  Please let me know anything else I can do to help track this down.

Comment 15 Piotr Kliczewski 2015-04-10 08:42:33 UTC

Looking at the engine logs I found that we are missing heartbeats and connection to a host is restored. When looking at node9 vdsm logs I can see very frequent local connections (xmlrpc). Interval between those connections in most of the time is less then a second sometimes more.

Do you know what can cause this number of local connections (probably vdsClient)?

I suggest to increase heartbeat interval which would reduce probability of reconnection on the engine side.

Comment 16 Chris Adams 2015-04-10 12:56:50 UTC

I don't know what is causing the local connections; I haven't changed anything from installed defaults AFAIK.  node9 is running the hosted engine HA agent/broker; could that be the cause?  That wouldn't explain all my connection issues though, because only node8 and node9 are running that (are the same connections logged in the node3 log?).

How can I tell what is making the xmlrpc connections?

Comment 17 Piotr Kliczewski 2015-04-10 12:59:42 UTC

I haven't seen xmlrpc calls on node3.

I would try to find the process which is connecting to vdsm.

Comment 18 Chris Adams 2015-04-10 13:40:22 UTC

It does appear that ovirt-ha-broker makes a lot of connections to vdsm (averaging a little over one per second, although they appear "clumped" around about a 5 second interval).

However, since that doesn't happen on node3 (or node1-5 in my setup), I don't think that's the source of the problem.  All of the nodes experience the communication errors.

I have raised the heartbeat interval to 20 seconds; I'll see what impact that has (although that would seem to me to be just avoiding a real underlying issue probably).

Comment 19 Chris Adams 2015-04-13 14:09:09 UTC

Raising the timeout had no effect, I'm still getting timeout errors.

Comment 20 Piotr Kliczewski 2015-04-16 08:47:12 UTC

Please confirm that you changed vdsHeartbeatInSeconds value.

Comment 21 Chris Adams 2015-04-16 12:55:13 UTC

[root@engine ~]# engine-config -g vdsHeartbeatInSeconds
vdsHeartbeatInSeconds: 20 version: general

Comment 22 Piotr Kliczewski 2015-04-17 11:33:35 UTC

We will try to reproduce your issue. Can you please provide information about your environment?

Comment 23 Chris Adams 2015-04-17 13:53:12 UTC

I have Dell PowerEdge servers (5x R610 and 2x R420), all running CentOS 7, specifically 7.0 with updates through the end of January (I'm adding RAM to the nodes next week and plan to update to 7.1 then).  I'm running the hosted engine on CentOS 6.6, with the 2x R420 servers configured for that.

Each of the servers has two connected NICs, connected to a pair of Dell PowerConnect 6248 switches in a stack (so one logical switch).  The links are in a LAG running LACP, configured on oVirt as "mode=802.3ad xmit_hash_policy=layer2+3 lacp_rate=1".  That carries a VLAN trunk for all the network access.

Neither the switch nor the servers has logged any errors on the NICs (transmit errors, LACP dropouts, etc.); I also ran fping on a 0.2 second interval for IIRC about 12 hours to all the nodes and the hosted engine, and got zero drops (but did see an oVirt communication issue during that time).  I'm pretty confident that the network is good (so I really hope I'm not wasting your time with a wild goose chase on this).

Storage for both the hosted engine and the VMs is on iSCSI (separate LUNs).  There is also an external NFS server, but that's just used for ISOs.

I know updating to CentOS 7.1 will bring a major libvirt update; don't know if that's potentially a factor in this.

Let me know if there's any other info you need.  Thanks.

Comment 24 Chris Adams 2015-04-28 18:49:42 UTC

So, an update: I did maintenance to my nodes last Monday (Apr 20), adding RAM (to prepare for more VMs, not to fix a shortage), and upgraded the nodes to the latest CentOS (from 7.0+some updates to 7.1+current updates), still on oVirt 3.5.1.  Since that time, I have not had a node communication error logged.

I'm not 100% sure this is solved, because I could go several days between errors before, but I don't think I went a week before.

Comment 25 Oved Ourfali 2015-06-08 12:37:22 UTC

Closing based on last comment.
If still relevant, please re-open.

Note You need to log in before you can comment on or make changes to this bug.