1395625 – [downstream clone - 4.0.6] Servers state isn't stable and they changing state to non-responsive every few minutes, if one host in the DC is non-responsive with the engine

Bug 1395625 - [downstream clone - 4.0.6] Servers state isn't stable and they changing state to non-responsive every few minutes, if one host in the DC is non-responsive with the engine

Summary: [downstream clone - 4.0.6] Servers state isn't stable and they changing state...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm-jsonrpc-java
Sub Component:
Version:	unspecified
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	urgent
Target Milestone:	ovirt-4.0.6
Target Release:	---
Assignee:	Piotr Kliczewski
QA Contact:	Michael Burman
Docs Contact:
URL:
Whiteboard:
Depends On:	1393714
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-16 10:11 UTC by rhev-integ
Modified:	2017-01-23 13:15 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1393714
Environment:
Last Closed:	2017-01-23 13:15:39 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	66799	0	master	MERGED	Closing channel could take too much reactor's time	2020-10-23 04:24:43 UTC
oVirt gerrit	66838	0	ovirt-4.0	MERGED	Closing channel could take too much reactor's time	2020-10-23 04:24:43 UTC

Comment 1 rhev-integ 2016-11-16 10:11:38 UTC

Created attachment 1219244 [details]
engine logs

Description of problem:
Servers state isn't stable and they changing state to non-responsive every few minutes.

The servers on latest master are moving from non-responsive to up every few minutes.

2016-11-10 10:16:47,116 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Internal server error: null
2016-11-10 10:16:47,116 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Internal server error: null
2016-11-10 10:16:47,128 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler3) [31d4820c] Command 'GetAllVmStatsVDSCommand(HostName = orchid-vds1.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='a8ac96b7-bce5-4039-b4ee-e608f34ceac7'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2016-11-10 10:16:47,128 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler3) [31d4820c] Failed to fetch vms info for host 'orchid-vds1.qa.lab.tlv.redhat.com' - skipping VMs monitoring.
2016-11-10 10:16:47,154 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler2) [7e87bdde] Command 'GetAllVmStatsVDSCommand(HostName = navy-vds1.qa.lab.tlv.redhat.com, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='cd60d5ed-3b0e-46e5-be1a-272ed6516c45'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2016-11-10 10:16:47,154 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler2) [7e87bdde] Failed to fetch vms info for host 'navy-vds1.qa.lab.tlv.redhat.com' - skipping VMs monitoring.
2016-11-10 10:16:47,157 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler7) [7e920fc] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM orchid-vds1.qa.lab.tlv.redhat.com command failed: Heartbeat exceeded
2016-11-10 10:16:47,158 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler7) [7e920fc] Command 'GetStatsVDSCommand(HostName = orchid-vds1.qa.lab.tlv.redhat.com, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='a8ac96b7-bce5-4039-b4ee-e608f34ceac7', vds='Host[orchid-vds1.qa.lab.tlv.redhat.com,a8ac96b7-bce5-4039-b4ee-e608f34ceac7]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2016-11-10 10:16:47,158 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler7) [7e920fc] Failed getting vds stats, host='orchid-vds1.qa.lab.tlv.redhat.com'(a8ac96b7-bce5-4039-b4ee-e608f34ceac7): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2016-11-10 10:16:47,158 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler7) [7e920fc] Failure to refresh host 'orchid-vds1.qa.lab.tlv.redhat.com' runtime info: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2016-11-10 10:16:47,158 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler7) [7e920fc] Failed to refresh VDS, network error, continuing, vds='orchid-vds1.qa.lab.tlv.redhat.com'(a8ac96b7-bce5-4039-b4ee-e608f34ceac7): VDSGenericException: VDSNetworkException: Heartbeat exceeded
2016-11-10 10:16:47,159 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-17) [31d4820c] Host 'orchid-vds1.qa.lab.tlv.redhat.com' is not responding.
2016-11-10 10:16:47,164 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler10) [33fad06d] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM orchid-vds1.qa.lab.tlv.redhat.com command failed: Heartbeat exceeded
2016-11-10 10:16:47,165 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler10) [33fad06d] Command 'SpmStatusVDSCommand(HostName = orchid-vds1.qa.lab.tlv.redhat.com, SpmStatusVDSCommandParameters:{runAsync='true', hostId='a8ac96b7-bce5-4039-b4ee-e608f34ceac7', storagePoolId='ea5798e5-47b6-4e81-8005-cceffdddcd5b'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2016-11-10 10:16:47,180 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler8) [db08d74] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM navy-vds1.qa.lab.tlv.redhat.com command failed: Heartbeat exceeded
2016-11-10 10:16:47,181 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler8) [db08d74] Command 'GetStatsVDSCommand(HostName = navy-vds1.qa.lab.tlv.redhat.com, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='cd60d5ed-3b0e-46e5-be1a-272ed6516c45', vds='Host[navy-vds1.qa.lab.tlv.redhat.com,cd60d5ed-3b0e-46e5-be1a-272ed6516c45]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2016-11-10 10:16:47,181 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler8) [db08d74] Failed getting vds stats, host='navy-vds1.qa.lab.tlv.redhat.com'(cd60d5ed-3b0e-46e5-be1a-272ed6516c45): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2016-11-10 10:16:47,181 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler8) [db08d74] Failure to refresh host 'navy-vds1.qa.lab.tlv.redhat.com' runtime info: VDSGenericException: VDSNetworkException: Heartbeat exceeded
2016-11-10 10:16:47,181 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler8) [db08d74] Failed to refresh VDS, network error, continuing, vds='navy-vds1.qa.lab.tlv.redhat.com'(cd60d5ed-3b0e-46e5-be1a-272ed6516c45): VDSGenericException: VDSNetworkException: Heartbeat exceeded

2016-11-10 10:21:11,124 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to orchid-vds1.qa.lab.
tlv.redhat.com/10.35.128.22
2016-11-10 10:21:11,579 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to navy-vds1.qa.lab.tl
v.redhat.com/10.35.128.14
2016-11-10 10:21:11,591 ERROR [org.ovirt.vdsm.jsonrpc.client.JsonRpcClient] (ResponseWorker) [] Not able to update response for "2d5664fa-cc8
e-459d-b282-623f619b6445"
2016-11-10 10:21:11,591 ERROR [org.ovirt.vdsm.jsonrpc.client.JsonRpcClient] (ResponseWorker) [] Not able to update response for "3aecd3b6-43f
3-4caf-8642-486156591a97"

Version-Release number of selected component (if applicable):
4.1.0-0.0.master.20161109091313.gitedb19fb.el7.centos

How reproducible:
100

This comment was originaly posted by mburman

Comment 3 rhev-integ 2016-11-16 10:11:45 UTC

Can you attach vdsm.log as well?

This comment was originaly posted by oourfali

Comment 4 rhev-integ 2016-11-16 10:11:52 UTC

Created attachment 1219247 [details]
vdsm logs

This comment was originaly posted by mburman

Comment 5 rhev-integ 2016-11-16 10:11:57 UTC

Looking at the logs I see that issues started from:

2016-11-10 03:45:58,466 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Unable to process messages: Connection reset by peer

It is just after installation of camel-vdsa.qa.lab.tlv.redhat.com. Later we see plenty of failures for other hosts due to heart beat exception.

Please make sure that there are no networking issues between engine, vdsm and storage.

This comment was originaly posted by pkliczew

Comment 6 rhev-integ 2016-11-16 10:12:03 UTC

Also, I see in supervdsm on navy-vds1 that the "n1" interface failed to get ip address. Is that a required network?

This comment was originaly posted by oourfali

Comment 7 rhev-integ 2016-11-16 10:12:09 UTC

I tried to connect to camel-vdsa.qa.lab.tlv.redhat.com but I was unable to. It seems that adding this host caused network instability which resulted in heartbeats for other hosts and later engine was not able to reconnect to them.

Do we know what happened to camel host?

This comment was originaly posted by pkliczew

Comment 8 rhev-integ 2016-11-16 10:12:16 UTC

We don't have network issues. I think i understand what triggers this behavior, it's happening on 4.0.5 as well.  
Once a host in an OVS cluster is non-responding, this behavior starts and all other servers in the DC(legacy cluster) starting to change state to non-responsive and up again.

Attaching new engine logs, i have initiated a reboot on an ovs host and this behavior started right away and all hosts on legacy cluster became non-responsive.
I think it some how related to ovs, but i don't understand how exactly.

This comment was originaly posted by mburman

Comment 9 rhev-integ 2016-11-16 10:12:21 UTC

(In reply to Piotr Kliczewski from comment #5)
> I tried to connect to camel-vdsa.qa.lab.tlv.redhat.com but I was unable to.
> It seems that adding this host caused network instability which resulted in
> heartbeats for other hosts and later engine was not able to reconnect to
> them.
> 
> Do we know what happened to camel host?

Yes, it's an ovs host(in ovs cluster) which i rebooted in order to reproduce this bug)

This comment was originaly posted by mburman

Comment 10 rhev-integ 2016-11-16 10:12:27 UTC

Created attachment 1219253 [details]
engine log

This comment was originaly posted by mburman

Comment 11 rhev-integ 2016-11-16 10:12:34 UTC

Michael,

I seems that it could be related. Please provide tcpdump for the engine.

In my opinion it seems to be network team issue.

This comment was originaly posted by pkliczew

Comment 12 rhev-integ 2016-11-16 10:12:41 UTC

I checked the logs and it was not installation but host upgrade manager.

This comment was originaly posted by pkliczew

Comment 13 rhev-integ 2016-11-16 10:12:47 UTC

It looks like ovs affects engine to vdsm communication which needs to be investigated by the network team. tcpdump taken on the engine host still would be nice to have.

This comment was originaly posted by pkliczew

Comment 14 rhev-integ 2016-11-16 10:12:54 UTC

Created attachment 1219292 [details]
tcpdump

It's a tcpdump of the engine with one of the server that changing state once the host in the ovs cluster going down(src+dst)

This comment was originaly posted by mburman

Comment 15 rhev-integ 2016-11-16 10:13:02 UTC

Updating this report, this issue happens with legacy servers as well, once a server is non-responsive with engine, all other servers in the DC affected.
both on master and 4.0.5

This comment was originaly posted by mburman

Comment 16 rhev-integ 2016-11-16 10:13:08 UTC

Can someone investigate this report? it is a 100% reproducible on all setups and versions. And it's not network related, we are tested if there was a network issue communications with engine and other hosts in the setup once 1 server is gone to non-responsive state and we had no disconnections between the engine and other hosts, but it looks like engine is mistake/confused and lie about the other connection with the other servers in the setup.

We run the next script on the engine with the servers in the setup:

from time import time, sleep
from vdsm import jsonrpcvdscli
import sys

host = sys.argv[1] if len(sys.argv) > 1 else 'localhost'

s = jsonrpcvdscli.connect(host=host)

while True:
    t0 = time()
    try:
        s.ping()
        print(time() - t0)
    except Exception:
        print("No response for JSON-RPC Host.ping request")
    sleep(0.5)

There were no prints and ping was consistent, but engine report that the servers state is changing every few minutes. This can be caused by current changes in the vdsm-jsonrpc..

This comment was originaly posted by mburman

Comment 17 rhev-integ 2016-11-16 10:13:15 UTC

Michael,

Please do one test for me. Take any version you tested and configure bunch of hosts. Block one of them and see if you will be able to see the same results.
Once you reproduce the same issue without using ovs please attach the logs.

This comment was originaly posted by pkliczew

Comment 18 rhev-integ 2016-11-16 10:13:22 UTC

Piotr, 

I'm attaching the engine log in debug mode from our tests today.

This comment was originaly posted by mburman

Comment 19 rhev-integ 2016-11-16 10:13:28 UTC

Created attachment 1220866 [details]
engine in debug

This comment was originaly posted by mburman

Comment 20 rhev-integ 2016-11-16 10:13:35 UTC

Micheal, Can you check which versions are affected?

This comment was originaly posted by pkliczew

Comment 21 rhev-integ 2016-11-16 10:13:40 UTC

(In reply to Piotr Kliczewski from comment #18)
> Micheal, Can you check which versions are affected?

Latest master 4.1 and rhevm-4.0.5.5-0.1.el7ev.noarch

This comment was originaly posted by mburman

Comment 22 rhev-integ 2016-11-16 10:13:47 UTC

During my investigation of the issue I saw that when one host is down every 3 - 4 mins other hosts go down for less than a second with heartbeat exceeded and they are Up again. This happens for the duration of a host being down. The fix was tested on Michael's environment.

This comment was originaly posted by pkliczew

Comment 23 Michael Burman 2016-12-11 14:16:27 UTC

Verified on - 4.0.6.3-0.1.el7ev

Comment 24 Martin Perina 2017-01-23 13:15:39 UTC

Closing manually as it's included in vdsm-jsonrpc-java-1.2.10-1.el7ev delivered as a part of https://rhn.redhat.com/errata/RHBA-2017-0046.html , but forgot to add the bug into this errata.

Note You need to log in before you can comment on or make changes to this bug.