Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1109722

Summary: Hosts does not change statuses properly during network outage.
Product: Red Hat Enterprise Virtualization Manager Reporter: Roman Hodain <rhodain>
Component: ovirt-engineAssignee: Eli Mesika <emesika>
Status: CLOSED NOTABUG QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 3.4.0CC: acathrow, amureini, gklein, iheim, laravot, lpeer, ofrenkel, oourfali, pstehlik, Rhev-m-bugs, rhodain, yeylon
Target Milestone: ---Keywords: Triaged
Target Release: 3.5.0   
Hardware: All   
OS: Linux   
Whiteboard: infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-07-27 09:46:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Engine logs
none
Hypervisor statuses in time none

Description Roman Hodain 2014-06-16 08:42:12 UTC
Description of problem:
	When a network outage occurs the hypervisors remains in UP state for
180s.

Version-Release number of selected component (if applicable):
	rhevm-3.4.0-0.21.el6ev.noarch

How reproducible:
	100%

Steps to Reproduce:

	1. Have a hypervisor in up state.

	2. Run the followin gon the hypervisor.

		# iptables -I INPUT -p tcp --dport 54321 -j REJECT

Actual results:
	The hypervisor remains in UP sate and afyter 180s is flipped to
connection and almost immediatelly fenced.

Expected results:
	The hypervisor flips to connecting state immediatelly when the network
issue is detected and then to non-responding state.

Comment 1 Omer Frenkel 2014-06-18 07:48:01 UTC
please attach engine.log
i assume that this cause timeout in requests from engine to vdsm, and the timeout as defined in the engine is what you see:

/share/ovirt-engine/bin/engine-config.sh --get vdsTimeout
vdsTimeout: 180 version: general

if this is the case, i dont think this is a bug.
usually on real network error there is some error that returns faster that the timeout.

Comment 6 Eli Mesika 2014-07-23 14:41:58 UTC
I had went over the code and seems that we are setting the "Connecting" status at the right time.

Please try to reproduce and at the moment that you have hosts in UP rather than in Connecting capture and attach the result of the following query

select vds_name,status from vds_static a ,vds_dynamic b where a.vds_id=b.vds_id;


What I would like to check is if that is a UI issue that is not get updated from some reason with the new status while the database already had been updated

Comment 7 Roman Hodain 2014-07-24 08:20:10 UTC
(In reply to Eli Mesika from comment #6)
> I had went over the code and seems that we are setting the "Connecting"
> status at the right time.
> 
> Please try to reproduce and at the moment that you have hosts in UP rather
> than in Connecting capture and attach the result of the following query
> 
> select vds_name,status from vds_static a ,vds_dynamic b where
> a.vds_id=b.vds_id;
> 
> 
> What I would like to check is if that is a UI issue that is not get updated
> from some reason with the new status while the database already had been
> updated

I have reproduced the issue. The hypervisor and the manager are time synced:

I have run the following commands on the hypervisor:

	[root@vm-155 ~]# date;iptables -I INPUT -p tcp --dport 54321 -j REJECT
	Thu Jul 24 04:09:28 EDT 2014

	[root@vm-155 ~]# iptables -L
	Chain INPUT (policy ACCEPT)
	target     prot opt source               destination         
	REJECT     tcp  --  anywhere             anywhere            tcp dpt:54321 reject-with icmp-port-unreachable 
	
	Chain FORWARD (policy ACCEPT)
	target     prot opt source               destination         
	
	Chain OUTPUT (policy ACCEPT)
	target     prot opt source               destination

The engine.log is is also attached and the status of the hypervisor from the DB
as well.

Comment 8 Roman Hodain 2014-07-24 08:20:55 UTC
Created attachment 920462 [details]
Engine logs

Comment 9 Roman Hodain 2014-07-24 08:22:24 UTC
Created attachment 920464 [details]
Hypervisor statuses in time

Comment 10 Eli Mesika 2014-07-24 08:52:02 UTC
How comes that all vds_name,vds_id are the same:

10.34.57.155 | cf443954-bf2f-47ee-a8a9-7d078d5b5b11

Comment 11 Roman Hodain 2014-07-24 09:04:29 UTC
(In reply to Eli Mesika from comment #10)
> How comes that all vds_name,vds_id are the same:
> 
> 10.34.57.155 | cf443954-bf2f-47ee-a8a9-7d078d5b5b11

Sorry for the confusion. There is just one hypevisor in the ENV. The file wa generated by the following command:

while true; do /usr/share/ovirt-engine/dbscripts/engine-psql.sh -q -t -c "select CURRENT_TIMESTAMP(0), vds_name,a.vds_id, status from vds_static a ,vds_dynamic b where a.vds_id=b.vds_id"| grep -v '^$'>>/tmp/vds_status.out; sleep 1; done

I started that before the test and stop it after the test.

Comment 13 Eli Mesika 2014-07-27 09:46:13 UTC
After checking the scenario it seems that we get an immediate notification when engine=>hypervisor connection is blocked. 
In the BZ case, the hypervisor=>VDSM connection was blocked using iptables. engine will not notice that before the 180 seconds timeout.
My conclusion is that this behavior was not changed (as I also run 'git blame') on the relevant code and saw no changes in behavior...)

So, bottom line, it seems that engine never listened to hypervisor=>VDSM disconnects, so the only way was (and is) waiting the time-out on the VDSM operation executed and getting the exception.

After discussing the results with Oved it was decided to close as NOTABUG