Bug 953115

Summary: Host is flapping between non-operational and UP status when required network is down.
Product: Red Hat Enterprise Virtualization Manager Reporter: Meni Yakove <myakove>
Component: ovirt-engineAssignee: Mike Kolesnik <mkolesni>
Status: CLOSED CURRENTRELEASE QA Contact: Meni Yakove <myakove>
Severity: high Docs Contact:
Priority: urgent    
Version: 3.2.0CC: acathrow, dyasny, iheim, lpeer, mpavlik, Rhev-m-bugs, sgrinber, yeylon, ykaul, yzaslavs
Target Milestone: ---   
Target Release: 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: network
Fixed In Version: sf14 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine.log none

Description Meni Yakove 2013-04-17 11:57:38 UTC
Description of problem:
When required network is down the host changes status to non-operational and after few minuets it changes to UP status again for few more minuets.
Also when activating it manually the status of the host is changed to UP and then non-operational again.
While the host is in UP state I can run VM on it. (even with required network is down) 


Version-Release number of selected component (if applicable):
rhevm-3.2.0-10.19.beta2.el6ev.noarch

How reproducible:
100%

Steps to Reproduce:
1. Create network NET1 on the cluster and make it required network.
2. Attach NET1 to eth1 on the host. 
3. On the host run: ifconfig eth1 down.
  
Actual results:
Host is flapping between non-operational and UP status

Expected results:
Host should be non-operational at all time.

Comment 1 Meni Yakove 2013-04-17 12:00:51 UTC
Created attachment 736817 [details]
engine.log

Comment 2 Moti Asayag 2013-04-18 06:55:03 UTC
The source of this bug resides on the difference between the collected data by the GetVdsCapabilities to the GetVdsStats.

When activating a host, the GetVdsCapabilities is called and verifies from network point-of-view the existence of the required networks on the host.
GetVdsCapabilities doesn't contain any information about the the underlying nics state.

Verifying the underlying nics state is done regularly by GetVdsStats, which verifies that the nics configured for the logical networks are valid, else the host is set to non-operational status.

Every 5 minutes, the Auto-Recovery manager tries to activate any of the hosts in non-operational status and the same scenario recurs.

Comment 3 Simon Grinberg 2013-04-18 08:30:35 UTC
This can be done in 3 ways:
1 - Before trying to activate, the auto recovery should get stats and not try to activate only if the error conditions still exists - not sure this can be done for storage though. 
2 - Every getVdsStats call that indicates the interface is still down should reset the auto-recovery timer 
3 - Before moving from Non-Operational to Up call getVdsStats to determine if to move the host to up in the first place.

We are seeing this incorrect behaviour just too often and it need to be solved for all cases.

Comment 5 Moti Asayag 2013-04-18 09:49:43 UTC
(In reply to comment #3)
> This can be done in 3 ways:
> 1 - Before trying to activate, the auto recovery should get stats and not
> try to activate only if the error conditions still exists - not sure this
> can be done for storage though. 

We can prevent from the auto-recovery to run if the host has an invalid nic for a required network, which will end up eventually in non-operational status. 
The non-operational status doesn't collect statistics from the host.
Therefore if host move to non-operational from network-reason, the user will have to manually activate the host.

> 2 - Every getVdsStats call that indicates the interface is still down should
> reset the auto-recovery timer 

The issues with 2 is that the timer isn't per-host, but a single timer that attempt to activate all of the hosts.

> 3 - Before moving from Non-Operational to Up call getVdsStats to determine
> if to move the host to up in the first place.
> 

This requires invoking the GetVdsStats from the the host refresh capabilities flow.

> We are seeing this incorrect behaviour just too often and it need to be
> solved for all cases.

Comment 6 Mike Kolesnik 2013-04-18 14:04:49 UTC
Pushed a patch to fix according to solution #1, as presented by Moti.

Comment 8 Martin Pavlik 2013-04-26 14:50:02 UTC
in SF14 Host is non-operational at all time, as expected

Comment 9 Itamar Heim 2013-06-11 09:44:17 UTC
3.2 has been released

Comment 10 Itamar Heim 2013-06-11 09:44:26 UTC
3.2 has been released

Comment 11 Itamar Heim 2013-06-11 09:55:28 UTC
3.2 has been released