Bug 953115 - Host is flapping between non-operational and UP status when required network is down.
Summary: Host is flapping between non-operational and UP status when required network ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.2.0
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: ---
: 3.2.0
Assignee: Mike Kolesnik
QA Contact: Meni Yakove
URL:
Whiteboard: network
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-04-17 11:57 UTC by Meni Yakove
Modified: 2016-02-10 19:47 UTC (History)
10 users (show)

Fixed In Version: sf14
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
oVirt Team: Network
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
engine.log (3.00 MB, text/x-log)
2013-04-17 12:00 UTC, Meni Yakove
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 14047 0 None None None Never

Description Meni Yakove 2013-04-17 11:57:38 UTC
Description of problem:
When required network is down the host changes status to non-operational and after few minuets it changes to UP status again for few more minuets.
Also when activating it manually the status of the host is changed to UP and then non-operational again.
While the host is in UP state I can run VM on it. (even with required network is down) 


Version-Release number of selected component (if applicable):
rhevm-3.2.0-10.19.beta2.el6ev.noarch

How reproducible:
100%

Steps to Reproduce:
1. Create network NET1 on the cluster and make it required network.
2. Attach NET1 to eth1 on the host. 
3. On the host run: ifconfig eth1 down.
  
Actual results:
Host is flapping between non-operational and UP status

Expected results:
Host should be non-operational at all time.

Comment 1 Meni Yakove 2013-04-17 12:00:51 UTC
Created attachment 736817 [details]
engine.log

Comment 2 Moti Asayag 2013-04-18 06:55:03 UTC
The source of this bug resides on the difference between the collected data by the GetVdsCapabilities to the GetVdsStats.

When activating a host, the GetVdsCapabilities is called and verifies from network point-of-view the existence of the required networks on the host.
GetVdsCapabilities doesn't contain any information about the the underlying nics state.

Verifying the underlying nics state is done regularly by GetVdsStats, which verifies that the nics configured for the logical networks are valid, else the host is set to non-operational status.

Every 5 minutes, the Auto-Recovery manager tries to activate any of the hosts in non-operational status and the same scenario recurs.

Comment 3 Simon Grinberg 2013-04-18 08:30:35 UTC
This can be done in 3 ways:
1 - Before trying to activate, the auto recovery should get stats and not try to activate only if the error conditions still exists - not sure this can be done for storage though. 
2 - Every getVdsStats call that indicates the interface is still down should reset the auto-recovery timer 
3 - Before moving from Non-Operational to Up call getVdsStats to determine if to move the host to up in the first place.

We are seeing this incorrect behaviour just too often and it need to be solved for all cases.

Comment 5 Moti Asayag 2013-04-18 09:49:43 UTC
(In reply to comment #3)
> This can be done in 3 ways:
> 1 - Before trying to activate, the auto recovery should get stats and not
> try to activate only if the error conditions still exists - not sure this
> can be done for storage though. 

We can prevent from the auto-recovery to run if the host has an invalid nic for a required network, which will end up eventually in non-operational status. 
The non-operational status doesn't collect statistics from the host.
Therefore if host move to non-operational from network-reason, the user will have to manually activate the host.

> 2 - Every getVdsStats call that indicates the interface is still down should
> reset the auto-recovery timer 

The issues with 2 is that the timer isn't per-host, but a single timer that attempt to activate all of the hosts.

> 3 - Before moving from Non-Operational to Up call getVdsStats to determine
> if to move the host to up in the first place.
> 

This requires invoking the GetVdsStats from the the host refresh capabilities flow.

> We are seeing this incorrect behaviour just too often and it need to be
> solved for all cases.

Comment 6 Mike Kolesnik 2013-04-18 14:04:49 UTC
Pushed a patch to fix according to solution #1, as presented by Moti.

Comment 8 Martin Pavlik 2013-04-26 14:50:02 UTC
in SF14 Host is non-operational at all time, as expected

Comment 9 Itamar Heim 2013-06-11 09:44:17 UTC
3.2 has been released

Comment 10 Itamar Heim 2013-06-11 09:44:26 UTC
3.2 has been released

Comment 11 Itamar Heim 2013-06-11 09:55:28 UTC
3.2 has been released


Note You need to log in before you can comment on or make changes to this bug.