Bug 953115

Summary:

Host is flapping between non-operational and UP status when required network is down.

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Meni Yakove <myakove>

Component:

ovirt-engine

Assignee:

Mike Kolesnik <mkolesni>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Meni Yakove <myakove>

Severity:

high

Docs Contact:

Priority:

urgent

Version:

3.2.0

CC:

acathrow, dyasny, iheim, lpeer, mpavlik, Rhev-m-bugs, sgrinber, yeylon, ykaul, yzaslavs

Target Milestone:

---

Target Release:

3.2.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

network

Fixed In Version:

sf14

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Network

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
engine.log	none

Description Meni Yakove 2013-04-17 11:57:38 UTC

Description of problem:
When required network is down the host changes status to non-operational and after few minuets it changes to UP status again for few more minuets.
Also when activating it manually the status of the host is changed to UP and then non-operational again.
While the host is in UP state I can run VM on it. (even with required network is down) 


Version-Release number of selected component (if applicable):
rhevm-3.2.0-10.19.beta2.el6ev.noarch

How reproducible:
100%

Steps to Reproduce:
1. Create network NET1 on the cluster and make it required network.
2. Attach NET1 to eth1 on the host. 
3. On the host run: ifconfig eth1 down.
  
Actual results:
Host is flapping between non-operational and UP status

Expected results:
Host should be non-operational at all time.

Comment 1 Meni Yakove 2013-04-17 12:00:51 UTC

Created attachment 736817 [details]
engine.log

Comment 2 Moti Asayag 2013-04-18 06:55:03 UTC

The source of this bug resides on the difference between the collected data by the GetVdsCapabilities to the GetVdsStats.

When activating a host, the GetVdsCapabilities is called and verifies from network point-of-view the existence of the required networks on the host.
GetVdsCapabilities doesn't contain any information about the the underlying nics state.

Verifying the underlying nics state is done regularly by GetVdsStats, which verifies that the nics configured for the logical networks are valid, else the host is set to non-operational status.

Every 5 minutes, the Auto-Recovery manager tries to activate any of the hosts in non-operational status and the same scenario recurs.

Comment 3 Simon Grinberg 2013-04-18 08:30:35 UTC

This can be done in 3 ways:
1 - Before trying to activate, the auto recovery should get stats and not try to activate only if the error conditions still exists - not sure this can be done for storage though. 
2 - Every getVdsStats call that indicates the interface is still down should reset the auto-recovery timer 
3 - Before moving from Non-Operational to Up call getVdsStats to determine if to move the host to up in the first place.

We are seeing this incorrect behaviour just too often and it need to be solved for all cases.

Comment 5 Moti Asayag 2013-04-18 09:49:43 UTC

(In reply to comment #3)
> This can be done in 3 ways:
> 1 - Before trying to activate, the auto recovery should get stats and not
> try to activate only if the error conditions still exists - not sure this
> can be done for storage though. 

We can prevent from the auto-recovery to run if the host has an invalid nic for a required network, which will end up eventually in non-operational status. 
The non-operational status doesn't collect statistics from the host.
Therefore if host move to non-operational from network-reason, the user will have to manually activate the host.

> 2 - Every getVdsStats call that indicates the interface is still down should
> reset the auto-recovery timer 

The issues with 2 is that the timer isn't per-host, but a single timer that attempt to activate all of the hosts.

> 3 - Before moving from Non-Operational to Up call getVdsStats to determine
> if to move the host to up in the first place.
> 

This requires invoking the GetVdsStats from the the host refresh capabilities flow.

> We are seeing this incorrect behaviour just too often and it need to be
> solved for all cases.

Comment 6 Mike Kolesnik 2013-04-18 14:04:49 UTC

Pushed a patch to fix according to solution #1, as presented by Moti.

Comment 8 Martin Pavlik 2013-04-26 14:50:02 UTC

in SF14 Host is non-operational at all time, as expected

Comment 9 Itamar Heim 2013-06-11 09:44:17 UTC

3.2 has been released

Comment 10 Itamar Heim 2013-06-11 09:44:26 UTC

3.2 has been released

Comment 11 Itamar Heim 2013-06-11 09:55:28 UTC

3.2 has been released