Bug 953115
Summary: | Host is flapping between non-operational and UP status when required network is down. | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Meni Yakove <myakove> | ||||
Component: | ovirt-engine | Assignee: | Mike Kolesnik <mkolesni> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Meni Yakove <myakove> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 3.2.0 | CC: | acathrow, dyasny, iheim, lpeer, mpavlik, Rhev-m-bugs, sgrinber, yeylon, ykaul, yzaslavs | ||||
Target Milestone: | --- | ||||||
Target Release: | 3.2.0 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | network | ||||||
Fixed In Version: | sf14 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | Type: | Bug | |||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Meni Yakove
2013-04-17 11:57:38 UTC
Created attachment 736817 [details]
engine.log
The source of this bug resides on the difference between the collected data by the GetVdsCapabilities to the GetVdsStats. When activating a host, the GetVdsCapabilities is called and verifies from network point-of-view the existence of the required networks on the host. GetVdsCapabilities doesn't contain any information about the the underlying nics state. Verifying the underlying nics state is done regularly by GetVdsStats, which verifies that the nics configured for the logical networks are valid, else the host is set to non-operational status. Every 5 minutes, the Auto-Recovery manager tries to activate any of the hosts in non-operational status and the same scenario recurs. This can be done in 3 ways: 1 - Before trying to activate, the auto recovery should get stats and not try to activate only if the error conditions still exists - not sure this can be done for storage though. 2 - Every getVdsStats call that indicates the interface is still down should reset the auto-recovery timer 3 - Before moving from Non-Operational to Up call getVdsStats to determine if to move the host to up in the first place. We are seeing this incorrect behaviour just too often and it need to be solved for all cases. (In reply to comment #3) > This can be done in 3 ways: > 1 - Before trying to activate, the auto recovery should get stats and not > try to activate only if the error conditions still exists - not sure this > can be done for storage though. We can prevent from the auto-recovery to run if the host has an invalid nic for a required network, which will end up eventually in non-operational status. The non-operational status doesn't collect statistics from the host. Therefore if host move to non-operational from network-reason, the user will have to manually activate the host. > 2 - Every getVdsStats call that indicates the interface is still down should > reset the auto-recovery timer The issues with 2 is that the timer isn't per-host, but a single timer that attempt to activate all of the hosts. > 3 - Before moving from Non-Operational to Up call getVdsStats to determine > if to move the host to up in the first place. > This requires invoking the GetVdsStats from the the host refresh capabilities flow. > We are seeing this incorrect behaviour just too often and it need to be > solved for all cases. Pushed a patch to fix according to solution #1, as presented by Moti. in SF14 Host is non-operational at all time, as expected 3.2 has been released 3.2 has been released 3.2 has been released |