Bug 1842344

Summary: Status loop due to host initialization not checking network status, monitoring finding the network issue and auto-recovery.
Product: Red Hat Enterprise Virtualization Manager Reporter: Germano Veit Michel <gveitmic>
Component: ovirt-engineAssignee: Eli Mesika <emesika>
Status: CLOSED ERRATA QA Contact: Michael Burman <mburman>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.3.9CC: dholler, mburman, mperina
Target Milestone: ovirt-4.4.3Keywords: Reopened
Target Release: 4.4.3   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ovirt-engine-4.4.3.7 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-24 13:09:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Germano Veit Michel 2020-06-01 02:28:52 UTC
Description of problem:

Host initialization does not seem to check if all required networks are configured and up, so the host can briefly switch to UP without proper networking configured or all interfaces up. Then host monitoring finds the missing/bad network and moves it to Non-Op.

And in conjuction with auto-recovery, this causes the host to loop through these states:
Non Operational -> Up -> Non Operational -> Up ...

It can become annoying and produce a lot of events. I'm also afraid the small window that the host is incorrectly set to Up could also cause further issues.

See in more details:

1. Interface is down, host moved to Non-Op

2020-06-01 12:16:19,143+10 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-33) [] Host 'host2.kvm' moved to Non-Operational state because interface/s which are down are needed by required network/s in the current cluster: 'eth2 (storage-B)'

2020-06-01 12:16:19,317+10 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-33) [6faa2e26] EVENT_ID: VDS_SET_NONOPERATIONAL(517), Host host2.kvm moved to Non-Operational state.

2. Autorecovery

2020-06-01 12:20:00,057+10 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-36) [] Autorecovering hosts id: ff49f88c-a98d-4aa5-9fff-831bd0b80b5d , name : host2.kvm

3. Switches to Up

2020-06-01 12:20:04,874+10 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [3647a97a] EVENT_ID: VDS_DETECTED(13), Status of host host2.kvm was set to Up.

4. Host monitoring picks it up again

2020-06-01 12:21:10,840+10 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [] Host 'host2.kvm' moved to Non-Operational state because interface/s which are down are needed by required network/s in the current cluster: 'eth2 (storage-B)'

5. GOTO 2

Version-Release number of selected component (if applicable):
rhvm-4.3.9.4-11.el7.noarch (customer)
rhvm-4.4.0-0.34.master.el8ev.noarch (labs)

How reproducible:
Always

Steps to Reproduce:
1. Configure a required network on a Host
2. Pull the cable of the interface
     OR -if on nested KVM-
   $ virsh domif-getlink host2 vnet9

Actual results:
Host status loops: Up -> Non Operational -> Up -> Non Operational

Expected results:
Non Operational

Comment 9 Michael Burman 2020-09-21 08:07:04 UTC
The bug has failedQA on rhvm-4.4.3.3-0.19.el8ev.noarch and vdsm-4.40.30-1.el8ev.x86_64

Scenario 1 working fine - Detach a required network from a host - the host not entering the status loop and remain in non-operational the whole time as expected.

Scenario 2 - which is also reported here in the bug - link down an interface that has a required network attached to it - FAILED - it is still behave the same as before. The host changing state from non-operational and UP in a loop. The issue is not fixed.

Comment 16 Michael Burman 2020-10-25 10:19:43 UTC
Verified on - 4.4.3.8-0.1.el8ev and vdsm-4.40.35-1.el8ev.x86_64

Comment 20 errata-xmlrpc 2020-11-24 13:09:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: Red Hat Virtualization security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5179