1842344 – Status loop due to host initialization not checking network status, monitoring finding the network issue and auto-recovery.

Bug 1842344 - Status loop due to host initialization not checking network status, monitoring finding the network issue and auto-recovery.

Summary: Status loop due to host initialization not checking network status, monitorin...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.3.9
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	low
Target Milestone:	ovirt-4.4.3
Target Release:	4.4.3
Assignee:	Eli Mesika
QA Contact:	Michael Burman
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-01 02:28 UTC by Germano Veit Michel
Modified:	2023-12-15 18:02 UTC (History)
CC List:	3 users (show)
Fixed In Version:	ovirt-engine-4.4.3.7
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-24 13:09:21 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2020:5179	None	None	None	2020-11-24 13:09:46 UTC
oVirt gerrit	110164	master	MERGED	core: adding networking checks to auto recovery	2020-11-09 09:56:36 UTC
oVirt gerrit	111677	master	MERGED	fix nics retrieval in auto recovery	2020-11-09 09:56:35 UTC

Description Germano Veit Michel 2020-06-01 02:28:52 UTC

Description of problem:

Host initialization does not seem to check if all required networks are configured and up, so the host can briefly switch to UP without proper networking configured or all interfaces up. Then host monitoring finds the missing/bad network and moves it to Non-Op.

And in conjuction with auto-recovery, this causes the host to loop through these states:
Non Operational -> Up -> Non Operational -> Up ...

It can become annoying and produce a lot of events. I'm also afraid the small window that the host is incorrectly set to Up could also cause further issues.

See in more details:

1. Interface is down, host moved to Non-Op

2020-06-01 12:16:19,143+10 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-33) [] Host 'host2.kvm' moved to Non-Operational state because interface/s which are down are needed by required network/s in the current cluster: 'eth2 (storage-B)'

2020-06-01 12:16:19,317+10 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-33) [6faa2e26] EVENT_ID: VDS_SET_NONOPERATIONAL(517), Host host2.kvm moved to Non-Operational state.

2. Autorecovery

2020-06-01 12:20:00,057+10 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-36) [] Autorecovering hosts id: ff49f88c-a98d-4aa5-9fff-831bd0b80b5d , name : host2.kvm

3. Switches to Up

2020-06-01 12:20:04,874+10 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [3647a97a] EVENT_ID: VDS_DETECTED(13), Status of host host2.kvm was set to Up.

4. Host monitoring picks it up again

2020-06-01 12:21:10,840+10 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [] Host 'host2.kvm' moved to Non-Operational state because interface/s which are down are needed by required network/s in the current cluster: 'eth2 (storage-B)'

5. GOTO 2

Version-Release number of selected component (if applicable):
rhvm-4.3.9.4-11.el7.noarch (customer)
rhvm-4.4.0-0.34.master.el8ev.noarch (labs)

How reproducible:
Always

Steps to Reproduce:
1. Configure a required network on a Host
2. Pull the cable of the interface
     OR -if on nested KVM-
   $ virsh domif-getlink host2 vnet9

Actual results:
Host status loops: Up -> Non Operational -> Up -> Non Operational

Expected results:
Non Operational

Comment 9 Michael Burman 2020-09-21 08:07:04 UTC

The bug has failedQA on rhvm-4.4.3.3-0.19.el8ev.noarch and vdsm-4.40.30-1.el8ev.x86_64

Scenario 1 working fine - Detach a required network from a host - the host not entering the status loop and remain in non-operational the whole time as expected.

Scenario 2 - which is also reported here in the bug - link down an interface that has a required network attached to it - FAILED - it is still behave the same as before. The host changing state from non-operational and UP in a loop. The issue is not fixed.

Comment 16 Michael Burman 2020-10-25 10:19:43 UTC

Verified on - 4.4.3.8-0.1.el8ev and vdsm-4.40.35-1.el8ev.x86_64

Comment 20 errata-xmlrpc 2020-11-24 13:09:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: Red Hat Virtualization security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5179

Note You need to log in before you can comment on or make changes to this bug.