1916171 – [NON-HE] Host is reported 'up' by RHV while it is rebooting

Bug 1916171 - [NON-HE] Host is reported 'up' by RHV while it is rebooting

Summary: [NON-HE] Host is reported 'up' by RHV while it is rebooting

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Infra
Sub Component:
Version:	4.4.4.7
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Artur Socha
QA Contact:	Lucie Leistnerova
Docs Contact:
URL:
Whiteboard:
Depends On:	1936897
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-14 11:48 UTC by msheena
Modified:	2021-10-05 11:30 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2021-10-05 11:30:20 UTC
oVirt Team:	Infra
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description msheena 2021-01-14 11:48:02 UTC

Description of problem
======================
Given I have a non-HE RHV environment
When I SSH into one of the hosts in the cluster (not SPM)
And I execute `# reboot -f`
Then RHV reports the host status 'up' while the host is going through a reboot

Version-Release number of selected component (if applicable)
============================================================
4.4.4.7-0.1.el8ev

How reproducible
================
100% on non-HE deployments.
* This can be WA by restarting the ovirt-engine service *
* It seems that this reproduces on deployments that are alive for some period of time - this wasn't empirically determined *

Steps to Reproduce
==================
1. SSH to root user of one of the hosts in the cluster.
2. Execute `# reboot -f` on the host.

Actual results
==============
The host status remains 'up' until the host finishes reboot, at which point the host transitions to 'connecting' state for less than 2 seconds and then to 'up'.

Expected results
================
The host transitions to 'connecting' state within 3 seconds of the reboot, and then to 'non-responsive', and only when the host finishes rebooting then it reported as 'connecting' and then 'up'.

Additional info
===============
# As written above a possible WA for this situation is restarting the ovirt-engine service.

# This is possibly a 'family member' of bug 1846338, but this cannot be determined at the moment, without a deeper investigation.

# I wasn't able to measure the time it takes for my environment to become "faulty" and not report the correct status for the rebooted host, however, ideally, the environment that will reproduce this bug would be live for more than a day or two.

Comment 3 Martin Perina 2021-03-09 17:06:01 UTC

We need to wait till we get more information from GC as introduced in BZ1936897

Comment 5 Martin Perina 2021-06-17 13:00:18 UTC

Closing for now, feel free to reopenif this is still reproducable on the latest version

Comment 6 Martin Perina 2021-07-19 11:49:13 UTC

Reopening because it's currently reproduced easily

Comment 8 Martin Perina 2021-10-05 11:30:20 UTC

Unfortunately again we are not able to reproduce the issue, so we need to close

Note You need to log in before you can comment on or make changes to this bug.