Bug 1916171 - [NON-HE] Host is reported 'up' by RHV while it is rebooting
Summary: [NON-HE] Host is reported 'up' by RHV while it is rebooting
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Infra
Version: 4.4.4.7
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Artur Socha
QA Contact: Lucie Leistnerova
URL:
Whiteboard:
Depends On: 1936897
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-14 11:48 UTC by msheena
Modified: 2021-10-05 11:30 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-05 11:30:20 UTC
oVirt Team: Infra
Embargoed:


Attachments (Terms of Use)

Description msheena 2021-01-14 11:48:02 UTC
Description of problem
======================
Given I have a non-HE RHV environment
When I SSH into one of the hosts in the cluster (not SPM)
And I execute `# reboot -f`
Then RHV reports the host status 'up' while the host is going through a reboot

Version-Release number of selected component (if applicable)
============================================================
4.4.4.7-0.1.el8ev

How reproducible
================
100% on non-HE deployments.
* This can be WA by restarting the ovirt-engine service *
* It seems that this reproduces on deployments that are alive for some period of time - this wasn't empirically determined *

Steps to Reproduce
==================
1. SSH to root user of one of the hosts in the cluster.
2. Execute `# reboot -f` on the host.

Actual results
==============
The host status remains 'up' until the host finishes reboot, at which point the host transitions to 'connecting' state for less than 2 seconds and then to 'up'.

Expected results
================
The host transitions to 'connecting' state within 3 seconds of the reboot, and then to 'non-responsive', and only when the host finishes rebooting then it reported as 'connecting' and then 'up'.

Additional info
===============
# As written above a possible WA for this situation is restarting the ovirt-engine service.

# This is possibly a 'family member' of bug 1846338, but this cannot be determined at the moment, without a deeper investigation.

# I wasn't able to measure the time it takes for my environment to become "faulty" and not report the correct status for the rebooted host, however, ideally, the environment that will reproduce this bug would be live for more than a day or two.

Comment 3 Martin Perina 2021-03-09 17:06:01 UTC
We need to wait till we get more information from GC as introduced in BZ1936897

Comment 5 Martin Perina 2021-06-17 13:00:18 UTC
Closing for now, feel free to reopenif this is still reproducable on the latest version

Comment 6 Martin Perina 2021-07-19 11:49:13 UTC
Reopening because it's currently reproduced easily

Comment 8 Martin Perina 2021-10-05 11:30:20 UTC
Unfortunately again we are not able to reproduce the issue, so we need to close


Note You need to log in before you can comment on or make changes to this bug.