1897114 – Add additional logging information to be able to understand why host is stuck in Unassigned state

Bug 1897114 - Add additional logging information to be able to understand why host is stuck in Unassigned state

Summary: Add additional logging information to be able to understand why host is stuck...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.3.8-1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-4.4.10
Target Release:	4.4.10
Assignee:	Artur Socha
QA Contact:	Pavol Brilla
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1975685 (view as bug list)
Depends On:
Blocks:	1985906
TreeView+	depends on / blocked

Reported:	2020-11-12 10:58 UTC by Miguel Martin
Modified:	2024-12-20 19:23 UTC (History)
CC List:	10 users (show)
Fixed In Version:	ovirt-engine-4.4.10.1
Doc Type:	Enhancement
Doc Text:	In this release, monitoring of host refresh capabilities functionality was improved to help debug very rare production issues that sometimes caused the Red Hat Virtualization Manager to lose connectivity with the Red Hat Virtualization hosts.
Clone Of:
Environment:
Last Closed:	2022-02-08 10:04:44 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2022:0461	None	None	None	2022-02-08 10:05:02 UTC
oVirt gerrit	116952	None	NEW	engine: fix race condition in host recovery	2021-12-21 12:18:38 UTC
oVirt gerrit	116988	None	MERGED	engine: non-atomic operation on volatile field fix	2021-12-21 12:18:29 UTC
oVirt gerrit	117034	None	MERGED	engine: watchdog service for host monitoring	2021-12-21 12:18:32 UTC
oVirt gerrit	117218	ovirt-engine-4.4	MERGED	engine: non-atomic operation on volatile field fix	2021-10-20 12:56:35 UTC
oVirt gerrit	117219	ovirt-engine-4.4	MERGED	engine: watchdog service for host monitoring	2021-10-20 12:56:38 UTC
oVirt gerrit	117220	ovirt-engine-4.4	MERGED	engine: host monitoring by VDSStatus refactoring	2021-10-20 12:56:41 UTC

Description Miguel Martin 2020-11-12 10:58:25 UTC

Description of problem:

After a power outage the hypervisor hosting the HE is stuck forever in 'Unassigned' state.

Version-Release number of selected component (if applicable):
ovirt-engine-4.3.8.2-0.4.el7.noarch

How reproducible:
Unknown

Actual results:
The hypervisor is stuck forever in 'Unassigned' state

Expected results:
The hypervisor is activated normally


Additional info:
Restarting the HE didn't work
Restarting the vdsmd service didn't work

Comment 2 Renaud RAKOTOMALALA 2020-12-09 17:10:58 UTC

we are exactly in the same issue:

Description of problem:

After an issue on the network between ovirt-engine and the hypervisors (8) 1 of the didn't recovered from the ovirt-engine point of view. The hypervisor itself is fine and working normally (the cluster is empty because in build).

I can SSH the hypervsor from the ovirt-engine, access to cockpit.... 

At first time it was in "non responding state", in engine.log I had:
2020-12-09 18:05:48,069+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-38) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM XXXXXXXXXXX command Get Host Capabilities failed: Message timeout which can be caused by communication issues

I switch the host in "Maintenance mode" and it seems working. However when I tried to "Activate" it, the host still not available but that time in "Unassigned" state. The only option propose by ovirt-engine is "stop or reboot" via ssh or fencing....


Version-Release number of selected component (if applicable):
ovirt-engine.noarch  4.3.9.4-1.el7

How reproducible:
Unknown

Actual results:
The hypervisor is stuck forever in 'Unassigned' state

Expected results:
The hypervisor is activated normally, ssh OK from the ovirt-engine host to the hypervisor


Additional info:
Hypervisor: oVirt Node 4.3.10
I restart vsdmd and mom on the hypervisor

Comment 5 Marina Kalinin 2021-02-10 04:52:58 UTC

Closing this as CURRENTRELEASE, since we cannot reproduce it on latest RHV 4.4.
If it happens again and on the latest verstion, please provide relevant logs and reopen this bug.

Comment 21 Martin Perina 2021-10-01 16:49:19 UTC

If this is a testing environment and customer is able to reproduce that issue, then I suggest following:

1. Get hosts into Up status
2. Enable debug logging in RHVM: https://access.redhat.com/solutions/3880281
3. Enable VDSM debug logs on each host: https://access.redhat.com/articles/2919931#setting-log-level-permanently-5
4. Try to reproduce the issue
5. Create RHVM thread-dump after the issue is raised: https://access.redhat.com/solutions/3227681
6. Gather logs using sos-logcollector from RHVM and affected host

Is it possible Nirav?

debug logs can be huge, but if customer is able to reproduce that issue, it might give us some more clues, because we haven't been able to reproduce the issue so far. Thanks

Comment 22 Marina Kalinin 2021-10-08 20:09:24 UTC

Customer came back and said they were not concentrating on this issue right now and first working on resolving the storage issue, that caused the original outage. At this point they do not know how reproducible it is.
Do you want them to try and enable the debug logging now or wait till the patch[1] gets approved upstream and we will offer the patch? And hopefully by that time we would know more about their storage issues too, and it will be easier to troubleshoot only 1 issue at a time.

[1] https://gerrit.ovirt.org/c/ovirt-engine/+/117034/

Comment 23 nsurati 2021-10-11 15:24:17 UTC

@mperina 

Customer came back and said they were not concentrating on this issue right now and first working on resolving the storage issue, that caused the original outage. At this point they do not know how reproducible it is.

Comment 24 Martin Perina 2021-10-15 06:23:33 UTC

*** Bug 1975685 has been marked as a duplicate of this bug. ***

Comment 27 Pavol Brilla 2022-01-26 08:49:42 UTC

We are not able to reproduce original/reopen issues on our env.


Thus verification is not of issue, but proposed fix by developer.

If there will be thread monitoring issue, it will be logged in engine.log as warning:

2022-01-26 10:36:35,689+02 WARN  [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoringWatchdog] (EE-ManagedScheduledExecutorService-engineThreadMonitoringThreadPool-Thread-1) [] Monitoring not executed for the host HOST_FQDN [e173b19b-aca5-4633-826a-4c918fd8249b] for 2913ms
2022-01-26 10:36:35,689+02 WARN  [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoringWatchdog] (EE-ManagedScheduledExecutorService-engineThreadMonitoringThreadPool-Thread-1) [] Monitoring not executed for the host HOST_FQDN [1d75fb3e-0e22-4f5c-98ad-abbf5e04502d] for 2909ms

Hopefully it will help with future debugging of similar issues.

# yum list ovirt-engine
Installed Packages
ovirt-engine.noarch    4.4.10.4-0.1.el8ev

Comment 31 errata-xmlrpc 2022-02-08 10:04:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV Manager (ovirt-engine) [ovirt-4.4.10]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0461

Note You need to log in before you can comment on or make changes to this bug.