Description of problem: After a power outage the hypervisor hosting the HE is stuck forever in 'Unassigned' state. Version-Release number of selected component (if applicable): ovirt-engine-4.3.8.2-0.4.el7.noarch How reproducible: Unknown Actual results: The hypervisor is stuck forever in 'Unassigned' state Expected results: The hypervisor is activated normally Additional info: Restarting the HE didn't work Restarting the vdsmd service didn't work
we are exactly in the same issue: Description of problem: After an issue on the network between ovirt-engine and the hypervisors (8) 1 of the didn't recovered from the ovirt-engine point of view. The hypervisor itself is fine and working normally (the cluster is empty because in build). I can SSH the hypervsor from the ovirt-engine, access to cockpit.... At first time it was in "non responding state", in engine.log I had: 2020-12-09 18:05:48,069+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-38) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM XXXXXXXXXXX command Get Host Capabilities failed: Message timeout which can be caused by communication issues I switch the host in "Maintenance mode" and it seems working. However when I tried to "Activate" it, the host still not available but that time in "Unassigned" state. The only option propose by ovirt-engine is "stop or reboot" via ssh or fencing.... Version-Release number of selected component (if applicable): ovirt-engine.noarch 4.3.9.4-1.el7 How reproducible: Unknown Actual results: The hypervisor is stuck forever in 'Unassigned' state Expected results: The hypervisor is activated normally, ssh OK from the ovirt-engine host to the hypervisor Additional info: Hypervisor: oVirt Node 4.3.10 I restart vsdmd and mom on the hypervisor
Closing this as CURRENTRELEASE, since we cannot reproduce it on latest RHV 4.4. If it happens again and on the latest verstion, please provide relevant logs and reopen this bug.
If this is a testing environment and customer is able to reproduce that issue, then I suggest following: 1. Get hosts into Up status 2. Enable debug logging in RHVM: https://access.redhat.com/solutions/3880281 3. Enable VDSM debug logs on each host: https://access.redhat.com/articles/2919931#setting-log-level-permanently-5 4. Try to reproduce the issue 5. Create RHVM thread-dump after the issue is raised: https://access.redhat.com/solutions/3227681 6. Gather logs using sos-logcollector from RHVM and affected host Is it possible Nirav? debug logs can be huge, but if customer is able to reproduce that issue, it might give us some more clues, because we haven't been able to reproduce the issue so far. Thanks
Customer came back and said they were not concentrating on this issue right now and first working on resolving the storage issue, that caused the original outage. At this point they do not know how reproducible it is. Do you want them to try and enable the debug logging now or wait till the patch[1] gets approved upstream and we will offer the patch? And hopefully by that time we would know more about their storage issues too, and it will be easier to troubleshoot only 1 issue at a time. [1] https://gerrit.ovirt.org/c/ovirt-engine/+/117034/
@mperina Customer came back and said they were not concentrating on this issue right now and first working on resolving the storage issue, that caused the original outage. At this point they do not know how reproducible it is.
*** Bug 1975685 has been marked as a duplicate of this bug. ***
We are not able to reproduce original/reopen issues on our env. Thus verification is not of issue, but proposed fix by developer. If there will be thread monitoring issue, it will be logged in engine.log as warning: 2022-01-26 10:36:35,689+02 WARN [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoringWatchdog] (EE-ManagedScheduledExecutorService-engineThreadMonitoringThreadPool-Thread-1) [] Monitoring not executed for the host HOST_FQDN [e173b19b-aca5-4633-826a-4c918fd8249b] for 2913ms 2022-01-26 10:36:35,689+02 WARN [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoringWatchdog] (EE-ManagedScheduledExecutorService-engineThreadMonitoringThreadPool-Thread-1) [] Monitoring not executed for the host HOST_FQDN [1d75fb3e-0e22-4f5c-98ad-abbf5e04502d] for 2909ms Hopefully it will help with future debugging of similar issues. # yum list ovirt-engine Installed Packages ovirt-engine.noarch 4.4.10.4-0.1.el8ev
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV Manager (ovirt-engine) [ovirt-4.4.10]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0461