Bug 1897114
| Summary: | Add additional logging information to be able to understand why host is stuck in Unassigned state | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Miguel Martin <mmartinv> |
| Component: | ovirt-engine | Assignee: | Artur Socha <asocha> |
| Status: | CLOSED ERRATA | QA Contact: | Pavol Brilla <pbrilla> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.3.8-1 | CC: | asocha, dfodor, emarcus, gdeolive, lleistne, mjankula, mkalinin, mperina, nsurati, redhat.bugzilla |
| Target Milestone: | ovirt-4.4.10 | Keywords: | Reopened, ZStream |
| Target Release: | 4.4.10 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | ovirt-engine-4.4.10.1 | Doc Type: | Enhancement |
| Doc Text: |
In this release, monitoring of host refresh capabilities functionality was improved to help debug very rare production issues that sometimes caused the Red Hat Virtualization Manager to lose connectivity with the Red Hat Virtualization hosts.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-02-08 10:04:44 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1985906 | ||
|
Description
Miguel Martin
2020-11-12 10:58:25 UTC
we are exactly in the same issue: Description of problem: After an issue on the network between ovirt-engine and the hypervisors (8) 1 of the didn't recovered from the ovirt-engine point of view. The hypervisor itself is fine and working normally (the cluster is empty because in build). I can SSH the hypervsor from the ovirt-engine, access to cockpit.... At first time it was in "non responding state", in engine.log I had: 2020-12-09 18:05:48,069+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-38) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM XXXXXXXXXXX command Get Host Capabilities failed: Message timeout which can be caused by communication issues I switch the host in "Maintenance mode" and it seems working. However when I tried to "Activate" it, the host still not available but that time in "Unassigned" state. The only option propose by ovirt-engine is "stop or reboot" via ssh or fencing.... Version-Release number of selected component (if applicable): ovirt-engine.noarch 4.3.9.4-1.el7 How reproducible: Unknown Actual results: The hypervisor is stuck forever in 'Unassigned' state Expected results: The hypervisor is activated normally, ssh OK from the ovirt-engine host to the hypervisor Additional info: Hypervisor: oVirt Node 4.3.10 I restart vsdmd and mom on the hypervisor Closing this as CURRENTRELEASE, since we cannot reproduce it on latest RHV 4.4. If it happens again and on the latest verstion, please provide relevant logs and reopen this bug. If this is a testing environment and customer is able to reproduce that issue, then I suggest following: 1. Get hosts into Up status 2. Enable debug logging in RHVM: https://access.redhat.com/solutions/3880281 3. Enable VDSM debug logs on each host: https://access.redhat.com/articles/2919931#setting-log-level-permanently-5 4. Try to reproduce the issue 5. Create RHVM thread-dump after the issue is raised: https://access.redhat.com/solutions/3227681 6. Gather logs using sos-logcollector from RHVM and affected host Is it possible Nirav? debug logs can be huge, but if customer is able to reproduce that issue, it might give us some more clues, because we haven't been able to reproduce the issue so far. Thanks Customer came back and said they were not concentrating on this issue right now and first working on resolving the storage issue, that caused the original outage. At this point they do not know how reproducible it is. Do you want them to try and enable the debug logging now or wait till the patch[1] gets approved upstream and we will offer the patch? And hopefully by that time we would know more about their storage issues too, and it will be easier to troubleshoot only 1 issue at a time. [1] https://gerrit.ovirt.org/c/ovirt-engine/+/117034/ @mperina Customer came back and said they were not concentrating on this issue right now and first working on resolving the storage issue, that caused the original outage. At this point they do not know how reproducible it is. *** Bug 1975685 has been marked as a duplicate of this bug. *** We are not able to reproduce original/reopen issues on our env. Thus verification is not of issue, but proposed fix by developer. If there will be thread monitoring issue, it will be logged in engine.log as warning: 2022-01-26 10:36:35,689+02 WARN [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoringWatchdog] (EE-ManagedScheduledExecutorService-engineThreadMonitoringThreadPool-Thread-1) [] Monitoring not executed for the host HOST_FQDN [e173b19b-aca5-4633-826a-4c918fd8249b] for 2913ms 2022-01-26 10:36:35,689+02 WARN [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoringWatchdog] (EE-ManagedScheduledExecutorService-engineThreadMonitoringThreadPool-Thread-1) [] Monitoring not executed for the host HOST_FQDN [1d75fb3e-0e22-4f5c-98ad-abbf5e04502d] for 2909ms Hopefully it will help with future debugging of similar issues. # yum list ovirt-engine Installed Packages ovirt-engine.noarch 4.4.10.4-0.1.el8ev Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV Manager (ovirt-engine) [ovirt-4.4.10]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0461 |