Bug 1658117 - [downstream clone - 4.2.8] GetAllVmStatsVDSCommand sent host to Not-Responding status after upgrade
Summary: [downstream clone - 4.2.8] GetAllVmStatsVDSCommand sent host to Not-Respondin...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.2.6
Hardware: x86_64
OS: Linux
unspecified
low
Target Milestone: ovirt-4.2.8
: ---
Assignee: Ravi Nori
QA Contact: Petr Matyáš
URL:
Whiteboard:
Depends On: 1645383
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-12-11 09:49 UTC by RHV bug bot
Modified: 2022-03-13 16:26 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1645383
Environment:
Last Closed: 2019-01-22 12:44:51 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-44347 0 None None None 2021-12-10 18:31:51 UTC
Red Hat Product Errata RHBA-2019:0121 0 None None None 2019-01-22 12:44:57 UTC
oVirt gerrit 95311 0 master MERGED engine: GetAllVmStatsVDSCommand sent host to Not-Responding status after upgrade 2018-12-11 09:49:30 UTC
oVirt gerrit 96111 0 ovirt-engine-4.2 MERGED engine: GetAllVmStatsVDSCommand sent host to Not-Responding status after upgrade 2018-12-11 11:25:11 UTC

Description RHV bug bot 2018-12-11 09:49:13 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1645383 +++
======================================================================

Description of problem:

This is a bit weird. The user upgrades a RHEL host via the UI, everything goes fine except that shortly after the host being restored to Up (after reboot), the engine times out on a GetAllVmStatsVDSCommand. This timeout should be 180s, but 180s back the engine should not even be trying to send this command. See:

1. Upgrade starts:

2018-10-30 13:29:28,178-04 INFO  [org.ovirt.engine.core.bll.hostdeploy.HostUpgradeCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-50) [72678092-5071-42ef-8518-1103d55ce063] Host 'host.removed.com' is on maintenance mode. Proceeding with Upgrade process.

2. VDSM restart during package upgrade (why ReactorClient error if the host is in maintenance?)

2018-10-30 13:38:51,572-04 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connection timeout for host 'host.removed.com', last response arrived 12658 ms ago.
>>> 2018-10-30 13:38:49,352-0400 INFO  (MainThread) [vds] (PID: 4605) I am the actual vdsm 4.20.39.1-1.el7ev host.removed.com (3.10.0-862.14.4.el7.x86_64) (vdsmd:149)

3. Upgrade finished, rebooting host

2018-10-30 13:57:16,754-04 INFO  [org.ovirt.engine.core.bll.SshHostRebootCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-10) [301bff94] Opening SSH reboot session on host host.removed.com

2018-10-30 13:57:17,696-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-commandCoordinator-Thread-10) [72678092-5071-42ef-8518-1103d55ce063] EVENT_ID: HOST_UPGRADE_FINISHED(842), Host host.removed.com upgrade was completed successfully.

4. Waiting for the host to come up:

Not up...

2018-10-30 14:02:20,137-04 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to host.removed.com/A.B.C.D
2018-10-30 14:02:38,164-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [] Unable to RefreshCapabilities: NoRouteToHostException: No route to host

VDSM starting...

2018-10-30 14:02:59,217-04 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to host.removed.com/A.B.C.D
2018-10-30 14:02:59,218-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-86) [] Unable to RefreshCapabilities: ConnectException: Connection refused

5. VDSM starts after reboot, host goes to up state

>>>> 2018-10-30 14:03:23,731-0400 INFO  (MainThread) [vds] (PID: 18910) I am the actual vdsm 4.20.39.1-1.el7ev host.removed.com (3.10.0-957.el7.x86_64) (vdsmd:149)

2018-10-30 14:04:39,089-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [1b0b6659] EVENT_ID: VDS_DETECTED(13), Status of host host.removed.com was set to Up.

6. Command times out, this command should not have even been triggered?

2018-10-30 14:05:22,258-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [] Command 'GetAllVmStatsVDSCommand(HostName = host.removed.com, VdsIdVDSCommandParametersBase:{hostId='16380971-9d41-41dd-8227-91ea4d17943d'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues

7. Not responding due to timed out command

2018-10-30 14:05:22,265-04 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-188063) [] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Host host.removed.com is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued.

8. 10 seconds later is back to up status

2018-10-30 14:05:32,756-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-52) [27de3b3c] EVENT_ID: VDS_DETECTED(13), Status of host host.removed.com was set to Up.

But 14:05:22 - 180s = 14:02:22. The host was not connected yet, vdsm was not runing. This command should not have been triggered at that time?

Version-Release number of selected component (if applicable):
rhvm-4.2.6.4-0.1.el7ev.noarch

How reproducible:
Unknown

(Originally by Germano Veit Michel)

Comment 4 RHV bug bot 2018-12-11 09:49:23 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops

(Originally by rhv-bugzilla-bot)

Comment 6 Petr Matyáš 2018-12-14 12:15:53 UTC
Verified on ovirt-engine-4.2.8.1-0.1.el7ev.noarch

Comment 8 errata-xmlrpc 2019-01-22 12:44:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0121


Note You need to log in before you can comment on or make changes to this bug.