Bug 1658117

Summary: [downstream clone - 4.2.8] GetAllVmStatsVDSCommand sent host to Not-Responding status after upgrade
Product: Red Hat Enterprise Virtualization Manager Reporter: RHV bug bot <rhv-bugzilla-bot>
Component: ovirt-engineAssignee: Ravi Nori <rnori>
Status: CLOSED ERRATA QA Contact: Petr Matyáš <pmatyas>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.2.6CC: lleistne, mperina, ratamir, Rhev-m-bugs
Target Milestone: ovirt-4.2.8Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1645383 Environment:
Last Closed: 2019-01-22 12:44:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1645383    
Bug Blocks:    

Description RHV bug bot 2018-12-11 09:49:13 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1645383 +++
======================================================================

Description of problem:

This is a bit weird. The user upgrades a RHEL host via the UI, everything goes fine except that shortly after the host being restored to Up (after reboot), the engine times out on a GetAllVmStatsVDSCommand. This timeout should be 180s, but 180s back the engine should not even be trying to send this command. See:

1. Upgrade starts:

2018-10-30 13:29:28,178-04 INFO  [org.ovirt.engine.core.bll.hostdeploy.HostUpgradeCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-50) [72678092-5071-42ef-8518-1103d55ce063] Host 'host.removed.com' is on maintenance mode. Proceeding with Upgrade process.

2. VDSM restart during package upgrade (why ReactorClient error if the host is in maintenance?)

2018-10-30 13:38:51,572-04 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connection timeout for host 'host.removed.com', last response arrived 12658 ms ago.
>>> 2018-10-30 13:38:49,352-0400 INFO  (MainThread) [vds] (PID: 4605) I am the actual vdsm 4.20.39.1-1.el7ev host.removed.com (3.10.0-862.14.4.el7.x86_64) (vdsmd:149)

3. Upgrade finished, rebooting host

2018-10-30 13:57:16,754-04 INFO  [org.ovirt.engine.core.bll.SshHostRebootCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-10) [301bff94] Opening SSH reboot session on host host.removed.com

2018-10-30 13:57:17,696-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-commandCoordinator-Thread-10) [72678092-5071-42ef-8518-1103d55ce063] EVENT_ID: HOST_UPGRADE_FINISHED(842), Host host.removed.com upgrade was completed successfully.

4. Waiting for the host to come up:

Not up...

2018-10-30 14:02:20,137-04 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to host.removed.com/A.B.C.D
2018-10-30 14:02:38,164-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [] Unable to RefreshCapabilities: NoRouteToHostException: No route to host

VDSM starting...

2018-10-30 14:02:59,217-04 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to host.removed.com/A.B.C.D
2018-10-30 14:02:59,218-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-86) [] Unable to RefreshCapabilities: ConnectException: Connection refused

5. VDSM starts after reboot, host goes to up state

>>>> 2018-10-30 14:03:23,731-0400 INFO  (MainThread) [vds] (PID: 18910) I am the actual vdsm 4.20.39.1-1.el7ev host.removed.com (3.10.0-957.el7.x86_64) (vdsmd:149)

2018-10-30 14:04:39,089-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [1b0b6659] EVENT_ID: VDS_DETECTED(13), Status of host host.removed.com was set to Up.

6. Command times out, this command should not have even been triggered?

2018-10-30 14:05:22,258-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [] Command 'GetAllVmStatsVDSCommand(HostName = host.removed.com, VdsIdVDSCommandParametersBase:{hostId='16380971-9d41-41dd-8227-91ea4d17943d'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues

7. Not responding due to timed out command

2018-10-30 14:05:22,265-04 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-188063) [] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Host host.removed.com is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued.

8. 10 seconds later is back to up status

2018-10-30 14:05:32,756-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-52) [27de3b3c] EVENT_ID: VDS_DETECTED(13), Status of host host.removed.com was set to Up.

But 14:05:22 - 180s = 14:02:22. The host was not connected yet, vdsm was not runing. This command should not have been triggered at that time?

Version-Release number of selected component (if applicable):
rhvm-4.2.6.4-0.1.el7ev.noarch

How reproducible:
Unknown

(Originally by Germano Veit Michel)

Comment 4 RHV bug bot 2018-12-11 09:49:23 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops

(Originally by rhv-bugzilla-bot)

Comment 6 Petr Matyáš 2018-12-14 12:15:53 UTC
Verified on ovirt-engine-4.2.8.1-0.1.el7ev.noarch

Comment 8 errata-xmlrpc 2019-01-22 12:44:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0121