+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1645383 +++ ====================================================================== Description of problem: This is a bit weird. The user upgrades a RHEL host via the UI, everything goes fine except that shortly after the host being restored to Up (after reboot), the engine times out on a GetAllVmStatsVDSCommand. This timeout should be 180s, but 180s back the engine should not even be trying to send this command. See: 1. Upgrade starts: 2018-10-30 13:29:28,178-04 INFO [org.ovirt.engine.core.bll.hostdeploy.HostUpgradeCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-50) [72678092-5071-42ef-8518-1103d55ce063] Host 'host.removed.com' is on maintenance mode. Proceeding with Upgrade process. 2. VDSM restart during package upgrade (why ReactorClient error if the host is in maintenance?) 2018-10-30 13:38:51,572-04 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connection timeout for host 'host.removed.com', last response arrived 12658 ms ago. >>> 2018-10-30 13:38:49,352-0400 INFO (MainThread) [vds] (PID: 4605) I am the actual vdsm 4.20.39.1-1.el7ev host.removed.com (3.10.0-862.14.4.el7.x86_64) (vdsmd:149) 3. Upgrade finished, rebooting host 2018-10-30 13:57:16,754-04 INFO [org.ovirt.engine.core.bll.SshHostRebootCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-10) [301bff94] Opening SSH reboot session on host host.removed.com 2018-10-30 13:57:17,696-04 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-commandCoordinator-Thread-10) [72678092-5071-42ef-8518-1103d55ce063] EVENT_ID: HOST_UPGRADE_FINISHED(842), Host host.removed.com upgrade was completed successfully. 4. Waiting for the host to come up: Not up... 2018-10-30 14:02:20,137-04 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to host.removed.com/A.B.C.D 2018-10-30 14:02:38,164-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [] Unable to RefreshCapabilities: NoRouteToHostException: No route to host VDSM starting... 2018-10-30 14:02:59,217-04 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to host.removed.com/A.B.C.D 2018-10-30 14:02:59,218-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-86) [] Unable to RefreshCapabilities: ConnectException: Connection refused 5. VDSM starts after reboot, host goes to up state >>>> 2018-10-30 14:03:23,731-0400 INFO (MainThread) [vds] (PID: 18910) I am the actual vdsm 4.20.39.1-1.el7ev host.removed.com (3.10.0-957.el7.x86_64) (vdsmd:149) 2018-10-30 14:04:39,089-04 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [1b0b6659] EVENT_ID: VDS_DETECTED(13), Status of host host.removed.com was set to Up. 6. Command times out, this command should not have even been triggered? 2018-10-30 14:05:22,258-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [] Command 'GetAllVmStatsVDSCommand(HostName = host.removed.com, VdsIdVDSCommandParametersBase:{hostId='16380971-9d41-41dd-8227-91ea4d17943d'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 7. Not responding due to timed out command 2018-10-30 14:05:22,265-04 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-188063) [] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Host host.removed.com is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued. 8. 10 seconds later is back to up status 2018-10-30 14:05:32,756-04 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-52) [27de3b3c] EVENT_ID: VDS_DETECTED(13), Status of host host.removed.com was set to Up. But 14:05:22 - 180s = 14:02:22. The host was not connected yet, vdsm was not runing. This command should not have been triggered at that time? Version-Release number of selected component (if applicable): rhvm-4.2.6.4-0.1.el7ev.noarch How reproducible: Unknown (Originally by Germano Veit Michel)
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.2.z': '?'}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.2.z': '?'}', ] For more info please contact: rhv-devops (Originally by rhv-bugzilla-bot)
Verified on ovirt-engine-4.2.8.1-0.1.el7ev.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0121