Bug 1922094

Summary: Upgrading host with reboot after upgrade option failes
Product: [oVirt] ovirt-engine Reporter: Sandro Bonazzola <sbonazzo>
Component: BLL.InfraAssignee: Dana <delfassy>
Status: CLOSED CURRENTRELEASE QA Contact: Petr Matyáš <pmatyas>
Severity: high Docs Contact:
Priority: medium    
Version: 4.4.4.7CC: bugs, dfodor, gdeolive, mperina, pmatyas
Target Milestone: ovirt-4.4.5Keywords: TestBlocker
Target Release: ---Flags: pm-rhel: ovirt-4.4+
pm-rhel: blocker?
gdeolive: testing_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-18 15:14:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sandro Bonazzola 2021-01-29 08:53:54 UTC
Description of problem:
- Engine upgraded to 4.4.4.7 along with all updates to CentOS 8.3
- Engine reports cluster level 4.4 and updates available to the host which is running CentOS 8.2 with latest 4.4.3
- Moved host to maintenance and started upgrade process
- Upgrade fails


Version-Release number of selected component (if applicable):
ovirt 4.4.4


Additional info: (host name has been replaced with '***********************'

Within engine logs:
2021-01-29 08:06:59,143Z INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] No interaction with host '***********************' for 20000 ms.
2021-01-29 08:07:01,643Z ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connection timeout for host '***********************', last response arrived 22501 ms ago.

and later:
2021-01-29 08:41:46,123Z ERROR [org.ovirt.engine.core.bll.SshHostRebootCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [7145d526] SSH reboot command failed on host '***********************': SSH session timeout host 'root@***********************'
Stdout: 
Stderr: 
2021-01-29 08:41:46,185Z ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [7145d526] EVENT_ID: SYSTEM_FAILED_SSH_HOST_RESTART(198), A restart using SSH initiated by the engine to Host node1 has failed.
2021-01-29 08:41:46,195Z INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [7145d526] START, SetVdsStatusVDSCommand(HostName = node1, SetVdsStatusVDSCommandParameters:{hostId='25133933-f7c5-49bc-be67-49fd32bfbd27', status='InstallFailed', nonOperationalReason='NONE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 1b5daf78
2021-01-29 08:41:46,200Z INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [7145d526] FINISH, SetVdsStatusVDSCommand, return: , log id: 1b5daf78
2021-01-29 08:41:46,200Z ERROR [org.ovirt.engine.core.bll.hostdeploy.UpgradeHostInternalCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [7145d526] Engine failed to restart via ssh host 'node1' ('25133933-f7c5-49bc-be67-49fd32bfbd27') after upgrade
2021-01-29 08:41:46,217Z ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2725cfaf-5397-4a40-9b36-74ba6d18a085] EVENT_ID: HOST_UPGRADE_FAILED(841), Failed to upgrade Host node1 (User: admin@internal-authz).
2021-01-29 08:41:53,874Z ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-50) [2725cfaf-5397-4a40-9b36-74ba6d18a085] EVENT_ID: HOST_UPGRADE_FAILED(841), Failed to upgrade Host node1 (User: admin@internal-authz).

Comment 1 RHEL Program Management 2021-01-29 08:54:02 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 3 Martin Perina 2021-01-29 09:20:11 UTC
Reducing severity as this happening only on some systems and we don't have clear reproducer, just a few ideas which could prevent this issue

Comment 4 RHEL Program Management 2021-01-29 09:20:20 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 5 Petr Matyáš 2021-02-16 09:34:28 UTC
IMO this should be marked as duplicate of bug 1917809

Otherwise this is also FailedQA with literally the same steps as in bug 1917809#c1

Comment 6 Martin Perina 2021-02-16 13:45:57 UTC
Does it fail on any host or only specific hosts? Because I haven't been able to reproduce on any my servers?

Comment 7 Petr Matyáš 2021-02-16 13:48:17 UTC
This fails consistently on my upgraded engine with any host I have in there. (Only running the SSH restart action is enough for this to reproduce)

Comment 10 Sandro Bonazzola 2021-03-18 15:14:32 UTC
This bugzilla is included in oVirt 4.4.5 release, published on March 18th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.