Bug 1917809

Summary: When running reboot after reinstall of a host reboot is reported as failed
Product: [oVirt] ovirt-engine Reporter: Petr Matyáš <pmatyas>
Component: BLL.InfraAssignee: Dana <delfassy>
Status: CLOSED CURRENTRELEASE QA Contact: Petr Matyáš <pmatyas>
Severity: medium Docs Contact:
Priority: high    
Version: 4.4.5CC: bugs, lleistne, mburman, mperina
Target Milestone: ovirt-4.4.5Flags: pm-rhel: ovirt-4.4+
pmatyas: testing_ack+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-4.4.5.5 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-18 15:12:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Petr Matyáš 2021-01-19 12:30:21 UTC
Created attachment 1748710 [details]
engine log

Description of problem:
I'm using the new feature to reboot a host after reinstall, even though the reboot is initiated, host ends up in install failed and reports the ssh reboot cmd as failed.
This issue already appeared from time to time with regular upgrade but only occasionally and only on some specific envs (maybe HW related?).

Version-Release number of selected component (if applicable):
ovirt-engine-4.4.5-0.11.el8ev.noarch

How reproducible:
on specific envs 100%

Steps to Reproduce:
1. reboot after upgrade/reinstall/install
2.
3.

Actual results:
host ends up in install failed

Expected results:
host rebooted successfully

Additional info:
2021-01-19 14:12:58,926+02 ERROR [org.ovirt.engine.core.bll.SshHostRebootCommand] (EE-ManagedThreadFactory-engine-Thread-2172) [455be915-6f4b-4731-bdf5-a3098d1a38d1] SSH reboot command failed on host 'aqua-vds2': SSH session timeout host 'root@aqua-vds2'
Stdout:
Stderr:
2021-01-19 14:12:58,949+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-2172) [455be915-6f4b-4731-bdf5-a3098d1a38d1] EVENT_ID: SYSTEM_FAILED_SSH_HOST_RESTART(198), A restart using SSH initiated by the engine to Host host_mixed_1 has failed.
2021-01-19 14:12:58,955+02 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engine-Thread-2172) [455be915-6f4b-4731-bdf5-a3098d1a38d1] START, SetVdsStatusVDSCommand(HostName = host_mixed_1, SetVdsStatusVDSCommandParameters:{hostId='9922639d-dcb6-4f06-ba82-5e52ea984502', status='InstallFailed', nonOperationalReason='NONE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 407bb397
2021-01-19 14:12:58,958+02 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engine-Thread-2172) [455be915-6f4b-4731-bdf5-a3098d1a38d1] FINISH, SetVdsStatusVDSCommand, return: , log id: 407bb397
2021-01-19 14:12:58,958+02 ERROR [org.ovirt.engine.core.bll.hostdeploy.InstallVdsInternalCommand] (EE-ManagedThreadFactory-engine-Thread-2172) [455be915-6f4b-4731-bdf5-a3098d1a38d1] Engine failed to restart via ssh host 'host_mixed_1' ('9922639d-dcb6-4f06-ba82-5e52ea984502') after host install
2021-01-19 14:12:58,962+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-2172) [455be915-6f4b-4731-bdf5-a3098d1a38d1] EVENT_ID: VDS_INSTALL_FAILED(505), Host host_mixed_1 installation failed. Please refer to /var/log/ovirt-engine/engine.log and log logs under /var/log/ovirt-engine/host-deploy/ for further details..

Comment 1 Petr Matyáš 2021-02-12 13:38:07 UTC
Using ovirt-engine-4.4.5.5-0.13.el8ev.noarch

Also there seems to be a brand new WARN that looks related, although I might have just missed it last time.

2021-02-12 14:03:14,736+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [a214d28b-c63f-429a-9522-7d7d0ac4ef1e] EVENT_ID: ANSIBLE_RUNNER_EVENT_NOTIFICATION(559), Update of host dell-r210ii-14. Remove temporary yum configuration file.
2021-02-12 14:03:14,747+01 WARN  [org.ovirt.engine.core.dal.job.ExecutionMessageDirector] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [1402e89] The message key 'SshHostReboot' is missing from 'bundles/ExecutionMessages'
2021-02-12 14:03:14,777+01 INFO  [org.ovirt.engine.core.bll.SshHostRebootCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [1402e89] Running command: SshHostRebootCommand internal: true. Entities affected :  ID: fc1136af-cf95-44b0-a011-e33eb35505cc Type: VDSAction group MANIPULATE_HOST with role type ADMIN
2021-02-12 14:03:14,784+01 INFO  [org.ovirt.engine.core.bll.SshHostRebootCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [1402e89] Opening SSH reboot session on host dell-r210ii-14.dn
2021-02-12 14:03:15,259+01 ERROR [org.ovirt.engine.core.bll.SshHostRebootCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [1402e89] SSH reboot command failed on host 'dell-r210ii-14.dn': SSH session closed during connection 'root'
Stdout:
Stderr:
2021-02-12 14:03:15,278+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [1402e89] EVENT_ID: SYSTEM_FAILED_SSH_HOST_RESTART(198), A restart using SSH initiated by the engine to Host dell-r210ii-14 has failed.
2021-02-12 14:03:15,289+01 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [1402e89] START, SetVdsStatusVDSCommand(HostName = dell-r210ii-14, SetVdsStatusVDSCommandParameters:{hostId='fc1136af-cf95-44b0-a011-e33eb35505cc', status='InstallFailed', nonOperationalReason='NONE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 434b6a83
2021-02-12 14:03:15,296+01 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [1402e89] FINISH, SetVdsStatusVDSCommand, return: , log id: 434b6a83
2021-02-12 14:03:15,296+01 ERROR [org.ovirt.engine.core.bll.hostdeploy.UpgradeHostInternalCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [1402e89] Engine failed to restart via ssh host 'dell-r210ii-14' ('fc1136af-cf95-44b0-a011-e33eb35505cc') after upgrade
2021-02-12 14:03:15,320+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [a214d28b-c63f-429a-9522-7d7d0ac4ef1e] EVENT_ID: HOST_UPGRADE_FAILED(841), Failed to upgrade Host dell-r210ii-14 (User: admin@internal-authz).

Comment 2 Petr Matyáš 2021-02-12 13:41:07 UTC
Forgot to mention the host was not rebooted at all.

Comment 3 Martin Perina 2021-02-12 15:49:17 UTC
So something strange is going on with your host, is there anything in journalctl, audit.log or other system logs on the host which could explain why engine cannot connect to your host?

Comment 5 Dana 2021-03-02 14:37:57 UTC
As I was investigating this on Petr's env. I found that I couldn't even reinstall the host using password (it immediately failed on ssh-copy-id). After removing the host from the env. and installing from fresh reinstall and upgrade completed successfully with reboot

Comment 6 Petr Matyáš 2021-03-03 08:42:11 UTC
Verified on ovirt-engine-4.4.5.7-0.1.el8ev.noarch

If this bug still appears for you, just remove the host and add it back.

Comment 7 Sandro Bonazzola 2021-03-18 15:12:49 UTC
This bugzilla is included in oVirt 4.4.5 release, published on March 18th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.