Bug 1836303 - After host upgrade from 4.3.9 to 4.4.0 reinstall fail without giving reason
Summary: After host upgrade from 4.3.9 to 4.4.0 reinstall fail without giving reason
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: ovirt-host-deploy-ansible
Version: 4.4.0.3
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ovirt-4.4.1-1
: ---
Assignee: Dana
QA Contact: Petr Matyáš
URL:
Whiteboard:
Depends On: 1855959
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-15 15:12 UTC by Sandro Bonazzola
Modified: 2020-08-05 06:28 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-05 06:28:17 UTC
oVirt Team: Infra
Embargoed:
pm-rhel: ovirt-4.4+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 109418 0 None MERGED engine: fail and add detailed audit log message when fingerprint changed 2020-08-10 21:03:23 UTC

Description Sandro Bonazzola 2020-05-15 15:12:49 UTC
Upgraded engine from 4.3.9 to 4.4.0, host still on 4.3.9, one VM running, everythign normal.

Moved one host to maintenance and turned off.

Reinstalled the host with oVirt Node 4.4.0 rc2.

Within the engine the host is not active. Moved to maintenance.
Tried to reinstall host and it failed telling to look at 

/var/log/ovirt-engine/host-deploy/ovirt-host-deploy-ansible-20200515145913-node0.lab-4115d42e-604c-4ef1-a247-26fa21bafac3.log

its content is:
2020-05-15 14:59:16 UTC - TASK [Gathering Facts] *********************************************************                                                                                                        
2020-05-15 14:59:16 UTC - PLAY RECAP *********************************************************************                                                                                                        
node0.lab                  : ok=0    changed=0    unreachable=1    failed=0    skipped=0    rescued=0    ignored=0 

which doesn't really give any valuable information.

The real issue is found in engine log:

2020-05-15 15:02:23,121Z ERROR [org.ovirt.engine.core.bll.hostdeploy.InstallVdsInternalCommand] (EE-ManagedThreadFactory-engine-Thread-780) [42aa371c-5223-416e-9266-47b105cd528e] ssh-copy-id command failed on ho
st 'node0.lab': Invalid fingerprint SHA256:eWJXTDz5aWN2GZD/Y0RU6yrZSQyhvs4CwvW0Bm8uU0w, expected SHA256:Z8uzhgEqtTEZvYw/q9a/bs2mCNQDksvVh439VtgaPog

host fingerprint changed due to host full reinstall with different operating system.

Just fetching the new fingerprint from the host solves the issue and the system can then start to be re-deployed.

We need to improve the error handling on fingerprint changed.

Comment 1 Martin Perina 2020-05-15 17:44:24 UTC
(In reply to Sandro Bonazzola from comment #0)
> Upgraded engine from 4.3.9 to 4.4.0, host still on 4.3.9, one VM running,
> everythign normal.
> 
> Moved one host to maintenance and turned off.
> 
> Reinstalled the host with oVirt Node 4.4.0 rc2.
> 
> Within the engine the host is not active. Moved to maintenance.
> Tried to reinstall host and it failed telling to look at 

This is not supported flow, the only supported flow of upgrading host from 4.3 to 4.4 is:

1. Move host to Maintenance
2. Remove host from engine
3. Reinstall OS on the host
4. Add host to engine

> 
> /var/log/ovirt-engine/host-deploy/ovirt-host-deploy-ansible-20200515145913-
> node0.lab-4115d42e-604c-4ef1-a247-26fa21bafac3.log
> 
> its content is:
> 2020-05-15 14:59:16 UTC - TASK [Gathering Facts]
> *********************************************************                   
> 
> 2020-05-15 14:59:16 UTC - PLAY RECAP
> *********************************************************************       
> 
> node0.lab                  : ok=0    changed=0    unreachable=1    failed=0 
> skipped=0    rescued=0    ignored=0 
> 
> which doesn't really give any valuable information.
> 
> The real issue is found in engine log:
> 
> 2020-05-15 15:02:23,121Z ERROR
> [org.ovirt.engine.core.bll.hostdeploy.InstallVdsInternalCommand]
> (EE-ManagedThreadFactory-engine-Thread-780)
> [42aa371c-5223-416e-9266-47b105cd528e] ssh-copy-id command failed on ho
> st 'node0.lab': Invalid fingerprint
> SHA256:eWJXTDz5aWN2GZD/Y0RU6yrZSQyhvs4CwvW0Bm8uU0w, expected
> SHA256:Z8uzhgEqtTEZvYw/q9a/bs2mCNQDksvVh439VtgaPog
> 
> host fingerprint changed due to host full reinstall with different operating
> system.
> 
> Just fetching the new fingerprint from the host solves the issue and the
> system can then start to be re-deployed.
> 
> We need to improve the error handling on fingerprint changed.

I agree, we need to show above error as error event in audit_log

Comment 2 Michal Skrivanek 2020-05-16 03:27:11 UTC
(In reply to Martin Perina from comment #1
> This is not supported flow, the only supported flow of upgrading host from
> 4.3 to 4.4 is:
> 
> 1. Move host to Maintenance
> 2. Remove host from engine
> 3. Reinstall OS on the host
> 4. Add host to engine

Any particular reason for #2 besides fingerprint? What would it take to fix this limitation? Host removal is..annoying

Comment 3 Martin Perina 2020-05-18 09:46:36 UTC
(In reply to Michal Skrivanek from comment #2)
> (In reply to Martin Perina from comment #1
> > This is not supported flow, the only supported flow of upgrading host from
> > 4.3 to 4.4 is:
> > 
> > 1. Move host to Maintenance
> > 2. Remove host from engine
> > 3. Reinstall OS on the host
> > 4. Add host to engine
> 
> Any particular reason for #2 besides fingerprint? What would it take to fix
> this limitation? Host removal is..annoying

Host is just one step. And around consequences, we would need to retest, but here are a few examples:

1. If the host is hosted engine host and we wouldn't reinstall it with hosted engine option set to deploy, we would have inconsistence between engine DB and host
2. If host is part of OVS cluster and we would just reinstalled it, we are loosing OVS setup on the itself, but the host would be still mentioned within OVS database on engine

And there might be other issues

We have never supported changing OS of the host on for host in maintenance (for example switch from RHV-H to RHEL-H and vice versa), so I really don't see any reason why we should support even more problematic change from EL7 to EL8

Comment 4 Martin Perina 2020-06-16 12:05:12 UTC
Moving to MODIFIED as the fix displaying the error around invalid fingerprint is shown

Comment 5 Lukas Svaty 2020-07-14 10:27:24 UTC
moving milestone due to dependant bug

Comment 6 Petr Matyáš 2020-07-28 12:29:39 UTC
Verified on ovirt-engine-4.4.1.10-0.1.el8ev.noarch

Comment 7 Sandro Bonazzola 2020-08-05 06:28:17 UTC
This bugzilla is included in oVirt 4.4.1.1 Async release, published on July 13th 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.1.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.