Bug 1836303

Summary: After host upgrade from 4.3.9 to 4.4.0 reinstall fail without giving reason
Product: [oVirt] ovirt-engine Reporter: Sandro Bonazzola <sbonazzo>
Component: ovirt-host-deploy-ansibleAssignee: Dana <delfassy>
Status: CLOSED CURRENTRELEASE QA Contact: Petr Matyáš <pmatyas>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.4.0.3CC: bugs, lsvaty, michal.skrivanek, mperina
Target Milestone: ovirt-4.4.1-1Flags: pm-rhel: ovirt-4.4+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-08-05 06:28:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1855959    
Bug Blocks:    

Description Sandro Bonazzola 2020-05-15 15:12:49 UTC
Upgraded engine from 4.3.9 to 4.4.0, host still on 4.3.9, one VM running, everythign normal.

Moved one host to maintenance and turned off.

Reinstalled the host with oVirt Node 4.4.0 rc2.

Within the engine the host is not active. Moved to maintenance.
Tried to reinstall host and it failed telling to look at 

/var/log/ovirt-engine/host-deploy/ovirt-host-deploy-ansible-20200515145913-node0.lab-4115d42e-604c-4ef1-a247-26fa21bafac3.log

its content is:
2020-05-15 14:59:16 UTC - TASK [Gathering Facts] *********************************************************                                                                                                        
2020-05-15 14:59:16 UTC - PLAY RECAP *********************************************************************                                                                                                        
node0.lab                  : ok=0    changed=0    unreachable=1    failed=0    skipped=0    rescued=0    ignored=0 

which doesn't really give any valuable information.

The real issue is found in engine log:

2020-05-15 15:02:23,121Z ERROR [org.ovirt.engine.core.bll.hostdeploy.InstallVdsInternalCommand] (EE-ManagedThreadFactory-engine-Thread-780) [42aa371c-5223-416e-9266-47b105cd528e] ssh-copy-id command failed on ho
st 'node0.lab': Invalid fingerprint SHA256:eWJXTDz5aWN2GZD/Y0RU6yrZSQyhvs4CwvW0Bm8uU0w, expected SHA256:Z8uzhgEqtTEZvYw/q9a/bs2mCNQDksvVh439VtgaPog

host fingerprint changed due to host full reinstall with different operating system.

Just fetching the new fingerprint from the host solves the issue and the system can then start to be re-deployed.

We need to improve the error handling on fingerprint changed.

Comment 1 Martin Perina 2020-05-15 17:44:24 UTC
(In reply to Sandro Bonazzola from comment #0)
> Upgraded engine from 4.3.9 to 4.4.0, host still on 4.3.9, one VM running,
> everythign normal.
> 
> Moved one host to maintenance and turned off.
> 
> Reinstalled the host with oVirt Node 4.4.0 rc2.
> 
> Within the engine the host is not active. Moved to maintenance.
> Tried to reinstall host and it failed telling to look at 

This is not supported flow, the only supported flow of upgrading host from 4.3 to 4.4 is:

1. Move host to Maintenance
2. Remove host from engine
3. Reinstall OS on the host
4. Add host to engine

> 
> /var/log/ovirt-engine/host-deploy/ovirt-host-deploy-ansible-20200515145913-
> node0.lab-4115d42e-604c-4ef1-a247-26fa21bafac3.log
> 
> its content is:
> 2020-05-15 14:59:16 UTC - TASK [Gathering Facts]
> *********************************************************                   
> 
> 2020-05-15 14:59:16 UTC - PLAY RECAP
> *********************************************************************       
> 
> node0.lab                  : ok=0    changed=0    unreachable=1    failed=0 
> skipped=0    rescued=0    ignored=0 
> 
> which doesn't really give any valuable information.
> 
> The real issue is found in engine log:
> 
> 2020-05-15 15:02:23,121Z ERROR
> [org.ovirt.engine.core.bll.hostdeploy.InstallVdsInternalCommand]
> (EE-ManagedThreadFactory-engine-Thread-780)
> [42aa371c-5223-416e-9266-47b105cd528e] ssh-copy-id command failed on ho
> st 'node0.lab': Invalid fingerprint
> SHA256:eWJXTDz5aWN2GZD/Y0RU6yrZSQyhvs4CwvW0Bm8uU0w, expected
> SHA256:Z8uzhgEqtTEZvYw/q9a/bs2mCNQDksvVh439VtgaPog
> 
> host fingerprint changed due to host full reinstall with different operating
> system.
> 
> Just fetching the new fingerprint from the host solves the issue and the
> system can then start to be re-deployed.
> 
> We need to improve the error handling on fingerprint changed.

I agree, we need to show above error as error event in audit_log

Comment 2 Michal Skrivanek 2020-05-16 03:27:11 UTC
(In reply to Martin Perina from comment #1
> This is not supported flow, the only supported flow of upgrading host from
> 4.3 to 4.4 is:
> 
> 1. Move host to Maintenance
> 2. Remove host from engine
> 3. Reinstall OS on the host
> 4. Add host to engine

Any particular reason for #2 besides fingerprint? What would it take to fix this limitation? Host removal is..annoying

Comment 3 Martin Perina 2020-05-18 09:46:36 UTC
(In reply to Michal Skrivanek from comment #2)
> (In reply to Martin Perina from comment #1
> > This is not supported flow, the only supported flow of upgrading host from
> > 4.3 to 4.4 is:
> > 
> > 1. Move host to Maintenance
> > 2. Remove host from engine
> > 3. Reinstall OS on the host
> > 4. Add host to engine
> 
> Any particular reason for #2 besides fingerprint? What would it take to fix
> this limitation? Host removal is..annoying

Host is just one step. And around consequences, we would need to retest, but here are a few examples:

1. If the host is hosted engine host and we wouldn't reinstall it with hosted engine option set to deploy, we would have inconsistence between engine DB and host
2. If host is part of OVS cluster and we would just reinstalled it, we are loosing OVS setup on the itself, but the host would be still mentioned within OVS database on engine

And there might be other issues

We have never supported changing OS of the host on for host in maintenance (for example switch from RHV-H to RHEL-H and vice versa), so I really don't see any reason why we should support even more problematic change from EL7 to EL8

Comment 4 Martin Perina 2020-06-16 12:05:12 UTC
Moving to MODIFIED as the fix displaying the error around invalid fingerprint is shown

Comment 5 Lukas Svaty 2020-07-14 10:27:24 UTC
moving milestone due to dependant bug

Comment 6 Petr Matyáš 2020-07-28 12:29:39 UTC
Verified on ovirt-engine-4.4.1.10-0.1.el8ev.noarch

Comment 7 Sandro Bonazzola 2020-08-05 06:28:17 UTC
This bugzilla is included in oVirt 4.4.1.1 Async release, published on July 13th 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.1.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.