Bug 1949290

Summary: cold migration and resize failing in nova-compute: ssh: Host key verification failed
Product: Red Hat OpenStack Reporter: Pavel Sedlák <psedlak>
Component: tripleo-ansibleAssignee: RHOS Maint <rhos-maint>
Status: CLOSED ERRATA QA Contact: Joe H. Rahme <jhakimra>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 16.1 (Train)CC: aschultz, bdobreli, dasmith, dpeacock, drosenfe, eglynn, jhakimra, jpretori, kchamart, mschuppe, sbauza, sgordon, spower, vromanso
Target Milestone: z6Keywords: AutomationBlocker, Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tripleo-ansible-0.5.1-1.20210323173504.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2058027 (view as bug list) Environment:
Last Closed: 2021-05-26 11:43:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1911891, 1974985, 2058027    

Description Pavel Sedlák 2021-04-13 21:22:43 UTC
Tempest tests for cold migration and resize fail:
> tempest.lib.exceptions.TimeoutException: Request timed out
> Details: (ServerDiskConfigTestJSON:test_resize_server_from_auto_to_manual)
> Server b592c193-88cd-4958-bf15-44b90b6531ed failed to reach VERIFY_RESIZE status and task state "None" within the required time (300 s).
> Current status: ACTIVE. Current task state: None.

Reproducible, in OSP CI phase2, though possibly not always / not in all setups.

Nova compute log shows:
> 2021-04-13 18:12:34.754 8 DEBUG oslo_concurrency.processutils [req-f7a2d6f8-51a7-4f4c-9f47-9640018c0b52 314612978bc24e9eb344156a3fc7f9b8 b0fe0b6a054f468e81b851d79e358729 - default default] 'ssh -o BatchMode=yes 172.17.1.115 mkdir -p /var/lib/nova/instances/b592c193-88cd-4958-bf15-44b90b6531ed' failed. Not Retrying. execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:457
> 2021-04-13 18:12:34.792 8 INFO nova.compute.manager [req-f7a2d6f8-51a7-4f4c-9f47-9640018c0b52 314612978bc24e9eb344156a3fc7f9b8 b0fe0b6a054f468e81b851d79e358729 - default default] [instance: b592c193-88cd-4958-bf15-44b90b6531ed] Setting instance back to active after: Instance rollback performed due to: Resize error: not able to execute ssh command: Unexpected error while running command.
> Command: ssh -o BatchMode=yes 172.17.1.115 mkdir -p /var/lib/nova/instances/b592c193-88cd-4958-bf15-44b90b6531ed
> Exit code: 255
> Stdout: ''
> Stderr: 'Host key verification failed.\r\n'

Possibly could be same issue as https://bugs.launchpad.net/tripleo/+bug/1923403 ?


Versions from undercloud-0:
> ansible.noarch                                2.9.19-1.el8ae                                  @rhosp-ansible-2.9       
> openstack-tempest.noarch                      1:24.0.0-1.20201113224606.c73e6b1.el8ost        @rhelosp-16.1            
> openstack-tripleo-common.noarch               11.4.1-1.20210407183434.75bd92a.el8ost          @rhelosp-16.1            
> openstack-tripleo-common-containers.noarch    11.4.1-1.20210407183434.75bd92a.el8ost          @rhelosp-16.1            
> openstack-tripleo-heat-templates.noarch       11.3.2-1.20210408163446.29a02c1.el8ost          @rhelosp-16.1            
> openstack-tripleo-image-elements.noarch       10.6.2-1.20201113215051.7dc0fa1.el8ost          @rhelosp-16.1            
> openstack-tripleo-puppet-elements.noarch      11.2.2-1.20201114042506.f061f90.el8ost          @rhelosp-16.1            
> openstack-tripleo-validations.noarch          11.3.2-1.20210408103437.4db92ba.el8ost          @rhelosp-16.1            
> tripleo-ansible.noarch                        0.5.1-1.20210323173503.902c3c8.el8ost           @rhelosp-16.1            

Versions from compute-1:
> ansible.noarch                                2.9.19-1.el8ae                                  @rhos-16.1-rhel-8-ansible      
> puppet-nova.noarch                            15.6.1-1.20201114010908.51a6857.el8ost          @rhos-16.1                     
> ### podman images compute:
> undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute                16.1_20210409.1   452734cc0544   8 hours ago    1.94 GB

Versions from compute-1 container nova-compute:
> 2021-04-13T01:37:06Z SUBDEBUG Installed: openstack-nova-common-1:20.4.1-1.20210406183726.1ee93b9.el8ost.noarch
> 2021-04-13T11:08:36Z SUBDEBUG Installed: openstack-nova-compute-1:20.4.1-1.20210406183726.1ee93b9.el8ost.noarch
> 2021-04-09T13:49:12Z SUBDEBUG Installed: puppet-tripleo-11.5.0-1.20210406223722.f716ef5.el8ost.noarch
> 2021-04-09T13:49:12Z SUBDEBUG Installed: openstack-tripleo-common-container-base-11.4.1-1.20210407183434.75bd92a.el8ost.noarch

Comment 2 Martin Schuppert 2021-04-14 06:19:53 UTC
This was introduced by https://bugzilla.redhat.com/show_bug.cgi?id=1911891, where setting ANSIBLE_INJECT_FACT_VARS=False the tripleo_ssh_known_hosts misses ansible_ssh_host_key_rsa_public information.

Comment 3 David Peacock 2021-04-14 14:48:59 UTC
Waiting for https://review.opendev.org/c/openstack/tripleo-ansible/+/786159 to hit master; should be the fix once it's merged and backported.

Comment 7 David Rosenfeld 2021-04-16 13:14:17 UTC
The Phase 2 jobs referenced in comment 1 that found this BZ are passing now.

Comment 18 errata-xmlrpc 2021-05-26 11:43:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenStack Platform 16.1.6 (tripleo-ansible) security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2119