Bug 1845957
| Summary: | [16.1] Migration of instance fails due to ssh keys missconfiguration | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Maxim Babushkin <mbabushk> |
| Component: | openstack-tripleo-heat-templates | Assignee: | Alex Schultz <aschultz> |
| Status: | CLOSED ERRATA | QA Contact: | David Rosenfeld <drosenfe> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 16.1 (Train) | CC: | aschultz, bdobreli, dasmith, eglynn, emacchi, hakhande, jhajyahy, jhakimra, jslagle, kchamart, kecarter, mburns, oblaut, owalsh, pbabbar, rhayakaw, sbauza, sclewis, sgordon, slinaber, supadhya, vromanso |
| Target Milestone: | z2 | Keywords: | AutomationBlocker, Triaged |
| Target Release: | 16.1 (Train on RHEL 8.2) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | openstack-tripleo-heat-templates-11.3.2-0.20200708133447.c21cc82.el8ost | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-10-28 15:37:36 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Attachments: | |||
|
Description
Maxim Babushkin
2020-06-10 13:39:50 UTC
The sosreports link: http://rhos-release.virt.bos.redhat.com/log/bz1845957/ FYI, With compose RHOS-16.1-RHEL-8-20200611.n.0 we are still facing the issue of Live Migration. Requesting this as a blocker (In reply to Sanjay Upadhyay from comment #2) > FYI, With compose RHOS-16.1-RHEL-8-20200611.n.0 we are still facing the > issue of Live Migration. Requesting this as a blocker Could you please attach /etc/ssh/ssh_known_hosts from the compute node? (In reply to Ollie Walsh from comment #3) > (In reply to Sanjay Upadhyay from comment #2) > > FYI, With compose RHOS-16.1-RHEL-8-20200611.n.0 we are still facing the > > issue of Live Migration. Requesting this as a blocker > > Could you please attach /etc/ssh/ssh_known_hosts from the compute node? .. ssh_known_hosts from the live migration source compute. The host key from dest compute host would also be helpful - /etc/ssh/ssh_host_*.pub. Created attachment 1697630 [details]
computehciovsdpdk-0 /etc/ssh/ssh_known_hosts
Created attachment 1697631 [details]
computehciovsdpdk-0 /etc/ssh/ssh_host_*
Created attachment 1697632 [details]
computehciovsdpdk-1 /etc/ssh/ssh_known_hosts
Created attachment 1697634 [details]
computehciovsdpdk-1 /etc/ssh/ssh_host_rsa_key
Provided requested files from both compute nodes. (In reply to Maxim Babushkin from comment #9) > Provided requested files from both compute nodes. Asked for the public key e.g /etc/ssh/ssh_host_rsa_key.pub but I think the private key can be used to generate this... Public rsa key for computehciovsdpdk-1 (generated from attached private key): AAAAB3NzaC1yc2EAAAADAQABAAABgQCq2Xys18mxUBr4JHDBT2HQlfUB4KqJcysaw/79MMpCGIkaSeBwX+Q9uvo71YVfg5Z3boC/Ch7JMRF3ffAgvthQCIh2zYVVi8R2klyTBjHSFTUkufbirKfd9J01fc7PNfwkWO5mTQM9T0XTUm7X2HwcndyK8MW+ADLMUFFehIuRvLJcOXo5YQl/lISkm5sslKp1KkmVobU2A53zIHduweZEnzzxHd+rJveICI+kAhQ8X7CXBOM3HPgJSVXiiukixf+4dZzMq9pQhnc8Aj22fAlXq+sF+SocyB8pS3yRcbNO0fJclSRQSByL3myfwHQbGrrNIJ/dr3eGASiUqQHXolIL8mRHTPuTKX2CmA0VROV8rfxJQwsPDBDe6WCfFEeV/dSABY4/VcSmjDhRV2V4aQhVobO35iZs/3389OjlMOJQk5prGVF5dmn1x5KT2XlWiZrLOENg/cklKTTCmcnP81IUZfZv3z11qdkjCCoeudpK7Af2eivKhSGM83nURPWzugc= Which matches the entry on /etc/ssh/ssh_known_hosts on computehciovsdpdk-0: [192.0.90.19]*,[computehciovsdpdk-1.localdomain]*,[computehciovsdpdk-1]*,[10.10.130.167]*,[computehciovsdpdk-1.internalapi]*,[computehciovsdpdk-1.internalapi.localdomain]*,[10.10.131.142]*,[computehciovsdpdk-1.tenant]*,[computehciovsdpdk-1.tenant.localdomain]*,[10.10.132.122]*,[computehciovsdpdk-1.storage]*,[computehciovsdpdk-1.storage.localdomain]*,[10.10.133.146]*,[computehciovsdpdk-1.storagemgmt]*,[computehciovsdpdk-1.storagemgmt.localdomain]*, ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCq2Xys18mxUBr4JHDBT2HQlfUB4KqJcysaw/79MMpCGIkaSeBwX+Q9uvo71YVfg5Z3boC/Ch7JMRF3ffAgvthQCIh2zYVVi8R2klyTBjHSFTUkufbirKfd9J01fc7PNfwkWO5mTQM9T0XTUm7X2HwcndyK8MW+ADLMUFFehIuRvLJcOXo5YQl/lISkm5sslKp1KkmVobU2A53zIHduweZEnzzxHd+rJveICI+kAhQ8X7CXBOM3HPgJSVXiiukixf+4dZzMq9pQhnc8Aj22fAlXq+sF+SocyB8pS3yRcbNO0fJclSRQSByL3myfwHQbGrrNIJ/dr3eGASiUqQHXolIL8mRHTPuTKX2CmA0VROV8rfxJQwsPDBDe6WCfFEeV/dSABY4/VcSmjDhRV2V4aQhVobO35iZs/3389OjlMOJQk5prGVF5dmn1x5KT2XlWiZrLOENg/cklKTTCmcnP81IUZfZv3z11qdkjCCoeudpK7Af2eivKhSGM83nURPWzugc= Are you enabling the infrared option to setup ssh keys? No. I'm not using any explicit ssh key setup option of infrared. In my opinion, it should happen automatically and be configured by tripleo. (In reply to Maxim Babushkin from comment #14) > No. > I'm not using any explicit ssh key setup option of infrared. > In my opinion, it should happen automatically and be configured by tripleo. indeed, that's why I'm asking. The only thing I can think of is that *something else* is changing the ssh hosts keys after nova_migration_target has started. Is it possible to get on this env for a closer look? I will install my setup tomorrow and keep it for you to debug. looks like regression issue in 16.1 latest compose (RHOS-16.1-RHEL-8-20200611.n.0,RHOS-16.1-RHEL-8-20200610.n.0). This is RC blocker for us and changing component to nova for their analysis. It's either tripleo-ansible/t-h-t or an infra issue. (In reply to Ollie Walsh from comment #12) > Which matches the entry on /etc/ssh/ssh_known_hosts on computehciovsdpdk-0: > [192.0.90.19]*,[computehciovsdpdk-1.localdomain]*,[computehciovsdpdk-1]*,[10. > 10.130.167]*,[computehciovsdpdk-1.internalapi]*,[computehciovsdpdk-1. > internalapi.localdomain]*,[10.10.131.142]*,[computehciovsdpdk-1.tenant]*, > [computehciovsdpdk-1.tenant.localdomain]*,[10.10.132.122]*, > [computehciovsdpdk-1.storage]*,[computehciovsdpdk-1.storage.localdomain]*, > [10.10.133.146]*,[computehciovsdpdk-1.storagemgmt]*,[computehciovsdpdk-1. > storagemgmt.localdomain]*, ssh-rsa > AAAAB3NzaC1yc2EAAAADAQABAAABgQCq2Xys18mxUBr4JHDBT2HQlfUB4KqJcysaw/ > 79MMpCGIkaSeBwX+Q9uvo71YVfg5Z3boC/ > Ch7JMRF3ffAgvthQCIh2zYVVi8R2klyTBjHSFTUkufbirKfd9J01fc7PNfwkWO5mTQM9T0XTUm7X2 > HwcndyK8MW+ADLMUFFehIuRvLJcOXo5YQl/ > lISkm5sslKp1KkmVobU2A53zIHduweZEnzzxHd+rJveICI+kAhQ8X7CXBOM3HPgJSVXiiukixf+4d > ZzMq9pQhnc8Aj22fAlXq+sF+SocyB8pS3yRcbNO0fJclSRQSByL3myfwHQbGrrNIJ/ > dr3eGASiUqQHXolIL8mRHTPuTKX2CmA0VROV8rfxJQwsPDBDe6WCfFEeV/dSABY4/ > VcSmjDhRV2V4aQhVobO35iZs/3389OjlMOJQk5prGVF5dmn1x5KT2XlWiZrLOENg/ > cklKTTCmcnP81IUZfZv3z11qdkjCCoeudpK7Af2eivKhSGM83nURPWzugc= There is an issue with this entry: 192.0.90.19 is the undercloud ctrl_plane IP which suggests this is https://bugs.launchpad.net/tripleo/+bug/1861296 I don't believe this the same issue as https://bugs.launchpad.net/tripleo/+bug/1861296. That was caused by a bad jinja2 syntax that resulted in missing host/ips for the ssh known host entry. The patch that cause was merged to upstream ussuri and not backported. Reproduced this on stable/train:
Deploy an overcloud.
Delete the overcloud.
Deploy an overcloud with the same stack name and same host names.
The cached ansible facts from the 1st deployment (overcloud-0_1) are used in the 2nd deployment (overcloud-0):
[CentOS-7.8 - root@undercloud mistral]# grep host_key overcloud-0/.ansible/fact_cache/overcloud-0-novacompute-0
"ansible_ssh_host_key_ecdsa_public": "AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBG1POUEid7AiBJNsHexvyy4D3oyhKP8ht7zHZ7FktsOb7PrLZVe0wWOxP/X6TdMZYLeTpDDsCo+gEXQXlVZ+hC8=",
"ansible_ssh_host_key_ed25519_public": "AAAAC3NzaC1lZDI1NTE5AAAAIDD5gi10zP5St8MrsvoUqAbwoZGRHbY2PI7hUA0m3rpd",
"ansible_ssh_host_key_rsa_public": "AAAAB3NzaC1yc2EAAAADAQABAAABAQCglZI/tVpWdC+71yBsE3HQIkoFcnSSIrtHLxXHGO/M382Z6lNK22oR7athjzsQIKaf6gW+paNI+Uf1DcebHQPpIqYHUl64XlyjayZ5xwdbK/dTgxCLRXvYousIC21Lg/7cpi2aY1dhQ8zLZXKnIveydS+twNRZ1Haol5pWIuB52WgX7idAysMkU6Smsxs/uxsJlMJ6Dby2IK5jXS/N5XM4aHo0gWBZ4Ea4UADXyJKfrrjrjLZHSc58Cp0WFAfgQukfTk9BnUzGVNBLF/w1ihalV1PkbBvv16+PKEDfwXnX49KJ75s76HVh+bD5KLVCCA0QSGLJilC7QqGUVXFlTpSB",
[CentOS-7.8 - root@undercloud mistral]# grep host_key overcloud-0_1/.ansible/fact_cache/overcloud-0-novacompute-0
"ansible_ssh_host_key_ecdsa_public": "AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBG1POUEid7AiBJNsHexvyy4D3oyhKP8ht7zHZ7FktsOb7PrLZVe0wWOxP/X6TdMZYLeTpDDsCo+gEXQXlVZ+hC8=",
"ansible_ssh_host_key_ed25519_public": "AAAAC3NzaC1lZDI1NTE5AAAAIDD5gi10zP5St8MrsvoUqAbwoZGRHbY2PI7hUA0m3rpd",
"ansible_ssh_host_key_rsa_public": "AAAAB3NzaC1yc2EAAAADAQABAAABAQCglZI/tVpWdC+71yBsE3HQIkoFcnSSIrtHLxXHGO/M382Z6lNK22oR7athjzsQIKaf6gW+paNI+Uf1DcebHQPpIqYHUl64XlyjayZ5xwdbK/dTgxCLRXvYousIC21Lg/7cpi2aY1dhQ8zLZXKnIveydS+twNRZ1Haol5pWIuB52WgX7idAysMkU6Smsxs/uxsJlMJ6Dby2IK5jXS/N5XM4aHo0gWBZ4Ea4UADXyJKfrrjrjLZHSc58Cp0WFAfgQukfTk9BnUzGVNBLF/w1ihalV1PkbBvv16+PKEDfwXnX49KJ75s76HVh+bD5KLVCCA0QSGLJilC7QqGUVXFlTpSB",
()[nova@overcloud-0-novacompute-0 /]$ ssh overcloud-0-novacompute-1
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
SHA256:c3KSB9JQENyKvCM5fe/UVUUO6CgvGoORNjvFz1wo18E.
Please contact your system administrator.
Add correct host key in /dev/null to get rid of this message.
Offending RSA key in /etc/ssh/ssh_known_hosts:6
RSA host key for [overcloud-0-novacompute-1]:2022 has changed and you have requested strict checking.
Host key verification failed.
The facts are cached for 2 hours so it's only likely to be issue when a deployment is deleted and immediately redeployed.
This should be easy to work around e.g remove /var/lib/mistral/<stack_name> after the overcloud delete, or just use a different overcloud stack name.
Hi Ollie, Thanks for reproducing it and finding the root cause. Why the tripleo will not make sure to clean all the leftovers after stack delete? Making it a manual step adds additional step for the user to remember to perform. I think we do delete that stack name. I think the facts end up in a different spot. I believe this change was backported for Upgrades so we'll likely need to address that. https://review.opendev.org/#/c/725515/3/tripleo_common/actions/ansible.py https://review.opendev.org/#/c/682855 was the original change where the facts end up in /var/tmp We can force the clearing of the cache at the start of a deployment to avoid this. Note this is only likely to be an issue for dev/test/POC deployments. It's extremely unlikely that a production deployment would be deployed then, within the next 2 hours, deleted & redeployed. Deploy an overcloud. Delete the overcloud. Deploy an overcloud with the same stack name and same host names. New ssh keys were generated and cached one were not used Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:4284 |