Bug 1671127 - live-migration not working with pre-provisioned servers and config-download deploy method
Summary: live-migration not working with pre-provisioned servers and config-download d...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z6
: 13.0 (Queens)
Assignee: Martin Schuppert
QA Contact: Joe H. Rahme
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-30 19:55 UTC by Boris Deschenes
Modified: 2019-04-30 17:27 UTC (History)
11 users (show)

Fixed In Version: openstack-tripleo-heat-templates-8.3.1-5.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-30 17:27:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
grep ecdsa /etc/ssh/ssh_known_hosts (a single line with errors) (247.19 KB, image/png)
2019-03-29 14:34 UTC, Vincent S. Cojot
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 649021 0 'None' MERGED Include ssh known_hosts entries for non-default port 2020-07-14 10:58:28 UTC
OpenStack gerrit 649022 0 'None' MERGED Simplify ssh known_hosts entries for non-default port 2020-07-14 10:58:28 UTC
Red Hat Product Errata RHBA-2019:0939 0 None None None 2019-04-30 17:27:54 UTC

Description Boris Deschenes 2019-01-30 19:55:50 UTC
Description of problem:
In a pre-provisioned deployment, live-migration will not work because /etc/ssh/ssh_known_hosts is not properly populated.

Version-Release number of selected component (if applicable):
13

How reproducible:
Should affect all pre-provisioned deployments

Steps to Reproduce:
1. deploy a pre-provisioned overcloud
2. look at /etc/ssh/ssh_known_hosts before and after the deployment
3. hosts ecdsa public keys are not inserted in /etc/ssh/ssh_known_hosts

Actual results:
live migration will fail with the following error message in /var/log/containers/nova/nova-compute.log of the source node:

Live Migration failure: operation failed: Failed to connect to remote libvirt URI qemu+ssh://nova_migration.localdomain:2022/system?keyfile=/etc/nova/migration/identity: Cannot recv data: Host key verification failed.: Connection reset by peer: libvirtError: operation failed: Failed to connect to remote libvirt URI qemu+ssh://nova_migration.localdomain:2022/system?keyfile=/etc/nova/migration/identity: Cannot recv data: Host key verification failed.: Connection reset by peer

Expected results:
live migration should work on a brand new pre-provisioned deployment

Additional info:
this can be fixed easily by populating /etc/ssh/ssh_know_hosts on all compute nodes with the ecdsa host public keys that you can get with ssh-keyscan:

simply execute the following on all compute hosts:
ssh-keyscan -t ecdsa compute001.internalapi.localdomain compute002.internalapi.localdomain ... ... >> /etc/ssh/ssh_known_hosts
docker restart nova_libvirt nova_compute

with this, I was able to live-migrate with shared storage (iSCSI) as well as local storage (--block-migrate)

Also, if you look at the /etc/ssh/ssh_known_hosts right after the deployment, you will see a very long first line listing all known machines and all their hostnames from different interfaces, I saw error messages of this line being too long with only 16 nodes.. I wonder what is the purpose of this line and is it really being read or simply ignored..

Comment 1 Martin Schuppert 2019-02-01 11:11:13 UTC
(In reply to Boris Deschenes from comment #0)
> Description of problem:
> In a pre-provisioned deployment, live-migration will not work because
> /etc/ssh/ssh_known_hosts is not properly populated.
> 
> Version-Release number of selected component (if applicable):
> 13
> 
> How reproducible:
> Should affect all pre-provisioned deployments
> 
> Steps to Reproduce:
> 1. deploy a pre-provisioned overcloud
> 2. look at /etc/ssh/ssh_known_hosts before and after the deployment
> 3. hosts ecdsa public keys are not inserted in /etc/ssh/ssh_known_hosts
> 
> Actual results:
> live migration will fail with the following error message in
> /var/log/containers/nova/nova-compute.log of the source node:
> 
> Live Migration failure: operation failed: Failed to connect to remote
> libvirt URI
> qemu+ssh://nova_migration.localdomain:2022/
> system?keyfile=/etc/nova/migration/identity: Cannot recv data: Host key
> verification failed.: Connection reset by peer: libvirtError: operation
> failed: Failed to connect to remote libvirt URI
> qemu+ssh://nova_migration.localdomain:2022/
> system?keyfile=/etc/nova/migration/identity: Cannot recv data: Host key
> verification failed.: Connection reset by peer
> 
> Expected results:
> live migration should work on a brand new pre-provisioned deployment
> 
> Additional info:
> this can be fixed easily by populating /etc/ssh/ssh_know_hosts on all
> compute nodes with the ecdsa host public keys that you can get with
> ssh-keyscan:
> 
> simply execute the following on all compute hosts:
> ssh-keyscan -t ecdsa compute001.internalapi.localdomain
> compute002.internalapi.localdomain ... ... >> /etc/ssh/ssh_known_hosts
> docker restart nova_libvirt nova_compute
> 
> with this, I was able to live-migrate with shared storage (iSCSI) as well as
> local storage (--block-migrate)
> 
> Also, if you look at the /etc/ssh/ssh_known_hosts right after the
> deployment, you will see a very long first line listing all known machines
> and all their hostnames from different interfaces, I saw error messages of
> this line being too long with only 16 nodes.. I wonder what is the purpose
> of this line and is it really being read or simply ignored..

could you provide the /etc/ssh/ssh_known_hosts and the logs showing the error message after the deployment? best would be an sosreport with --all-logs option.

Comment 2 Boris Deschenes 2019-02-04 14:13:18 UTC
Hi,

The pre-provisioned servers (RHEL 7.5) we provide for the overcloud do not contain any /etc/ssh/ssh_known_hosts initially, and after a successful deployment, the /etc/ssh/ssh_known_hosts contains the following (single line file):

192.168.20.13,controller001.localdomain,controller001,100.101.0.11,controller001.storage.localdomain,controller001.storage,100.101.0.11,controller001.storagemgmt.localdomain,controller001.storagemgmt,192.168.20.13,controller001.internalapi.localdomain,controller001.internalapi,192.168.50.12,controller001.tenant.localdomain,controller001.tenant,192.168.10.12,controller001.external.localdomain,controller001.external,100.101.0.11,controller001.management.localdomain,controller001.management,100.101.0.11,controller001.ctlplane.localdomain,controller001.ctlplane ecdsa192.168.20.22,controller002.localdomain,controller002,100.101.0.12,controller002.storage.localdomain,controller002.storage,100.101.0.12,controller002.storagemgmt.localdomain,controller002.storagemgmt,192.168.20.22,controller002.internalapi.localdomain,controller002.internalapi,192.168.50.10,controller002.tenant.localdomain,controller002.tenant,192.168.10.13,controller002.external.localdomain,controller002.external,100.101.0.12,controller002.management.localdomain,controller002.management,100.101.0.12,controller002.ctlplane.localdomain,controller002.ctlplane ecdsa192.168.20.15,controller003.localdomain,controller003,100.101.0.13,controller003.storage.localdomain,controller003.storage,100.101.0.13,controller003.storagemgmt.localdomain,controller003.storagemgmt,192.168.20.15,controller003.internalapi.localdomain,controller003.internalapi,192.168.50.24,controller003.tenant.localdomain,controller003.tenant,192.168.10.10,controller003.external.localdomain,controller003.external,100.101.0.13,controller003.management.localdomain,controller003.management,100.101.0.13,controller003.ctlplane.localdomain,controller003.ctlplane ecdsa192.168.20.35,compute001.localdomain,compute001,100.101.0.21,compute001.storage.localdomain,compute001.storage,100.101.0.21,compute001.storagemgmt.localdomain,compute001.storagemgmt,192.168.20.35,compute001.internalapi.localdomain,compute001.internalapi,192.168.50.20,compute001.tenant.localdomain,compute001.tenant,192.168.10.31,compute001.external.localdomain,compute001.external,100.101.0.21,compute001.management.localdomain,compute001.management,100.101.0.21,compute001.ctlplane.localdomain,compute001.ctlplane ecdsa192.168.20.21,compute002.localdomain,compute002,100.101.0.22,compute002.storage.localdomain,compute002.storage,100.101.0.22,compute002.storagemgmt.localdomain,compute002.storagemgmt,192.168.20.21,compute002.internalapi.localdomain,compute002.internalapi,192.168.50.11,compute002.tenant.localdomain,compute002.tenant,192.168.10.11,compute002.external.localdomain,compute002.external,100.101.0.22,compute002.management.localdomain,compute002.management,100.101.0.22,compute002.ctlplane.localdomain,compute002.ctlplane ecdsa

sosreport will follow

Boris

Comment 3 Boris Deschenes 2019-02-04 14:57:41 UTC
Created attachment 1526816 [details]
compute001 sosreport

Comment 5 Martin Schuppert 2019-03-14 16:45:19 UTC
I have closed this bug as it has been waiting for more info for at least 4
weeks. We only do this to ensure that we don't accumulate stale bugs which
can't be addressed. If you are able to provide the requested information,
please feel free to re-open this bug.

Comment 10 Vincent S. Cojot 2019-03-28 17:45:32 UTC
Hi,
Yes, the pre-deployed servers do have this:
[root@qasite1-compute001 ~]# ls -la /etc/ssh/ssh_host_rsa_key.pub /etc/ssh/ssh_host_ecdsa_key.pub /etc/ssh/ssh_host_ed25519_key.pub
-rw-r--r--. 1 root root 162 Aug 30  2018 /etc/ssh/ssh_host_ecdsa_key.pub
-rw-r--r--. 1 root root  82 Aug 30  2018 /etc/ssh/ssh_host_ed25519_key.pub
-rw-r--r--. 1 root root 382 Aug 30  2018 /etc/ssh/ssh_host_rsa_key.pub

Problem is also present in that lab:
[root@qasite1-compute001 ~]# cat /etc/ssh/ssh_known_hosts 
192.168.20.11,qasite1-controller001.localdomain,qasite1-controller001,172.18.250.6,qasite1-controller001.storage.localdomain,qasite1-controller001.storage,172.18.250.6,qasite1-controller001.storagemgmt.localdomain,qasite1-controller001.storagemgmt,192.168.20.11,qasite1-controller001.internalapi.localdomain,qasite1-controller001.internalapi,192.168.9.21,qasite1-controller001.tenant.localdomain,qasite1-controller001.tenant,172.18.250.74,qasite1-controller001.external.localdomain,qasite1-controller001.external,172.18.250.6,qasite1-controller001.management.localdomain,qasite1-controller001.management,172.18.250.6,qasite1-controller001.ctlplane.localdomain,qasite1-controller001.ctlplane ecdsa192.168.20.21,qasite1-controller002.localdomain,qasite1-controller002,172.18.250.7,qasite1-controller002.storage.localdomain,qasite1-controller002.storage,172.18.250.7,qasite1-controller002.storagemgmt.localdomain,qasite1-controller002.storagemgmt,192.168.20.21,qasite1-controller002.internalapi.localdomain,qasite1-controller002.internalapi,192.168.9.10,qasite1-controller002.tenant.localdomain,qasite1-controller002.tenant,172.18.250.73,qasite1-controller002.external.localdomain,qasite1-controller002.external,172.18.250.7,qasite1-controller002.management.localdomain,qasite1-controller002.management,172.18.250.7,qasite1-controller002.ctlplane.localdomain,qasite1-controller002.ctlplane ecdsa192.168.20.12,qasite1-controller003.localdomain,qasite1-controller003,172.18.250.8,qasite1-controller003.storage.localdomain,qasite1-controller003.storage,172.18.250.8,qasite1-controller003.storagemgmt.localdomain,qasite1-controller003.storagemgmt,192.168.20.12,qasite1-controller003.internalapi.localdomain,qasite1-controller003.internalapi,192.168.9.13,qasite1-controller003.tenant.localdomain,qasite1-controller003.tenant,172.18.250.80,qasite1-controller003.external.localdomain,qasite1-controller003.external,172.18.250.8,qasite1-controller003.management.localdomain,qasite1-controller003.management,172.18.250.8,qasite1-controller003.ctlplane.localdomain,qasite1-controller003.ctlplane ecdsa192.168.20.22,qasite1-compute001.localdomain,qasite1-compute001,172.18.250.9,qasite1-compute001.storage.localdomain,qasite1-compute001.storage,172.18.250.9,qasite1-compute001.storagemgmt.localdomain,qasite1-compute001.storagemgmt,192.168.20.22,qasite1-compute001.internalapi.localdomain,qasite1-compute001.internalapi,192.168.9.17,qasite1-compute001.tenant.localdomain,qasite1-compute001.tenant,172.18.250.77,qasite1-compute001.external.localdomain,qasite1-compute001.external,172.18.250.9,qasite1-compute001.management.localdomain,qasite1-compute001.management,172.18.250.9,qasite1-compute001.ctlplane.localdomain,qasite1-compute001.ctlplane ecdsa192.168.20.10,qasite1-compute002.localdomain,qasite1-compute002,172.18.250.10,qasite1-compute002.storage.localdomain,qasite1-compute002.storage,172.18.250.10,qasite1-compute002.storagemgmt.localdomain,qasite1-compute002.storagemgmt,192.168.20.10,qasite1-compute002.internalapi.localdomain,qasite1-compute002.internalapi,192.168.9.20,qasite1-compute002.tenant.localdomain,qasite1-compute002.tenant,172.18.250.79,qasite1-compute002.external.localdomain,qasite1-compute002.external,172.18.250.10,qasite1-compute002.management.localdomain,qasite1-compute002.management,172.18.250.10,qasite1-compute002.ctlplane.localdomain,qasite1-compute002.ctlplane ecdsa

Comment 12 Vincent S. Cojot 2019-03-28 18:08:07 UTC
This is OSP13z3 or OSP13z4:
openstack-glance-16.0.1-3.el7ost.noarch
openstack-heat-api-10.0.1-2.el7ost.noarch
openstack-heat-api-cfn-10.0.1-2.el7ost.noarch
openstack-heat-common-10.0.1-2.el7ost.noarch
openstack-heat-engine-10.0.1-2.el7ost.noarch
openstack-ironic-api-10.1.3-5.el7ost.noarch
openstack-ironic-common-10.1.3-5.el7ost.noarch
openstack-ironic-conductor-10.1.3-5.el7ost.noarch
openstack-ironic-inspector-7.2.1-2.el7ost.noarch
openstack-ironic-staging-drivers-0.9.0-4.el7ost.noarch
openstack-keystone-13.0.1-1.el7ost.noarch
openstack-mistral-api-6.0.3-1.el7ost.noarch
openstack-mistral-common-6.0.3-1.el7ost.noarch
openstack-mistral-engine-6.0.3-1.el7ost.noarch
openstack-mistral-executor-6.0.3-1.el7ost.noarch
openstack-neutron-12.0.3-5.el7ost.noarch
openstack-neutron-common-12.0.3-5.el7ost.noarch
openstack-neutron-ml2-12.0.3-5.el7ost.noarch
openstack-neutron-openvswitch-12.0.3-5.el7ost.noarch
openstack-nova-api-17.0.5-3.d7864fbgit.el7ost.noarch
openstack-nova-common-17.0.5-3.d7864fbgit.el7ost.noarch
openstack-nova-compute-17.0.5-3.d7864fbgit.el7ost.noarch
openstack-nova-conductor-17.0.5-3.d7864fbgit.el7ost.noarch
openstack-nova-placement-api-17.0.5-3.d7864fbgit.el7ost.noarch
openstack-nova-scheduler-17.0.5-3.d7864fbgit.el7ost.noarch
openstack-selinux-0.8.14-14.el7ost.noarch
openstack-swift-account-2.17.1-0.20180314165245.caeeb54.el7ost.noarch
openstack-swift-container-2.17.1-0.20180314165245.caeeb54.el7ost.noarch
openstack-swift-object-2.17.1-0.20180314165245.caeeb54.el7ost.noarch
openstack-swift-proxy-2.17.1-0.20180314165245.caeeb54.el7ost.noarch
openstack-tempest-18.0.0-2.el7ost.noarch
openstack-tripleo-common-8.6.3-13.el7ost.noarch
openstack-tripleo-common-containers-8.6.3-13.el7ost.noarch
openstack-tripleo-heat-templates-8.0.4-20.el7ost.noarch
openstack-tripleo-image-elements-8.0.1-1.el7ost.noarch
openstack-tripleo-puppet-elements-8.0.1-1.el7ost.noarch
openstack-tripleo-ui-8.3.2-1.el7ost.noarch
openstack-tripleo-validations-8.4.2-1.el7ost.noarch
openstack-zaqar-6.0.1-1.el7ost.noarch

(undercloud) [stack@qasite1-director ~]$ rpm -qa|grep instack|sort
instack-8.1.1-0.20180313084439.0d768a3.el7ost.noarch
instack-undercloud-8.4.3-4.el7ost.noarch

Comment 13 Vincent S. Cojot 2019-03-28 18:09:47 UTC
Where in the t-h-t templates does the creation of the proper ssh keys + known_hosts for normal/live migration happen?

If possible and while we work on this issue, I'd like to try to re-create everything properly (and as expected by nova).

Comment 14 Martin Schuppert 2019-03-29 10:53:33 UTC
> Where in the t-h-t templates does the creation of the proper ssh keys + known_hosts for normal/live migration happen?

The ssh keys itself get not created by tht. They get created when sshd starts the first time a node, which we can see from the provided sosreport:

Aug 30 16:04:46 localhost sshd-keygen: Generating SSH2 RSA host key: [  OK  ]
Aug 30 16:04:47 localhost sshd-keygen: Generating SSH2 ECDSA host key: [  OK  ]
Aug 30 16:04:47 localhost sshd-keygen: Generating SSH2 ED25519 host key: [  OK  ]

tht [1] with the tripleo-ssh-known-hosts role from tripleo-common create/fill the ssh_known_hosts file.


We also see in the log the lines which should get added, where we see the keys:

Feb  4 13:40:38 compute001 ansible-lineinfile: Invoked with directory_mode=None force=None remote_src=None backrefs=False insertafter=None path=/etc/ssh/ssh_known_hosts owner=None follow=False validate=None group=None insertbefore=None unsafe_writes=None create=True setype=None content=NOT_LOGGING_PARAMETER serole=None state=present selevel=None regexp=None line=192.168.20.13,controller001.localdomain,controller001,100.101.0.11,controller001.storage.localdomain,controller001.storage,100.101.0.11,controller001.storagemgmt.localdomain,controller001.storagemgmt,192.168.20.13,controller001.internalapi.localdomain,controller001.internalapi,192.168.50.12,controller001.tenant.localdomain,controller001.tenant,192.168.10.12,controller001.external.localdomain,controller001.external,100.101.0.11,controller001.management.localdomain,controller001.management,100.101.0.11,controller001.ctlplane.localdomain,controller001.ctlplane ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC1dpKy2EGDd3kqKm3X2CtbaVfCZemQhKL3bnkgWnY+JiiY3P0EmPrySKhusy84OowmpLPHi6lqT+q7b2VKX45+ACr8AXmqHQDTwwpO45ivTmdCu+BzLks66Bg0DLh1X6KC0oERAhI60cxKDQepXeLhiB/QazQkEXHpv6T3TOAJlYf1RyztjXnJDwHLo25NmhDiyYO9udGtA0LtBpP97eF5OfShXmw3tdfVqwDRViB6YHiH8UKc/d1PFyKslY1p73f94FZaleDW2l91PZgJDDKhxmCSspaIlQZiATmmtVvrMAlhVm6kf5Zc1C0A7uwIIwaaz4hVbwSXCFPDbWLM8Dh1 src=None seuser=None delimiter=None mode=None attributes=None backup=False
Feb  4 13:40:38 compute001 ansible-lineinfile: Invoked with directory_mode=None force=None remote_src=None backrefs=False insertafter=None path=/etc/ssh/ssh_known_hosts owner=None follow=False validate=None group=None insertbefore=None unsafe_writes=None create=True setype=None content=NOT_LOGGING_PARAMETER serole=None state=present selevel=None regexp=None line=192.168.20.15,controller003.localdomain,controller003,100.101.0.13,controller003.storage.localdomain,controller003.storage,100.101.0.13,controller003.storagemgmt.localdomain,controller003.storagemgmt,192.168.20.15,controller003.internalapi.localdomain,controller003.internalapi,192.168.50.24,controller003.tenant.localdomain,controller003.tenant,192.168.10.10,controller003.external.localdomain,controller003.external,100.101.0.13,controller003.management.localdomain,controller003.management,100.101.0.13,controller003.ctlplane.localdomain,controller003.ctlplane ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC1dpKy2EGDd3kqKm3X2CtbaVfCZemQhKL3bnkgWnY+JiiY3P0EmPrySKhusy84OowmpLPHi6lqT+q7b2VKX45+ACr8AXmqHQDTwwpO45ivTmdCu+BzLks66Bg0DLh1X6KC0oERAhI60cxKDQepXeLhiB/QazQkEXHpv6T3TOAJlYf1RyztjXnJDwHLo25NmhDiyYO9udGtA0LtBpP97eF5OfShXmw3tdfVqwDRViB6YHiH8UKc/d1PFyKslY1p73f94FZaleDW2l91PZgJDDKhxmCSspaIlQZiATmmtVvrMAlhVm6kf5Zc1C0A7uwIIwaaz4hVbwSXCFPDbWLM8Dh1 src=None seuser=None delimiter=None mode=None attributes=None backup=False
Feb  4 13:40:38 compute001 ansible-lineinfile: Invoked with directory_mode=None force=None remote_src=None backrefs=False insertafter=None path=/etc/ssh/ssh_known_hosts owner=None follow=False validate=None group=None insertbefore=None unsafe_writes=None create=True setype=None content=NOT_LOGGING_PARAMETER serole=None state=present selevel=None regexp=None line=192.168.20.22,controller002.localdomain,controller002,100.101.0.12,controller002.storage.localdomain,controller002.storage,100.101.0.12,controller002.storagemgmt.localdomain,controller002.storagemgmt,192.168.20.22,controller002.internalapi.localdomain,controller002.internalapi,192.168.50.10,controller002.tenant.localdomain,controller002.tenant,192.168.10.13,controller002.external.localdomain,controller002.external,100.101.0.12,controller002.management.localdomain,controller002.management,100.101.0.12,controller002.ctlplane.localdomain,controller002.ctlplane ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC1dpKy2EGDd3kqKm3X2CtbaVfCZemQhKL3bnkgWnY+JiiY3P0EmPrySKhusy84OowmpLPHi6lqT+q7b2VKX45+ACr8AXmqHQDTwwpO45ivTmdCu+BzLks66Bg0DLh1X6KC0oERAhI60cxKDQepXeLhiB/QazQkEXHpv6T3TOAJlYf1RyztjXnJDwHLo25NmhDiyYO9udGtA0LtBpP97eF5OfShXmw3tdfVqwDRViB6YHiH8UKc/d1PFyKslY1p73f94FZaleDW2l91PZgJDDKhxmCSspaIlQZiATmmtVvrMAlhVm6kf5Zc1C0A7uwIIwaaz4hVbwSXCFPDbWLM8Dh1 src=None seuser=None delimiter=None mode=None attributes=None backup=False
Feb  4 13:40:39 compute001 ansible-lineinfile: Invoked with directory_mode=None force=None remote_src=None backrefs=False insertafter=None path=/etc/ssh/ssh_known_hosts owner=None follow=False validate=None group=None insertbefore=None unsafe_writes=None create=True setype=None content=NOT_LOGGING_PARAMETER serole=None state=present selevel=None regexp=None line=192.168.20.21,compute002.localdomain,compute002,100.101.0.22,compute002.storage.localdomain,compute002.storage,100.101.0.22,compute002.storagemgmt.localdomain,compute002.storagemgmt,192.168.20.21,compute002.internalapi.localdomain,compute002.internalapi,192.168.50.11,compute002.tenant.localdomain,compute002.tenant,192.168.10.11,compute002.external.localdomain,compute002.external,100.101.0.22,compute002.management.localdomain,compute002.management,100.101.0.22,compute002.ctlplane.localdomain,compute002.ctlplane ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC1dpKy2EGDd3kqKm3X2CtbaVfCZemQhKL3bnkgWnY+JiiY3P0EmPrySKhusy84OowmpLPHi6lqT+q7b2VKX45+ACr8AXmqHQDTwwpO45ivTmdCu+BzLks66Bg0DLh1X6KC0oERAhI60cxKDQepXeLhiB/QazQkEXHpv6T3TOAJlYf1RyztjXnJDwHLo25NmhDiyYO9udGtA0LtBpP97eF5OfShXmw3tdfVqwDRViB6YHiH8UKc/d1PFyKslY1p73f94FZaleDW2l91PZgJDDKhxmCSspaIlQZiATmmtVvrMAlhVm6kf5Zc1C0A7uwIIwaaz4hVbwSXCFPDbWLM8Dh1 src=None seuser=None delimiter=None mode=None attributes=None backup=False
Feb  4 13:40:39 compute001 ansible-lineinfile: Invoked with directory_mode=None force=None remote_src=None backrefs=False insertafter=None path=/etc/ssh/ssh_known_hosts owner=None follow=False validate=None group=None insertbefore=None unsafe_writes=None create=True setype=None content=NOT_LOGGING_PARAMETER serole=None state=present selevel=None regexp=None line=192.168.20.35,compute001.localdomain,compute001,100.101.0.21,compute001.storage.localdomain,compute001.storage,100.101.0.21,compute001.storagemgmt.localdomain,compute001.storagemgmt,192.168.20.35,compute001.internalapi.localdomain,compute001.internalapi,192.168.50.20,compute001.tenant.localdomain,compute001.tenant,192.168.10.31,compute001.external.localdomain,compute001.external,100.101.0.21,compute001.management.localdomain,compute001.management,100.101.0.21,compute001.ctlplane.localdomain,compute001.ctlplane ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC1dpKy2EGDd3kqKm3X2CtbaVfCZemQhKL3bnkgWnY+JiiY3P0EmPrySKhusy84OowmpLPHi6lqT+q7b2VKX45+ACr8AXmqHQDTwwpO45ivTmdCu+BzLks66Bg0DLh1X6KC0oERAhI60cxKDQepXeLhiB/QazQkEXHpv6T3TOAJlYf1RyztjXnJDwHLo25NmhDiyYO9udGtA0LtBpP97eF5OfShXmw3tdfVqwDRViB6YHiH8UKc/d1PFyKslY1p73f94FZaleDW2l91PZgJDDKhxmCSspaIlQZiATmmtVvrMAlhVm6kf5Zc1C0A7uwIIwaaz4hVbwSXCFPDbWLM8Dh1 src=None seuser=None delimiter=None mode=None attributes=None backup=False

Note: it seems the controllers got cloned (?) as they have the same host pub key.

Lets run only the common_roles where the tripleo_ssh_known_hosts is part of with verbose output:

* create inventory if not already done
$ tripleo-ansible-inventory --static-yaml-inventory ~/inventory.yaml

* download the playbooks
$ openstack overcloud config download --name overcloud --config-dir ~/config-download

* run the deploy_steps_playbook.yaml playbook and limit to the ComputeDeployedServer and the common_roles tag ( you might need to change the private key and ansible_ssh_user)
$ ansible-playbook -i ~/inventory.yaml --private-key /home/stack/.ssh/id_rsa  --limit ComputeDeployedServer --tags common_roles -e ansible_ssh_user=stack --become -vvv config-download/tripleo-nWbCb4-config/deploy_steps_playbook.yaml >> ansible-common_roles.log

With the -vvv when the facts are gathered we should see the ansible_ssh_host_key_rsa_public and later the lines which get created.

> # gtar cvzf /tmp/var_log_mistral.tar.gz /var/log/mistral/*log

What I wanted was /var/lib/mistral to get the playbooks and ansible.log for the deployment.

Can we get tar of /var/lib/mistral , the ansible-common_roles.log from above run and the templates.

I could not reproduce the issue.

Note, the used versions are really old:
* openstack-tripleo-heat-templates-8.0.4-20.el7ost.noarch is not even z3, it is z2. Current latest is 8.2.0-6.2.
* openstack-tripleo-common-8.6.3-13.el7ost.noarch, current latest is 8.6.6-16

[1] https://github.com/openstack/tripleo-heat-templates/commit/54010e2358850df97d34f5f9e67b89a800dba67d#diff-fdfa9108b5c67b2c4ce1dae2a05ec0c2

Comment 15 Vincent S. Cojot 2019-03-29 14:02:35 UTC
Hi Martin,
I'll be uploading those archives shortly.
Thanks,
Vincent

Comment 16 Vincent S. Cojot 2019-03-29 14:03:01 UTC
(undercloud) [stack@qasite1-director tripleo-Bt6Efr-config]$  ansible-playbook -i ../inventory.yaml --private-key /home/stack/.ssh/id_rsa  --limit ComputeDl360g10V1 --tags common_roles -e ansible_ssh_user=stack --become -vvv deploy_steps_playbook.yaml >> ~/ansible-common_roles.log

Comment 19 Vincent S. Cojot 2019-03-29 14:33:31 UTC
Also, I see you are mostly mentioning RSA but the problem seems to lie within the 'ecdsa' keys in /etc/ssh/ssh_known_hosts.
Please look at the attached screenshot.

Comment 20 Vincent S. Cojot 2019-03-29 14:34:16 UTC
Created attachment 1549468 [details]
grep ecdsa /etc/ssh/ssh_known_hosts (a single line with errors)

Comment 21 Vincent S. Cojot 2019-03-29 16:21:40 UTC
For the record, I'm currently having good results with this workaround:

# rm -fv /etc/ssh/ssh_known_hosts
# sed -n '/HEAT_HOSTS_START/,/HEAT_HOSTS_END/p' /etc/hosts|awk '{ if ( $1 ~ /^[12]/ ) print $0 }' |xargs ssh-keyscan -p 2022 >> /etc/ssh/ssh_known_hosts 2>/dev/null
# sed -n '/HEAT_HOSTS_START/,/HEAT_HOSTS_END/p' /etc/hosts|awk '{ if ( $1 ~ /^[12]/ ) print $0 }' |xargs ssh-keyscan  >> /etc/ssh/ssh_known_hosts 2>/dev/null

Comment 22 Martin Schuppert 2019-03-29 16:49:33 UTC
(In reply to Vincent S. Cojot from comment #21)
> For the record, I'm currently having good results with this workaround:
> 
> # rm -fv /etc/ssh/ssh_known_hosts
> # sed -n '/HEAT_HOSTS_START/,/HEAT_HOSTS_END/p' /etc/hosts|awk '{ if ( $1 ~
> /^[12]/ ) print $0 }' |xargs ssh-keyscan -p 2022 >> /etc/ssh/ssh_known_hosts
> 2>/dev/null
> # sed -n '/HEAT_HOSTS_START/,/HEAT_HOSTS_END/p' /etc/hosts|awk '{ if ( $1 ~
> /^[12]/ ) print $0 }' |xargs ssh-keyscan  >> /etc/ssh/ssh_known_hosts
> 2>/dev/null


It seems the correct information was added to the file, only the rsa. Two lines, for each compute, which have the key.

    "invocation": {
        "module_args": {
            "attributes": null, 
            "backrefs": false, 
            "backup": false, 
            "content": null, 
            "create": true, 
            "delimiter": null, 
            "directory_mode": null, 
            "follow": false, 
            "force": null, 
            "group": null, 
            "insertafter": null, 
            "insertbefore": null, 
            "line": "192.168.20.10,qasite1-compute002.localdomain,qasite1-compute002,172.18.250.10,qasite1-compute002.storage.localdomain,qasite1-compute002.storage,172.18.250.10,qasite1-compute002.storagemgmt.localdomain,qasite1-compute002.storagemgmt,192.168.20.10,qasite1-compute002.internalapi.localdomain,qasite1-compute002.internalapi,192.168.9.20,qasite1-compute002.tenant.localdomain,qasite1-compute002.tenant,172.18.250.79,qasite1-compute002.external.localdomain,qasite1-compute002.external,172.18.250.10,qasite1-compute002.management.localdomain,qasite1-compute002.management,172.18.250.10,qasite1-compute002.ctlplane.localdomain,qasite1-compute002.ctlplane ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC1dpKy2EGDd3kqKm3X2CtbaVfCZemQhKL3bnkgWnY+JiiY3P0EmPrySKhusy84OowmpLPHi6lqT+q7b2VKX45+ACr8AXmqHQDTwwpO45ivTmdCu+BzLks66Bg0DLh1X6KC0oERAhI60cxKDQepXeLhiB/QazQkEXHpv6T3TOAJlYf1RyztjXnJDwHLo25NmhDiyYO9udGtA0LtBpP97eF5OfShXmw3tdfVqwDRViB6YHiH8UKc/d1PFyKslY1p73f94FZaleDW2l91PZgJDDKhxmCSspaIlQZiATmmtVvrMAlhVm6kf5Zc1C0A7uwIIwaaz4hVbwSXCFPDbWLM8Dh1", 
            "mode": null, 
            "owner": null, 
            "path": "/etc/ssh/ssh_known_hosts", 
            "regexp": null, 
            "remote_src": null, 
            "selevel": null, 
            "serole": null, 
            "setype": null, 
            "seuser": null, 
            "src": null, 
            "state": "present", 
            "unsafe_writes": null, 
            "validate": null
        }


    "invocation": {
        "module_args": {
            "attributes": null, 
            "backrefs": false, 
            "backup": false, 
            "content": null, 
            "create": true, 
            "delimiter": null, 
            "directory_mode": null, 
            "follow": false, 
            "force": null, 
            "group": null, 
            "insertafter": null, 
            "insertbefore": null, 
            "line": "192.168.20.22,qasite1-compute001.localdomain,qasite1-compute001,172.18.250.9,qasite1-compute001.storage.localdomain,qasite1-compute001.storage,172.18.250.9,qasite1-compute001.storagemgmt.localdomain,qasite1-compute001.storagemgmt,192.168.20.22,qasite1-compute001.internalapi.localdomain,qasite1-compute001.internalapi,192.168.9.17,qasite1-compute001.tenant.localdomain,qasite1-compute001.tenant,172.18.250.77,qasite1-compute001.external.localdomain,qasite1-compute001.external,172.18.250.9,qasite1-compute001.management.localdomain,qasite1-compute001.management,172.18.250.9,qasite1-compute001.ctlplane.localdomain,qasite1-compute001.ctlplane ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC1dpKy2EGDd3kqKm3X2CtbaVfCZemQhKL3bnkgWnY+JiiY3P0EmPrySKhusy84OowmpLPHi6lqT+q7b2VKX45+ACr8AXmqHQDTwwpO45ivTmdCu+BzLks66Bg0DLh1X6KC0oERAhI60cxKDQepXeLhiB/QazQkEXHpv6T3TOAJlYf1RyztjXnJDwHLo25NmhDiyYO9udGtA0LtBpP97eF5OfShXmw3tdfVqwDRViB6YHiH8UKc/d1PFyKslY1p73f94FZaleDW2l91PZgJDDKhxmCSspaIlQZiATmmtVvrMAlhVm6kf5Zc1C0A7uwIIwaaz4hVbwSXCFPDbWLM8Dh1", 
            "mode": null, 
            "owner": null, 
            "path": "/etc/ssh/ssh_known_hosts", 
            "regexp": null, 
            "remote_src": null, 
            "selevel": null, 
            "serole": null, 
            "setype": null, 
            "seuser": null, 
            "src": null, 
            "state": "present", 
            "unsafe_writes": null, 
            "validate": null
        }

so I think when you delete /etc/ssh/ssh_known_hosts and run the ansible common_roles tag you should end up a good ssh_known_hosts file. I need to check where the ecdsa entry comes from, afaik this is not done by the playbook.

Can you please also add the ansible.log from /var/lib/mistral/<id>/ansible.log .

Comment 23 Vincent S. Cojot 2019-03-29 16:59:44 UTC
I think that it is not sufficient to inject those keys into /etc/ssh/ssh_known_hosts as the connections reaching out to the other computes on port 2022 aren't going to be able to make use of the plain (port 22) entries.
At least, that's what happened here it's the reason why I have two ssh-keyscan lines.

Comment 24 Vincent S. Cojot 2019-03-29 17:04:28 UTC
I just noticed, there is -nothing- on the director under /var/lib/mistral, it's just an empty directory.

Comment 25 Vincent S. Cojot 2019-03-29 22:20:50 UTC
Here's what I'm currently using post-deployment:

#!/bin/bash
echo "#Rebuilt by post-deploy-fix-server-knownhosts.sh, do NOT edit manually" > /etc/ssh/ssh_known_hosts

# Pass #1 (normal hosts)
for ktype in rsa ecdsa
do
        sed -n '/HEAT_HOSTS_START/,/HEAT_HOSTS_END/p' /etc/hosts| \
                awk '{ if ( $1 ~ /^[12]/ ) print $0 }' | \
                xargs ssh-keyscan -t ${ktype} >> /etc/ssh/ssh_known_hosts 2>/dev/null
done

# Pass #2 (hosts on port 2022)
for ktype in rsa ecdsa
do
        sed -n '/HEAT_HOSTS_START/,/HEAT_HOSTS_END/p' /etc/hosts| \
                awk '{ if ( $1 ~ /^[12]/ ) print $0 }' | \
                xargs ssh-keyscan -t ${ktype} -p 2022 >> /etc/ssh/ssh_known_hosts 2>/dev/null
done

Comment 26 Martin Schuppert 2019-04-01 13:09:08 UTC
(In reply to Vincent S. Cojot from comment #23)
> I think that it is not sufficient to inject those keys into
> /etc/ssh/ssh_known_hosts as the connections reaching out to the other
> computes on port 2022 aren't going to be able to make use of the plain (port
> 22) entries.
> At least, that's what happened here it's the reason why I have two
> ssh-keyscan lines.

The ssh host keys in the migration container get merged from the ones on the host,
they are the same. So there should be no need to scan and add additional entries 
for the ssh daemon on 2022 [1].

The issue is the following:
* with config-download we add the rsa-ssh key to the ssh_known_hosts file via the ansible role
* while with os-collect-config method we add the ecdsa key to the ssh_known_hosts [2]

When the ssh connection during the live migration is performed the rsa key can not be identified due to the non default port 2022 and the default ecdsa key gets offered. Since we don't have it in the ssh_known_hosts with config-download the verification and therefore the migration fails.

We can see this when we run the ssh in verbose - OSP13:
()[root@compute-0 /]$ ssh -p2022 -vvv  -i /etc/nova/migration/identity nova_migration.localdomain                                                                                                                                                                   
...
debug1: Authenticating to compute-1.internalapi.localdomain:2022 as 'nova_migration'
debug3: put_host_port: [compute-1.internalapi.localdomain]:2022
debug3: hostkeys_foreach: reading file "/etc/ssh/ssh_known_hosts"
...
debug1: kex: algorithm: curve25519-sha256
debug1: kex: host key algorithm: ecdsa-sha2-nistp256

-> ecdsa gets selected but we only have rsa keys:

debug1: Server host key: ecdsa-sha2-nistp256 SHA256:94Hm16YJquI0Q+GR0/F3E0PFDg1hrIa4egqmQCaJQUY
debug3: put_host_port: [172.17.1.19]:2022
debug3: put_host_port: [compute-1.internalapi.localdomain]:2022
debug3: hostkeys_foreach: reading file "/etc/ssh/ssh_known_hosts"
debug3: hostkeys_foreach: reading file "/etc/ssh/ssh_known_hosts"
debug1: checking without port identifier
debug3: hostkeys_foreach: reading file "/etc/ssh/ssh_known_hosts"
debug3: record_hostkey: found key type RSA in file /etc/ssh/ssh_known_hosts:1
debug3: load_hostkeys: loaded 1 keys from compute-1.internalapi.localdomain
debug3: hostkeys_foreach: reading file "/etc/ssh/ssh_known_hosts"
debug3: record_hostkey: found key type RSA in file /etc/ssh/ssh_known_hosts:1
debug3: load_hostkeys: loaded 1 keys from 172.17.1.19
The authenticity of host '[compute-1.internalapi.localdomain]:2022 ([172.17.1.19]:2022)' can't be established.
ECDSA key fingerprint is SHA256:94Hm16YJquI0Q+GR0/F3E0PFDg1hrIa4egqmQCaJQUY.
ECDSA key fingerprint is MD5:c1:d9:44:e2:fe:47:63:e1:fc:7c:1e:7c:5c:87:c7:74.
Are you sure you want to continue connecting (yes/no)?

When we you do the same on master or OSP14:
debug1: Authenticating to compute-1.internalapi.localdomain:2022 as 'nova_migration'
debug3: put_host_port: [compute-1.internalapi.localdomain]:2022
debug3: hostkeys_foreach: reading file "/etc/ssh/ssh_known_hosts"
debug3: record_hostkey: found key type RSA in file /etc/ssh/ssh_known_hosts:1
debug3: load_hostkeys: loaded 1 keys from [compute-1.internalapi.localdomain]:2022
debug3: order_hostkeyalgs: prefer hostkeyalgs: ssh-rsa-cert-v01,rsa-sha2-512,rsa-sha2-256,ssh-rsa
...
debug1: kex: host key algorithm: rsa-sha2-512

-> RSA!

debug1: Server host key: ssh-rsa SHA256:DrCJMDbpYINjQEQe9eZSTVMmBWxU2EbDXyOhzhIJy3M
debug3: put_host_port: [172.17.1.12]:2022
debug3: put_host_port: [compute-1.internalapi.localdomain]:2022
debug3: hostkeys_foreach: reading file "/etc/ssh/ssh_known_hosts"
debug3: record_hostkey: found key type RSA in file /etc/ssh/ssh_known_hosts:1
debug3: load_hostkeys: loaded 1 keys from [compute-1.internalapi.localdomain]:2022
debug3: hostkeys_foreach: reading file "/etc/ssh/ssh_known_hosts"
debug3: record_hostkey: found key type RSA in file /etc/ssh/ssh_known_hosts:1
debug3: load_hostkeys: loaded 1 keys from [172.17.1.12]:2022
debug1: Host '[compute-1.internalapi.localdomain]:2022' is known and matches the RSA host key.

-> matches!

When we manually say yes when do again the above test on OSP13 we see the following entry gets added, see the port information:

()[root@compute-0 /root]$ cat .ssh/known_hosts 
[compute-1.internalapi.localdomain]:2022,[172.17.1.19]:2022 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBBuzoUMcE8exWix91S9KuOeyTwq9oZ5UDid+AeX8s8ZAXAq15rVELpvcwKUTycBNJ4Ew0IbUY9bA35BI32z6we8=

Checking rocky/OSP14 this was fixed with the following two changes:

commit 7a05790d51ebba48cf7e9bf230a4cb87acac7315
Author: Oliver Walsh <owalsh>
Date:   Thu Aug 30 12:23:45 2018 +0100

    Simplify ssh known_hosts entries for non-default port
    
    '[host]*' matches both default port and non-default port.
    
    Change-Id: Id83bed36f3ab7f8d0fbdbd03f3960307af62fc84
    Related-bug: #1789452
    (cherry picked from commit c70d197d36d48ce29486e2744a8bc9fda630dd11)

commit 3d5d275c6dfb11ea9410ec9aa926df500cfd22c1
Author: Oliver Walsh <owalsh>
Date:   Tue Aug 28 16:34:21 2018 +0100

    Include ssh known_hosts entries for non-default port
    
    The ssh client no longer appears to accept the regular known hosts entry when
    the target is running on a non-default port.
    Adding '[host]:*' should fix this, regardless of the port.
    However this does not work for the default port so we must include both.
    
    Change-Id: I519ff6053676870dff1bdff60fb1f6b2aa5ee8c9
    Closes-bug: #1789452
    (cherry picked from commit 876683f317640aa6877f991f2d4d0b098de3b7b3)

I'll work on get this fixed in queens/OSP13.

[1] https://github.com/openstack/tripleo-heat-templates/blob/stable/queens/docker/services/nova-migration-target.yaml#L131
[2] https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/role.role.j2.yaml#L839

Comment 29 Vincent S. Cojot 2019-04-02 20:01:45 UTC
And here is a playbook + template that fixes the issue too (that's the solution we'll be using here going forward until this BZ gets resolved):

[stack@director utilities]$ cat rebuild_ssh_known_hosts.yaml 
---
- hosts: nova_compute
  vars:
      ssh_pubkey_names:
         ansible_ssh_host_key_ed25519_public: ssh-ed25519
         ansible_ssh_host_key_ecdsa_public: ecdsa-sha2-nistp256
         ansible_ssh_host_key_rsa_public: ssh-rsa
      ssh_host_suffixes:
         - "internalapi"
         - "internalapi.localdomain"
         - "ctlplane"
         - "ctlplane.localdomain"

  tasks:
    - name: rebuild /etc/ssh/ssh_known_hosts
      become: true
      template:
        src=/home/stack/overcloud/utilities/ssh_known_hosts.j2
        dest=/etc/ssh/ssh_known_hosts
        owner=root
        group=root
        mode=0644

    - name: restart nova containers after updating /etc/ssh/ssh_known_hosts
      become: true
      shell: >
          docker restart nova_migration_target nova_compute

[stack@director utilities]$ cat ssh_known_hosts.j2 
{% for host in groups['nova_compute'] %}

{#- Iterate over all IPs on each host -#}
{% for ip in hostvars[host]['ansible_all_ipv4_addresses'] %}
{% if "172.31" not in ip  %}
{% endif %}
{% for ssh_pubkey,shortname in ssh_pubkey_names.items() %}
{% if ssh_pubkey in hostvars[host] %}
{% if "172.31" not in ip  %}
{{ ip }} {{ shortname }} {{ hostvars[host][ssh_pubkey] }}
{% endif %}
{% endif %}
{% endfor %}
{% endfor %}

{#- Iterate over all possible hostnames on each host -#}
{%- for ssh_suffix in ssh_host_suffixes -%}
{% for ssh_pubkey,shortname in ssh_pubkey_names.items() %}
{% if ssh_pubkey in hostvars[host] %}
{{ hostvars[host]['ansible_hostname']}}.{{ ssh_suffix }} {{ shortname }} {{ hostvars[host][ssh_pubkey] }}
{% endif %}
{% endfor %}
{% endfor %}

{% endfor %}

Comment 40 errata-xmlrpc 2019-04-30 17:27:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0939


Note You need to log in before you can comment on or make changes to this bug.