Bug 1707099

Summary: Unable to login to OC BM instance booted with overcloud-full image with cloud-user.
Product: Red Hat OpenStack Reporter: Alexander Chuzhoy <sasha>
Component: openstack-tripleo-heat-templatesAssignee: Bob Fournier <bfournie>
Status: CLOSED ERRATA QA Contact: Alexander Chuzhoy <sasha>
Severity: high Docs Contact:
Priority: high    
Version: 15.0 (Stein)CC: apevec, bfournie, dalvarez, derekh, dsneddon, dtantsur, hjensas, lhh, lmartins, mbooth, mburns, mlammon, racedoro, rpittau
Target Milestone: betaKeywords: Regression, Triaged
Target Release: 15.0 (Stein)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-10.5.1-0.20190528190414.f41e7e4.1.el8ost Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-09-21 11:21:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
cloud-init log none

Description Alexander Chuzhoy 2019-05-06 19:47:23 UTC
Unable to login to OC BM instance booted with overcloud-full image with cloud-user.

Environment:
python3-ironic-lib-2.16.2-0.20190312184129.2020575.el8ost.noarch
puppet-ironic-14.4.1-0.20190420120354.f4220d1.el8ost.noarch
openstack-ironic-api-12.1.1-0.20190420011811.98a9730.el8ost.noarch
python3-ironicclient-2.7.0-0.20190312102843.4af8a79.el8ost.noarch
python3-ironic-inspector-client-3.5.0-0.20190313131319.9bb1150.el8ost.noarch
openstack-ironic-common-12.1.1-0.20190420011811.98a9730.el8ost.noarch



Steps to reproduce:

Deploy OC with BMaaS.
Boot BM instances with the same image used to deploy OC with these commands:


openstack keypair create --public-key ~/.ssh/id_rsa.pub stack-key
openstack server create --image overcloud-full --flavor baremetal --nic net-id=$(openstack network show baremetal -f value -c id) --key-name stack-key bmserver1


Wait until the server is up.

Try to ssh into the machine with cloud-user.


Result:

(undercloud) [stack@undercloud-0 ~]$ ssh cloud-user.24.74
Warning: Permanently added '192.168.24.74' (ECDSA) to the list of known hosts.
cloud-user.24.74: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).



Logged into the image via root user and saw that the .ssh/authorized_keys file is empty for cloud-user. The user did exist on the instance.

Comment 1 Alexander Chuzhoy 2019-05-06 19:48:32 UTC
On the same setup I booted a regular instance with cirros image and was able to login with cirros user using the public key provided with --key-name.

Comment 3 Bob Fournier 2019-05-10 13:19:55 UTC
Sasha - on the instance we should see cloud-init adding the cloud-user, for example (on another system):

[cloud-user@instance1 log]$ sudo grep cloud-user /var/log/cloud-init.log
2019-04-10 20:28:41,370 - __init__.py[DEBUG]: Adding user cloud-user
2019-04-10 20:28:41,371 - util.py[DEBUG]: Running hidden command to protect sensitive input/output logstring: ['useradd', 'cloud-user', '--comment', 'Cloud User', '--groups', 'adm,systemd-journal,wheel', '--shell', '/bin/bash', '-m']
2019-04-10 20:28:41,481 - util.py[DEBUG]: Running command ['passwd', '-l', 'cloud-user'] with allowed return codes [0] (shell=False, capture=True)
2019-04-10 20:28:41,565 - util.py[DEBUG]: Restoring selinux mode for /home/cloud-user (recursive=True)
2019-04-10 20:28:41,567 - util.py[DEBUG]: Restoring selinux mode for /home/cloud-user/.ssh (recursive=False)
2019-04-10 20:28:41,567 - util.py[DEBUG]: Changing the ownership of /home/cloud-user/.ssh to 1000:1001
2019-04-10 20:28:41,580 - util.py[DEBUG]: Restoring selinux mode for /home/cloud-user/.ssh (recursive=True)
2019-04-10 20:28:41,581 - util.py[DEBUG]: Restoring selinux mode for /home/cloud-user/.ssh (recursive=False)
2019-04-10 20:28:41,582 - util.py[DEBUG]: Writing to /home/cloud-user/.ssh/authorized_keys - wb: [384] 433 bytes
2019-04-10 20:28:41,583 - util.py[DEBUG]: Restoring selinux mode for /home/cloud-user/.ssh/authorized_keys (recursive=False)
2019-04-10 20:28:41,584 - util.py[DEBUG]: Restoring selinux mode for /home/cloud-user/.ssh/authorized_keys (recursive=False)
2019-04-10 20:28:41,584 - util.py[DEBUG]: Changing the ownership of /home/cloud-user/.ssh/authorized_keys to 1000:1001
2019-04-10 20:28:41,585 - util.py[DEBUG]: Restoring selinux mode for /home/cloud-user/.ssh (recursive=True)
2019-04-10 20:28:44,462 - util.py[DEBUG]: Restoring selinux mode for /home/cloud-user/.ssh (recursive=True)
2019-04-10 20:28:44,463 - util.py[DEBUG]: Reading from /home/cloud-user/.ssh/authorized_keys (quiet=False)
2019-04-10 20:28:44,463 - util.py[DEBUG]: Read 433 bytes from /home/cloud-user/.ssh/authorized_keys

When you're logged in as root can retrieve /var/log/cloud-init.log?  If you can't get this file can you check if the cloud-init pkg is installed?

Comment 4 Alexander Chuzhoy 2019-05-10 19:07:56 UTC
Created attachment 1566779 [details]
cloud-init log

Comment 5 Bob Fournier 2019-05-10 19:36:39 UTC
On sasha's system, looks like the cloud-user is being set up fine but 0 bytes are written to /home/cloud-user/.ssh/authorized_keys.

[root@host-192-168-24-97 ~]# grep cloud-user /var/log/cloud-init.log
2019-05-10 18:46:09,239 - __init__.py[DEBUG]: Adding user cloud-user
2019-05-10 18:46:09,239 - util.py[DEBUG]: Running hidden command to protect sensitive input/output logstring: ['useradd', 'cloud-user', '--comment', 'Cloud User', '--groups', 'adm,systemd-journal,wheel', '--shell', '/bin/bash', '-m']
2019-05-10 18:46:09,637 - util.py[DEBUG]: Running command ['passwd', '-l', 'cloud-user'] with allowed return codes [0] (shell=False, capture=True)
2019-05-10 18:46:09,658 - util.py[DEBUG]: Restoring selinux mode for /home/cloud-user (recursive=True)
2019-05-10 18:46:09,660 - util.py[DEBUG]: Restoring selinux mode for /home/cloud-user/.ssh (recursive=False)
2019-05-10 18:46:09,660 - util.py[DEBUG]: Changing the ownership of /home/cloud-user/.ssh to 1000:1001
2019-05-10 18:46:09,661 - util.py[DEBUG]: Restoring selinux mode for /home/cloud-user/.ssh (recursive=True)
2019-05-10 18:46:09,662 - util.py[DEBUG]: Restoring selinux mode for /home/cloud-user/.ssh (recursive=False)
2019-05-10 18:46:09,662 - util.py[DEBUG]: Writing to /home/cloud-user/.ssh/authorized_keys - wb: [600] 0 bytes  <======
2019-05-10 18:46:09,662 - util.py[DEBUG]: Restoring selinux mode for /home/cloud-user/.ssh/authorized_keys (recursive=False)
2019-05-10 18:46:09,663 - util.py[DEBUG]: Restoring selinux mode for /home/cloud-user/.ssh/authorized_keys (recursive=False)
2019-05-10 18:46:09,663 - util.py[DEBUG]: Changing the ownership of /home/cloud-user/.ssh/authorized_keys to 1000:1001
2019-05-10 18:46:09,663 - util.py[DEBUG]: Restoring selinux mode for /home/cloud-user/.ssh (recursive=True)
2019-05-10 18:46:11,003 - util.py[DEBUG]: Restoring selinux mode for /home/cloud-user/.ssh (recursive=True)
2019-05-10 18:46:11,004 - util.py[DEBUG]: Reading from /home/cloud-user/.ssh/authorized_keys (quiet=False)
2019-05-10 18:46:11,004 - util.py[DEBUG]: Read 0 bytes from /home/cloud-user/.ssh/authorized_keys  <=====

Comment 6 Bob Fournier 2019-05-10 19:44:41 UTC
Looks like an issue getting metadata for sasha's instance:
loud-init-output.log:2019-05-10 18:45:49,439 - url_helper.py[WARNING]: Calling 'http://192.168.24.1/latest/meta-data/instance-id' failed [101/120s]: request error [HTTPConnectionPool(host='192.168.24.1', port=80): Max retries exceeded with url: /latest/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f28a44d8390>, 'Connection to 192.168.24.1 timed out. (connect timeout=50.0)'))]
cloud-init-output.log:2019-05-10 18:46:07,460 - url_helper.py[WARNING]: Calling 'http://192.168.24.1/latest/meta-data/instance-id' failed [119/120s]: request error [HTTPConnectionPool(host='192.168.24.1', port=80): Max retries exceeded with url: /latest/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f28a44d8ba8>, 'Connection to 192.168.24.1 timed out. (connect timeout=17.0)'))]

Comment 7 Bob Fournier 2019-05-10 20:14:55 UTC
Also there's no route to the metadata server 169.254.169.254
[root@host-192-168-24-97 ~]# ip route
default via 192.168.24.1 dev eth0 proto dhcp metric 100 
192.168.24.0/24 dev eth0 proto kernel scope link src 192.168.24.97 metric 100

versus what we see on working instance on osp-13:
cloud-user@instance1 ~]$ ip route
default via 172.21.2.1 dev em1 proto dhcp metric 100 
169.254.169.254 via 172.21.2.1 dev em1 proto dhcp metric 100 
172.21.2.0/24 dev em1 proto kernel scope link src 172.21.2.206 metric 100

Comment 8 Dmitry Tantsur 2019-05-13 09:33:01 UTC
We should use configdrive, not metadata. I wonder if something has changed that prevented configdrives from being read.

Comment 9 Derek Higgins 2019-05-16 15:54:51 UTC
Sasha - 
Can you log onto the instance, mount the config-drive (it should be a 64M partition) and see if the files in there contain your key.
Also can you take a look at ~/.ssh/id_rsa.pub to make sure it isn't an empty file for some reason.

Comment 10 Alexander Chuzhoy 2019-05-17 17:04:26 UTC
So if the server is created with "--config-drive true" - the issue doesn't reproduce.

Comment 11 Bob Fournier 2019-05-17 17:11:45 UTC
Thanks for testing Sasha.  Derek found that force_config_drive is set to false in OSP-15 where previously it was true, this is because of the ovn change https://review.opendev.org/#/c/502943/.  The release note for this change indicates:
features:
  - Adds ability to configure metadata agent for networking-ovn based deployments.
upgrade:
  - force_config_drive is now set to False in Nova. Instances will now fetch their metadata from the metadata service instead from the config drive.


Using the metadata service isn't working properly...

Comment 12 Alexander Chuzhoy 2019-05-17 19:09:54 UTC
During boot the BM instance's console shows:
[   45.726833] cloud-init[1072]: 2019-05-17 19:08:55,628 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [7/120s]: bad status code [404]
[   47.777999] cloud-init[1072]: 2019-05-17 19:08:57,679 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [9/120s]: bad status code [404]
[   49.833946] cloud-init[1072]: 2019-05-17 19:08:59,735 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [11/120s]: bad status code [404]
[   51.884542] cloud-init[1072]: 2019-05-17 19:09:01,785 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [13/120s]: bad status code [404]
[   53.931933] cloud-init[1072]: 2019-05-17 19:09:03,833 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [15/120s]: bad status code [404]
[   56.997295] cloud-init[1072]: 2019-05-17 19:09:06,898 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [18/120s]: bad status code [404]
[   60.046866] cloud-init[1072]: 2019-05-17 19:09:09,948 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [21/120s]: bad status code [404]
[   63.103655] cloud-init[1072]: 2019-05-17 19:09:13,004 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [24/120s]: bad status code [404]
[   66.156222] cloud-init[1072]: 2019-05-17 19:09:16,057 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [28/120s]: bad status code [404]
[   69.208272] cloud-init[1072]: 2019-05-17 19:09:19,109 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [31/120s]: bad status code [404]
[   73.262001] cloud-init[1072]: 2019-05-17 19:09:23,163 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [35/120s]: bad status code [404]
[   77.490188] cloud-init[1072]: 2019-05-17 19:09:27,391 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [39/120s]: bad status code [404]
[   81.543769] cloud-init[1072]: 2019-05-17 19:09:31,444 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [43/120s]: bad status code [404]

Comment 13 Alexander Chuzhoy 2019-05-17 19:10:51 UTC
Note: this also affects boot time of the BM instance - much longer.

Comment 14 Riccardo Pittau 2019-05-20 07:36:08 UTC
Looking at the logs it seems the instance is indeed trying to use the correct metadata server ip, but since the route rule is missing for 169.254.169.254 it can't reach it.
As a consequence, we have first a no route to host to 169.254.169.254, and then the server trying to fall-back to 192.168.24.1 (gateway) to retrieve the metadata.
The route rule to reach the metadata ip should be added by neutron during vm creation, specifically by the neutron dhcp service.
It would be interesting to verify that the metadata service is reachable after manually adding the route rule, to exclude any other possible networking service issues.

Comment 16 Bob Fournier 2019-05-21 19:51:02 UTC
Daniel - per Dmitry's comment in https://review.opendev.org/#/c/502943/32/puppet/services/ovn-controller.yaml, is there any downside to reverting the change setting force_config_drive to false?

Comment 21 Bob Fournier 2019-05-29 20:37:02 UTC
Looks like force_config_drive is now set to true using latest compose (5/29)

[root@undercloud-0 etc]# rpm -qa | grep tripleo-heat
openstack-tripleo-heat-templates-10.5.1-0.20190528190414.f41e7e4.1.el8ost.noarch

[root@undercloud-0 etc]# grep -r force_config_drive *
puppet/hieradata/service_configs.json:    "nova::compute::force_config_drive": true,

Comment 22 Bob Fournier 2019-05-30 16:11:10 UTC
force_config_drive is also true on controller nodes

[heat-admin@controller-0 puppet]$ sudo grep -r force_config_drive *
hieradata/service_configs.json:    "nova::compute::force_config_drive": true,

Comment 23 mlammon 2019-06-04 13:49:48 UTC
Env:

openstack-tripleo-heat-templates-10.5.1-0.20190528190414.f41e7e4.1.el8ost.noarch

(undercloud) [stack@undercloud-0 ~]$ sudo grep -r force_config_drive /etc/puppet/hieradata/service_configs.json
    "nova::compute::force_config_drive": true

Latest test no longer see the issue ssh to instance

Comment 26 errata-xmlrpc 2019-09-21 11:21:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:2811