Bug 1661052 - Failure to deploy overcloud due to failure to generate ssh key
Summary: Failure to deploy overcloud due to failure to generate ssh key
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Linux
high
high
Target Milestone: ---
: 14.0 (Rocky)
Assignee: Cédric Jeanneret
QA Contact: Gurenko Alex
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-12-20 00:49 UTC by David Hill
Modified: 2019-04-08 13:33 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-08 13:33:29 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description David Hill 2018-12-20 00:49:15 UTC
Description of problem:
Failure to deploy overcloud due to failure to generate ssh key :

 Stack overcloud/88f21287-7329-4e8f-b80c-1dec19038c18 CSaving key "/tmp/tmpvmlQbw/id_rsa" failed: Permission denied
Generating public/private rsa key pair.
Command '['ssh-keygen', '-N', '', '-t', 'rsa', '-b', '4096', '-f', '/tmp/tmpvmlQbw/id_rsa', '-C', 'TripleO split stack short term key']' returned non-zero exit status 1
REATE_COMPLETE 


Version-Release number of selected component (if applicable):


How reproducible:
Intermittent

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 David Hill 2018-12-20 13:10:18 UTC
Looks like we're failing here:

function generate_short_term_keys {
    local tmpdir=$(mktemp -d)
    ssh-keygen -N '' -t rsa -b 4096 -f "$tmpdir/id_rsa" -C "$SHORT_TERM_KEY_COMMENT" > /dev/null
    echo "$tmpdir"
}

Comment 2 Emilien Macchi 2018-12-28 10:47:34 UTC
The bug report is incomplete. Please describe how to reproduce, what version are you using, and if you did some manual actions regarding the SSH keys before.

Comment 3 David Hill 2019-01-02 03:39:56 UTC
I'm running the following script [1] and use the following templates [2] using the latest RHEL 7.6 KVM guest image for the undercloud and the latest puddles.   I get a successfully deployed overcloud as the heat stack is CREATE_COMPLETE:

(undercloud) [stack@undercloud-0-rhosp14 ~]$ heat stack-list
WARNING (shell) "heat stack-list" is deprecated, please use "openstack stack list" instead
+--------------------------------------+------------+-----------------+----------------------+--------------+----------------------------------+                                                                  
| id                                   | stack_name | stack_status    | creation_time        | updated_time | project                          |                                                                  
+--------------------------------------+------------+-----------------+----------------------+--------------+----------------------------------+                                                                  
| 7c88e9f9-7e36-4335-97ea-47fb45f28216 | overcloud  | CREATE_COMPLETE | 2018-12-31T17:17:27Z | None         | 94fd1d4c095d44ecb0a85f7b51241a25 |                                                                  
+--------------------------------------+------------+-----------------+----------------------+--------------+----------------------------------+ 

but I'm failing here:

Deploying overcloud                                         error (2272s)

This means "openstack overcloud deploy" didn't return 0.

These are the last output from the "openstack overcloud deploy" command:

2018-12-31 17:44:12Z [overcloud.AllNodesDeploySteps.BlockStorageDeployment_Step5]: CREATSaving key "/tmp/tmpldn4bS/id_rsa" failed: Permission denied
Generating public/private rsa key pair.
Command '['ssh-keygen', '-N', '', '-t', 'rsa', '-b', '4096', '-f', '/tmp/tmpldn4bS/id_rsa', '-C', 'TripleO split stack short term key']' returned non-zero exit status 1
E_IN_PROGRESS  state changed
2018-12-31 17:44:12Z [overcloud.AllNodesDeploySteps.ObjectStorageDeployment_Step5]: CREATE_COMPLETE  state changed
2018-12-31 17:44:12Z [overcloud.AllNodesDeploySteps.BlockStorageDeployment_Step5]: CREATE_COMPLETE  state changed
2018-12-31 17:44:12Z [overcloud.AllNodesDeploySteps.ControllerDeployment_Step5]: CREATE_IN_PROGRESS  state changed
2018-12-31 17:44:12Z [overcloud.AllNodesDeploySteps.ControllerDeployment_Step5]: CREATE_COMPLETE  state changed
2018-12-31 17:44:12Z [overcloud.AllNodesDeploySteps.ComputeDeployment_Step5]: CREATE_IN_PROGRESS  state changed
2018-12-31 17:44:13Z [overcloud.AllNodesDeploySteps.ComputeDeployment_Step5]: CREATE_COMPLETE  state changed
2018-12-31 17:44:13Z [overcloud.AllNodesDeploySteps.CephStorageDeployment_Step5]: CREATE_IN_PROGRESS  state changed
2018-12-31 17:44:13Z [overcloud.AllNodesDeploySteps.CephStorageDeployment_Step5]: CREATE_COMPLETE  state changed
2018-12-31 17:44:14Z [overcloud.AllNodesDeploySteps.ComputeExtraConfigPost]: CREATE_IN_PROGRESS  state changed
2018-12-31 17:44:14Z [overcloud.AllNodesDeploySteps.ObjectStorageExtraConfigPost]: CREATE_IN_PROGRESS  state changed
2018-12-31 17:44:14Z [overcloud.AllNodesDeploySteps.ControllerExtraConfigPost]: CREATE_IN_PROGRESS  state changed
2018-12-31 17:44:14Z [overcloud.AllNodesDeploySteps.BlockStorageExtraConfigPost]: CREATE_IN_PROGRESS  state changed
2018-12-31 17:44:14Z [overcloud.AllNodesDeploySteps.CephStorageExtraConfigPost]: CREATE_IN_PROGRESS  state changed
2018-12-31 17:44:17Z [overcloud.AllNodesDeploySteps.ComputeExtraConfigPost]: CREATE_COMPLETE  state changed
2018-12-31 17:44:17Z [overcloud.AllNodesDeploySteps.ObjectStorageExtraConfigPost]: CREATE_COMPLETE  state changed
2018-12-31 17:44:17Z [overcloud.AllNodesDeploySteps.ControllerExtraConfigPost]: CREATE_COMPLETE  state changed
2018-12-31 17:44:18Z [overcloud.AllNodesDeploySteps.BlockStorageExtraConfigPost]: CREATE_COMPLETE  state changed
2018-12-31 17:44:19Z [overcloud.AllNodesDeploySteps.CephStorageExtraConfigPost]: CREATE_COMPLETE  state changed
2018-12-31 17:44:20Z [overcloud.AllNodesDeploySteps.CephStoragePostConfig]: CREATE_IN_PROGRESS  state changed
2018-12-31 17:44:20Z [overcloud.AllNodesDeploySteps.BlockStoragePostConfig]: CREATE_IN_PROGRESS  state changed
2018-12-31 17:44:20Z [overcloud.AllNodesDeploySteps.CephStoragePostConfig]: CREATE_COMPLETE  state changed
2018-12-31 17:44:20Z [overcloud.AllNodesDeploySteps.BlockStoragePostConfig]: CREATE_COMPLETE  state changed
2018-12-31 17:44:20Z [overcloud.AllNodesDeploySteps.ComputePostConfig]: CREATE_IN_PROGRESS  state changed
2018-12-31 17:44:20Z [overcloud.AllNodesDeploySteps.ControllerPostConfig]: CREATE_IN_PROGRESS  state changed
2018-12-31 17:44:20Z [overcloud.AllNodesDeploySteps.ComputePostConfig]: CREATE_COMPLETE  state changed
2018-12-31 17:44:20Z [overcloud.AllNodesDeploySteps.ControllerPostConfig]: CREATE_COMPLETE  state changed
2018-12-31 17:44:20Z [overcloud.AllNodesDeploySteps.ObjectStoragePostConfig]: CREATE_IN_PROGRESS  state changed
2018-12-31 17:44:20Z [overcloud.AllNodesDeploySteps.ObjectStoragePostConfig]: CREATE_COMPLETE  state changed
2018-12-31 17:44:21Z [overcloud.AllNodesDeploySteps]: CREATE_COMPLETE  Stack CREATE completed successfully
2018-12-31 17:44:21Z [overcloud.AllNodesDeploySteps]: CREATE_COMPLETE  state changed
2018-12-31 17:44:21Z [overcloud]: CREATE_COMPLETE  Stack CREATE completed successfully

 Stack overcloud/7c88e9f9-7e36-4335-97ea-47fb45f28216 CREATE_COMPLETE 

Deploying overcloud configuration
Enabling ssh admin (tripleo-admin) for hosts:
192.0.2.15 192.0.2.8 192.0.2.12 192.0.2.22
Using ssh user heat-admin for initial connection.
Using ssh key at /home/stack/.ssh/id_rsa for initial connection.
Removing short term keys locally


I'm using the following templates:

(undercloud) [stack@undercloud-0-rhosp14 ~]$ rpm -qa | grep heat-templ
openstack-tripleo-heat-templates-9.0.1-0.20181013060906.el7ost.noarch




[1] https://github.com/david-hill/cloud/blob/14.0/create_undercloud.sh
[2] https://github.com/david-hill/rhosp14/tree/14.0-internal

Comment 4 David Hill 2019-01-02 03:41:30 UTC
If I re-run this , it'll complete successfully and return 0 which is what I'd expect to happen the first time I run it.

Comment 5 David Hill 2019-01-02 03:42:48 UTC
[root@undercloud-0-rhosp14 audit]# grep denied *
[root@undercloud-0-rhosp14 audit]#

Comment 6 Jill Rouleau 2019-01-02 21:15:51 UTC
There's quite a bit of obfuscation going on in the script, it would help if we had the actual overcloud deploy command that can reproduce this issue.  Are there any unusual permissions on /tmp on the undercloud, or can you capture the perms on the $tmpdir before it is deleted (or comment out the deletion step[0], temporarily) on an occasion where ssh-keygen fails?

[0] https://github.com/openstack/tripleo-heat-templates/blob/a0b72fa415d57171621144e104bac561cf9ef211/deployed-server/scripts/enable-ssh-admin.sh#L93

Comment 10 David Hill 2019-01-07 22:07:49 UTC
It looks like it might be a selinux issue after all as I've noticed the behavior changed lately and it was no longer set in permissive .

I got the following denied AVCs:

audit.log.2:type=AVC msg=audit(1546897000.464:230073): avc:  denied  { write } for  pid=332637 comm="ssh-keygen" name="tmpi54FXC" dev="vda1" ino=102719051 scontext=system_u:system_r:ssh_keygen_t:s0 tcontext=system_u:object_r:initrc_tmp_t:s0 tclass=dir
audit.log.2:type=AVC msg=audit(1546897000.464:230073): avc:  denied  { add_name } for  pid=332637 comm="ssh-keygen" name="id_rsa" scontext=system_u:system_r:ssh_keygen_t:s0 tcontext=system_u:object_r:initrc_tmp_t:s0 tclass=dir
audit.log.2:type=AVC msg=audit(1546897000.464:230073): avc:  denied  { create } for  pid=332637 comm="ssh-keygen" name="id_rsa" scontext=system_u:system_r:ssh_keygen_t:s0 tcontext=system_u:object_r:initrc_tmp_t:s0 tclass=file


Previously I was setting selinux in permissive using this:
puppet-stack-config/os-apply-config/etc/puppet/hieradata/CentOS.yaml:tripleo::selinux::mode: permissive
puppet-stack-config/os-apply-config/etc/puppet/hieradata/RedHat.yaml:tripleo::selinux::mode: permissive

but for some reasons, it looks like it's not longer effective.  I added a custom hiera_data.yaml file to undercloud.conf that contains :
tripleo::selinux::mode: permissive

and redeployed the undercloud .  Then redeployed the overcloud ...

Comment 11 David Hill 2019-01-08 13:27:11 UTC
So it was a selinux problem as mentionned above as after fixing the permissive selinux issue, I successfully deployed an overcloud and tested it:

[jenkins@zappa linux-stable-new]$ bash reproduce_rhosp14.sh

Fetching image                                              done (97s)
Copying base image                                          done (144s)
Resizing base disk                                          done (0s)
Customizing image                                           done (110s)
Waiting for VM to come up                                   done (18s)
Waiting for SSH to come up                                  done (265s)
Creating VMs for control                                    done (58s)
Creating VMs for compute                                    done (10s)
Creating VMs for ceph                                       done (40s)
Resuming stopped vbmc engines                               done (20s)
Waiting for VM to reboot                                    done (729s)
Copying instackenv to 192.168.122.2                         done (1s)
Sending overcloud images to undercloud                      done (47s)
Waiting for undercloud deployment                           done (3957s)
Getting new images                                          done (1391s)
Uploading RHEL image                                        done (11s)
Waiting for introspection                                   done (480s)
Waiting for overcloud deployment                            done (7231s)
Waiting for overcloud test                                  done (810s)
Reproduce 0

Comment 12 Cédric Jeanneret 2019-01-29 14:38:11 UTC
Hello,

It's weird: the AVC points a type (initrc_tmp_t) that isn't shown in the "ls -laZd" (tmp_t) shown in your output above.

Care to provide the versions for selinux-related packages? I'll try to reproduce it on my lab and dig a bit this weird type.

Cheers,

C.

Comment 13 Cédric Jeanneret 2019-01-29 17:31:03 UTC
Hello,

So I'm trying to reproduce that on a RHEL 7.6, with enforcing selinux, but apparently, I'm unable to get this error. Here's the relevant information from my env:

container-selinux-2.74-1.el7.noarch
libselinux-2.5-14.1.el7.x86_64
libselinux-python-2.5-14.1.el7.x86_64
libselinux-ruby-2.5-14.1.el7.x86_64
libselinux-utils-2.5-14.1.el7.x86_64
openstack-selinux-0.8.15-1.el7ost.noarch
openvswitch-selinux-extra-policy-1.0-9.el7fdp.noarch
selinux-policy-3.13.1-229.el7_6.6.noarch
selinux-policy-targeted-3.13.1-229.el7_6.6.noarch

python-tripleoclient-10.6.1-0.20181010222413.8c8f259.el7ost.noarch

Interesting log part:
Starting ssh admin enablement workflow
ssh admin enablement workflow - RUNNING.
ssh admin enablement workflow - RUNNING.
ssh admin enablement workflow - RUNNING.
ssh admin enablement workflow - RUNNING.
ssh admin enablement workflow - COMPLETE.
Removing TripleO short term key from 192.168.24.7
Warning: Permanently added '192.168.24.7' (ECDSA) to the list of known hosts.
Removing TripleO short term key from 192.168.24.14
Warning: Permanently added '192.168.24.14' (ECDSA) to the list of known hosts.
Removing short term keys locally
Enabling ssh admin - COMPLETE.
Waiting for messages on queue 'tripleo' with no timeout.
Config downloaded at /var/lib/mistral/overcloud
Inventory generated at /var/lib/mistral/overcloud/tripleo-ansible-inventory.yaml
Running ansible playbook at /var/lib/mistral/overcloud/deploy_steps_playbook.yaml. See log file at /var/lib/mistral/overcloud/ansible.log for progress. ...

Using /var/lib/mistral/overcloud/ansible.cfg as config file

PLAY [Gather facts from undercloud] ********************************************


Care to provide some info about your versions?

Cheers,

C.

Comment 14 David Hill 2019-01-30 21:35:24 UTC
Hello, 

   You'll be able to reproduce this only if you start your deployment upon startup using a init script started via systemd.   I'm pretty sure I must be the only customer trying that so I wouldn't spend much time on that but if you want to allow that selinux AVC in the policies, it would avoid me of messing with selinux modes ...

Thank you very much,

David Hill

Comment 15 Cédric Jeanneret 2019-01-31 07:01:50 UTC
Hello David,

Interesting use-case indeed. Care to share your systemd unit? I'd be interested in its content, I think there might be a way to set some proper selinux things in there directly.

I'm afraid allowing ssh_keygen_t to write in initrc_tmp_t is a bit out of hand, especially if it's a one-time use-case like that. We'd better find a solution within either the unit script, or maybe some wrapper or whatever.

Cheers,

C.

Comment 16 David Hill 2019-01-31 13:40:48 UTC
Hey Cedric,

   I've investigated that side myself but couldn't find anything but perhaps my google questions were not adequate.   Here is the service file I'm using [1].

Thanks,

Dave


[1] https://github.com/david-hill/cloud/blob/master/customize.service

Comment 17 David Hill 2019-01-31 13:43:59 UTC
Or maybe I simply need to move my /etc/rc.d files to some other location that are unconfined ?

Comment 18 Cédric Jeanneret 2019-01-31 13:54:49 UTC
Hey,

maybe you can try to push the script in /usr/local/bin, where it has a standard confinement?

Cheers,

C.

Comment 19 Cédric Jeanneret 2019-02-19 07:34:03 UTC
Hello David,

Any news on that?

Cheers,

C.

Comment 20 David Hill 2019-02-19 13:50:59 UTC
Hey Cedric,

    I didn't have time to try to move the files to /usr/local/bin ... as I found a workaround (permissive selinux) .  I'll do this as soon as I can find the time between two cases.

Thank you very much,

David Hill


Note You need to log in before you can comment on or make changes to this bug.