Bug 1503495

Summary: OSP11 -> OSP12 upgrade: major-upgrade-composable-steps-docker on split stack deployments with ceph nodes - ceph-ansible workflow fails with Permission denied
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Jiri Stransky <jstransk>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 12.0 (Pike)CC: dbecker, emacchi, gfidente, jomurphy, jschluet, mbultel, mburns, morazi, rhel-osp-director-maint, scohen, yprokule
Target Milestone: betaKeywords: Triaged
Target Release: 12.0 (Pike)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-7.0.3-0.20171024200823.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-13 22:17:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marius Cornea 2017-10-18 09:31:15 UTC
Description of problem:
OSP11 -> OSP12 upgrade: major-upgrade-composable-steps-docker on split stack deployments with ceph nodes - ceph-ansible workflow fails with Permission denied

I suspect this is caused because the split stack nodes are not reacheable via the heat-admin user but via a user called 'stack'.

(undercloud) [stack@undercloud-0 ~]$ openstack stack failures list overcloud --long
overcloud.AllNodesDeploySteps.AllNodesPostUpgradeSteps.WorkflowTasks_Step2_Execution:
  resource_type: OS::Mistral::ExternalResource
  physical_resource_id: 95f4b38d-34f2-4d13-b419-53e1d672c2e1
  status: CREATE_FAILED
  status_reason: |
    resources.WorkflowTasks_Step2_Execution: ERROR

 [root@undercloud-0 stack]# cat /var/log/mistral/ceph-install-workflow.log 
2017-10-18 05:15:01,832 p=21843 u=mistral |  [DEPRECATION WARNING]: docker is kept for backwards compatibility but usage is 
discouraged. The module documentation details page may explain more about this 
rationale..
This feature will be removed in a future release. Deprecation 
warnings can be disabled by setting deprecation_warnings=False in ansible.cfg.
2017-10-18 05:15:03,404 p=21843 u=mistral |  PLAY [confirm whether user really meant to switch from non-containerized to containerized ceph daemons] ***
2017-10-18 05:15:03,431 p=21843 u=mistral |  TASK [exit playbook, if user did not mean to switch from non-containerized to containerized daemons?] ***
2017-10-18 05:15:03,448 p=21843 u=mistral |  skipping: [localhost]
2017-10-18 05:15:03,471 p=21843 u=mistral |  PLAY [make sure docker is present and started] *********************************
2017-10-18 05:15:03,478 p=21843 u=mistral |  TASK [Gathering Facts] *********************************************************
2017-10-18 05:15:03,980 p=21843 u=mistral |  fatal: [192.168.0.51]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Could not create directory '/home/mistral/.ssh'.\r\nWarning: Permanently added '192.168.0.51' (ECDSA) to the list of known hosts.\r\nPermission denied (publickey,gssapi-keyex,gssapi-with-mic).\r\n", "unreachable": true}
2017-10-18 05:15:03,981 p=21843 u=mistral |  fatal: [192.168.0.64]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Could not create directory '/home/mistral/.ssh'.\r\nWarning: Permanently added '192.168.0.64' (ECDSA) to the list of known hosts.\r\nPermission denied (publickey,gssapi-keyex,gssapi-with-mic).\r\n", "unreachable": true}
2017-10-18 05:15:03,984 p=21843 u=mistral |  fatal: [192.168.0.65]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Could not create directory '/home/mistral/.ssh'.\r\nWarning: Permanently added '192.168.0.65' (ECDSA) to the list of known hosts.\r\nPermission denied (publickey,gssapi-keyex,gssapi-with-mic).\r\n", "unreachable": true}
2017-10-18 05:15:03,986 p=21843 u=mistral |  fatal: [192.168.0.50]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Could not create directory '/home/mistral/.ssh'.\r\nWarning: Permanently added '192.168.0.50' (ECDSA) to the list of known hosts.\r\nPermission denied (publickey,gssapi-keyex,gssapi-with-mic).\r\n", "unreachable": true}
2017-10-18 05:15:03,987 p=21843 u=mistral |  fatal: [192.168.0.63]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Could not create directory '/home/mistral/.ssh'.\r\nWarning: Permanently added '192.168.0.63' (ECDSA) to the list of known hosts.\r\nPermission denied (publickey,gssapi-keyex,gssapi-with-mic).\r\n", "unreachable": true}
2017-10-18 05:15:03,990 p=21843 u=mistral |  fatal: [192.168.0.52]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Could not create directory '/home/mistral/.ssh'.\r\nWarning: Permanently added '192.168.0.52' (ECDSA) to the list of known hosts.\r\nPermission denied (publickey,gssapi-keyex,gssapi-with-mic).\r\n", "unreachable": true}
2017-10-18 05:15:03,991 p=21843 u=mistral |  PLAY RECAP *********************************************************************
2017-10-18 05:15:03,991 p=21843 u=mistral |  192.168.0.50               : ok=0    changed=0    unreachable=1    failed=0   
2017-10-18 05:15:03,991 p=21843 u=mistral |  192.168.0.51               : ok=0    changed=0    unreachable=1    failed=0   
2017-10-18 05:15:03,992 p=21843 u=mistral |  192.168.0.52               : ok=0    changed=0    unreachable=1    failed=0   
2017-10-18 05:15:03,992 p=21843 u=mistral |  192.168.0.63               : ok=0    changed=0    unreachable=1    failed=0   
2017-10-18 05:15:03,992 p=21843 u=mistral |  192.168.0.64               : ok=0    changed=0    unreachable=1    failed=0   
2017-10-18 05:15:03,992 p=21843 u=mistral |  192.168.0.65               : ok=0    changed=0    unreachable=1    failed=0   
2017-10-18 05:15:03,992 p=21843 u=mistral |  localhost                  : ok=0    changed=0    unreachable=0    failed=0   


Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-7.0.3-0.20171014102841.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP11 split stack deployment with Ceph nodes
2. Upgrade to OSP12

Actual results:
Upgrade fails while running the ceph-ansible playbook because the split stack nodes are unreacheable.

Expected results:
With split stack nodes ansible uses the 'stack' user per the documentation.

Additional info:

Comment 1 Marius Cornea 2017-10-18 10:06:59 UTC
The ansible command appears to be so it's trying to use the tripleo-admin user to reach the nodes

Command: ansible-playbook /usr/share/ceph-ansible/infrastructure-playbooks/switch-from-non-containerized-to-containerized-ceph-daemons.yml --user tripleo-admin --become --become-user root --extra-vars {"monitor_secret": "***", "ceph_conf_o
verrides": {"global": {"rgw_s3_auth_use_keystone": "true", "rgw_keystone_admin_password": "***", "osd_pool_default_pgp_num": 128, "rgw_keystone_url": "http://10.0.0.16:5000", "rgw_keystone_admin_project": "service", "rgw_keystone_accepted_
roles": "Member, _member_, admin", "osd_pool_default_size": 3, "osd_pool_default_pg_num": 128, "rgw_keystone_api_version": 3, "rgw_keystone_admin_user": "swift", "rgw_keystone_admin_domain": "default"}}, "fetch_directory": "/tmp/file-mistr
al-actionWim7Dr", "user_config": true, "ceph_docker_image_tag": "latest", "ceph_release": "jewel", "containerized_deployment": true, "public_network": "10.0.0.128/25", "generate_fsid": false, "monitor_address_block": "10.0.0.128/25", "admi
n_secret": "***", "keys": [{"mon_cap": "allow r", "osd_cap": "allow class-read object_prefix rbd_children, allow rwx pool=volumes, allow rwx pool=backups, allow rwx pool=vms, allow rwx pool=images, allow rwx pool=metrics", "name": "client.
openstack", "key": "AQAg6+ZZAAAAABAAlyZ5Uw/EYSLGW1gZ0YxXiQ==", "mode": "0644"}, {"mon_cap": "allow r, allow command \\\\\\\"auth del\\\\\\\", allow command \\\\\\\"auth caps\\\\\\\", allow command \\\\\\\"auth get\\\\\\\", allow command \\
\\\\\"auth get-or-create\\\\\\\"", "mds_cap": "allow *", "name": "client.manila", "mode": "0644", "key": "AQAg6+ZZAAAAABAA3lxxib+rri81tZGrM3pDog==", "osd_cap": "allow rw"}, {"mon_cap": "allow rw", "osd_cap": "allow rwx", "name": "client.ra
dosgw", "key": "AQAg6+ZZAAAAABAA4VsKb08ebtlPTwFqHqcHZQ==", "mode": "0644"}], "openstack_keys": [{"mon_cap": "allow r", "osd_cap": "allow class-read object_prefix rbd_children, allow rwx pool=volumes, allow rwx pool=backups, allow rwx pool=
vms, allow rwx pool=images, allow rwx pool=metrics", "name": "client.openstack", "key": "AQAg6+ZZAAAAABAAlyZ5Uw/EYSLGW1gZ0YxXiQ==", "mode": "0644"}, {"mon_cap": "allow r, allow command \\\\\\\"auth del\\\\\\\", allow command \\\\\\\"auth c
aps\\\\\\\", allow command \\\\\\\"auth get\\\\\\\", allow command \\\\\\\"auth get-or-create\\\\\\\"", "mds_cap": "allow *", "name": "client.manila", "mode": "0644", "key": "AQAg6+ZZAAAAABAA3lxxib+rri81tZGrM3pDog==", "osd_cap": "allow rw"
}, {"mon_cap": "allow rw", "osd_cap": "allow rwx", "name": "client.radosgw", "key": "AQAg6+ZZAAAAABAA4VsKb08ebtlPTwFqHqcHZQ==", "mode": "0644"}], "osd_objectstore": "filestore", "pools": [], "ntp_service_enabled": false, "ceph_docker_image
": "ceph/rhceph-2-rhel7", "cluster_network": "10.0.1.0/25", "fsid": "efcfa5d6-b3c7-11e7-979d-525400fe98cd", "openstack_config": true, "ceph_docker_registry": "192.168.0.1:8787", "ceph_stable": true, "devices": ["/dev/vdb", "/dev/vdc"], "ce
ph_origin": "distro", "openstack_pools": [{"rule_name": "", "pg_num": 128, "name": "volumes"}, {"rule_name": "", "pg_num": 128, "name": "backups"}, {"rule_name": "", "pg_num": 128, "name": "vms"}, {"rule_name": "", "pg_num": 128, "name": "
images"}, {"rule_name": "", "pg_num": 128, "name": "metrics"}], "ip_version": "ipv4", "ireallymeanit": "yes", "docker": true} --forks 8 --ssh-common-args "-o StrictHostKeyChecking=no" --ssh-extra-args "-o UserKnownHostsFile=/dev/null" --in
ventory-file /tmp/ansible-mistral-actionSQbLV7/inventory.yaml --private-key /tmp/ansible-mistral-actionSQbLV7/ssh_private_key --skip-tags package-install,with_pkg

Comment 2 Marius Cornea 2017-10-18 10:12:07 UTC
It looks like the workflow which creates the tripleo-admin user tries to use the nova servers but for split stack there are no nova servers:

in /var/log/mistral/engine.log:
2017-10-18 05:14:57.477 12036 INFO workflow_trace [req-9a3fe81b-734f-4370-aca6-f5c452a5ca69 1f7068fda10c46ebb15865f3160fe5b5 08da50fc73114b118f112d645e8631dd - default default] Starting workflow [name=tripleo.access.v1.create_admin_via_nova, input={tasks: [{u'name': u'create user tripleo-admin', u'user': {u'name': u'tripleo-admin'}}, {u'copy': {u'dest': u'/etc/sudoers.d/tripleo-admin', u'content': u'tripleo-admin ALL=(ALL) NOPASSWD:ALL\n', u'mode': 288}, u'name': u'grant admin rights to user tripleo-admin'}, {u'name': u'ensure .ssh dir exists for user tripleo-admin', u'file': {u'owner': u'tripleo-admin', u'path': u'/home/tripleo-admin/.ssh', u'state': u'directory', u'group': u'tripleo-admin', u'mode': 448}}, {u'name': u'ensure authorized_keys file exists for user tripleo-admin', u'file': {u'owner': u'tripleo-admin', u'path': u'/home/tripleo-admin/.ssh/authorized_keys', u'state': u'touch', u'group': u'tripleo-admin', u'mode': 448}}, {u'lineinfile': {u'path': u'/home/tripleo-admin/.ssh/authorized_keys', u'line': u'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCxSo8pgiWjuwaE2nOS4gC1TKUILxyKVwTPFc1mlsqRN/3D7BQxRAen2Uy5r4ksGc975QbD2hg12cWTdVs/QrTdZU+408fjRyTA6jxPTOdALAgUvoxQriPC+fITZTIsfBPpM7/qO+jMrdnVTKQtTqG8wg+ZwIxZlOLpT+Q2FuMmtt3HGt5Co33RZTmRuZRUQe9A6hcxbEx3UySIkfCt5X1nNEy/vRRvS8Crm0au9OKQpeWFCwAz5ReczwPYFv7Q+rRryUgdWNUuUVUKJBpLekXV2fFOIx5s627QkmglL7kDnOUCOXvN6Ie30CFn4k48YCldhQTb44uE2O1JlC8JUaxD Generated by TripleO', u'regexp': u'Generated by TripleO'}, u'name': u'authorize TripleO Mistral key for user tripleo-admin'}]...] 
2017-10-18 05:14:57.488 12036 INFO workflow_trace [req-9a3fe81b-734f-4370-aca6-f5c452a5ca69 1f7068fda10c46ebb15865f3160fe5b5 08da50fc73114b118f112d645e8631dd - default default] Workflow 'tripleo.access.v1.create_admin_via_nova' [IDLE -> RUNNING, msg=None] (execution_id=7d0b714c-29cb-4a05-9448-a69315150849)
2017-10-18 05:14:57.881 12036 INFO mistral.engine.engine_server [req-9a3fe81b-734f-4370-aca6-f5c452a5ca69 1f7068fda10c46ebb15865f3160fe5b5 08da50fc73114b118f112d645e8631dd - default default] Received RPC request 'on_action_complete'[action_ex_id=dfe75f79-b7cf-4c68-a8f9-e4493dbf3b81, result=Result [data=[], error=None, cancel=False]]
2017-10-18 05:14:57.898 12036 INFO workflow_trace [req-9a3fe81b-734f-4370-aca6-f5c452a5ca69 1f7068fda10c46ebb15865f3160fe5b5 08da50fc73114b118f112d645e8631dd - default default] Action 'nova.servers_list' (dfe75f79-b7cf-4c68-a8f9-e4493dbf3b81)(task=get_servers) [RUNNING -> SUCCESS, result = []] 
2017-10-18 05:14:57.971 12036 INFO workflow_trace [req-9a3fe81b-734f-4370-aca6-f5c452a5ca69 1f7068fda10c46ebb15865f3160fe5b5 08da50fc73114b118f112d645e8631dd - default default] Task 'get_servers' (714174f5-4eb9-472a-a756-3adf50c542d6) [RUNNING -> SUCCESS, msg=None] (execution_id=7d0b714c-29cb-4a05-9448-a69315150849)
2017-10-18 05:14:58.069 12036 INFO workflow_trace [req-9a3fe81b-734f-4370-aca6-f5c452a5ca69 1f7068fda10c46ebb15865f3160fe5b5 08da50fc73114b118f112d645e8631dd - default default] Task 'create_admin' (181dba13-9170-4c42-a483-cd36cb4313e5) [RUNNING -> SUCCESS, msg=None] (execution_id=7d0b714c-29cb-4a05-9448-a69315150849)
2017-10-18 05:14:58.901 12036 INFO workflow_trace [req-9a3fe81b-734f-4370-aca6-f5c452a5ca69 1f7068fda10c46ebb15865f3160fe5b5 08da50fc73114b118f112d645e8631dd - default default] Workflow 'tripleo.access.v1.create_admin_via_nova' [RUNNING -> SUCCESS, msg=None] (execution_id=7d0b714c-29cb-4a05-9448-a69315150849)

Comment 3 Giulio Fidente 2017-10-18 12:14:44 UTC
One idea is to check if the user exists already (similarily to https://review.openstack.org/#/c/509001/) and skip creation if it does.

In this scenario the operator will have to create the user manually on the nodes.

Comment 11 errata-xmlrpc 2017-12-13 22:17:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462