Bug 1562209

Summary: [FFU]: ceph-ansible gets triggered(and fails) when removing a compute node post FFU
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Lukas Bezdicka <lbezdick>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 13.0 (Queens)CC: dbecker, gfidente, johfulto, mandreou, mbracho, mbultel, mburns, morazi, pgrist, rhel-osp-director-maint
Target Milestone: betaKeywords: Triaged
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-8.0.2-3.el7ost python-tripleoclient-9.2.1-3.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-27 13:49:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1558787    
Attachments:
Description Flags
ceph-install-workflow.log none

Description Marius Cornea 2018-03-29 19:00:58 UTC
Description of problem:
FFU: ceph-ansible gets triggered(and fails) when removing a compute node post FFU

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Deploy OSP10 with 3 controller + 2 computes + 3 ceph osd nodes
2. Upgrade to OSP13 via FFU procedure
3. Last step of the procedure was to upgrade the ceph nodes by switching the ceph related services to continers by running the below deploy command:

#!/bin/bash
openstack overcloud deploy \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack QualtiyEng \
--libvirt-type kvm \
--ntp-server clock.redhat.com \
--control-scale 3 \
--control-flavor controller \
--compute-scale 2 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
-e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \
-e /home/stack/virt/internal.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/debug.yaml \
-e /home/stack/virt/docker-images.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/updates/update-from-ceph-newton.yaml \
-e /home/stack/ceph-ansible-env.yaml \

4. Successfully upgrade the deployment
5. Remove one of the 2 compute nodes from the stack:
openstack overcloud node delete --stack QualtiyEng 0c2ceb6a-1648-44e2-9e27-3e10b50b8685

Actual results:
Node gets deleted but stack update fails:

(undercloud) [stack@undercloud-0 ~]$ nova list
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| ID                                   | Name         | Status | Task State | Power State | Networks               |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| 6e6b6e96-78f5-4916-acb9-11193e01964f | ceph-0       | ACTIVE | -          | Running     | ctlplane=192.168.24.22 |
| 62c3dc3b-b366-4cfa-8162-de775d9f0ca7 | ceph-1       | ACTIVE | -          | Running     | ctlplane=192.168.24.12 |
| 93e3088c-a317-49fa-8a4f-1597280e8e84 | ceph-2       | ACTIVE | -          | Running     | ctlplane=192.168.24.15 |
| edfc16a2-c589-44ab-ae92-34767c6a64b4 | compute-1    | ACTIVE | -          | Running     | ctlplane=192.168.24.14 |
| 5403b97f-b800-4897-8918-8f67663f60f1 | controller-0 | ACTIVE | -          | Running     | ctlplane=192.168.24.9  |
| 41ccaa91-153d-419d-89af-960be8b21541 | controller-1 | ACTIVE | -          | Running     | ctlplane=192.168.24.18 |
| bff1bcaf-f631-442a-801e-4ba40bfeaa3c | controller-2 | ACTIVE | -          | Running     | ctlplane=192.168.24.10 |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
(undercloud) [stack@undercloud-0 ~]$ openstack stack list
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+
| ID                                   | Stack Name | Project                          | Stack Status  | Creation Time        | Updated Time         |
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+
| e25e11b5-5b3b-4c2c-aaa3-ee8f8e899a9f | QualtiyEng | 685e34f3d6b24ef5af5075745629db22 | UPDATE_FAILED | 2018-03-28T23:41:22Z | 2018-03-29T18:25:12Z |
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+

We can see the failure in /var/log/mistral/ceph-install-workflow.log which is attached to this BZ

Expected results:

There are multiple issues here as I can tell, maybe not all of them strictly related to the upgrade process but we can split them into multiple BZs if needed.

1. ceph-ansible gets triggered when a node removal is requested. Does ceph-ansible need to configure anything when a compute nodes(client for ceph cluster) gets removed from the deployment?

2. Based on the log output the switch from non-containerized to containerized ceph mon play gets run which should not be the case as this has already happened in a previous step and the services are already running inside containers at this point.

3. Nevertheless the ceph-ansible playbook should be idempotent and not fail, from what I can tell from the log it fails on:

"stderr": "Error EPERM: Are you SURE? Pool 'metrics' already has an enabled application; pass --yes-i-really-mean-it to proceed anyway"

Additional info:

Attachins sosreport and ceph-install-workflow.log

Comment 2 Marius Cornea 2018-03-29 19:02:36 UTC
Created attachment 1414867 [details]
ceph-install-workflow.log

Comment 4 John Fulton 2018-03-29 19:58:57 UTC
(In reply to Marius Cornea from comment #0)
> There are multiple issues here as I can tell, maybe not all of them strictly
> related to the upgrade process but we can split them into multiple BZs if
> needed.
> 
> 1. ceph-ansible gets triggered when a node removal is requested. Does
> ceph-ansible need to configure anything when a compute nodes(client for ceph
> cluster) gets removed from the deployment?

An optimization could be made where what's to be done is discovered and it could be decided that ceph-ansible doesn't run but doing that is difficult. Running the correct playbook however _should_be_ idempotent so I think it better to focus on ensuring idempotence. 

> 2. Based on the log output the switch from non-containerized to
> containerized ceph mon play gets run which should not be the case as this
> has already happened in a previous step and the services are already running
> inside containers at this point.

The compute node that is left over (24.14 as per `nova list` above) is the only client node in the inventory, which is right:

[root@undercloud-0 ansible-mistral-actiondh26qI]# grep clients -A 5 inventory.yaml 
clients:
  hosts:
    192.168.24.14: {}
mdss:
  hosts: {}
mgrs:
[root@undercloud-0 ansible-mistral-actiondh26qI]# 

but, as you say, the wrong playbook, rolling_update.yml ran: 

[root@undercloud-0 ansible-mistral-actiondh26qI]# grep ansible-playbook ansible-playbook-command.sh
ansible-playbook -vv /usr/share/ceph-ansible/infrastructure-playbooks/rolling_update.yml --user tripleo-admin --become --become-user root --extra-vars {"ireallymeanit": "yes"} --inventory-file /tmp/ansible-mistral-actiondh26qI/inventory.yaml --private-key /tmp/ansible-mistral-actiondh26qI/ssh_private_key --skip-tags package-install,with_pkg "$@"
[root@undercloud-0 ansible-mistral-actiondh26qI]# 

That playbook that was used in the FFU and seems to be "left over" which is a problem. 

One way to address it is to document that the user needs to run the node deletion using something like: 

openstack overcloud node delete --stack QualtiyEng 0c2ceb6a-1648-44e2-9e27-3e10b50b8685 -e foo.yaml

where foo.yaml contains the correct playbook: 

 CephAnsiblePlaybook: ['/usr/share/ceph-ansible/site-docker.yml.sample']

The above could be used as a workaround in the meantime if you want to try it. 

Giulio: Is there a better way to get the stack to "remember" the right playbook?

> 3. Nevertheless the ceph-ansible playbook should be idempotent and not fail,
> from what I can tell from the log it fails on:
> 
> "stderr": "Error EPERM: Are you SURE? Pool 'metrics' already has an enabled
> application; pass --yes-i-really-mean-it to proceed anyway"

This seems like an idempotence bug in ceph-ansible itself. I opened the following bug for it: 

 https://bugzilla.redhat.com/show_bug.cgi?id=1562220

Comment 5 Marios Andreou 2018-04-02 13:00:43 UTC
please triage this we are going through the list and assigning round robin thanks (DFG:Upgrades triage call)

Comment 13 errata-xmlrpc 2018-06-27 13:49:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086