1562209 – [FFU]: ceph-ansible gets triggered(and fails) when removing a compute node post FFU

Bug 1562209 - [FFU]: ceph-ansible gets triggered(and fails) when removing a compute node post FFU

Summary: [FFU]: ceph-ansible gets triggered(and fails) when removing a compute node po...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	beta
Target Release:	13.0 (Queens)
Assignee:	Lukas Bezdicka
QA Contact:	Marius Cornea
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1558787
TreeView+	depends on / blocked

Reported:	2018-03-29 19:00 UTC by Marius Cornea
Modified:	2018-06-27 13:50 UTC (History)
CC List:	10 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-8.0.2-3.el7ost python-tripleoclient-9.2.1-3.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-06-27 13:49:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
ceph-install-workflow.log (7.37 MB, text/plain) 2018-03-29 19:02 UTC, Marius Cornea	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	562457	'None'	MERGED	No-op Mistral workflow resources for update/upgrade/ffwd	2020-03-25 15:40:36 UTC
OpenStack gerrit	564033	'None'	MERGED	Introduce Ceph upgrade environments	2020-03-25 15:40:36 UTC
OpenStack gerrit	564034	'None'	MERGED	Introduce Ceph upgrade command	2020-03-25 15:40:36 UTC
Red Hat Product Errata	RHEA-2018:2086	None	None	None	2018-06-27 13:50:27 UTC

Description Marius Cornea 2018-03-29 19:00:58 UTC

Description of problem:
FFU: ceph-ansible gets triggered(and fails) when removing a compute node post FFU

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Deploy OSP10 with 3 controller + 2 computes + 3 ceph osd nodes
2. Upgrade to OSP13 via FFU procedure
3. Last step of the procedure was to upgrade the ceph nodes by switching the ceph related services to continers by running the below deploy command:

#!/bin/bash
openstack overcloud deploy \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack QualtiyEng \
--libvirt-type kvm \
--ntp-server clock.redhat.com \
--control-scale 3 \
--control-flavor controller \
--compute-scale 2 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
-e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \
-e /home/stack/virt/internal.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/debug.yaml \
-e /home/stack/virt/docker-images.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/updates/update-from-ceph-newton.yaml \
-e /home/stack/ceph-ansible-env.yaml \

4. Successfully upgrade the deployment
5. Remove one of the 2 compute nodes from the stack:
openstack overcloud node delete --stack QualtiyEng 0c2ceb6a-1648-44e2-9e27-3e10b50b8685

Actual results:
Node gets deleted but stack update fails:

(undercloud) [stack@undercloud-0 ~]$ nova list
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| ID                                   | Name         | Status | Task State | Power State | Networks               |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| 6e6b6e96-78f5-4916-acb9-11193e01964f | ceph-0       | ACTIVE | -          | Running     | ctlplane=192.168.24.22 |
| 62c3dc3b-b366-4cfa-8162-de775d9f0ca7 | ceph-1       | ACTIVE | -          | Running     | ctlplane=192.168.24.12 |
| 93e3088c-a317-49fa-8a4f-1597280e8e84 | ceph-2       | ACTIVE | -          | Running     | ctlplane=192.168.24.15 |
| edfc16a2-c589-44ab-ae92-34767c6a64b4 | compute-1    | ACTIVE | -          | Running     | ctlplane=192.168.24.14 |
| 5403b97f-b800-4897-8918-8f67663f60f1 | controller-0 | ACTIVE | -          | Running     | ctlplane=192.168.24.9  |
| 41ccaa91-153d-419d-89af-960be8b21541 | controller-1 | ACTIVE | -          | Running     | ctlplane=192.168.24.18 |
| bff1bcaf-f631-442a-801e-4ba40bfeaa3c | controller-2 | ACTIVE | -          | Running     | ctlplane=192.168.24.10 |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
(undercloud) [stack@undercloud-0 ~]$ openstack stack list
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+
| ID                                   | Stack Name | Project                          | Stack Status  | Creation Time        | Updated Time         |
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+
| e25e11b5-5b3b-4c2c-aaa3-ee8f8e899a9f | QualtiyEng | 685e34f3d6b24ef5af5075745629db22 | UPDATE_FAILED | 2018-03-28T23:41:22Z | 2018-03-29T18:25:12Z |
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+

We can see the failure in /var/log/mistral/ceph-install-workflow.log which is attached to this BZ

Expected results:

There are multiple issues here as I can tell, maybe not all of them strictly related to the upgrade process but we can split them into multiple BZs if needed.

1. ceph-ansible gets triggered when a node removal is requested. Does ceph-ansible need to configure anything when a compute nodes(client for ceph cluster) gets removed from the deployment?

2. Based on the log output the switch from non-containerized to containerized ceph mon play gets run which should not be the case as this has already happened in a previous step and the services are already running inside containers at this point.

3. Nevertheless the ceph-ansible playbook should be idempotent and not fail, from what I can tell from the log it fails on:

"stderr": "Error EPERM: Are you SURE? Pool 'metrics' already has an enabled application; pass --yes-i-really-mean-it to proceed anyway"

Additional info:

Attachins sosreport and ceph-install-workflow.log

Comment 2 Marius Cornea 2018-03-29 19:02:36 UTC

Created attachment 1414867 [details]
ceph-install-workflow.log

Comment 4 John Fulton 2018-03-29 19:58:57 UTC

(In reply to Marius Cornea from comment #0)
> There are multiple issues here as I can tell, maybe not all of them strictly
> related to the upgrade process but we can split them into multiple BZs if
> needed.
> 
> 1. ceph-ansible gets triggered when a node removal is requested. Does
> ceph-ansible need to configure anything when a compute nodes(client for ceph
> cluster) gets removed from the deployment?

An optimization could be made where what's to be done is discovered and it could be decided that ceph-ansible doesn't run but doing that is difficult. Running the correct playbook however _should_be_ idempotent so I think it better to focus on ensuring idempotence. 

> 2. Based on the log output the switch from non-containerized to
> containerized ceph mon play gets run which should not be the case as this
> has already happened in a previous step and the services are already running
> inside containers at this point.

The compute node that is left over (24.14 as per `nova list` above) is the only client node in the inventory, which is right:

[root@undercloud-0 ansible-mistral-actiondh26qI]# grep clients -A 5 inventory.yaml 
clients:
  hosts:
    192.168.24.14: {}
mdss:
  hosts: {}
mgrs:
[root@undercloud-0 ansible-mistral-actiondh26qI]# 

but, as you say, the wrong playbook, rolling_update.yml ran: 

[root@undercloud-0 ansible-mistral-actiondh26qI]# grep ansible-playbook ansible-playbook-command.sh
ansible-playbook -vv /usr/share/ceph-ansible/infrastructure-playbooks/rolling_update.yml --user tripleo-admin --become --become-user root --extra-vars {"ireallymeanit": "yes"} --inventory-file /tmp/ansible-mistral-actiondh26qI/inventory.yaml --private-key /tmp/ansible-mistral-actiondh26qI/ssh_private_key --skip-tags package-install,with_pkg "$@"
[root@undercloud-0 ansible-mistral-actiondh26qI]# 

That playbook that was used in the FFU and seems to be "left over" which is a problem. 

One way to address it is to document that the user needs to run the node deletion using something like: 

openstack overcloud node delete --stack QualtiyEng 0c2ceb6a-1648-44e2-9e27-3e10b50b8685 -e foo.yaml

where foo.yaml contains the correct playbook: 

 CephAnsiblePlaybook: ['/usr/share/ceph-ansible/site-docker.yml.sample']

The above could be used as a workaround in the meantime if you want to try it. 

Giulio: Is there a better way to get the stack to "remember" the right playbook?

> 3. Nevertheless the ceph-ansible playbook should be idempotent and not fail,
> from what I can tell from the log it fails on:
> 
> "stderr": "Error EPERM: Are you SURE? Pool 'metrics' already has an enabled
> application; pass --yes-i-really-mean-it to proceed anyway"

This seems like an idempotence bug in ceph-ansible itself. I opened the following bug for it: 

 https://bugzilla.redhat.com/show_bug.cgi?id=1562220

Comment 5 Marios Andreou 2018-04-02 13:00:43 UTC

please triage this we are going through the list and assigning round robin thanks (DFG:Upgrades triage call)

Comment 13 errata-xmlrpc 2018-06-27 13:49:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086

Note You need to log in before you can comment on or make changes to this bug.