Bug 1562209 - [FFU]: ceph-ansible gets triggered(and fails) when removing a compute node post FFU
Summary: [FFU]: ceph-ansible gets triggered(and fails) when removing a compute node po...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: beta
: 13.0 (Queens)
Assignee: Lukas Bezdicka
QA Contact: Marius Cornea
URL:
Whiteboard:
Depends On:
Blocks: 1558787
TreeView+ depends on / blocked
 
Reported: 2018-03-29 19:00 UTC by Marius Cornea
Modified: 2018-06-27 13:50 UTC (History)
10 users (show)

Fixed In Version: openstack-tripleo-heat-templates-8.0.2-3.el7ost python-tripleoclient-9.2.1-3.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-27 13:49:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ceph-install-workflow.log (7.37 MB, text/plain)
2018-03-29 19:02 UTC, Marius Cornea
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 562457 0 'None' MERGED No-op Mistral workflow resources for update/upgrade/ffwd 2020-03-25 15:40:36 UTC
OpenStack gerrit 564033 0 'None' MERGED Introduce Ceph upgrade environments 2020-03-25 15:40:36 UTC
OpenStack gerrit 564034 0 'None' MERGED Introduce Ceph upgrade command 2020-03-25 15:40:36 UTC
Red Hat Product Errata RHEA-2018:2086 0 None None None 2018-06-27 13:50:27 UTC

Description Marius Cornea 2018-03-29 19:00:58 UTC
Description of problem:
FFU: ceph-ansible gets triggered(and fails) when removing a compute node post FFU

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Deploy OSP10 with 3 controller + 2 computes + 3 ceph osd nodes
2. Upgrade to OSP13 via FFU procedure
3. Last step of the procedure was to upgrade the ceph nodes by switching the ceph related services to continers by running the below deploy command:

#!/bin/bash
openstack overcloud deploy \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack QualtiyEng \
--libvirt-type kvm \
--ntp-server clock.redhat.com \
--control-scale 3 \
--control-flavor controller \
--compute-scale 2 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
-e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \
-e /home/stack/virt/internal.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/debug.yaml \
-e /home/stack/virt/docker-images.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/updates/update-from-ceph-newton.yaml \
-e /home/stack/ceph-ansible-env.yaml \

4. Successfully upgrade the deployment
5. Remove one of the 2 compute nodes from the stack:
openstack overcloud node delete --stack QualtiyEng 0c2ceb6a-1648-44e2-9e27-3e10b50b8685

Actual results:
Node gets deleted but stack update fails:

(undercloud) [stack@undercloud-0 ~]$ nova list
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| ID                                   | Name         | Status | Task State | Power State | Networks               |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| 6e6b6e96-78f5-4916-acb9-11193e01964f | ceph-0       | ACTIVE | -          | Running     | ctlplane=192.168.24.22 |
| 62c3dc3b-b366-4cfa-8162-de775d9f0ca7 | ceph-1       | ACTIVE | -          | Running     | ctlplane=192.168.24.12 |
| 93e3088c-a317-49fa-8a4f-1597280e8e84 | ceph-2       | ACTIVE | -          | Running     | ctlplane=192.168.24.15 |
| edfc16a2-c589-44ab-ae92-34767c6a64b4 | compute-1    | ACTIVE | -          | Running     | ctlplane=192.168.24.14 |
| 5403b97f-b800-4897-8918-8f67663f60f1 | controller-0 | ACTIVE | -          | Running     | ctlplane=192.168.24.9  |
| 41ccaa91-153d-419d-89af-960be8b21541 | controller-1 | ACTIVE | -          | Running     | ctlplane=192.168.24.18 |
| bff1bcaf-f631-442a-801e-4ba40bfeaa3c | controller-2 | ACTIVE | -          | Running     | ctlplane=192.168.24.10 |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
(undercloud) [stack@undercloud-0 ~]$ openstack stack list
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+
| ID                                   | Stack Name | Project                          | Stack Status  | Creation Time        | Updated Time         |
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+
| e25e11b5-5b3b-4c2c-aaa3-ee8f8e899a9f | QualtiyEng | 685e34f3d6b24ef5af5075745629db22 | UPDATE_FAILED | 2018-03-28T23:41:22Z | 2018-03-29T18:25:12Z |
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+

We can see the failure in /var/log/mistral/ceph-install-workflow.log which is attached to this BZ

Expected results:

There are multiple issues here as I can tell, maybe not all of them strictly related to the upgrade process but we can split them into multiple BZs if needed.

1. ceph-ansible gets triggered when a node removal is requested. Does ceph-ansible need to configure anything when a compute nodes(client for ceph cluster) gets removed from the deployment?

2. Based on the log output the switch from non-containerized to containerized ceph mon play gets run which should not be the case as this has already happened in a previous step and the services are already running inside containers at this point.

3. Nevertheless the ceph-ansible playbook should be idempotent and not fail, from what I can tell from the log it fails on:

"stderr": "Error EPERM: Are you SURE? Pool 'metrics' already has an enabled application; pass --yes-i-really-mean-it to proceed anyway"

Additional info:

Attachins sosreport and ceph-install-workflow.log

Comment 2 Marius Cornea 2018-03-29 19:02:36 UTC
Created attachment 1414867 [details]
ceph-install-workflow.log

Comment 4 John Fulton 2018-03-29 19:58:57 UTC
(In reply to Marius Cornea from comment #0)
> There are multiple issues here as I can tell, maybe not all of them strictly
> related to the upgrade process but we can split them into multiple BZs if
> needed.
> 
> 1. ceph-ansible gets triggered when a node removal is requested. Does
> ceph-ansible need to configure anything when a compute nodes(client for ceph
> cluster) gets removed from the deployment?

An optimization could be made where what's to be done is discovered and it could be decided that ceph-ansible doesn't run but doing that is difficult. Running the correct playbook however _should_be_ idempotent so I think it better to focus on ensuring idempotence. 

> 2. Based on the log output the switch from non-containerized to
> containerized ceph mon play gets run which should not be the case as this
> has already happened in a previous step and the services are already running
> inside containers at this point.

The compute node that is left over (24.14 as per `nova list` above) is the only client node in the inventory, which is right:

[root@undercloud-0 ansible-mistral-actiondh26qI]# grep clients -A 5 inventory.yaml 
clients:
  hosts:
    192.168.24.14: {}
mdss:
  hosts: {}
mgrs:
[root@undercloud-0 ansible-mistral-actiondh26qI]# 

but, as you say, the wrong playbook, rolling_update.yml ran: 

[root@undercloud-0 ansible-mistral-actiondh26qI]# grep ansible-playbook ansible-playbook-command.sh
ansible-playbook -vv /usr/share/ceph-ansible/infrastructure-playbooks/rolling_update.yml --user tripleo-admin --become --become-user root --extra-vars {"ireallymeanit": "yes"} --inventory-file /tmp/ansible-mistral-actiondh26qI/inventory.yaml --private-key /tmp/ansible-mistral-actiondh26qI/ssh_private_key --skip-tags package-install,with_pkg "$@"
[root@undercloud-0 ansible-mistral-actiondh26qI]# 

That playbook that was used in the FFU and seems to be "left over" which is a problem. 

One way to address it is to document that the user needs to run the node deletion using something like: 

openstack overcloud node delete --stack QualtiyEng 0c2ceb6a-1648-44e2-9e27-3e10b50b8685 -e foo.yaml

where foo.yaml contains the correct playbook: 

 CephAnsiblePlaybook: ['/usr/share/ceph-ansible/site-docker.yml.sample']

The above could be used as a workaround in the meantime if you want to try it. 

Giulio: Is there a better way to get the stack to "remember" the right playbook?

> 3. Nevertheless the ceph-ansible playbook should be idempotent and not fail,
> from what I can tell from the log it fails on:
> 
> "stderr": "Error EPERM: Are you SURE? Pool 'metrics' already has an enabled
> application; pass --yes-i-really-mean-it to proceed anyway"

This seems like an idempotence bug in ceph-ansible itself. I opened the following bug for it: 

 https://bugzilla.redhat.com/show_bug.cgi?id=1562220

Comment 5 Marios Andreou 2018-04-02 13:00:43 UTC
please triage this we are going through the list and assigning round robin thanks (DFG:Upgrades triage call)

Comment 13 errata-xmlrpc 2018-06-27 13:49:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086


Note You need to log in before you can comment on or make changes to this bug.