Bug 1562209
Summary: | [FFU]: ceph-ansible gets triggered(and fails) when removing a compute node post FFU | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Marius Cornea <mcornea> | ||||
Component: | openstack-tripleo-heat-templates | Assignee: | Lukas Bezdicka <lbezdick> | ||||
Status: | CLOSED ERRATA | QA Contact: | Marius Cornea <mcornea> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 13.0 (Queens) | CC: | dbecker, gfidente, johfulto, mandreou, mbracho, mbultel, mburns, morazi, pgrist, rhel-osp-director-maint | ||||
Target Milestone: | beta | Keywords: | Triaged | ||||
Target Release: | 13.0 (Queens) | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | openstack-tripleo-heat-templates-8.0.2-3.el7ost python-tripleoclient-9.2.1-3.el7ost | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-06-27 13:49:35 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1558787 | ||||||
Attachments: |
|
Description
Marius Cornea
2018-03-29 19:00:58 UTC
Created attachment 1414867 [details]
ceph-install-workflow.log
(In reply to Marius Cornea from comment #0) > There are multiple issues here as I can tell, maybe not all of them strictly > related to the upgrade process but we can split them into multiple BZs if > needed. > > 1. ceph-ansible gets triggered when a node removal is requested. Does > ceph-ansible need to configure anything when a compute nodes(client for ceph > cluster) gets removed from the deployment? An optimization could be made where what's to be done is discovered and it could be decided that ceph-ansible doesn't run but doing that is difficult. Running the correct playbook however _should_be_ idempotent so I think it better to focus on ensuring idempotence. > 2. Based on the log output the switch from non-containerized to > containerized ceph mon play gets run which should not be the case as this > has already happened in a previous step and the services are already running > inside containers at this point. The compute node that is left over (24.14 as per `nova list` above) is the only client node in the inventory, which is right: [root@undercloud-0 ansible-mistral-actiondh26qI]# grep clients -A 5 inventory.yaml clients: hosts: 192.168.24.14: {} mdss: hosts: {} mgrs: [root@undercloud-0 ansible-mistral-actiondh26qI]# but, as you say, the wrong playbook, rolling_update.yml ran: [root@undercloud-0 ansible-mistral-actiondh26qI]# grep ansible-playbook ansible-playbook-command.sh ansible-playbook -vv /usr/share/ceph-ansible/infrastructure-playbooks/rolling_update.yml --user tripleo-admin --become --become-user root --extra-vars {"ireallymeanit": "yes"} --inventory-file /tmp/ansible-mistral-actiondh26qI/inventory.yaml --private-key /tmp/ansible-mistral-actiondh26qI/ssh_private_key --skip-tags package-install,with_pkg "$@" [root@undercloud-0 ansible-mistral-actiondh26qI]# That playbook that was used in the FFU and seems to be "left over" which is a problem. One way to address it is to document that the user needs to run the node deletion using something like: openstack overcloud node delete --stack QualtiyEng 0c2ceb6a-1648-44e2-9e27-3e10b50b8685 -e foo.yaml where foo.yaml contains the correct playbook: CephAnsiblePlaybook: ['/usr/share/ceph-ansible/site-docker.yml.sample'] The above could be used as a workaround in the meantime if you want to try it. Giulio: Is there a better way to get the stack to "remember" the right playbook? > 3. Nevertheless the ceph-ansible playbook should be idempotent and not fail, > from what I can tell from the log it fails on: > > "stderr": "Error EPERM: Are you SURE? Pool 'metrics' already has an enabled > application; pass --yes-i-really-mean-it to proceed anyway" This seems like an idempotence bug in ceph-ansible itself. I opened the following bug for it: https://bugzilla.redhat.com/show_bug.cgi?id=1562220 please triage this we are going through the list and assigning round robin thanks (DFG:Upgrades triage call) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086 |