Description of problem: FFU: ceph-ansible gets triggered(and fails) when removing a compute node post FFU Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Deploy OSP10 with 3 controller + 2 computes + 3 ceph osd nodes 2. Upgrade to OSP13 via FFU procedure 3. Last step of the procedure was to upgrade the ceph nodes by switching the ceph related services to continers by running the below deploy command: #!/bin/bash openstack overcloud deploy \ --timeout 100 \ --templates /usr/share/openstack-tripleo-heat-templates \ --stack QualtiyEng \ --libvirt-type kvm \ --ntp-server clock.redhat.com \ --control-scale 3 \ --control-flavor controller \ --compute-scale 2 \ --compute-flavor compute \ --ceph-storage-scale 3 \ --ceph-storage-flavor ceph \ -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \ -e /home/stack/virt/internal.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \ -e /home/stack/virt/network/network-environment.yaml \ -e /home/stack/virt/hostnames.yml \ -e /home/stack/virt/debug.yaml \ -e /home/stack/virt/docker-images.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/updates/update-from-ceph-newton.yaml \ -e /home/stack/ceph-ansible-env.yaml \ 4. Successfully upgrade the deployment 5. Remove one of the 2 compute nodes from the stack: openstack overcloud node delete --stack QualtiyEng 0c2ceb6a-1648-44e2-9e27-3e10b50b8685 Actual results: Node gets deleted but stack update fails: (undercloud) [stack@undercloud-0 ~]$ nova list +--------------------------------------+--------------+--------+------------+-------------+------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------------+--------+------------+-------------+------------------------+ | 6e6b6e96-78f5-4916-acb9-11193e01964f | ceph-0 | ACTIVE | - | Running | ctlplane=192.168.24.22 | | 62c3dc3b-b366-4cfa-8162-de775d9f0ca7 | ceph-1 | ACTIVE | - | Running | ctlplane=192.168.24.12 | | 93e3088c-a317-49fa-8a4f-1597280e8e84 | ceph-2 | ACTIVE | - | Running | ctlplane=192.168.24.15 | | edfc16a2-c589-44ab-ae92-34767c6a64b4 | compute-1 | ACTIVE | - | Running | ctlplane=192.168.24.14 | | 5403b97f-b800-4897-8918-8f67663f60f1 | controller-0 | ACTIVE | - | Running | ctlplane=192.168.24.9 | | 41ccaa91-153d-419d-89af-960be8b21541 | controller-1 | ACTIVE | - | Running | ctlplane=192.168.24.18 | | bff1bcaf-f631-442a-801e-4ba40bfeaa3c | controller-2 | ACTIVE | - | Running | ctlplane=192.168.24.10 | +--------------------------------------+--------------+--------+------------+-------------+------------------------+ (undercloud) [stack@undercloud-0 ~]$ openstack stack list +--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+ | ID | Stack Name | Project | Stack Status | Creation Time | Updated Time | +--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+ | e25e11b5-5b3b-4c2c-aaa3-ee8f8e899a9f | QualtiyEng | 685e34f3d6b24ef5af5075745629db22 | UPDATE_FAILED | 2018-03-28T23:41:22Z | 2018-03-29T18:25:12Z | +--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+ We can see the failure in /var/log/mistral/ceph-install-workflow.log which is attached to this BZ Expected results: There are multiple issues here as I can tell, maybe not all of them strictly related to the upgrade process but we can split them into multiple BZs if needed. 1. ceph-ansible gets triggered when a node removal is requested. Does ceph-ansible need to configure anything when a compute nodes(client for ceph cluster) gets removed from the deployment? 2. Based on the log output the switch from non-containerized to containerized ceph mon play gets run which should not be the case as this has already happened in a previous step and the services are already running inside containers at this point. 3. Nevertheless the ceph-ansible playbook should be idempotent and not fail, from what I can tell from the log it fails on: "stderr": "Error EPERM: Are you SURE? Pool 'metrics' already has an enabled application; pass --yes-i-really-mean-it to proceed anyway" Additional info: Attachins sosreport and ceph-install-workflow.log
Created attachment 1414867 [details] ceph-install-workflow.log
(In reply to Marius Cornea from comment #0) > There are multiple issues here as I can tell, maybe not all of them strictly > related to the upgrade process but we can split them into multiple BZs if > needed. > > 1. ceph-ansible gets triggered when a node removal is requested. Does > ceph-ansible need to configure anything when a compute nodes(client for ceph > cluster) gets removed from the deployment? An optimization could be made where what's to be done is discovered and it could be decided that ceph-ansible doesn't run but doing that is difficult. Running the correct playbook however _should_be_ idempotent so I think it better to focus on ensuring idempotence. > 2. Based on the log output the switch from non-containerized to > containerized ceph mon play gets run which should not be the case as this > has already happened in a previous step and the services are already running > inside containers at this point. The compute node that is left over (24.14 as per `nova list` above) is the only client node in the inventory, which is right: [root@undercloud-0 ansible-mistral-actiondh26qI]# grep clients -A 5 inventory.yaml clients: hosts: 192.168.24.14: {} mdss: hosts: {} mgrs: [root@undercloud-0 ansible-mistral-actiondh26qI]# but, as you say, the wrong playbook, rolling_update.yml ran: [root@undercloud-0 ansible-mistral-actiondh26qI]# grep ansible-playbook ansible-playbook-command.sh ansible-playbook -vv /usr/share/ceph-ansible/infrastructure-playbooks/rolling_update.yml --user tripleo-admin --become --become-user root --extra-vars {"ireallymeanit": "yes"} --inventory-file /tmp/ansible-mistral-actiondh26qI/inventory.yaml --private-key /tmp/ansible-mistral-actiondh26qI/ssh_private_key --skip-tags package-install,with_pkg "$@" [root@undercloud-0 ansible-mistral-actiondh26qI]# That playbook that was used in the FFU and seems to be "left over" which is a problem. One way to address it is to document that the user needs to run the node deletion using something like: openstack overcloud node delete --stack QualtiyEng 0c2ceb6a-1648-44e2-9e27-3e10b50b8685 -e foo.yaml where foo.yaml contains the correct playbook: CephAnsiblePlaybook: ['/usr/share/ceph-ansible/site-docker.yml.sample'] The above could be used as a workaround in the meantime if you want to try it. Giulio: Is there a better way to get the stack to "remember" the right playbook? > 3. Nevertheless the ceph-ansible playbook should be idempotent and not fail, > from what I can tell from the log it fails on: > > "stderr": "Error EPERM: Are you SURE? Pool 'metrics' already has an enabled > application; pass --yes-i-really-mean-it to proceed anyway" This seems like an idempotence bug in ceph-ansible itself. I opened the following bug for it: https://bugzilla.redhat.com/show_bug.cgi?id=1562220
please triage this we are going through the list and assigning round robin thanks (DFG:Upgrades triage call)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086