Description of problem: openstack overcloud upgrade run --roles Controller is not idempotent. As a result in case of failure the user cannot recover from failure to continue with the upgrade. Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-8.0.2-0.20180416194362.29a5ad5.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy OSP10 with 3 controllers + 2 computes + 3 ceph nodes 2. Upgrade undercloud to OSP11/12/13 3. Run FFU prepare: openstack overcloud ffwd-upgrade prepare 4. Run FFU run: openstack overcloud ffwd-upgrade run 5. Upgrade controller: openstack overcloud upgrade run --roles Controller --skip-tags validation 6. When launching the containers simulate a failure(stop one docker-puppet-* container for example) 7. Re-run openstack overcloud upgrade run --roles Controller --skip-tags validation Actual results: 'Disable the haproxy cluster resource' fails because the haproxy pacemaker resource doesn't exist anymore: 018-04-24 22:10:48,073 p=22874 u=mistral | TASK [Disable the haproxy cluster resource] ************************************ 2018-04-24 22:10:48,100 p=22874 u=mistral | skipping: [192.168.24.11] => {"changed": false, "skip_reason": "Conditional result was False"} 2018-04-24 22:10:48,123 p=22874 u=mistral | skipping: [192.168.24.16] => {"changed": false, "skip_reason": "Conditional result was False"} 2018-04-24 22:10:50,801 p=22874 u=mistral | FAILED - RETRYING: Disable the haproxy cluster resource (5 retries left). 2018-04-24 22:10:58,270 p=22874 u=mistral | FAILED - RETRYING: Disable the haproxy cluster resource (4 retries left). 2018-04-24 22:11:05,773 p=22874 u=mistral | FAILED - RETRYING: Disable the haproxy cluster resource (3 retries left). 2018-04-24 22:11:13,349 p=22874 u=mistral | FAILED - RETRYING: Disable the haproxy cluster resource (2 retries left). 2018-04-24 22:11:20,953 p=22874 u=mistral | FAILED - RETRYING: Disable the haproxy cluster resource (1 retries left). 2018-04-24 22:11:28,452 p=22874 u=mistral | fatal: [192.168.24.18]: FAILED! => {"attempts": 5, "changed": false, "error": "Error: resource/clone/master/group/bundle 'haproxy-bundle' does not exist\n", "msg": "Failed, to set the resource haproxy-bundle to the state disable", "output": "", "rc": 1} Expected results: The upgrade tasks are idempotent so the operator can re-run the upgrade commands and be able to recover from failed upgrade attempts. Additional info:
Created attachment 1426332 [details] logs.tar.gz Attaching logs + playbooks.
@Marios, this BZ has been assigned to you during triage duty call. Please, feel free to reasign.
marking triaged - this might be related to BZ 1571549 and https://review.openstack.org/#/c/563073/ but I need to look at the upgrade tasks and what failed here specifically.
I think this is related to 1571549 as they will have similar fix but it needs its own fix. I'm digging into it today and will post something thanks
Hi mcornea, after more investigation [0] - especially looking at the attached logs very helpful thanks very much I think this is indeed a duplicate for BZ 1571549. Can you please try again, making sure you have openstack-tripleo-heat-templates-8.0.2-5.el7ost or newer with the fix in https://review.openstack.org/#/c/563588/. If it reproduces then I can investigate further as a matter of urgency, otherwise we can close duplicate thanks. [0] from the logs but for convenience, the failed trace is like http://pastebin.test.redhat.com/585636 and relevant upgrade tasks from the environment like http://pastebin.test.redhat.com/585641 So it looks like you were running the 'already containerized' tasks, but missing the check that is added in the /#/c/563588/ review.
moving this ON_QA as discussed so it can be picked up for testing. As per comment #5 we hope this is fixed by BZ 1571549
I was able to run the controllers upgrade twice so this issue is fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086