Bug 1571500

Summary: FFU: openstack overcloud upgrade run --roles Controller is not idempotent
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Emilien Macchi <emacchi>
Status: CLOSED ERRATA QA Contact: Gurenko Alex <agurenko>
Severity: urgent Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: ccamacho, dbecker, jfrancoa, mbracho, mburns, mcornea, morazi, sathlang, sclewis
Target Milestone: rcKeywords: Triaged
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-8.0.2-5.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-27 13:53:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs.tar.gz none

Description Marius Cornea 2018-04-25 02:25:51 UTC
Description of problem:
openstack overcloud upgrade run --roles Controller is not idempotent. As a result in case of failure the user cannot recover from failure to continue with the upgrade.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-8.0.2-0.20180416194362.29a5ad5.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP10 with 3 controllers + 2 computes + 3 ceph nodes
2. Upgrade undercloud to OSP11/12/13
3. Run FFU prepare: openstack overcloud ffwd-upgrade prepare
4. Run FFU run: openstack overcloud ffwd-upgrade run 
5. Upgrade controller: openstack overcloud upgrade run --roles Controller --skip-tags validation
6. When launching the containers simulate a failure(stop one docker-puppet-* container for example)
7. Re-run openstack overcloud upgrade run --roles Controller --skip-tags validation

Actual results:
'Disable the haproxy cluster resource' fails because the haproxy pacemaker resource doesn't exist anymore:

018-04-24 22:10:48,073 p=22874 u=mistral |  TASK [Disable the haproxy cluster resource] ************************************
2018-04-24 22:10:48,100 p=22874 u=mistral |  skipping: [192.168.24.11] => {"changed": false, "skip_reason": "Conditional result was False"}
2018-04-24 22:10:48,123 p=22874 u=mistral |  skipping: [192.168.24.16] => {"changed": false, "skip_reason": "Conditional result was False"}
2018-04-24 22:10:50,801 p=22874 u=mistral |  FAILED - RETRYING: Disable the haproxy cluster resource (5 retries left).
2018-04-24 22:10:58,270 p=22874 u=mistral |  FAILED - RETRYING: Disable the haproxy cluster resource (4 retries left).
2018-04-24 22:11:05,773 p=22874 u=mistral |  FAILED - RETRYING: Disable the haproxy cluster resource (3 retries left).
2018-04-24 22:11:13,349 p=22874 u=mistral |  FAILED - RETRYING: Disable the haproxy cluster resource (2 retries left).
2018-04-24 22:11:20,953 p=22874 u=mistral |  FAILED - RETRYING: Disable the haproxy cluster resource (1 retries left).
2018-04-24 22:11:28,452 p=22874 u=mistral |  fatal: [192.168.24.18]: FAILED! => {"attempts": 5, "changed": false, "error": "Error: resource/clone/master/group/bundle 'haproxy-bundle' does not exist\n", "msg": "Failed, to set the resource haproxy-bundle to the state disable", "output": "", "rc": 1}


Expected results:
The upgrade tasks are idempotent so the operator can re-run the upgrade commands and be able to recover from failed upgrade attempts.

Additional info:

Comment 1 Marius Cornea 2018-04-25 02:37:15 UTC
Created attachment 1426332 [details]
logs.tar.gz

Attaching logs + playbooks.

Comment 2 Jose Luis Franco 2018-04-30 14:44:25 UTC
@Marios, this BZ has been assigned to you during triage duty call. Please, feel free to reasign.

Comment 3 Marios Andreou 2018-05-02 11:59:17 UTC
marking triaged - this might be related to BZ 1571549 and https://review.openstack.org/#/c/563073/ but I need to look at the upgrade tasks and what failed here specifically.

Comment 4 Marios Andreou 2018-05-03 11:56:00 UTC
I think this is related to 1571549 as they will have similar fix but it needs its own fix. I'm digging into it today and will post something thanks

Comment 5 Marios Andreou 2018-05-03 12:48:41 UTC
Hi mcornea, after more investigation [0] - especially looking at the attached logs very helpful thanks very much I think this is indeed a duplicate for BZ 1571549. Can you please try again, making sure you have openstack-tripleo-heat-templates-8.0.2-5.el7ost or newer with the fix in https://review.openstack.org/#/c/563588/.

If it reproduces then I can investigate further as a matter of urgency, otherwise we can close duplicate thanks.

[0] from the logs but for convenience, the failed trace is like http://pastebin.test.redhat.com/585636 and relevant upgrade tasks from the environment like http://pastebin.test.redhat.com/585641
So it looks like you were running the 'already containerized' tasks, but missing the check that is added in the /#/c/563588/ review.

Comment 11 Marios Andreou 2018-05-18 11:25:22 UTC
moving this ON_QA as discussed so it can be picked up for testing. As per comment #5 we hope this is fixed by BZ 1571549

Comment 13 Marius Cornea 2018-05-18 16:18:45 UTC
I was able to run the controllers upgrade twice so this issue is fixed.

Comment 16 errata-xmlrpc 2018-06-27 13:53:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086