Bug 1571500

Summary:

FFU: openstack overcloud upgrade run --roles Controller is not idempotent

Product:

Red Hat OpenStack

Reporter:

Marius Cornea <mcornea>

Component:

openstack-tripleo-heat-templates

Assignee:

Emilien Macchi <emacchi>

Status:

CLOSED ERRATA

QA Contact:

Gurenko Alex <agurenko>

Severity:

urgent

Docs Contact:

Priority:

high

Version:

13.0 (Queens)

CC:

ccamacho, dbecker, jfrancoa, mbracho, mburns, mcornea, morazi, sathlang, sclewis

Target Milestone:

Keywords:

Triaged

Target Release:

13.0 (Queens)

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

openstack-tripleo-heat-templates-8.0.2-5.el7ost

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-06-27 13:53:37 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
logs.tar.gz	none

Description Marius Cornea 2018-04-25 02:25:51 UTC

Description of problem:
openstack overcloud upgrade run --roles Controller is not idempotent. As a result in case of failure the user cannot recover from failure to continue with the upgrade.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-8.0.2-0.20180416194362.29a5ad5.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP10 with 3 controllers + 2 computes + 3 ceph nodes
2. Upgrade undercloud to OSP11/12/13
3. Run FFU prepare: openstack overcloud ffwd-upgrade prepare
4. Run FFU run: openstack overcloud ffwd-upgrade run 
5. Upgrade controller: openstack overcloud upgrade run --roles Controller --skip-tags validation
6. When launching the containers simulate a failure(stop one docker-puppet-* container for example)
7. Re-run openstack overcloud upgrade run --roles Controller --skip-tags validation

Actual results:
'Disable the haproxy cluster resource' fails because the haproxy pacemaker resource doesn't exist anymore:

018-04-24 22:10:48,073 p=22874 u=mistral |  TASK [Disable the haproxy cluster resource] ************************************
2018-04-24 22:10:48,100 p=22874 u=mistral |  skipping: [192.168.24.11] => {"changed": false, "skip_reason": "Conditional result was False"}
2018-04-24 22:10:48,123 p=22874 u=mistral |  skipping: [192.168.24.16] => {"changed": false, "skip_reason": "Conditional result was False"}
2018-04-24 22:10:50,801 p=22874 u=mistral |  FAILED - RETRYING: Disable the haproxy cluster resource (5 retries left).
2018-04-24 22:10:58,270 p=22874 u=mistral |  FAILED - RETRYING: Disable the haproxy cluster resource (4 retries left).
2018-04-24 22:11:05,773 p=22874 u=mistral |  FAILED - RETRYING: Disable the haproxy cluster resource (3 retries left).
2018-04-24 22:11:13,349 p=22874 u=mistral |  FAILED - RETRYING: Disable the haproxy cluster resource (2 retries left).
2018-04-24 22:11:20,953 p=22874 u=mistral |  FAILED - RETRYING: Disable the haproxy cluster resource (1 retries left).
2018-04-24 22:11:28,452 p=22874 u=mistral |  fatal: [192.168.24.18]: FAILED! => {"attempts": 5, "changed": false, "error": "Error: resource/clone/master/group/bundle 'haproxy-bundle' does not exist\n", "msg": "Failed, to set the resource haproxy-bundle to the state disable", "output": "", "rc": 1}


Expected results:
The upgrade tasks are idempotent so the operator can re-run the upgrade commands and be able to recover from failed upgrade attempts.

Additional info:

Comment 1 Marius Cornea 2018-04-25 02:37:15 UTC

Created attachment 1426332 [details]
logs.tar.gz

Attaching logs + playbooks.

Comment 2 Jose Luis Franco 2018-04-30 14:44:25 UTC

@Marios, this BZ has been assigned to you during triage duty call. Please, feel free to reasign.

Comment 3 Marios Andreou 2018-05-02 11:59:17 UTC

marking triaged - this might be related to BZ 1571549 and https://review.openstack.org/#/c/563073/ but I need to look at the upgrade tasks and what failed here specifically.

Comment 4 Marios Andreou 2018-05-03 11:56:00 UTC

I think this is related to 1571549 as they will have similar fix but it needs its own fix. I'm digging into it today and will post something thanks

Comment 5 Marios Andreou 2018-05-03 12:48:41 UTC

Hi mcornea, after more investigation [0] - especially looking at the attached logs very helpful thanks very much I think this is indeed a duplicate for BZ 1571549. Can you please try again, making sure you have openstack-tripleo-heat-templates-8.0.2-5.el7ost or newer with the fix in https://review.openstack.org/#/c/563588/.

If it reproduces then I can investigate further as a matter of urgency, otherwise we can close duplicate thanks.

[0] from the logs but for convenience, the failed trace is like http://pastebin.test.redhat.com/585636 and relevant upgrade tasks from the environment like http://pastebin.test.redhat.com/585641
So it looks like you were running the 'already containerized' tasks, but missing the check that is added in the /#/c/563588/ review.

Comment 11 Marios Andreou 2018-05-18 11:25:22 UTC

moving this ON_QA as discussed so it can be picked up for testing. As per comment #5 we hope this is fixed by BZ 1571549

Comment 13 Marius Cornea 2018-05-18 16:18:45 UTC

I was able to run the controllers upgrade twice so this issue is fixed.

Comment 16 errata-xmlrpc 2018-06-27 13:53:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086