Bug 1631382

Summary: [UPGRADES][14] external-upgrade failed during ceph upgrade
Product: Red Hat OpenStack Reporter: Yurii Prokulevych <yprokule>
Component: python-tripleoclientAssignee: Jiri Stransky <jstransk>
Status: CLOSED ERRATA QA Contact: Yurii Prokulevych <yprokule>
Severity: high Docs Contact:
Priority: high    
Version: 14.0 (Rocky)CC: augol, ccamacho, gfidente, hbrock, jpichon, jslagle, jstransk, mbracho, mburns, sclewis, yprokule
Target Milestone: betaKeywords: Triaged
Target Release: 14.0 (Rocky)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-tripleoclient-10.6.1-0.20180929200237.1d8dcb6.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-11 11:53:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yurii Prokulevych 2018-09-20 13:00:21 UTC
Description of problem:
-----------------------
Running next command failed:
openstack overcloud external-upgrade run \
    --stack QE-Cloud-0 2>&1 
...
ERROR openstack [-] Update failed with: {u'status': u'RUNNING', u'message': u'ason\\": \\"Conditional result was False\\"}", "", "TASK [ceph-osd : include common.yml]
...(log of command is attached with sosreports)


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
python-tripleoclient-10.5.1-0.20180901082351.6d7aa74.el7ost.noarch
python-tripleoclient-heat-installer-10.5.1-0.20180901082351.6d7aa74.el7ost.noarch

Steps to Reproduce:
-------------------
1. Upgrade UC to RHOS-14
2. Upgrade OC to RHOS-14
3. Start upgrade of ceph
    openstack overcloud external-upgrade run

Actual results:
---------------
Upgrade failed with no obvious reason logged

Comment 2 Jiri Stransky 2018-10-01 12:49:12 UTC
Based on our earlier investigation it seemed like ceph-ansible finished fine, it was just the CLI command that failed.

It should be fine to ignore the failure to continue with the upgrade procedure, but this needs fixing before release. I'll triage this to high/high, but i keep the blocker flag.

Comment 4 Jiri Stransky 2018-10-02 14:22:14 UTC
Yuri is this reproducible repeatedly? I ran the external-upgrade command and can't reproduce the error, the command itself seems to work fine for me.

I wonder if it could be that the ceph-ansible output is too big and it breaks the Zaqar message processing somehow. That's the only sensible explanation i can think of. The task "run ceph-anisble" was successful, and then we can see an incomplete output from it, and the CLI command crashes. That looks like nothing was broken in Ansible but something went over limits when communicating the log output perhaps...

I'll add DFG:Ceph too, since we'll likely want to address this in the Ceph composable service Ansible tasks. Cut up the output into smaller chunks somehow perhaps. Looking into this.

Comment 5 Jiri Stransky 2018-10-02 16:18:47 UTC
We still don't know the root cause of this bug with certainty, but if my above guess is correct, we may be able to solve it this way:

https://review.openstack.org/#/c/607302/

Comment 6 Jiri Stransky 2018-10-04 14:13:38 UTC
Merged to master, merging to stable/rocky.

Comment 7 Jiri Stransky 2018-10-04 14:15:11 UTC
I'll cancel needinfo on Yurii. The patch we have now is our best shot anyway :)

Comment 14 Yurii Prokulevych 2018-12-17 14:19:48 UTC
Verified with:
- openstack-tripleo-heat-templates-9.0.1-0.20181013060906.el7ost.noarch
- ceph-ansible-3.1.10-1.el7cp.noarch
- python-tripleoclient-10.6.1-0.20181010222412.8c8f259.el7ost.noarch

openstack overcloud external-upgrade run \
    --stack qe-Cloud-0 \
    --tags ceph 2>&1
...
 u'PLAY RECAP *********************************************************************',
 u'ceph-0                     : ok=2    changed=0    unreachable=0    failed=0   ',
 u'ceph-1                     : ok=2    changed=0    unreachable=0    failed=0   ',
 u'ceph-2                     : ok=2    changed=0    unreachable=0    failed=0   ',
 u'compute-0                  : ok=2    changed=0    unreachable=0    failed=0   ',
 u'compute-1                  : ok=2    changed=0    unreachable=0    failed=0   ',
 u'controller-0               : ok=2    changed=0    unreachable=0    failed=0   ',
 u'controller-1               : ok=2    changed=0    unreachable=0    failed=0   ',
 u'controller-2               : ok=2    changed=0    unreachable=0    failed=0   ',
 u'undercloud                 : ok=40   changed=18   unreachable=0    failed=0   ',
 u'',
 u'Monday 17 December 2018  08:20:23 -0500 (0:00:00.024)       0:12:33.132 ******* ',
 u'=============================================================================== ']
[u'Updated nodes - all']
Success
2018-12-17 08:20:25.807 531797 INFO tripleoclient.v1.overcloud_external_upgrade.ExternalUpgradeRun [-] Completed Overcloud External Upgrade Run.ESC[00m

Comment 16 errata-xmlrpc 2019-01-11 11:53:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045