Bug 1609459

Summary:

Ceph mons die during FFU (or upgrade?)

Product:

Red Hat OpenStack

Reporter:

Ian Pilcher <ipilcher>

Component:

openstack-tripleo-heat-templates

Assignee:

Emilien Macchi <emacchi>

Status:

CLOSED DUPLICATE

QA Contact:

Gurenko Alex <agurenko>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

13.0 (Queens)

CC:

aschultz, gfidente, jfrancoa, johfulto, mburns

Target Milestone:

---

Keywords:

Reopened

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-08-08 13:24:24 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1499098

Bug Blocks:

Attachments:

Description	Flags
Journal entries showing failed update (monitor failed to restart)	none
Journal entries from standalone update	none

Description Ian Pilcher 2018-07-28 04:02:16 UTC

Created attachment 1471196 [details]
Journal entries showing failed update (monitor failed to restart)

During a fast-forward upgrade, the 'openstack overcloud upgrade run --roles Controller --skip-tags validation' command updates both Ceph and OSP RPMs.  On slower systems, the Ceph monitor service can fail to restart.

The sequence seems to be:

1. First "wave" of Ceph RPMs are updated - libradosstriper1, ceph-common,
   ceph-base, ceph-selinux.

2. ceph-selinux update causes restart of all Ceph services (ceph-mon in
   this case).

3. Because the ceph-mon package hasn't yet been updated, it fails with an
   undefined symbol error.

4. Many other packages are updated ...

5. systemd tries to start the Ceph monitor several times, but it keeps
   failing (see #3).  Finally, systemd gives up.

6. More packages are updated ...

7. Eventually the remaining Ceph packages are updated - ceph-radosgw,
   ceph-mon, and ceph-osd.  Now the Ceph monitor could be started, but
   it's too late; systemd has already given up.

When the Ceph RPMs are updated by themselves, the undefined symbol still occurs, but the ceph-mon package then gets updated and the services starts successfully on the next attempt.

So this is a problem for the OSP FFU/upgrade workflow because of the large number of packages that are updated at the same time as the Ceph packages.  (It's also exposed by running on slower systems.)

Comment 1 Ian Pilcher 2018-07-28 04:03:29 UTC

Created attachment 1471197 [details]
Journal entries from standalone update

Comment 2 Ian Pilcher 2018-07-28 17:06:25 UTC

Updating the Ceph packages on the controllers (yum update ceph-common) before running 'openstack overcloud upgrade run --roles Controller --skip-tags validation' allows the upgrade to succeed.

Comment 3 Ian Pilcher 2018-08-01 18:00:46 UTC

This is likely an artifact of running 'rhos-release 13' on my overcloud nodes, which is enabling Ceph 3 repositories at the wrong time.

Comment 4 Ian Pilcher 2018-08-01 19:45:59 UTC

Upon reflection, I'm going to re-open this.  Given how easy it is to hit this issue with an incorrect repo setup, I think it's worth considering specifically excluding Ceph repos/packages during yum updates.

Comment 5 John Fulton 2018-08-08 13:24:24 UTC


*** This bug has been marked as a duplicate of bug 1609966 ***