Bug 1609459 - Ceph mons die during FFU (or upgrade?)
Summary: Ceph mons die during FFU (or upgrade?)
Keywords:
Status: CLOSED DUPLICATE of bug 1609966
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Emilien Macchi
QA Contact: Gurenko Alex
URL:
Whiteboard:
Depends On: 1499098
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-28 04:02 UTC by Ian Pilcher
Modified: 2018-08-08 13:24 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-08-08 13:24:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Journal entries showing failed update (monitor failed to restart) (8.47 KB, text/plain)
2018-07-28 04:02 UTC, Ian Pilcher
no flags Details
Journal entries from standalone update (5.50 KB, text/plain)
2018-07-28 04:03 UTC, Ian Pilcher
no flags Details

Description Ian Pilcher 2018-07-28 04:02:16 UTC
Created attachment 1471196 [details]
Journal entries showing failed update (monitor failed to restart)

During a fast-forward upgrade, the 'openstack overcloud upgrade run --roles Controller --skip-tags validation' command updates both Ceph and OSP RPMs.  On slower systems, the Ceph monitor service can fail to restart.

The sequence seems to be:

1. First "wave" of Ceph RPMs are updated - libradosstriper1, ceph-common,
   ceph-base, ceph-selinux.

2. ceph-selinux update causes restart of all Ceph services (ceph-mon in
   this case).

3. Because the ceph-mon package hasn't yet been updated, it fails with an
   undefined symbol error.

4. Many other packages are updated ...

5. systemd tries to start the Ceph monitor several times, but it keeps
   failing (see #3).  Finally, systemd gives up.

6. More packages are updated ...

7. Eventually the remaining Ceph packages are updated - ceph-radosgw,
   ceph-mon, and ceph-osd.  Now the Ceph monitor could be started, but
   it's too late; systemd has already given up.

When the Ceph RPMs are updated by themselves, the undefined symbol still occurs, but the ceph-mon package then gets updated and the services starts successfully on the next attempt.

So this is a problem for the OSP FFU/upgrade workflow because of the large number of packages that are updated at the same time as the Ceph packages.  (It's also exposed by running on slower systems.)

Comment 1 Ian Pilcher 2018-07-28 04:03:29 UTC
Created attachment 1471197 [details]
Journal entries from standalone update

Comment 2 Ian Pilcher 2018-07-28 17:06:25 UTC
Updating the Ceph packages on the controllers (yum update ceph-common) before running 'openstack overcloud upgrade run --roles Controller --skip-tags validation' allows the upgrade to succeed.

Comment 3 Ian Pilcher 2018-08-01 18:00:46 UTC
This is likely an artifact of running 'rhos-release 13' on my overcloud nodes, which is enabling Ceph 3 repositories at the wrong time.

Comment 4 Ian Pilcher 2018-08-01 19:45:59 UTC
Upon reflection, I'm going to re-open this.  Given how easy it is to hit this issue with an incorrect repo setup, I think it's worth considering specifically excluding Ceph repos/packages during yum updates.

Comment 5 John Fulton 2018-08-08 13:24:24 UTC

*** This bug has been marked as a duplicate of bug 1609966 ***


Note You need to log in before you can comment on or make changes to this bug.