Bug 1609459

Summary: Ceph mons die during FFU (or upgrade?)
Product: Red Hat OpenStack Reporter: Ian Pilcher <ipilcher>
Component: openstack-tripleo-heat-templatesAssignee: Emilien Macchi <emacchi>
Status: CLOSED DUPLICATE QA Contact: Gurenko Alex <agurenko>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 13.0 (Queens)CC: aschultz, gfidente, jfrancoa, johfulto, mburns
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-08-08 13:24:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1499098    
Bug Blocks:    
Attachments:
Description Flags
Journal entries showing failed update (monitor failed to restart)
none
Journal entries from standalone update none

Description Ian Pilcher 2018-07-28 04:02:16 UTC
Created attachment 1471196 [details]
Journal entries showing failed update (monitor failed to restart)

During a fast-forward upgrade, the 'openstack overcloud upgrade run --roles Controller --skip-tags validation' command updates both Ceph and OSP RPMs.  On slower systems, the Ceph monitor service can fail to restart.

The sequence seems to be:

1. First "wave" of Ceph RPMs are updated - libradosstriper1, ceph-common,
   ceph-base, ceph-selinux.

2. ceph-selinux update causes restart of all Ceph services (ceph-mon in
   this case).

3. Because the ceph-mon package hasn't yet been updated, it fails with an
   undefined symbol error.

4. Many other packages are updated ...

5. systemd tries to start the Ceph monitor several times, but it keeps
   failing (see #3).  Finally, systemd gives up.

6. More packages are updated ...

7. Eventually the remaining Ceph packages are updated - ceph-radosgw,
   ceph-mon, and ceph-osd.  Now the Ceph monitor could be started, but
   it's too late; systemd has already given up.

When the Ceph RPMs are updated by themselves, the undefined symbol still occurs, but the ceph-mon package then gets updated and the services starts successfully on the next attempt.

So this is a problem for the OSP FFU/upgrade workflow because of the large number of packages that are updated at the same time as the Ceph packages.  (It's also exposed by running on slower systems.)

Comment 1 Ian Pilcher 2018-07-28 04:03:29 UTC
Created attachment 1471197 [details]
Journal entries from standalone update

Comment 2 Ian Pilcher 2018-07-28 17:06:25 UTC
Updating the Ceph packages on the controllers (yum update ceph-common) before running 'openstack overcloud upgrade run --roles Controller --skip-tags validation' allows the upgrade to succeed.

Comment 3 Ian Pilcher 2018-08-01 18:00:46 UTC
This is likely an artifact of running 'rhos-release 13' on my overcloud nodes, which is enabling Ceph 3 repositories at the wrong time.

Comment 4 Ian Pilcher 2018-08-01 19:45:59 UTC
Upon reflection, I'm going to re-open this.  Given how easy it is to hit this issue with an incorrect repo setup, I think it's worth considering specifically excluding Ceph repos/packages during yum updates.

Comment 5 John Fulton 2018-08-08 13:24:24 UTC

*** This bug has been marked as a duplicate of bug 1609966 ***