Bug 1609459
Summary: | Ceph mons die during FFU (or upgrade?) | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Ian Pilcher <ipilcher> | ||||||
Component: | openstack-tripleo-heat-templates | Assignee: | Emilien Macchi <emacchi> | ||||||
Status: | CLOSED DUPLICATE | QA Contact: | Gurenko Alex <agurenko> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 13.0 (Queens) | CC: | aschultz, gfidente, jfrancoa, johfulto, mburns | ||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2018-08-08 13:24:24 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 1499098 | ||||||||
Bug Blocks: | |||||||||
Attachments: |
|
Created attachment 1471197 [details]
Journal entries from standalone update
Updating the Ceph packages on the controllers (yum update ceph-common) before running 'openstack overcloud upgrade run --roles Controller --skip-tags validation' allows the upgrade to succeed. This is likely an artifact of running 'rhos-release 13' on my overcloud nodes, which is enabling Ceph 3 repositories at the wrong time. Upon reflection, I'm going to re-open this. Given how easy it is to hit this issue with an incorrect repo setup, I think it's worth considering specifically excluding Ceph repos/packages during yum updates. *** This bug has been marked as a duplicate of bug 1609966 *** |
Created attachment 1471196 [details] Journal entries showing failed update (monitor failed to restart) During a fast-forward upgrade, the 'openstack overcloud upgrade run --roles Controller --skip-tags validation' command updates both Ceph and OSP RPMs. On slower systems, the Ceph monitor service can fail to restart. The sequence seems to be: 1. First "wave" of Ceph RPMs are updated - libradosstriper1, ceph-common, ceph-base, ceph-selinux. 2. ceph-selinux update causes restart of all Ceph services (ceph-mon in this case). 3. Because the ceph-mon package hasn't yet been updated, it fails with an undefined symbol error. 4. Many other packages are updated ... 5. systemd tries to start the Ceph monitor several times, but it keeps failing (see #3). Finally, systemd gives up. 6. More packages are updated ... 7. Eventually the remaining Ceph packages are updated - ceph-radosgw, ceph-mon, and ceph-osd. Now the Ceph monitor could be started, but it's too late; systemd has already given up. When the Ceph RPMs are updated by themselves, the undefined symbol still occurs, but the ceph-mon package then gets updated and the services starts successfully on the next attempt. So this is a problem for the OSP FFU/upgrade workflow because of the large number of packages that are updated at the same time as the Ceph packages. (It's also exposed by running on slower systems.)