Description of problem: This behaviour is due to the postinstall/postuninstall script for the ceph-selinux package and is less than ideal in a hyperconverged scenario when MONs must be upgrade before OSDs can be restarted. $ rpm -qp --scripts ceph-selinux-10.2.7-41.el7cp.x86_64.rpm|head -40 postinstall scriptlet (using /bin/sh): # backup file_contexts before update . /etc/selinux/config FILE_CONTEXT=/etc/selinux/${SELINUXTYPE}/contexts/files/file_contexts cp ${FILE_CONTEXT} ${FILE_CONTEXT}.pre # Install the policy /usr/sbin/semodule -i /usr/share/selinux/packages/ceph.pp # Load the policy if SELinux is enabled if ! /usr/sbin/selinuxenabled; then # Do not relabel if selinux is not enabled exit 0 fi if diff ${FILE_CONTEXT} ${FILE_CONTEXT}.pre > /dev/null 2>&1; then # Do not relabel if file contexts did not change exit 0 fi # Check whether the daemons are running /usr/bin/systemctl status ceph.target > /dev/null 2>&1 STATUS=$? # Stop the daemons if they were running if test $STATUS -eq 0; then /usr/bin/systemctl stop ceph.target > /dev/null 2>&1 fi # Now, relabel the files /usr/sbin/fixfiles -C ${FILE_CONTEXT}.pre restore 2> /dev/null rm -f ${FILE_CONTEXT}.pre # The fixfiles command won't fix label for /var/run/ceph /usr/sbin/restorecon -R /var/run/ceph > /dev/null 2>&1 # Start the daemons iff they were running before if test $STATUS -eq 0; then /usr/bin/systemctl start ceph.target > /dev/null 2>&1 || : fi exit 0 Actual results: All daemons are restarted Expected results: All daemons not restarted Additional info: See upstream tracker http://tracker.ceph.com/issues/21672 Steps to Reproduce: # cp /etc/selinux/targeted/contexts/files/file_contexts /tmp/file_contexts.pre # sestatus SELinux status: enabled SELinuxfs mount: /sys/fs/selinux SELinux root directory: /etc/selinux Loaded policy name: targeted Current mode: permissive Mode from config file: permissive Policy MLS status: enabled Policy deny_unknown status: allowed Max kernel policy version: 28 # rpm -q ceph-mon ceph-osd ceph-mon-10.2.10-0.el7.x86_64 ceph-osd-10.2.10-0.el7.x86_64 # ps auwwx|grep ceph- ceph 1038 0.0 2.9 357024 30232 ? Ssl 22:34 0:00 /usr/bin/ceph-mon -f --cluster ceph --id MON1 --setuser ceph --setgroup ceph ceph 2329 0.1 4.8 885768 49584 ? Ssl 22:34 0:00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph ceph 2610 0.1 3.6 879044 37536 ? Ssl 22:34 0:00 /usr/bin/ceph-osd -f --cluster ceph --id 2 --setuser ceph --setgroup ceph root 2832 0.0 0.0 112648 976 pts/0 R+ 22:43 0:00 grep --color=auto ceph- # yum -y update ceph-mon # ps auwwx|grep ceph- ceph 1038 0.0 3.1 364208 32268 ? Ssl 22:34 0:00 /usr/bin/ceph-mon -f --cluster ceph --id MON1 --setuser ceph --setgroup ceph ceph 3932 2.0 3.4 771688 34680 ? Ssl 22:49 0:00 /usr/bin/ceph-osd -f --cluster ceph --id 2 --setuser ceph --setgroup ceph ceph 3934 3.0 3.9 777120 39932 ? Ssl 22:49 0:00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph root 4056 0.0 0.0 112664 968 pts/0 R+ 22:49 0:00 grep --color=auto ceph- # rpm -q ceph-mon ceph-osd ceph-mon-12.2.1-0.el7.x86_64 ceph-osd-12.2.1-0.el7.x86_64 # diff /etc/selinux/targeted/contexts/files/file_contexts /tmp/file_contexts.pre 3979d3978 < /usr/bin/ceph-mgr -- system_u:object_r:ceph_exec_t:s0 It's not clear why the MON service wasn't restarted above, it should have been as the 'journalctl -x' output shows the ceph.target being stopped and then started. Possibly a problem with my manual configuration. I believe the relevant change in selinux in this case is commit 8f6a526f9a36ff847755cba68b6b78b37e8e99cb but any change in the file context will cause this. This was all accomplished with upstream packages but it simulates the upgrade from RHCS2 -> RHSC3 to at least some extent.
Note that the postuninstall script also has the potential to restart the daemons.
The daemon restart is by design. The daemons need to be down when they are being relabelled. Otherwise, they might just keep on writing improperly labelled files. However, we should document this and add a note that if you are running a hyperconverged scenario you should stop the daemons before the upgrade (so this won't really show up as an issue...) and start them after the upgrade.
@Anjana: Why did you move this to 3.1? Are we not documenting that the admins should stop the daemons before the upgrade to 3.0 and start them after the upgrade to 3.0? Because if not, we definitely should.
An example of where this can bite is here - https://bugzilla.redhat.com/show_bug.cgi?id=1609459 (which also explains how the mons can fail to restart). It also causes all OSDs to go down if 'yum update' is naïvely run simultaneously on all OSD nodes, which TripleO sometimes does. Note that I only hit this because running 'rhos-release 13' automatically subscribes the node to the Ceph 3 repos, and I didn't realize that wasn't supposed to happen as part of the OSP fast-forward upgrade workflow. It does seem like this is an awfully easy issue to hit.
I suppose this might be fairly easy to hit if you do not follow the upgrade instructions when upgrading Ceph. AFAIK, we only require MONs to be restarted before OSDs in case of major upgrades and I certainly hope that we do document that you should stop the daemons before doing the major upgrade (iirc, we do). If the daemons are stopped during upgrade they won't be restarted by the post script that runs the relabelling so you won't run into issues like these.