Description of problem: During the upgrade from OSP12 -> 13 the ceph-osd package is removed from the storage node which kills the running OSD containers. Specifically in the preuninstall scriptlet in the ceph-osd rpm, "systemctl stop ceph-disk@\*.service ceph-osd@\*.service ceph-osd.target" is called during the uninstall which kills the containers In this environment the OSD nodes originally have ceph-osd-10.2.10-28.el7cp RPM. # rpm -qp --scripts ceph-osd-10.2.10-28.el7cp.x86_64.rpm [...] preuninstall scriptlet (using /bin/sh): if [ $1 -eq 0 ] ; then # Package removal, not upgrade systemctl --no-reload disable ceph-disk@\*.service ceph-osd@\*.service ceph-osd.target > /dev/null 2>&1 || : systemctl stop ceph-disk@\*.service ceph-osd@\*.service ceph-osd.target > /dev/null 2>&1 || : fi [...] From /var/log/messages Sep 20 16:11:24 ceph-host systemd: Starting Session 668 of user tripleo-admin. Sep 20 16:11:25 ceph-host ansible-yum: Invoked with allow_downgrade=False name=['ceph-osd'] list=None install_repoquery=True conf_file=None disable_gpg_check=False state=absent disablerepo=None update_cache=False enablerepo=None exclude=None security=False validate_certs=True installroot=/ skip_broken=False Sep 20 16:11:27 ceph-host systemd: Stopped target ceph target allowing to start/stop all ceph-osd@.service instances at once. Sep 20 16:11:27 ceph-host systemd: Stopping ceph target allowing to start/stop all ceph-osd@.service instances at once. Sep 20 16:11:27 ceph-host systemd: Stopping Ceph OSD... Sep 20 16:11:27 ceph-host ceph-osd-run.sh: Sending SIGTERM to PID 963224 Sep 20 16:11:27 ceph-host ceph-osd-run.sh: sigterm_cleanup_post Sep 20 16:11:27 ceph-host journal: Sending SIGTERM to PID 963224 Sep 20 16:11:27 ceph-host ceph-osd-run.sh: 2018-09-20 16:11:27.059566 7f33c78a6700 -1 received signal: Terminated from PID: 963012 task name: /bin/bash /entrypoint.sh UID: 0 Sep 20 16:11:27 ceph-host ceph-osd-run.sh: 2018-09-20 16:11:27.059592 7f33c78a6700 -1 osd.3 168 *** Got signal Terminated *** Sep 20 16:11:27 ceph-host journal: sigterm_cleanup_post Sep 20 16:11:27 ceph-host journal: 2018-09-20 16:11:27.059566 7f33c78a6700 -1 received signal: Terminated from PID: 963012 task name: /bin/bash /entrypoint.sh UID: 0 Sep 20 16:11:27 ceph-host journal: 2018-09-20 16:11:27.059592 7f33c78a6700 -1 osd.3 168 *** Got signal Terminated *** Sep 20 16:11:27 ceph-host journal: 2018-09-20 16:11:27 /entrypoint.sh: Unmounting /dev/vdc1 Sep 20 16:11:27 ceph-host ceph-osd-run.sh: 2018-09-20 16:11:27 /entrypoint.sh: Unmounting /dev/vdc1 Sep 20 16:11:27 ceph-host journal: umount: /var/lib/ceph/osd/ceph-3: target is busy. Sep 20 16:11:27 ceph-host journal: (In some cases useful info about processes that use Sep 20 16:11:27 ceph-host ceph-osd-run.sh: umount: /var/lib/ceph/osd/ceph-3: target is busy. Sep 20 16:11:27 ceph-host ceph-osd-run.sh: (In some cases useful info about processes that use Sep 20 16:11:27 ceph-host ceph-osd-run.sh: the device is found by lsof(8) or fuser(1)) Sep 20 16:11:27 ceph-host journal: the device is found by lsof(8) or fuser(1)) Sep 20 16:11:27 ceph-host journal: 2018-09-20 16:11:27 /entrypoint.sh: Failed to umount /dev/vdc1 Sep 20 16:11:27 ceph-host ceph-osd-run.sh: 2018-09-20 16:11:27 /entrypoint.sh: Failed to umount /dev/vdc1 Sep 20 16:11:27 ceph-host kernel: XFS (vdc1): Unmounting Filesystem Sep 20 16:11:27 ceph-host dockerd-current: time="2018-09-20T16:11:27.183225874Z" level=error msg="containerd: deleting container" error="exit status 1: \"container 282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b does not exist\\none or more of the container deletions failed\\n\"" Sep 20 16:11:27 ceph-host dockerd-current: time="2018-09-20T16:11:27.202889350Z" level=warning msg="282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b cleanup: failed to unmount secrets: invalid argument" Sep 20 16:11:27 ceph-host docker: ceph-osd-ceph-host-vdc Sep 20 16:11:27 ceph-host dockerd-current: time="2018-09-20T16:11:27.212865010Z" level=error msg="Handler for POST /v1.26/containers/282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b/kill?signal=TERM returned error: Cannot kill container 282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b: No such container: 282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b" Sep 20 16:11:27 ceph-host dockerd-current: time="2018-09-20T16:11:27.213614631Z" level=error msg="Handler for POST /v1.26/containers/282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b/kill returned error: Cannot kill container 282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b: No such container: 282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b" Sep 20 16:11:27 ceph-host systemd: Stopped Ceph OSD. Sep 20 16:11:27 ceph-host yum[963532]: Erased: 2:ceph-osd-10.2.10-28.el7cp.x86_64 Sep 20 16:11:27 ceph-host systemd: Reloading. Version-Release number of selected component (if applicable): Current OSP12 to 13 components. ceph-osd-10.2.10-28.el7cp on the OSD node. How reproducible: unknown Steps to Reproduce: 1. OSP 12 -> 13 upgrade with ceph-osd still installed on the base OSD nodes. This can also be reproduced on an OSP 13 storage node with running OSD containers: systemctl stop ceph-disk@\*.service ceph-osd@\*.service ceph-osd.target Actual results: OSD containers are killed leaving ceph in an unhealthy state. Expected results: Clean upgrade Additional info: I'll attach additional log
I've suggested manually removing the ceph-osd package prior the the upgrade as a work-around. rpm -e --noscripts ceph-osd
(In reply to Matt Flusche from comment #3) > I've suggested manually removing the ceph-osd package prior the the upgrade > as a work-around. > > rpm -e --noscripts ceph-osd hi Matt, thanks for reporting this bug! Can you see if by re-enabling and re-starting the systemd units the OSDs get back up in working state? I don't think we can remove the package before upgrading because during FFU we haven't migrated the OSDs into containers until the ceph-ansible run finished so removing the package installed locally would hit the running cluster anyway. I believe we can start by making a change in the ceph-ansible workflow so that it does not remove the package at all from the OSD nodes, then see where/how is best to resolve this which seems something to be addressed in either ceph-ansible or the ceph RPMs uninstall scripts.
(In reply to Giulio Fidente from comment #4) > > thanks for reporting this bug! Can you see if by re-enabling and re-starting > the systemd units the OSDs get back up in working state? > Yes, re-starting the systemd units recovered the OSDs. Also, the manual work-around of removing the ceph-osd package (rpm -e --noscripts ceph-osd) prior to the upgrade worked successfully in the production environment.
verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3587