Bug 1631848

Summary:	OSP 12->13 upgrade - ceph OSD containers are killed during removal of ceph-osd rpm
Product:	Red Hat OpenStack	Reporter:	Matt Flusche <mflusche>
Component:	openstack-tripleo-common	Assignee:	Giulio Fidente <gfidente>
Status:	CLOSED ERRATA	QA Contact:	Yogev Rabl <yrabl>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	13.0 (Queens)	CC:	gabrioux, gfidente, johfulto, lmarsh, mburns, mcornea, slinaber, yprokule
Target Milestone:	z3	Keywords:	Triaged, ZStream
Target Release:	13.0 (Queens)
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	openstack-tripleo-common-8.6.3-16.el7ost	Doc Type:	Bug Fix
Doc Text:	When upgrading from Red Hat OpenStack Platform 12 to 13 the ceph-osd package is removed. The package removal stopped the running OSDs even though they were running in containers and shouldn't have required the package. This release removes the playbook that removes the package during the upgrade and Ceph OSDs are not unintentionally stopped during upgrade.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-11-13 22:28:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Matt Flusche 2018-09-21 17:32:08 UTC

Description of problem:

During the upgrade from OSP12 -> 13 the ceph-osd package is removed from the storage node which kills the running OSD containers.

Specifically in the preuninstall scriptlet in the ceph-osd rpm, "systemctl stop ceph-disk@\*.service ceph-osd@\*.service ceph-osd.target" is called during the uninstall which kills the containers

In this environment the OSD nodes originally have ceph-osd-10.2.10-28.el7cp RPM.

# rpm -qp --scripts ceph-osd-10.2.10-28.el7cp.x86_64.rpm 
[...]
preuninstall scriptlet (using /bin/sh):

if [ $1 -eq 0 ] ; then 
        # Package removal, not upgrade 
        systemctl --no-reload disable ceph-disk@\*.service ceph-osd@\*.service ceph-osd.target > /dev/null 2>&1 || : 
        systemctl stop ceph-disk@\*.service ceph-osd@\*.service ceph-osd.target > /dev/null 2>&1 || : 
fi
[...]

From /var/log/messages
Sep 20 16:11:24 ceph-host systemd: Starting Session 668 of user tripleo-admin.
Sep 20 16:11:25 ceph-host ansible-yum: Invoked with allow_downgrade=False name=['ceph-osd'] list=None install_repoquery=True conf_file=None disable_gpg_check=False state=absent disablerepo=None update_cache=False enablerepo=None exclude=None security=False validate_certs=True installroot=/ skip_broken=False
Sep 20 16:11:27 ceph-host systemd: Stopped target ceph target allowing to start/stop all ceph-osd@.service instances at once.
Sep 20 16:11:27 ceph-host systemd: Stopping ceph target allowing to start/stop all ceph-osd@.service instances at once.
Sep 20 16:11:27 ceph-host systemd: Stopping Ceph OSD...
Sep 20 16:11:27 ceph-host ceph-osd-run.sh: Sending SIGTERM to PID 963224
Sep 20 16:11:27 ceph-host ceph-osd-run.sh: sigterm_cleanup_post
Sep 20 16:11:27 ceph-host journal: Sending SIGTERM to PID 963224
Sep 20 16:11:27 ceph-host ceph-osd-run.sh: 2018-09-20 16:11:27.059566 7f33c78a6700 -1 received  signal: Terminated from  PID: 963012 task name: /bin/bash /entrypoint.sh  UID: 0
Sep 20 16:11:27 ceph-host ceph-osd-run.sh: 2018-09-20 16:11:27.059592 7f33c78a6700 -1 osd.3 168 *** Got signal Terminated ***
Sep 20 16:11:27 ceph-host journal: sigterm_cleanup_post
Sep 20 16:11:27 ceph-host journal: 2018-09-20 16:11:27.059566 7f33c78a6700 -1 received  signal: Terminated from  PID: 963012 task name: /bin/bash /entrypoint.sh  UID: 0
Sep 20 16:11:27 ceph-host journal: 2018-09-20 16:11:27.059592 7f33c78a6700 -1 osd.3 168 *** Got signal Terminated ***
Sep 20 16:11:27 ceph-host journal: 2018-09-20 16:11:27  /entrypoint.sh: Unmounting /dev/vdc1
Sep 20 16:11:27 ceph-host ceph-osd-run.sh: 2018-09-20 16:11:27  /entrypoint.sh: Unmounting /dev/vdc1
Sep 20 16:11:27 ceph-host journal: umount: /var/lib/ceph/osd/ceph-3: target is busy.
Sep 20 16:11:27 ceph-host journal:        (In some cases useful info about processes that use
Sep 20 16:11:27 ceph-host ceph-osd-run.sh: umount: /var/lib/ceph/osd/ceph-3: target is busy.
Sep 20 16:11:27 ceph-host ceph-osd-run.sh: (In some cases useful info about processes that use
Sep 20 16:11:27 ceph-host ceph-osd-run.sh: the device is found by lsof(8) or fuser(1))
Sep 20 16:11:27 ceph-host journal:         the device is found by lsof(8) or fuser(1))
Sep 20 16:11:27 ceph-host journal: 2018-09-20 16:11:27  /entrypoint.sh: Failed to umount /dev/vdc1
Sep 20 16:11:27 ceph-host ceph-osd-run.sh: 2018-09-20 16:11:27  /entrypoint.sh: Failed to umount /dev/vdc1
Sep 20 16:11:27 ceph-host kernel: XFS (vdc1): Unmounting Filesystem
Sep 20 16:11:27 ceph-host dockerd-current: time="2018-09-20T16:11:27.183225874Z" level=error msg="containerd: deleting container" error="exit status 1: \"container 282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b does not exist\\none or more of the container deletions failed\\n\""
Sep 20 16:11:27 ceph-host dockerd-current: time="2018-09-20T16:11:27.202889350Z" level=warning msg="282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b cleanup: failed to unmount secrets: invalid argument"
Sep 20 16:11:27 ceph-host docker: ceph-osd-ceph-host-vdc
Sep 20 16:11:27 ceph-host dockerd-current: time="2018-09-20T16:11:27.212865010Z" level=error msg="Handler for POST /v1.26/containers/282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b/kill?signal=TERM returned error: Cannot kill container 282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b: No such container: 282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b"
Sep 20 16:11:27 ceph-host dockerd-current: time="2018-09-20T16:11:27.213614631Z" level=error msg="Handler for POST /v1.26/containers/282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b/kill returned error: Cannot kill container 282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b: No such container: 282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b"
Sep 20 16:11:27 ceph-host systemd: Stopped Ceph OSD.
Sep 20 16:11:27 ceph-host yum[963532]: Erased: 2:ceph-osd-10.2.10-28.el7cp.x86_64
Sep 20 16:11:27 ceph-host systemd: Reloading.



Version-Release number of selected component (if applicable):
Current OSP12 to 13 components.
ceph-osd-10.2.10-28.el7cp on the OSD node.


How reproducible:
unknown

Steps to Reproduce:
1.  OSP 12 -> 13 upgrade with ceph-osd still installed on the base OSD nodes.

This can also be reproduced on an OSP 13 storage node with running OSD containers:

systemctl stop ceph-disk@\*.service ceph-osd@\*.service ceph-osd.target

Actual results:
OSD containers are killed leaving ceph in an unhealthy state.

Expected results:
Clean upgrade

Additional info:
I'll attach additional log

Comment 3 Matt Flusche 2018-09-21 17:39:29 UTC

I've suggested manually removing the ceph-osd package prior the the upgrade as a work-around.

rpm -e --noscripts ceph-osd

Comment 4 Giulio Fidente 2018-09-24 10:31:28 UTC

(In reply to Matt Flusche from comment #3)
> I've suggested manually removing the ceph-osd package prior the the upgrade
> as a work-around.
> 
> rpm -e --noscripts ceph-osd

hi Matt,

thanks for reporting this bug! Can you see if by re-enabling and re-starting the systemd units the OSDs get back up in working state?

I don't think we can remove the package before upgrading because during FFU we haven't migrated the OSDs into containers until the ceph-ansible run finished so removing the package installed locally would hit the running cluster anyway.

I believe we can start by making a change in the ceph-ansible workflow so that it does not remove the package at all from the OSD nodes, then see where/how is best to resolve this which seems something to be addressed in either ceph-ansible or the ceph RPMs uninstall scripts.

Comment 7 Matt Flusche 2018-09-24 12:57:49 UTC

(In reply to Giulio Fidente from comment #4)

> 
> thanks for reporting this bug! Can you see if by re-enabling and re-starting
> the systemd units the OSDs get back up in working state?
> 
Yes, re-starting the systemd units recovered the OSDs.  Also, the manual work-around of removing the ceph-osd package (rpm -e --noscripts ceph-osd) prior to the upgrade worked successfully in the production environment.

Comment 13 Yogev Rabl 2018-11-02 18:55:39 UTC

verified

Comment 17 errata-xmlrpc 2018-11-13 22:28:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3587