1631848 – OSP 12->13 upgrade - ceph OSD containers are killed during removal of ceph-osd rpm

Bug 1631848 - OSP 12->13 upgrade - ceph OSD containers are killed during removal of ceph-osd rpm

Summary: OSP 12->13 upgrade - ceph OSD containers are killed during removal of ceph-os...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-common
Sub Component:
Version:	13.0 (Queens)
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	z3
Target Release:	13.0 (Queens)
Assignee:	Giulio Fidente
QA Contact:	Yogev Rabl
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-09-21 17:32 UTC by Matt Flusche
Modified:	2021-12-10 17:49 UTC (History)
CC List:	8 users (show)
Fixed In Version:	openstack-tripleo-common-8.6.3-16.el7ost
Doc Type:	Bug Fix
Doc Text:	When upgrading from Red Hat OpenStack Platform 12 to 13 the ceph-osd package is removed. The package removal stopped the running OSDs even though they were running in containers and shouldn't have required the package. This release removes the playbook that removes the package during the upgrade and Ceph OSDs are not unintentionally stopped during upgrade.
Clone Of:
Environment:
Last Closed:	2018-11-13 22:28:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1794079	None	None	None	2018-09-24 12:46:50 UTC
OpenStack gerrit	604779	None	MERGED	Revert "Remove ceph-osd after deployment succeeded in containers"	2020-12-02 06:14:20 UTC
Red Hat Issue Tracker	OSP-11666	None	None	None	2021-12-10 17:49:27 UTC
Red Hat Product Errata	RHBA-2018:3587	None	None	None	2018-11-13 22:29:51 UTC

Description Matt Flusche 2018-09-21 17:32:08 UTC

Description of problem:

During the upgrade from OSP12 -> 13 the ceph-osd package is removed from the storage node which kills the running OSD containers.

Specifically in the preuninstall scriptlet in the ceph-osd rpm, "systemctl stop ceph-disk@\*.service ceph-osd@\*.service ceph-osd.target" is called during the uninstall which kills the containers

In this environment the OSD nodes originally have ceph-osd-10.2.10-28.el7cp RPM.

# rpm -qp --scripts ceph-osd-10.2.10-28.el7cp.x86_64.rpm 
[...]
preuninstall scriptlet (using /bin/sh):

if [ $1 -eq 0 ] ; then 
        # Package removal, not upgrade 
        systemctl --no-reload disable ceph-disk@\*.service ceph-osd@\*.service ceph-osd.target > /dev/null 2>&1 || : 
        systemctl stop ceph-disk@\*.service ceph-osd@\*.service ceph-osd.target > /dev/null 2>&1 || : 
fi
[...]

From /var/log/messages
Sep 20 16:11:24 ceph-host systemd: Starting Session 668 of user tripleo-admin.
Sep 20 16:11:25 ceph-host ansible-yum: Invoked with allow_downgrade=False name=['ceph-osd'] list=None install_repoquery=True conf_file=None disable_gpg_check=False state=absent disablerepo=None update_cache=False enablerepo=None exclude=None security=False validate_certs=True installroot=/ skip_broken=False
Sep 20 16:11:27 ceph-host systemd: Stopped target ceph target allowing to start/stop all ceph-osd@.service instances at once.
Sep 20 16:11:27 ceph-host systemd: Stopping ceph target allowing to start/stop all ceph-osd@.service instances at once.
Sep 20 16:11:27 ceph-host systemd: Stopping Ceph OSD...
Sep 20 16:11:27 ceph-host ceph-osd-run.sh: Sending SIGTERM to PID 963224
Sep 20 16:11:27 ceph-host ceph-osd-run.sh: sigterm_cleanup_post
Sep 20 16:11:27 ceph-host journal: Sending SIGTERM to PID 963224
Sep 20 16:11:27 ceph-host ceph-osd-run.sh: 2018-09-20 16:11:27.059566 7f33c78a6700 -1 received  signal: Terminated from  PID: 963012 task name: /bin/bash /entrypoint.sh  UID: 0
Sep 20 16:11:27 ceph-host ceph-osd-run.sh: 2018-09-20 16:11:27.059592 7f33c78a6700 -1 osd.3 168 *** Got signal Terminated ***
Sep 20 16:11:27 ceph-host journal: sigterm_cleanup_post
Sep 20 16:11:27 ceph-host journal: 2018-09-20 16:11:27.059566 7f33c78a6700 -1 received  signal: Terminated from  PID: 963012 task name: /bin/bash /entrypoint.sh  UID: 0
Sep 20 16:11:27 ceph-host journal: 2018-09-20 16:11:27.059592 7f33c78a6700 -1 osd.3 168 *** Got signal Terminated ***
Sep 20 16:11:27 ceph-host journal: 2018-09-20 16:11:27  /entrypoint.sh: Unmounting /dev/vdc1
Sep 20 16:11:27 ceph-host ceph-osd-run.sh: 2018-09-20 16:11:27  /entrypoint.sh: Unmounting /dev/vdc1
Sep 20 16:11:27 ceph-host journal: umount: /var/lib/ceph/osd/ceph-3: target is busy.
Sep 20 16:11:27 ceph-host journal:        (In some cases useful info about processes that use
Sep 20 16:11:27 ceph-host ceph-osd-run.sh: umount: /var/lib/ceph/osd/ceph-3: target is busy.
Sep 20 16:11:27 ceph-host ceph-osd-run.sh: (In some cases useful info about processes that use
Sep 20 16:11:27 ceph-host ceph-osd-run.sh: the device is found by lsof(8) or fuser(1))
Sep 20 16:11:27 ceph-host journal:         the device is found by lsof(8) or fuser(1))
Sep 20 16:11:27 ceph-host journal: 2018-09-20 16:11:27  /entrypoint.sh: Failed to umount /dev/vdc1
Sep 20 16:11:27 ceph-host ceph-osd-run.sh: 2018-09-20 16:11:27  /entrypoint.sh: Failed to umount /dev/vdc1
Sep 20 16:11:27 ceph-host kernel: XFS (vdc1): Unmounting Filesystem
Sep 20 16:11:27 ceph-host dockerd-current: time="2018-09-20T16:11:27.183225874Z" level=error msg="containerd: deleting container" error="exit status 1: \"container 282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b does not exist\\none or more of the container deletions failed\\n\""
Sep 20 16:11:27 ceph-host dockerd-current: time="2018-09-20T16:11:27.202889350Z" level=warning msg="282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b cleanup: failed to unmount secrets: invalid argument"
Sep 20 16:11:27 ceph-host docker: ceph-osd-ceph-host-vdc
Sep 20 16:11:27 ceph-host dockerd-current: time="2018-09-20T16:11:27.212865010Z" level=error msg="Handler for POST /v1.26/containers/282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b/kill?signal=TERM returned error: Cannot kill container 282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b: No such container: 282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b"
Sep 20 16:11:27 ceph-host dockerd-current: time="2018-09-20T16:11:27.213614631Z" level=error msg="Handler for POST /v1.26/containers/282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b/kill returned error: Cannot kill container 282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b: No such container: 282224798fd53b4c43b24f3dcd31001c72ee217c7b92371ee769ecf938d1ac4b"
Sep 20 16:11:27 ceph-host systemd: Stopped Ceph OSD.
Sep 20 16:11:27 ceph-host yum[963532]: Erased: 2:ceph-osd-10.2.10-28.el7cp.x86_64
Sep 20 16:11:27 ceph-host systemd: Reloading.



Version-Release number of selected component (if applicable):
Current OSP12 to 13 components.
ceph-osd-10.2.10-28.el7cp on the OSD node.


How reproducible:
unknown

Steps to Reproduce:
1.  OSP 12 -> 13 upgrade with ceph-osd still installed on the base OSD nodes.

This can also be reproduced on an OSP 13 storage node with running OSD containers:

systemctl stop ceph-disk@\*.service ceph-osd@\*.service ceph-osd.target

Actual results:
OSD containers are killed leaving ceph in an unhealthy state.

Expected results:
Clean upgrade

Additional info:
I'll attach additional log

Comment 3 Matt Flusche 2018-09-21 17:39:29 UTC

I've suggested manually removing the ceph-osd package prior the the upgrade as a work-around.

rpm -e --noscripts ceph-osd

Comment 4 Giulio Fidente 2018-09-24 10:31:28 UTC

(In reply to Matt Flusche from comment #3)
> I've suggested manually removing the ceph-osd package prior the the upgrade
> as a work-around.
> 
> rpm -e --noscripts ceph-osd

hi Matt,

thanks for reporting this bug! Can you see if by re-enabling and re-starting the systemd units the OSDs get back up in working state?

I don't think we can remove the package before upgrading because during FFU we haven't migrated the OSDs into containers until the ceph-ansible run finished so removing the package installed locally would hit the running cluster anyway.

I believe we can start by making a change in the ceph-ansible workflow so that it does not remove the package at all from the OSD nodes, then see where/how is best to resolve this which seems something to be addressed in either ceph-ansible or the ceph RPMs uninstall scripts.

Comment 7 Matt Flusche 2018-09-24 12:57:49 UTC

(In reply to Giulio Fidente from comment #4)

> 
> thanks for reporting this bug! Can you see if by re-enabling and re-starting
> the systemd units the OSDs get back up in working state?
> 
Yes, re-starting the systemd units recovered the OSDs.  Also, the manual work-around of removing the ceph-osd package (rpm -e --noscripts ceph-osd) prior to the upgrade worked successfully in the production environment.

Comment 13 Yogev Rabl 2018-11-02 18:55:39 UTC

verified

Comment 17 errata-xmlrpc 2018-11-13 22:28:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3587

Note You need to log in before you can comment on or make changes to this bug.