1624341 – All OSDs down after OSP FFU

Bug 1624341 - All OSDs down after OSP FFU

Summary: All OSDs down after OSP FFU

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Container
Sub Component:
Version:	3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	3.2
Assignee:	Erwan Velu
QA Contact:	Vasishta
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1578730
TreeView+	depends on / blocked

Reported:	2018-08-31 09:23 UTC by Gregory Charot
Modified:	2019-01-09 12:22 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-01-09 12:22:52 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
osd-logs (212.12 KB, text/plain) 2018-08-31 09:26 UTC, Gregory Charot	no flags	Details
ceph-ansible-logs (1.80 MB, text/plain) 2018-08-31 09:26 UTC, Gregory Charot	no flags	Details
THT (13.97 KB, application/x-gzip) 2018-08-31 09:27 UTC, Gregory Charot	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ceph ceph-container pull 1179	0	None	closed	docker_exec: Reworking SIGTERM handler	2020-02-22 21:40:33 UTC

Description Gregory Charot 2018-08-31 09:23:43 UTC

Description of problem:

When doing an OSP10 to 13 FFU, after the Ceph Upgrade
openstack overcloud upgrade run --roles CephStorage

All OSDs are down, ceph ansible does not complain. Starting the OSD with systemctl  start ceph-osd works. The "ceph-upgrade" run terminates successfully

Version-Release number of selected component (if applicable):

13

How reproducible:

On slow env (virtualised for training purposes) happens 60% of time.

Steps to Reproduce:
1. OSP10 latest minor release
2. Start FFU process
3. openstack overcloud upgrade run --roles CephStorage
4. docker ps on OSD nodes / ceph -s / ceph osd tree

Actual results:

All OSDs down
Mons up

Expected results:

All OSDs up

Additional info:

OSD logs show:
Aug 30 12:26:17 lab-ceph01 ceph-osd-run.sh[66710]: 2018-08-30 12:26:17  /entrypoint.sh: Unmounting /dev/vdb1
Aug 30 12:26:17 lab-ceph01 ceph-osd-run.sh[66710]: umount: /var/lib/ceph/osd/ceph-2: target is busy.
Aug 30 12:26:17 lab-ceph01 ceph-osd-run.sh[66710]: (In some cases useful info about processes that use
Aug 30 12:26:17 lab-ceph01 ceph-osd-run.sh[66710]: the device is found by lsof(8) or fuser(1))

As if we send a term signal to the process then try to umount /var/lib/ceph/osd/ceph-X but process has not finished and is still using the mount point. Appears to be some kind of race condition.

Comment 3 Gregory Charot 2018-08-31 09:26:02 UTC

Created attachment 1480063 [details]
osd-logs

osd-logs - error at the end of the file

Comment 4 Gregory Charot 2018-08-31 09:26:50 UTC

Created attachment 1480065 [details]
ceph-ansible-logs

ceph-ansible logs from mistral

Comment 5 Gregory Charot 2018-08-31 09:27:31 UTC

Created attachment 1480066 [details]
THT

Comment 6 Harish NV Rao 2018-09-05 07:19:52 UTC

@Gregory, If you want this bug to be in 3.1 release notes, please add 1584264 in the blocks field. Currently this bug is not targeted to 3.1 (GA Sep 12th).

Comment 8 Erwan Velu 2018-09-05 15:18:01 UTC

I investigated that issue and found some improvements to make to avoid this situation.

https://github.com/ceph/ceph-container/pull/1179

Comment 9 Erwan Velu 2018-09-06 14:08:15 UTC

Therefore, I don't know what is the default gracetime we have in the product but I'd suggest to have at least 30 secs to avoid docker sending a sigkill too soon.

Comment 10 Erwan Velu 2018-10-15 15:08:43 UTC

Please, can anyone consider checking if the default gracetime can be increase too ? Improving our code is fine but it would be more secured to increase it also.

Comment 12 Giulio Fidente 2019-01-09 12:22:52 UTC

Latest available version is ceph-ansible-3.2.0-1.el7cp from
http://access.redhat.com/errata/RHBA-2019:0020

Note You need to log in before you can comment on or make changes to this bug.