Bug 1624341

Summary:

All OSDs down after OSP FFU

Product:

[Red Hat Storage] Red Hat Ceph Storage

Reporter:

Gregory Charot <gcharot>

Component:

Container

Assignee:

Erwan Velu <evelu>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Vasishta <vashastr>

Severity:

high

Docs Contact:

Priority:

high

Version:

3.1

CC:

ceph-eng-bugs, evelu, gabrioux, gfidente, hnallurv, mbracho, shan

Target Milestone:

Target Release:

3.2

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-01-09 12:22:52 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1578730

Attachments:

Description	Flags
osd-logs	none
ceph-ansible-logs	none
THT	none

Description Gregory Charot 2018-08-31 09:23:43 UTC

Description of problem:

When doing an OSP10 to 13 FFU, after the Ceph Upgrade
openstack overcloud upgrade run --roles CephStorage

All OSDs are down, ceph ansible does not complain. Starting the OSD with systemctl  start ceph-osd works. The "ceph-upgrade" run terminates successfully

Version-Release number of selected component (if applicable):

13

How reproducible:

On slow env (virtualised for training purposes) happens 60% of time.

Steps to Reproduce:
1. OSP10 latest minor release
2. Start FFU process
3. openstack overcloud upgrade run --roles CephStorage
4. docker ps on OSD nodes / ceph -s / ceph osd tree

Actual results:

All OSDs down
Mons up

Expected results:

All OSDs up

Additional info:

OSD logs show:
Aug 30 12:26:17 lab-ceph01 ceph-osd-run.sh[66710]: 2018-08-30 12:26:17  /entrypoint.sh: Unmounting /dev/vdb1
Aug 30 12:26:17 lab-ceph01 ceph-osd-run.sh[66710]: umount: /var/lib/ceph/osd/ceph-2: target is busy.
Aug 30 12:26:17 lab-ceph01 ceph-osd-run.sh[66710]: (In some cases useful info about processes that use
Aug 30 12:26:17 lab-ceph01 ceph-osd-run.sh[66710]: the device is found by lsof(8) or fuser(1))

As if we send a term signal to the process then try to umount /var/lib/ceph/osd/ceph-X but process has not finished and is still using the mount point. Appears to be some kind of race condition.

Comment 3 Gregory Charot 2018-08-31 09:26:02 UTC

Created attachment 1480063 [details]
osd-logs

osd-logs - error at the end of the file

Comment 4 Gregory Charot 2018-08-31 09:26:50 UTC

Created attachment 1480065 [details]
ceph-ansible-logs

ceph-ansible logs from mistral

Comment 5 Gregory Charot 2018-08-31 09:27:31 UTC

Created attachment 1480066 [details]
THT

Comment 6 Harish NV Rao 2018-09-05 07:19:52 UTC

@Gregory, If you want this bug to be in 3.1 release notes, please add 1584264 in the blocks field. Currently this bug is not targeted to 3.1 (GA Sep 12th).

Comment 8 Erwan Velu 2018-09-05 15:18:01 UTC

I investigated that issue and found some improvements to make to avoid this situation.

https://github.com/ceph/ceph-container/pull/1179

Comment 9 Erwan Velu 2018-09-06 14:08:15 UTC

Therefore, I don't know what is the default gracetime we have in the product but I'd suggest to have at least 30 secs to avoid docker sending a sigkill too soon.

Comment 10 Erwan Velu 2018-10-15 15:08:43 UTC

Please, can anyone consider checking if the default gracetime can be increase too ? Improving our code is fine but it would be more secured to increase it also.

Comment 12 Giulio Fidente 2019-01-09 12:22:52 UTC

Latest available version is ceph-ansible-3.2.0-1.el7cp from
http://access.redhat.com/errata/RHBA-2019:0020