Bug 1624341

Summary: All OSDs down after OSP FFU
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Gregory Charot <gcharot>
Component: ContainerAssignee: Erwan Velu <evelu>
Status: CLOSED CURRENTRELEASE QA Contact: Vasishta <vashastr>
Severity: high Docs Contact:
Priority: high    
Version: 3.1CC: ceph-eng-bugs, evelu, gabrioux, gfidente, hnallurv, mbracho, shan
Target Milestone: rc   
Target Release: 3.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-09 12:22:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1578730    
Attachments:
Description Flags
osd-logs
none
ceph-ansible-logs
none
THT none

Description Gregory Charot 2018-08-31 09:23:43 UTC
Description of problem:

When doing an OSP10 to 13 FFU, after the Ceph Upgrade
openstack overcloud upgrade run --roles CephStorage

All OSDs are down, ceph ansible does not complain. Starting the OSD with systemctl  start ceph-osd works. The "ceph-upgrade" run terminates successfully

Version-Release number of selected component (if applicable):

13

How reproducible:

On slow env (virtualised for training purposes) happens 60% of time.

Steps to Reproduce:
1. OSP10 latest minor release
2. Start FFU process
3. openstack overcloud upgrade run --roles CephStorage
4. docker ps on OSD nodes / ceph -s / ceph osd tree

Actual results:

All OSDs down
Mons up

Expected results:

All OSDs up

Additional info:

OSD logs show:
Aug 30 12:26:17 lab-ceph01 ceph-osd-run.sh[66710]: 2018-08-30 12:26:17  /entrypoint.sh: Unmounting /dev/vdb1
Aug 30 12:26:17 lab-ceph01 ceph-osd-run.sh[66710]: umount: /var/lib/ceph/osd/ceph-2: target is busy.
Aug 30 12:26:17 lab-ceph01 ceph-osd-run.sh[66710]: (In some cases useful info about processes that use
Aug 30 12:26:17 lab-ceph01 ceph-osd-run.sh[66710]: the device is found by lsof(8) or fuser(1))

As if we send a term signal to the process then try to umount /var/lib/ceph/osd/ceph-X but process has not finished and is still using the mount point. Appears to be some kind of race condition.

Comment 3 Gregory Charot 2018-08-31 09:26:02 UTC
Created attachment 1480063 [details]
osd-logs

osd-logs - error at the end of the file

Comment 4 Gregory Charot 2018-08-31 09:26:50 UTC
Created attachment 1480065 [details]
ceph-ansible-logs

ceph-ansible logs from mistral

Comment 5 Gregory Charot 2018-08-31 09:27:31 UTC
Created attachment 1480066 [details]
THT

Comment 6 Harish NV Rao 2018-09-05 07:19:52 UTC
@Gregory, If you want this bug to be in 3.1 release notes, please add 1584264 in the blocks field. Currently this bug is not targeted to 3.1 (GA Sep 12th).

Comment 8 Erwan Velu 2018-09-05 15:18:01 UTC
I investigated that issue and found some improvements to make to avoid this situation.

https://github.com/ceph/ceph-container/pull/1179

Comment 9 Erwan Velu 2018-09-06 14:08:15 UTC
Therefore, I don't know what is the default gracetime we have in the product but I'd suggest to have at least 30 secs to avoid docker sending a sigkill too soon.

Comment 10 Erwan Velu 2018-10-15 15:08:43 UTC
Please, can anyone consider checking if the default gracetime can be increase too ? Improving our code is fine but it would be more secured to increase it also.

Comment 12 Giulio Fidente 2019-01-09 12:22:52 UTC
Latest available version is ceph-ansible-3.2.0-1.el7cp from
http://access.redhat.com/errata/RHBA-2019:0020