Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use Jira Cloud for all bug tracking management.

Bug 1624341

Summary: All OSDs down after OSP FFU
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Gregory Charot <gcharot>
Component: ContainerAssignee: Erwan Velu <evelu>
Status: CLOSED CURRENTRELEASE QA Contact: Vasishta <vashastr>
Severity: high Docs Contact:
Priority: high    
Version: 3.1CC: ceph-eng-bugs, evelu, gabrioux, gfidente, hnallurv, mbracho, shan
Target Milestone: rc   
Target Release: 3.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-09 12:22:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1578730    
Attachments:
Description Flags
osd-logs
none
ceph-ansible-logs
none
THT none

Description Gregory Charot 2018-08-31 09:23:43 UTC
Description of problem:

When doing an OSP10 to 13 FFU, after the Ceph Upgrade
openstack overcloud upgrade run --roles CephStorage

All OSDs are down, ceph ansible does not complain. Starting the OSD with systemctl  start ceph-osd works. The "ceph-upgrade" run terminates successfully

Version-Release number of selected component (if applicable):

13

How reproducible:

On slow env (virtualised for training purposes) happens 60% of time.

Steps to Reproduce:
1. OSP10 latest minor release
2. Start FFU process
3. openstack overcloud upgrade run --roles CephStorage
4. docker ps on OSD nodes / ceph -s / ceph osd tree

Actual results:

All OSDs down
Mons up

Expected results:

All OSDs up

Additional info:

OSD logs show:
Aug 30 12:26:17 lab-ceph01 ceph-osd-run.sh[66710]: 2018-08-30 12:26:17  /entrypoint.sh: Unmounting /dev/vdb1
Aug 30 12:26:17 lab-ceph01 ceph-osd-run.sh[66710]: umount: /var/lib/ceph/osd/ceph-2: target is busy.
Aug 30 12:26:17 lab-ceph01 ceph-osd-run.sh[66710]: (In some cases useful info about processes that use
Aug 30 12:26:17 lab-ceph01 ceph-osd-run.sh[66710]: the device is found by lsof(8) or fuser(1))

As if we send a term signal to the process then try to umount /var/lib/ceph/osd/ceph-X but process has not finished and is still using the mount point. Appears to be some kind of race condition.

Comment 3 Gregory Charot 2018-08-31 09:26:02 UTC
Created attachment 1480063 [details]
osd-logs

osd-logs - error at the end of the file

Comment 4 Gregory Charot 2018-08-31 09:26:50 UTC
Created attachment 1480065 [details]
ceph-ansible-logs

ceph-ansible logs from mistral

Comment 5 Gregory Charot 2018-08-31 09:27:31 UTC
Created attachment 1480066 [details]
THT

Comment 6 Harish NV Rao 2018-09-05 07:19:52 UTC
@Gregory, If you want this bug to be in 3.1 release notes, please add 1584264 in the blocks field. Currently this bug is not targeted to 3.1 (GA Sep 12th).

Comment 8 Erwan Velu 2018-09-05 15:18:01 UTC
I investigated that issue and found some improvements to make to avoid this situation.

https://github.com/ceph/ceph-container/pull/1179

Comment 9 Erwan Velu 2018-09-06 14:08:15 UTC
Therefore, I don't know what is the default gracetime we have in the product but I'd suggest to have at least 30 secs to avoid docker sending a sigkill too soon.

Comment 10 Erwan Velu 2018-10-15 15:08:43 UTC
Please, can anyone consider checking if the default gracetime can be increase too ? Improving our code is fine but it would be more secured to increase it also.

Comment 12 Giulio Fidente 2019-01-09 12:22:52 UTC
Latest available version is ceph-ansible-3.2.0-1.el7cp from
http://access.redhat.com/errata/RHBA-2019:0020