Description of problem: When doing an OSP10 to 13 FFU, after the Ceph Upgrade openstack overcloud upgrade run --roles CephStorage All OSDs are down, ceph ansible does not complain. Starting the OSD with systemctl start ceph-osd works. The "ceph-upgrade" run terminates successfully Version-Release number of selected component (if applicable): 13 How reproducible: On slow env (virtualised for training purposes) happens 60% of time. Steps to Reproduce: 1. OSP10 latest minor release 2. Start FFU process 3. openstack overcloud upgrade run --roles CephStorage 4. docker ps on OSD nodes / ceph -s / ceph osd tree Actual results: All OSDs down Mons up Expected results: All OSDs up Additional info: OSD logs show: Aug 30 12:26:17 lab-ceph01 ceph-osd-run.sh[66710]: 2018-08-30 12:26:17 /entrypoint.sh: Unmounting /dev/vdb1 Aug 30 12:26:17 lab-ceph01 ceph-osd-run.sh[66710]: umount: /var/lib/ceph/osd/ceph-2: target is busy. Aug 30 12:26:17 lab-ceph01 ceph-osd-run.sh[66710]: (In some cases useful info about processes that use Aug 30 12:26:17 lab-ceph01 ceph-osd-run.sh[66710]: the device is found by lsof(8) or fuser(1)) As if we send a term signal to the process then try to umount /var/lib/ceph/osd/ceph-X but process has not finished and is still using the mount point. Appears to be some kind of race condition.
Created attachment 1480063 [details] osd-logs osd-logs - error at the end of the file
Created attachment 1480065 [details] ceph-ansible-logs ceph-ansible logs from mistral
Created attachment 1480066 [details] THT
@Gregory, If you want this bug to be in 3.1 release notes, please add 1584264 in the blocks field. Currently this bug is not targeted to 3.1 (GA Sep 12th).
I investigated that issue and found some improvements to make to avoid this situation. https://github.com/ceph/ceph-container/pull/1179
Therefore, I don't know what is the default gracetime we have in the product but I'd suggest to have at least 30 secs to avoid docker sending a sigkill too soon.
Please, can anyone consider checking if the default gracetime can be increase too ? Improving our code is fine but it would be more secured to increase it also.
Latest available version is ceph-ansible-3.2.0-1.el7cp from http://access.redhat.com/errata/RHBA-2019:0020