Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1507770

Summary: 2.4 Container: OSD services are restarting continuously
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Sidhant Agrawal <sagrawal>
Component: ContainerAssignee: Sébastien Han <shan>
Status: CLOSED ERRATA QA Contact: Vasishta <vashastr>
Severity: high Docs Contact:
Priority: unspecified    
Version: 2.4CC: agunn, anharris, dang, flucifre, gmeno, hchen, hnallurv, jim.curtis, kdreyer, pbyregow, pprakash, prsurve, sagrawal, shan, vashastr
Target Milestone: rc   
Target Release: 2.5   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rhceph:ceph-2-rhel-7-docker-candidate-65237-20180109194512 Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-02-21 20:38:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1498183    
Bug Blocks:    
Attachments:
Description Flags
File contains journald logs of an osd serivce none

Description Sidhant Agrawal 2017-10-31 05:59:15 UTC
Description of problem:
After upgrading from container 2.4 to 2.4 Async, OSD services are restarting.


Version-Release number of selected component (if applicable):
ceph-ansible-2.2.11-1.el7scon.noarch

How reproducible:
always

Steps to Reproduce:
1.Install ceph 2.4 cluster in container
2.Follow the upgrade documentation.
3.Now upgrade to brew-pulp-
  docker01.web.prod.ext.phx2.redhat.com:8888/rhceph:2.4-4
4.Check OSD services

Actual results:

Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/mapper/0e7137c3-a6e6-5956-ad49-ba1714e2acf3": lstat /dev/mapper/0e7137c3-a6e6-5956-ad49-ba1714e2acf3: no such file or directory.


Expected results:
OSD services must be running

Additional info:
Oct 31 05:50:02 magna096 dockerd-current[1121]: time="2017-10-31T05:50:02.464118729Z" level=error msg="Handler for POST /v1.24/containers/0838c276f1549d7776a84a757ac9e87542670d0da07a87748e51066205d3927c/start returned error: linux runtime spec devices: error gathering device information while adding custom device \"/dev/mapper/a74d025c-5ccd-5a06-a2a6-2493a28937cc\": lstat /dev/mapper/a74d025c-5ccd-5a06-a2a6-2493a28937cc: no such file or directory"
Oct 31 05:50:02 magna096 ceph-osd-run.sh[30173]: /usr/bin/docker-current: Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/mapper/a74d025c-5ccd-5a06-a2a6-2493a28937cc": lstat /dev/mapper/a74d025c-5ccd-5a06-a2a6-2493a28937cc: no such file or directory.
Oct 31 05:50:02 magna096 dockerd-current[1121]: time="2017-10-31T05:50:02.465165541Z" level=info msg="{Action=remove, LoginUID=4294967295, PID=30201}"
Oct 31 05:50:02 magna096 systemd-udevd[29743]: inotify_add_watch(7, /dev/dm-1, 10) failed: No such file or directory
Oct 31 05:50:03 magna096 systemd-udevd[29743]: inotify_add_watch(7, /dev/dm-1, 10) failed: No such file or directory
Oct 31 05:50:03 magna096 systemd[1]: ceph-osd: main process exited, code=exited, status=127/n/a
Oct 31 05:50:03 magna096 dockerd-current[1121]: time="2017-10-31T05:50:03.469620588Z" level=info msg="{Action=stop, LoginUID=4294967295, PID=30297}"
Oct 31 05:50:03 magna096 dockerd-current[1121]: time="2017-10-31T05:50:03.470107301Z" level=error msg="Handler for POST /v1.24/containers/ceph-osd-magna096-devsdd/stop?t=10 returned error: No such container: ceph-osd-magna096-devsdd"
Oct 31 05:50:03 magna096 docker[30297]: Error response from daemon: No such container: ceph-osd-magna096-devsdd
Oct 31 05:50:03 magna096 dockerd-current[1121]: time="2017-10-31T05:50:03.470130853Z" level=error msg="Handler for POST /v1.24/containers/ceph-osd-magna096-devsdd/stop returned error: No such container: ceph-osd-magna096-devsdd"
Oct 31 05:50:03 magna096 systemd[1]: Unit ceph-osd entered failed state.
Oct 31 05:50:03 magna096 systemd[1]: ceph-osd failed.
Oct 31 05:50:05 magna096 systemd[1]: ceph-osd holdoff time over, scheduling restart.
Oct 31 05:50:05 magna096 systemd[1]: Starting Ceph OSD...

Comment 4 Sébastien Han 2017-10-31 14:59:14 UTC
The new container image along with the new ceph-ansible version will fix that.
As a results do we need to get a fix for that image? Can't we provide a new image version instead?

AFAIR the fix for this is huge so we better fix by having the new container + ceph-ansible version.

Greg, what do you think?

Comment 7 Sébastien Han 2017-11-03 15:20:25 UTC
Please move this bug to VERIFIED and create a new one for the reboot issue, also please share logs from the ceph-osd@ services.
Thanks.

Comment 8 Harish NV Rao 2017-11-03 15:45:07 UTC
(In reply to leseb from comment #7)
> Please move this bug to VERIFIED and create a new one for the reboot issue,
> also please share logs from the ceph-osd@ services.
> Thanks.

Hi Sebastien,

I am not sure what is fixed to move this defect to verified state. Can't we track this defect for reboot issue itself?

I still feel this defect can be tracked for reboot issue whether it was after upgrade or installation. 

I am changing the summary of this bug to match above. Please check if it is ok or not.

summary changed: "2.4 Container: OSD services are restarting continuously"

Regards,
Harish

Comment 10 Sébastien Han 2017-11-03 19:18:54 UTC
Alright, you seem to be using dmcrypt on that OSD. Indeed there is a bug when doing the reboot. I don't have any quick fix for this. 2.5 has the fix, it's not backportable, not with this limited window, sorry.

Comment 11 Ken Dreyer (Red Hat) 2017-11-07 16:53:56 UTC
Is this same issue present in the older docker01.web.prod.ext.phx2.redhat.com:8888/rhceph:2.4-2 image?

Comment 12 Sébastien Han 2017-11-15 09:15:54 UTC
@Ken I believe it's present in 2.4-2 yes?

Comment 13 Ken Dreyer (Red Hat) 2017-11-15 15:53:21 UTC
The reason I'm asking is to verify this is not a regression between 2.4-2 and 2.4-4. It sounds like it has been an issue all along.

Comment 15 Ken Dreyer (Red Hat) 2018-01-03 15:20:27 UTC
Sébastien would you please confirm this is resolved in ceph-container upstream? Since you're resync'ing from upstream (in bug 1498183) will this be addressed in the resync?

Comment 16 Sébastien Han 2018-01-03 16:40:51 UTC
Ken, I need to revisit https://bugzilla.redhat.com/show_bug.cgi?id=1498183 since we had a lot of back and forth and must make sure we have the right content again. I'll respond in https://bugzilla.redhat.com/show_bug.cgi?id=1498183, I might need to repush.

Comment 17 Sébastien Han 2018-01-03 16:48:52 UTC
Just checked looks like I did a final revert in 5ce1f6490314b6489384e1b4f3f6d7b6c91e6b88 and we are good now.

Comment 19 Vasishta 2018-01-24 03:21:38 UTC
Created attachment 1385163 [details]
File contains journald logs of an osd serivce

Hi,

Updated cluster from 2.4 to 2.5 following the doc and rebooted the node having dmcrypt+dedicated OSDs, OSD services are not coming up

Moving BZ to ASSIGNED state.

Regards,
Vasishta

Comment 20 Sébastien Han 2018-01-30 10:40:36 UTC
Can we access the machine that has the issue?
Thanks

Comment 21 Harish NV Rao 2018-01-30 16:14:07 UTC
(In reply to leseb from comment #20)
> Can we access the machine that has the issue?
> Thanks
@Seb, we don't have the system as of now which has this issue. We are trying to get it reproduced. Will update the bug with details once the issue is reproduced.

Comment 22 Sébastien Han 2018-01-30 16:27:37 UTC
Thanks a lot.

Comment 24 Sébastien Han 2018-01-31 15:58:09 UTC
The current failure is not a surprise.
The correct procedure to upgrade is the following:

* pull latest ceph-ansible code (2.5)
* run rolling_update.yml

Your step 3 should have been what I just listed.

That should be enough.
The doc is outdated, we should remove it.

We worked on something similar with Vasi and we successfully did the update.

Please try and let us know how that goes.
Thanks!

Comment 25 Vasishta 2018-02-01 15:37:25 UTC
Hi,

QE had not used ceph-ansible to upgrade containerized cluster from 2.4 to 2.5 before. 

Sebastien, Thanks for the inputs.

QE tested rolling_update from 2.4 live to 2.5 using ceph-ansible-3.0.22-1.el7cp.noarch and ceph-2-rhel-7-docker-candidate-49560-20180131210934.

Post upgrade, dmcrypt OSDs could successfully come up active and running after node reboot.

I have filed BZ 1541010 to update the Doc with steps to guide user to upgrade cluster using rolling_update.

Moving BZ to VERIFIED state.


Regards,
Vasishta 
AQE, Ceph

Comment 28 errata-xmlrpc 2018-02-21 20:38:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0341