Bug 1507770
| Summary: | 2.4 Container: OSD services are restarting continuously | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Sidhant Agrawal <sagrawal> | ||||
| Component: | Container | Assignee: | Sébastien Han <shan> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Vasishta <vashastr> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 2.4 | CC: | agunn, anharris, dang, flucifre, gmeno, hchen, hnallurv, jim.curtis, kdreyer, pbyregow, pprakash, prsurve, sagrawal, shan, vashastr | ||||
| Target Milestone: | rc | ||||||
| Target Release: | 2.5 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | rhceph:ceph-2-rhel-7-docker-candidate-65237-20180109194512 | Doc Type: | No Doc Update | ||||
| Doc Text: |
undefined
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2018-02-21 20:38:32 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1498183 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
The new container image along with the new ceph-ansible version will fix that. As a results do we need to get a fix for that image? Can't we provide a new image version instead? AFAIR the fix for this is huge so we better fix by having the new container + ceph-ansible version. Greg, what do you think? Please move this bug to VERIFIED and create a new one for the reboot issue, also please share logs from the ceph-osd@ services. Thanks. (In reply to leseb from comment #7) > Please move this bug to VERIFIED and create a new one for the reboot issue, > also please share logs from the ceph-osd@ services. > Thanks. Hi Sebastien, I am not sure what is fixed to move this defect to verified state. Can't we track this defect for reboot issue itself? I still feel this defect can be tracked for reboot issue whether it was after upgrade or installation. I am changing the summary of this bug to match above. Please check if it is ok or not. summary changed: "2.4 Container: OSD services are restarting continuously" Regards, Harish Alright, you seem to be using dmcrypt on that OSD. Indeed there is a bug when doing the reboot. I don't have any quick fix for this. 2.5 has the fix, it's not backportable, not with this limited window, sorry. Is this same issue present in the older docker01.web.prod.ext.phx2.redhat.com:8888/rhceph:2.4-2 image? @Ken I believe it's present in 2.4-2 yes? The reason I'm asking is to verify this is not a regression between 2.4-2 and 2.4-4. It sounds like it has been an issue all along. Sébastien would you please confirm this is resolved in ceph-container upstream? Since you're resync'ing from upstream (in bug 1498183) will this be addressed in the resync? Ken, I need to revisit https://bugzilla.redhat.com/show_bug.cgi?id=1498183 since we had a lot of back and forth and must make sure we have the right content again. I'll respond in https://bugzilla.redhat.com/show_bug.cgi?id=1498183, I might need to repush. Just checked looks like I did a final revert in 5ce1f6490314b6489384e1b4f3f6d7b6c91e6b88 and we are good now. Created attachment 1385163 [details]
File contains journald logs of an osd serivce
Hi,
Updated cluster from 2.4 to 2.5 following the doc and rebooted the node having dmcrypt+dedicated OSDs, OSD services are not coming up
Moving BZ to ASSIGNED state.
Regards,
Vasishta
Can we access the machine that has the issue? Thanks (In reply to leseb from comment #20) > Can we access the machine that has the issue? > Thanks @Seb, we don't have the system as of now which has this issue. We are trying to get it reproduced. Will update the bug with details once the issue is reproduced. Thanks a lot. The current failure is not a surprise. The correct procedure to upgrade is the following: * pull latest ceph-ansible code (2.5) * run rolling_update.yml Your step 3 should have been what I just listed. That should be enough. The doc is outdated, we should remove it. We worked on something similar with Vasi and we successfully did the update. Please try and let us know how that goes. Thanks! Hi, QE had not used ceph-ansible to upgrade containerized cluster from 2.4 to 2.5 before. Sebastien, Thanks for the inputs. QE tested rolling_update from 2.4 live to 2.5 using ceph-ansible-3.0.22-1.el7cp.noarch and ceph-2-rhel-7-docker-candidate-49560-20180131210934. Post upgrade, dmcrypt OSDs could successfully come up active and running after node reboot. I have filed BZ 1541010 to update the Doc with steps to guide user to upgrade cluster using rolling_update. Moving BZ to VERIFIED state. Regards, Vasishta AQE, Ceph Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0341 |
Description of problem: After upgrading from container 2.4 to 2.4 Async, OSD services are restarting. Version-Release number of selected component (if applicable): ceph-ansible-2.2.11-1.el7scon.noarch How reproducible: always Steps to Reproduce: 1.Install ceph 2.4 cluster in container 2.Follow the upgrade documentation. 3.Now upgrade to brew-pulp- docker01.web.prod.ext.phx2.redhat.com:8888/rhceph:2.4-4 4.Check OSD services Actual results: Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/mapper/0e7137c3-a6e6-5956-ad49-ba1714e2acf3": lstat /dev/mapper/0e7137c3-a6e6-5956-ad49-ba1714e2acf3: no such file or directory. Expected results: OSD services must be running Additional info: Oct 31 05:50:02 magna096 dockerd-current[1121]: time="2017-10-31T05:50:02.464118729Z" level=error msg="Handler for POST /v1.24/containers/0838c276f1549d7776a84a757ac9e87542670d0da07a87748e51066205d3927c/start returned error: linux runtime spec devices: error gathering device information while adding custom device \"/dev/mapper/a74d025c-5ccd-5a06-a2a6-2493a28937cc\": lstat /dev/mapper/a74d025c-5ccd-5a06-a2a6-2493a28937cc: no such file or directory" Oct 31 05:50:02 magna096 ceph-osd-run.sh[30173]: /usr/bin/docker-current: Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/mapper/a74d025c-5ccd-5a06-a2a6-2493a28937cc": lstat /dev/mapper/a74d025c-5ccd-5a06-a2a6-2493a28937cc: no such file or directory. Oct 31 05:50:02 magna096 dockerd-current[1121]: time="2017-10-31T05:50:02.465165541Z" level=info msg="{Action=remove, LoginUID=4294967295, PID=30201}" Oct 31 05:50:02 magna096 systemd-udevd[29743]: inotify_add_watch(7, /dev/dm-1, 10) failed: No such file or directory Oct 31 05:50:03 magna096 systemd-udevd[29743]: inotify_add_watch(7, /dev/dm-1, 10) failed: No such file or directory Oct 31 05:50:03 magna096 systemd[1]: ceph-osd: main process exited, code=exited, status=127/n/a Oct 31 05:50:03 magna096 dockerd-current[1121]: time="2017-10-31T05:50:03.469620588Z" level=info msg="{Action=stop, LoginUID=4294967295, PID=30297}" Oct 31 05:50:03 magna096 dockerd-current[1121]: time="2017-10-31T05:50:03.470107301Z" level=error msg="Handler for POST /v1.24/containers/ceph-osd-magna096-devsdd/stop?t=10 returned error: No such container: ceph-osd-magna096-devsdd" Oct 31 05:50:03 magna096 docker[30297]: Error response from daemon: No such container: ceph-osd-magna096-devsdd Oct 31 05:50:03 magna096 dockerd-current[1121]: time="2017-10-31T05:50:03.470130853Z" level=error msg="Handler for POST /v1.24/containers/ceph-osd-magna096-devsdd/stop returned error: No such container: ceph-osd-magna096-devsdd" Oct 31 05:50:03 magna096 systemd[1]: Unit ceph-osd entered failed state. Oct 31 05:50:03 magna096 systemd[1]: ceph-osd failed. Oct 31 05:50:05 magna096 systemd[1]: ceph-osd holdoff time over, scheduling restart. Oct 31 05:50:05 magna096 systemd[1]: Starting Ceph OSD...