Bug 1628713
| Summary: | ceph-osd containers should be referenced by OSD number not just device name | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Ben England <bengland> |
| Component: | Container | Assignee: | Sébastien Han <shan> |
| Status: | CLOSED DUPLICATE | QA Contact: | Vasishta <vashastr> |
| Severity: | high | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.0 | CC: | ceph-eng-bugs, dwilson, evelu, gabrioux, johfulto, twilkins |
| Target Milestone: | rc | ||
| Target Release: | 3.* | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-09-13 19:08:34 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Description of problem: On a system with lots of ceph-osd containers going up and down, it becomes extremely difficult to troubleshoot problems with OSDs or manage OSDs. Specifically: - container names are based on block device name, which has nothing to do with OSD number and is not guaranteed stable across reboots - systemd unit files are named by block device name, not OSD number. So if I want to stop, start, or check status on a Ceph OSD number, it's not easy to do. The problem is that OSD containers are named like this: ceph-osd-overcloud-compute-12-sdad So if I want to see logs for osd.N, I have to translate it to a container ID/name like the above. although sometimes they are named by OSD number, like this: expose_partitions_169 On the systemd unit files, if I do systemctl | grep osd, I see: [root@overcloud-compute-13 ~]# systemctl | grep osd ceph-osd loaded activating auto-restart Ceph OSD ... So I can't easily ask if osd.N container is up and running Version-Release number of selected component (if applicable): RHCS 3.0 - container image 3-9, running ceph-common-12.2.4-10.el7cp.x86_64 RHOSP 13 GA dated June 21. Expected results: You should be able to do things like # docker ps | grep osd | grep _N_ | awk '{ print $1 }' To get container ID for osd.N, at least. I'm not asking for container name change because sometimes you need to be able to find what container is using device /dev/sdY, for hardware repair purposes. And you should be able to do something like: # systemctl -l | grep osd | grep _N_ to find the systemd unit that goes with osd.N Additional info: only known way to discover container name for OSD N is to do something like: # for c in `docker ps | grep osd | awk '{ print $1 }'` ; do docker exec -it $c /bin/bash -c 'ls /var/lib/ceph/osd/*/fsid' ; done And to find it in the entire cluster you have to make this into an ansible command and do something like # ansible -u heat-admin -b -i osds.list -m shell -a 'find-osd-num.sh' all where find-osd-num.sh is just the preceding shell command in a script file. Ughh.