Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1628713

Summary: ceph-osd containers should be referenced by OSD number not just device name
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Ben England <bengland>
Component: ContainerAssignee: Sébastien Han <shan>
Status: CLOSED DUPLICATE QA Contact: Vasishta <vashastr>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: ceph-eng-bugs, dwilson, evelu, gabrioux, johfulto, twilkins
Target Milestone: rc   
Target Release: 3.*   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-13 19:08:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ben England 2018-09-13 18:52:25 UTC
Description of problem:

On a system with lots of ceph-osd containers going up and down, it becomes extremely difficult to troubleshoot problems with OSDs or manage OSDs.  Specifically:

- container names are based on block device name, which has nothing to do with OSD number and is not guaranteed stable across reboots
- systemd unit files are named by block device name, not OSD number.  So if I want to stop, start, or check status on a Ceph OSD number, it's not easy to do.

The problem is that OSD containers are named like this:

ceph-osd-overcloud-compute-12-sdad

So if I want to see logs for osd.N, I have to translate it to a container ID/name like the above.

although sometimes they are named by OSD number, like this:

expose_partitions_169

On the systemd unit files, if I do systemctl | grep osd, I see:

[root@overcloud-compute-13 ~]# systemctl | grep osd
ceph-osd      loaded activating auto-restart Ceph OSD
...

So I can't easily ask if osd.N container is up and running


Version-Release number of selected component (if applicable):

RHCS 3.0 - container image 3-9, running ceph-common-12.2.4-10.el7cp.x86_64
RHOSP 13 GA dated June 21.


Expected results:

You should be able to do things like

# docker ps | grep osd | grep _N_ | awk '{ print $1 }'

To get container ID for osd.N, at least.  I'm not asking for container name change because sometimes you need to be able to find what container is using device /dev/sdY, for hardware repair purposes.

And you should be able to do something like:

# systemctl -l | grep osd | grep _N_

to find the systemd unit that goes with osd.N


Additional info:

only known way to discover container name for OSD N is to do something like:

# for c in `docker ps | grep osd | awk '{ print $1 }'` ; do docker exec -it $c /bin/bash -c 'ls /var/lib/ceph/osd/*/fsid' ; done

And to find it in the entire cluster you have to make this into an ansible command and do something like

# ansible -u heat-admin -b -i osds.list -m shell -a 'find-osd-num.sh' all

where find-osd-num.sh is just the preceding shell command in a script file.  Ughh.