Bug 1628713 - ceph-osd containers should be referenced by OSD number not just device name
Summary: ceph-osd containers should be referenced by OSD number not just device name
Keywords:
Status: CLOSED DUPLICATE of bug 1544836
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Container
Version: 3.0
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: rc
: 3.*
Assignee: Sébastien Han
QA Contact: Vasishta
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-13 18:52 UTC by Ben England
Modified: 2018-09-13 19:08 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-13 19:08:34 UTC
Embargoed:


Attachments (Terms of Use)

Description Ben England 2018-09-13 18:52:25 UTC
Description of problem:

On a system with lots of ceph-osd containers going up and down, it becomes extremely difficult to troubleshoot problems with OSDs or manage OSDs.  Specifically:

- container names are based on block device name, which has nothing to do with OSD number and is not guaranteed stable across reboots
- systemd unit files are named by block device name, not OSD number.  So if I want to stop, start, or check status on a Ceph OSD number, it's not easy to do.

The problem is that OSD containers are named like this:

ceph-osd-overcloud-compute-12-sdad

So if I want to see logs for osd.N, I have to translate it to a container ID/name like the above.

although sometimes they are named by OSD number, like this:

expose_partitions_169

On the systemd unit files, if I do systemctl | grep osd, I see:

[root@overcloud-compute-13 ~]# systemctl | grep osd
ceph-osd      loaded activating auto-restart Ceph OSD
...

So I can't easily ask if osd.N container is up and running


Version-Release number of selected component (if applicable):

RHCS 3.0 - container image 3-9, running ceph-common-12.2.4-10.el7cp.x86_64
RHOSP 13 GA dated June 21.


Expected results:

You should be able to do things like

# docker ps | grep osd | grep _N_ | awk '{ print $1 }'

To get container ID for osd.N, at least.  I'm not asking for container name change because sometimes you need to be able to find what container is using device /dev/sdY, for hardware repair purposes.

And you should be able to do something like:

# systemctl -l | grep osd | grep _N_

to find the systemd unit that goes with osd.N


Additional info:

only known way to discover container name for OSD N is to do something like:

# for c in `docker ps | grep osd | awk '{ print $1 }'` ; do docker exec -it $c /bin/bash -c 'ls /var/lib/ceph/osd/*/fsid' ; done

And to find it in the entire cluster you have to make this into an ansible command and do something like

# ansible -u heat-admin -b -i osds.list -m shell -a 'find-osd-num.sh' all

where find-osd-num.sh is just the preceding shell command in a script file.  Ughh.


Note You need to log in before you can comment on or make changes to this bug.