1628713 – ceph-osd containers should be referenced by OSD number not just device name

Bug 1628713 - ceph-osd containers should be referenced by OSD number not just device name

Summary: ceph-osd containers should be referenced by OSD number not just device name

Keywords:
Status:	CLOSED DUPLICATE of bug 1544836
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Container
Sub Component:
Version:	3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	3.*
Assignee:	Sébastien Han
QA Contact:	Vasishta
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-09-13 18:52 UTC by Ben England
Modified:	2018-09-13 19:08 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-13 19:08:34 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Ben England 2018-09-13 18:52:25 UTC

Description of problem:

On a system with lots of ceph-osd containers going up and down, it becomes extremely difficult to troubleshoot problems with OSDs or manage OSDs.  Specifically:

- container names are based on block device name, which has nothing to do with OSD number and is not guaranteed stable across reboots
- systemd unit files are named by block device name, not OSD number.  So if I want to stop, start, or check status on a Ceph OSD number, it's not easy to do.

The problem is that OSD containers are named like this:

ceph-osd-overcloud-compute-12-sdad

So if I want to see logs for osd.N, I have to translate it to a container ID/name like the above.

although sometimes they are named by OSD number, like this:

expose_partitions_169

On the systemd unit files, if I do systemctl | grep osd, I see:

[root@overcloud-compute-13 ~]# systemctl | grep osd
ceph-osd      loaded activating auto-restart Ceph OSD
...

So I can't easily ask if osd.N container is up and running


Version-Release number of selected component (if applicable):

RHCS 3.0 - container image 3-9, running ceph-common-12.2.4-10.el7cp.x86_64
RHOSP 13 GA dated June 21.


Expected results:

You should be able to do things like

# docker ps | grep osd | grep _N_ | awk '{ print $1 }'

To get container ID for osd.N, at least.  I'm not asking for container name change because sometimes you need to be able to find what container is using device /dev/sdY, for hardware repair purposes.

And you should be able to do something like:

# systemctl -l | grep osd | grep _N_

to find the systemd unit that goes with osd.N


Additional info:

only known way to discover container name for OSD N is to do something like:

# for c in `docker ps | grep osd | awk '{ print $1 }'` ; do docker exec -it $c /bin/bash -c 'ls /var/lib/ceph/osd/*/fsid' ; done

And to find it in the entire cluster you have to make this into an ansible command and do something like

# ansible -u heat-admin -b -i osds.list -m shell -a 'find-osd-num.sh' all

where find-osd-num.sh is just the preceding shell command in a script file.  Ughh.

Note You need to log in before you can comment on or make changes to this bug.