Bug 2103256

Summary:	[RFE] Create a maintenance pod to use ceph tools
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Vikhyat Umrao <vumrao>
Component:	rook	Assignee:	Parth Arora <paarora>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Mahesh Shetty <mashetty>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.11	CC:	assingh, bkunal, linuxkidd, mhackett, mmuench, muagarwa, nravinas, ocs-bugs, odf-bz-bot, owasserm, paarora, tdesala, tnielsen
Target Milestone:	---	Keywords:	FutureFeature
Target Release:	ODF 4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:	Feature: Add debug mode for mon and osd Reason: To run advanced maintenance operations of the cluster seamlessly Result: Cluster now is capable of running Debug Mode using Krew	Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-02-08 14:06:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Vikhyat Umrao 2022-07-01 20:41:58 UTC

[RFE] Create a maintenance pod to use ceph tools

In the current ODF setup if we want to use ceph tools - for example, either ceph-objectstore-tool or ceph-bluestore-tool or ceph-monstore-tool, etc there are multi-level steps involved to access these tools as the essential requirement for these tools is the daemon where we using these tools needs to down while these tools being used in the given daemon type.

The RFE proposes to have a concept similar to what we have in standalone ceph orchestrator cephadm it is called `cephadm shell`.

This cephadm shell can create a maintenance dummy container for maintenance purposes.

It has three basic major options

--fsid <cluster fsid>
--image <the new container image to be used in this maintenance container>
--name <name/type of the daemon you want to create >

For example for an OSD container if you want to create a maintenance container with COT with new image we just need to run the following command: 

cephadm shell --fsid ${fsid} --name osd.${osdid}  ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-${osdid} --op list-pgs > /var/log/ceph/${fsid}/osd.${osdid}/osd.${osdid}_list-pgs.txt #2>/dev/null

But to achieve the same thing in ODF you have to do the following:

1. scaling down rook-ceph and ocs operators
   oc scale deployment {rook-ceph,ocs}-operator --replicas=0 -n openshift-storage

2. Backing up osd deployment yaml
   oc get deployment rook-ceph-osd-${osdid} -o yaml > ${osdid}.yaml

3. Removing livenessProbe for osd.${osdid} pod
   osdpod=$(oc get pod -l osd=${osdid} -o name)
resp=$(oc patch deployment rook-ceph-osd-${osdid} -n openshift-storage -p '{"op":"remove", "path":"/spec/template/spec/containers/0/livenessProbe"}')
RETVAL=$?
if [ $RETVAL -ne 0 ]; then
  log "ERROR: Failed to remove livenessProbe osd.${osdid} - ret: $RETVAL"
  exit $RETVAL
fi
if [ $(echo $resp | grep -c "no change") -eq 0 ]; then
  waitOSDPod ${osdid} ${osdpod}
fi

log "INFO: Sleeping osd.${osdid} pod"
osdpod=$(oc get pod -l osd=${osdid} -o name)
if [ ! -z "$imagerepo" ]; then
  resp=$(oc patch deployment rook-ceph-osd-${osdid} -n openshift-storage -p '{"spec": {"template": {"spec": {"containers": [{ "image": "'${imagerepo}'", "name": "osd", "command": ["sleep", "infinity"], "args": []}]}}}}')
  RETVAL=$?
else
  resp=$(oc patch deployment rook-ceph-osd-${osdid} -n openshift-storage -p '{"spec": {"template": {"spec": {"containers": [{"name": "osd", "command": ["sleep"], "args": ["infinity"]}]}}}}')
  RETVAL=$?
fi
if [ $RETVAL -ne 0 ]; then
  log "ERROR: Failed to sleep osd.${osdid} - ret: $RETVAL"
  exit $RETVAL
fi
if [ $(echo $resp | grep -c "no change") -eq 0 ]; then
  waitOSDPod ${osdid} ${osdpod}
fi

Comment 3 Vikhyat Umrao 2022-07-01 20:44:12 UTC

Michael Kidd's scripts:

Standalone cephadm script to dump pg log and trim the pg dup: https://github.com/linuxkidd/ceph-misc/blob/master/osd_pglog_trim.sh
and the same thing needs to achieve in ODF the script has steps mentioned in comment#0 - https://github.com/linuxkidd/ceph-misc/blob/master/osd_pglog_trim.odf.sh

Another issue is these steps are not the same in all ODF versions and they change from version to version so it causes more issues.

Comment 5 Vikhyat Umrao 2022-07-01 21:00:32 UTC

Also, this new maintenance container should have the capability to use a new container image from an internal or external, or local image.

Comment 6 Travis Nielsen 2022-07-01 22:59:50 UTC

The approach I see doable for such a tool to perform maintenance on OSD pods would require:
1. Scale down the rook operator
2. Scale down the OSD daemon deployment
3. Create a new OSD "debug" deployment
a. Copy the existing OSD daemon deployment and give it a new name (e.g. rook-ceph-osd-<id>-debug)
b. Remove the startup/liveness probes
c. Replace the main ceph-osd container with the debug container that runs the desired image (by default it will be the same image as the osd pod)

When done debugging the OSD pod:
a. Delete the OSD debug deployment
b. Scale back up the OSD deployment
c. Start the operator back up again

To allow the operator to continue running even while the maintenance is being run, we would need a code change to skip an OSD reconcile if there is a debug deployment found for the OSD.

With the exception of the operator code change just mentioned, the tool should be fully implementable in the krew plugin [1] with commands such as the following:
1. oc scale --replicas=0 deploy/rook-ceph-operator # temporary until the operator fix is implemented
2. kubectl rook-ceph create-debug-osd --id <osd-id> [--image <alternate-image>] # scales the OSD down and starts the debug OSD pod
3. Perform debug operations in the debug osd pod
4. kubectl rook-ceph restore-osd --id <osd-id> # deletes the debug pod and scales back up the original osd pod
5. oc scale --replicas=1 deploy/rook-ceph-operator # temporary until the operator fix is implemented

Does this approach work for support team? I don't see a simpler approach to this tool in Rook. The only way to access an OSD in Rook is by mounting it with the same PVs and using the same init containers that the OSD daemon pod uses. We just don't have the same simple view of hardware that is seen by cephadm on bare metal. Rook OSDs are backed by PVs, which are an abstraction from the host.

A positive point of this approach is that the krew plugin updates are independent from the product and do not need to wait for product releases.

[1] https://github.com/rook/kubectl-rook-ceph

Comment 8 Vikhyat Umrao 2022-07-12 18:07:21 UTC

(In reply to Travis Nielsen from comment #6)
> The approach I see doable for such a tool to perform maintenance on OSD pods
> would require:
> 1. Scale down the rook operator
> 2. Scale down the OSD daemon deployment 
> 3. Create a new OSD "debug" deployment
>   a. Copy the existing OSD daemon deployment and give it a new name (e.g.
> rook-ceph-osd-<id>-debug)
>   b. Remove the startup/liveness probes
>   c. Replace the main ceph-osd container with the debug container that runs
> the desired image (by default it will be the same image as the osd pod)
> 
> When done debugging the OSD pod:
> a. Delete the OSD debug deployment
> b. Scale back up the OSD deployment
> c. Start the operator back up again
> 
> To allow the operator to continue running even while the maintenance is
> being run, we would need a code change to skip an OSD reconcile if there is
> a debug deployment found for the OSD.
> 
> With the exception of the operator code change just mentioned, the tool
> should be fully implementable in the krew plugin [1] with commands such as
> the following:
> 1. oc scale --replicas=0 deploy/rook-ceph-operator # temporary until the
> operator fix is implemented
> 2. kubectl rook-ceph create-debug-osd --id <osd-id> [--image
> <alternate-image>] # scales the OSD down and starts the debug OSD pod
> 3. Perform debug operations in the debug osd pod
> 4. kubectl rook-ceph restore-osd --id <osd-id>  # deletes the debug pod and
> scales back up the original osd pod
> 5. oc scale --replicas=1 deploy/rook-ceph-operator # temporary until the
> operator fix is implemented
> 
> Does this approach work for support team? I don't see a simpler approach to
> this tool in Rook. The only way to access an OSD in Rook is by mounting it
> with the same PVs and using the same init containers that the OSD daemon pod
> uses. We just don't have the same simple view of hardware that is seen by
> cephadm on bare metal. Rook OSDs are backed by PVs, which are an abstraction
> from the host. 
> 
> A positive point of this approach is that the krew plugin updates are
> independent from the product and do not need to wait for product releases. 
> 
> 
> [1] https://github.com/rook/kubectl-rook-ceph

Thanks, Travis I think to start with this looks good to me. One quick question the above workflow will be independent of pods type right? The same method can be used for MDS, MON, MGR, etc!

Comment 11 Travis Nielsen 2022-07-12 22:15:36 UTC

Two updates:
1. An upstream issue for the krew plugin is opened to describe that work item: https://github.com/rook/kubectl-rook-ceph/issues/35
2. A rook PR is opened for the operator to skip reconcile of mons or osd deployments that are in maintenance: https://github.com/rook/rook/pull/10585

But how helpful will it really be for the operator to skip reconciling those mons/osds in debug mode? We will still need to scale down the operator in other scenarios anyway. Should we just abandon the rook PR and just scale down the operator for this scenario too? Consistency seems better than needing to remember in which scenarios the operator must be stopped.

Comment 12 Vikhyat Umrao 2022-07-13 22:03:24 UTC

(In reply to Travis Nielsen from comment #11)
> Two updates:
> 1. An upstream issue for the krew plugin is opened to describe that work
> item: https://github.com/rook/kubectl-rook-ceph/issues/35
> 2. A rook PR is opened for the operator to skip reconcile of mons or osd
> deployments that are in maintenance: https://github.com/rook/rook/pull/10585
> 
> But how helpful will it really be for the operator to skip reconciling those
> mons/osds in debug mode? We will still need to scale down the operator in
> other scenarios anyway. Should we just abandon the rook PR and just scale
> down the operator for this scenario too? Consistency seems better than
> needing to remember in which scenarios the operator must be stopped.

Thank you, Travis. Sounds good to me! @r.martinez and @linuxkidd - please review the workflow and provide your feedback.

Travis - if I understand correctly, the steps mentioned in comment#6, steps 1 and 5 would be needed in all cases it could be MON, OSD, MDS, etc!

1. oc scale --replicas=0 deploy/rook-ceph-operator
5. oc scale --replicas=1 deploy/rook-ceph-operator

But when it comes to MONs and OSDs we would need steps 2 to 4 including 1 and 5:

2. kubectl rook-ceph create-debug-osd --id <osd-id> [--image <alternate-image>] # scales the OSD down and starts the debug OSD pod
3. Perform debug operations in the debug osd pod
4. kubectl rook-ceph restore-osd --id <osd-id>  # deletes the debug pod and scales back up the original osd pod

and this all be taken care of with the help of krew plugin.

Except MONs and OSDs it would be ceph-tools POD with steps 1 and 5. I hope my understanding is correct?

Comment 13 Travis Nielsen 2022-07-14 17:16:53 UTC

Vikhyat Yes, your understanding is correct, thanks.

Comment 17 Vikhyat Umrao 2022-08-09 20:40:35 UTC

(In reply to Travis Nielsen from comment #6)

> 2. kubectl rook-ceph create-debug-osd --id <osd-id> [--image
> <alternate-image>] # scales the OSD down and starts the debug OSD pod


Travis one quick question to confirm my understanding.

https://github.com/rook/rook/pull/10585 - This PR is needed for the above command right? I mean when a user runs this command we need to make sure that the operator is not reconciling the OSD and MON pods. 
Or is it needed for the following command to be skipped:

oc scale --replicas=0 deploy/rook-ceph-operator

so that we can directly run the following command:

kubectl rook-ceph create-debug-osd --id <osd-id>

Thanks,
Vikhyat

Comment 18 Travis Nielsen 2022-08-09 21:17:01 UTC

(In reply to Vikhyat Umrao from comment #17)
> (In reply to Travis Nielsen from comment #6)
> 
> > 2. kubectl rook-ceph create-debug-osd --id <osd-id> [--image
> > <alternate-image>] # scales the OSD down and starts the debug OSD pod
> 
> 
> Travis one quick question to confirm my understanding.
> 
> https://github.com/rook/rook/pull/10585 - This PR is needed for the above
> command right? I mean when a user runs this command we need to make sure
> that the operator is not reconciling the OSD and MON pods. 
> Or is it needed for the following command to be skipped:
> 
> oc scale --replicas=0 deploy/rook-ceph-operator
> 
> so that we can directly run the following command:
> 
> kubectl rook-ceph create-debug-osd --id <osd-id>
> 
> Thanks,
> Vikhyat


Correct, if PR 10585 is merged, the operator will be aware of the debug pods and skip them during reconcile. There will be no need to scale down the operator in that case.

But if that fix is not in the release against which the debug tool is run, the operator will need to be scaled down. Otherwise, the original mon/osd daemon pods will be scaled up the next time the operator reconciles.

Comment 19 Vikhyat Umrao 2022-08-09 21:30:06 UTC

(In reply to Travis Nielsen from comment #18)


> Correct, if PR 10585 is merged, the operator will be aware of the debug pods
> and skip them during reconcile. There will be no need to scale down the
> operator in that case.
> 
> But if that fix is not in the release against which the debug tool is run,
> the operator will need to be scaled down. Otherwise, the original mon/osd
> daemon pods will be scaled up the next time the operator reconciles.

Awesome. Thanks, Travis! and if I understand correctly the current plan is to take this PR in 4.12.

So ODF releases before 4.12 would need the operator to be scaled down but from 4.12 and above it is just directly running the `kubectl rook-ceph create-debug-osd --id <osd-id>` and the scaledown part will be taken care by this PR feature.

Comment 20 Travis Nielsen 2022-08-09 21:55:15 UTC

(In reply to Vikhyat Umrao from comment #19)

> Awesome. Thanks, Travis! and if I understand correctly the current plan is
> to take this PR in 4.12.
> 
> So ODF releases before 4.12 would need the operator to be scaled down but
> from 4.12 and above it is just directly running the `kubectl rook-ceph
> create-debug-osd --id <osd-id>` and the scaledown part will be taken care by
> this PR feature.

Correct, when running this tool against older than 4.12, the operator would need to be scaled down before the tool is run, and scaled back up after it's done. We could also see about backporting it further if helpful.

Comment 21 Vikhyat Umrao 2022-08-15 16:35:58 UTC

(In reply to Travis Nielsen from comment #20)
> (In reply to Vikhyat Umrao from comment #19)
> 
> > Awesome. Thanks, Travis! and if I understand correctly the current plan is
> > to take this PR in 4.12.
> > 
> > So ODF releases before 4.12 would need the operator to be scaled down but
> > from 4.12 and above it is just directly running the `kubectl rook-ceph
> > create-debug-osd --id <osd-id>` and the scaledown part will be taken care by
> > this PR feature.
> 
> Correct, when running this tool against older than 4.12, the operator would
> need to be scaled down before the tool is run, and scaled back up after it's
> done. We could also see about backporting it further if helpful.

Thank you, Travis.

Comment 22 Travis Nielsen 2022-08-26 20:29:33 UTC

The changes have all been merged and are available currently in the following releases:

# Operator changes
Merged to 4.12: Operator will ignore the mon and osd pods that are currently operating in debug mode. 

This change could be backported if needed to avoid the need to stop the operator pod when debugging these daemons

# Krew Plugin

The tool to create the debug pods for mons and osds is merged to the Rook krew plugin, see the doc here:
https://github.com/rook/kubectl-rook-ceph#debug-mode

This tool is not part of the downstream product, but is to be installed locally by support team who is doing the debugging.
The tool is expected to run against any version of Rook/ODF, though QE validation would be helpful.