Bug 2103256
| Summary: | [RFE] Create a maintenance pod to use ceph tools | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Vikhyat Umrao <vumrao> |
| Component: | rook | Assignee: | Parth Arora <paarora> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Mahesh Shetty <mashetty> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.11 | CC: | assingh, bkunal, linuxkidd, mhackett, mmuench, muagarwa, nravinas, ocs-bugs, odf-bz-bot, owasserm, paarora, tdesala, tnielsen |
| Target Milestone: | --- | Keywords: | FutureFeature |
| Target Release: | ODF 4.12.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Enhancement | |
| Doc Text: |
Feature: Add debug mode for mon and osd
Reason: To run advanced maintenance operations of the cluster seamlessly
Result: Cluster now is capable of running Debug Mode using Krew
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-02-08 14:06:28 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Michael Kidd's scripts: Standalone cephadm script to dump pg log and trim the pg dup: https://github.com/linuxkidd/ceph-misc/blob/master/osd_pglog_trim.sh and the same thing needs to achieve in ODF the script has steps mentioned in comment#0 - https://github.com/linuxkidd/ceph-misc/blob/master/osd_pglog_trim.odf.sh Another issue is these steps are not the same in all ODF versions and they change from version to version so it causes more issues. Also, this new maintenance container should have the capability to use a new container image from an internal or external, or local image. The approach I see doable for such a tool to perform maintenance on OSD pods would require: 1. Scale down the rook operator 2. Scale down the OSD daemon deployment 3. Create a new OSD "debug" deployment a. Copy the existing OSD daemon deployment and give it a new name (e.g. rook-ceph-osd-<id>-debug) b. Remove the startup/liveness probes c. Replace the main ceph-osd container with the debug container that runs the desired image (by default it will be the same image as the osd pod) When done debugging the OSD pod: a. Delete the OSD debug deployment b. Scale back up the OSD deployment c. Start the operator back up again To allow the operator to continue running even while the maintenance is being run, we would need a code change to skip an OSD reconcile if there is a debug deployment found for the OSD. With the exception of the operator code change just mentioned, the tool should be fully implementable in the krew plugin [1] with commands such as the following: 1. oc scale --replicas=0 deploy/rook-ceph-operator # temporary until the operator fix is implemented 2. kubectl rook-ceph create-debug-osd --id <osd-id> [--image <alternate-image>] # scales the OSD down and starts the debug OSD pod 3. Perform debug operations in the debug osd pod 4. kubectl rook-ceph restore-osd --id <osd-id> # deletes the debug pod and scales back up the original osd pod 5. oc scale --replicas=1 deploy/rook-ceph-operator # temporary until the operator fix is implemented Does this approach work for support team? I don't see a simpler approach to this tool in Rook. The only way to access an OSD in Rook is by mounting it with the same PVs and using the same init containers that the OSD daemon pod uses. We just don't have the same simple view of hardware that is seen by cephadm on bare metal. Rook OSDs are backed by PVs, which are an abstraction from the host. A positive point of this approach is that the krew plugin updates are independent from the product and do not need to wait for product releases. [1] https://github.com/rook/kubectl-rook-ceph (In reply to Travis Nielsen from comment #6) > The approach I see doable for such a tool to perform maintenance on OSD pods > would require: > 1. Scale down the rook operator > 2. Scale down the OSD daemon deployment > 3. Create a new OSD "debug" deployment > a. Copy the existing OSD daemon deployment and give it a new name (e.g. > rook-ceph-osd-<id>-debug) > b. Remove the startup/liveness probes > c. Replace the main ceph-osd container with the debug container that runs > the desired image (by default it will be the same image as the osd pod) > > When done debugging the OSD pod: > a. Delete the OSD debug deployment > b. Scale back up the OSD deployment > c. Start the operator back up again > > To allow the operator to continue running even while the maintenance is > being run, we would need a code change to skip an OSD reconcile if there is > a debug deployment found for the OSD. > > With the exception of the operator code change just mentioned, the tool > should be fully implementable in the krew plugin [1] with commands such as > the following: > 1. oc scale --replicas=0 deploy/rook-ceph-operator # temporary until the > operator fix is implemented > 2. kubectl rook-ceph create-debug-osd --id <osd-id> [--image > <alternate-image>] # scales the OSD down and starts the debug OSD pod > 3. Perform debug operations in the debug osd pod > 4. kubectl rook-ceph restore-osd --id <osd-id> # deletes the debug pod and > scales back up the original osd pod > 5. oc scale --replicas=1 deploy/rook-ceph-operator # temporary until the > operator fix is implemented > > Does this approach work for support team? I don't see a simpler approach to > this tool in Rook. The only way to access an OSD in Rook is by mounting it > with the same PVs and using the same init containers that the OSD daemon pod > uses. We just don't have the same simple view of hardware that is seen by > cephadm on bare metal. Rook OSDs are backed by PVs, which are an abstraction > from the host. > > A positive point of this approach is that the krew plugin updates are > independent from the product and do not need to wait for product releases. > > > [1] https://github.com/rook/kubectl-rook-ceph Thanks, Travis I think to start with this looks good to me. One quick question the above workflow will be independent of pods type right? The same method can be used for MDS, MON, MGR, etc! Two updates: 1. An upstream issue for the krew plugin is opened to describe that work item: https://github.com/rook/kubectl-rook-ceph/issues/35 2. A rook PR is opened for the operator to skip reconcile of mons or osd deployments that are in maintenance: https://github.com/rook/rook/pull/10585 But how helpful will it really be for the operator to skip reconciling those mons/osds in debug mode? We will still need to scale down the operator in other scenarios anyway. Should we just abandon the rook PR and just scale down the operator for this scenario too? Consistency seems better than needing to remember in which scenarios the operator must be stopped. (In reply to Travis Nielsen from comment #11) > Two updates: > 1. An upstream issue for the krew plugin is opened to describe that work > item: https://github.com/rook/kubectl-rook-ceph/issues/35 > 2. A rook PR is opened for the operator to skip reconcile of mons or osd > deployments that are in maintenance: https://github.com/rook/rook/pull/10585 > > But how helpful will it really be for the operator to skip reconciling those > mons/osds in debug mode? We will still need to scale down the operator in > other scenarios anyway. Should we just abandon the rook PR and just scale > down the operator for this scenario too? Consistency seems better than > needing to remember in which scenarios the operator must be stopped. Thank you, Travis. Sounds good to me! @r.martinez and @linuxkidd - please review the workflow and provide your feedback. Travis - if I understand correctly, the steps mentioned in comment#6, steps 1 and 5 would be needed in all cases it could be MON, OSD, MDS, etc! 1. oc scale --replicas=0 deploy/rook-ceph-operator 5. oc scale --replicas=1 deploy/rook-ceph-operator But when it comes to MONs and OSDs we would need steps 2 to 4 including 1 and 5: 2. kubectl rook-ceph create-debug-osd --id <osd-id> [--image <alternate-image>] # scales the OSD down and starts the debug OSD pod 3. Perform debug operations in the debug osd pod 4. kubectl rook-ceph restore-osd --id <osd-id> # deletes the debug pod and scales back up the original osd pod and this all be taken care of with the help of krew plugin. Except MONs and OSDs it would be ceph-tools POD with steps 1 and 5. I hope my understanding is correct? Vikhyat Yes, your understanding is correct, thanks. (In reply to Travis Nielsen from comment #6) > 2. kubectl rook-ceph create-debug-osd --id <osd-id> [--image > <alternate-image>] # scales the OSD down and starts the debug OSD pod Travis one quick question to confirm my understanding. https://github.com/rook/rook/pull/10585 - This PR is needed for the above command right? I mean when a user runs this command we need to make sure that the operator is not reconciling the OSD and MON pods. Or is it needed for the following command to be skipped: oc scale --replicas=0 deploy/rook-ceph-operator so that we can directly run the following command: kubectl rook-ceph create-debug-osd --id <osd-id> Thanks, Vikhyat (In reply to Vikhyat Umrao from comment #17) > (In reply to Travis Nielsen from comment #6) > > > 2. kubectl rook-ceph create-debug-osd --id <osd-id> [--image > > <alternate-image>] # scales the OSD down and starts the debug OSD pod > > > Travis one quick question to confirm my understanding. > > https://github.com/rook/rook/pull/10585 - This PR is needed for the above > command right? I mean when a user runs this command we need to make sure > that the operator is not reconciling the OSD and MON pods. > Or is it needed for the following command to be skipped: > > oc scale --replicas=0 deploy/rook-ceph-operator > > so that we can directly run the following command: > > kubectl rook-ceph create-debug-osd --id <osd-id> > > Thanks, > Vikhyat Correct, if PR 10585 is merged, the operator will be aware of the debug pods and skip them during reconcile. There will be no need to scale down the operator in that case. But if that fix is not in the release against which the debug tool is run, the operator will need to be scaled down. Otherwise, the original mon/osd daemon pods will be scaled up the next time the operator reconciles. (In reply to Travis Nielsen from comment #18) > Correct, if PR 10585 is merged, the operator will be aware of the debug pods > and skip them during reconcile. There will be no need to scale down the > operator in that case. > > But if that fix is not in the release against which the debug tool is run, > the operator will need to be scaled down. Otherwise, the original mon/osd > daemon pods will be scaled up the next time the operator reconciles. Awesome. Thanks, Travis! and if I understand correctly the current plan is to take this PR in 4.12. So ODF releases before 4.12 would need the operator to be scaled down but from 4.12 and above it is just directly running the `kubectl rook-ceph create-debug-osd --id <osd-id>` and the scaledown part will be taken care by this PR feature. (In reply to Vikhyat Umrao from comment #19) > Awesome. Thanks, Travis! and if I understand correctly the current plan is > to take this PR in 4.12. > > So ODF releases before 4.12 would need the operator to be scaled down but > from 4.12 and above it is just directly running the `kubectl rook-ceph > create-debug-osd --id <osd-id>` and the scaledown part will be taken care by > this PR feature. Correct, when running this tool against older than 4.12, the operator would need to be scaled down before the tool is run, and scaled back up after it's done. We could also see about backporting it further if helpful. (In reply to Travis Nielsen from comment #20) > (In reply to Vikhyat Umrao from comment #19) > > > Awesome. Thanks, Travis! and if I understand correctly the current plan is > > to take this PR in 4.12. > > > > So ODF releases before 4.12 would need the operator to be scaled down but > > from 4.12 and above it is just directly running the `kubectl rook-ceph > > create-debug-osd --id <osd-id>` and the scaledown part will be taken care by > > this PR feature. > > Correct, when running this tool against older than 4.12, the operator would > need to be scaled down before the tool is run, and scaled back up after it's > done. We could also see about backporting it further if helpful. Thank you, Travis. The changes have all been merged and are available currently in the following releases: # Operator changes Merged to 4.12: Operator will ignore the mon and osd pods that are currently operating in debug mode. This change could be backported if needed to avoid the need to stop the operator pod when debugging these daemons # Krew Plugin The tool to create the debug pods for mons and osds is merged to the Rook krew plugin, see the doc here: https://github.com/rook/kubectl-rook-ceph#debug-mode This tool is not part of the downstream product, but is to be installed locally by support team who is doing the debugging. The tool is expected to run against any version of Rook/ODF, though QE validation would be helpful. |
[RFE] Create a maintenance pod to use ceph tools In the current ODF setup if we want to use ceph tools - for example, either ceph-objectstore-tool or ceph-bluestore-tool or ceph-monstore-tool, etc there are multi-level steps involved to access these tools as the essential requirement for these tools is the daemon where we using these tools needs to down while these tools being used in the given daemon type. The RFE proposes to have a concept similar to what we have in standalone ceph orchestrator cephadm it is called `cephadm shell`. This cephadm shell can create a maintenance dummy container for maintenance purposes. It has three basic major options --fsid <cluster fsid> --image <the new container image to be used in this maintenance container> --name <name/type of the daemon you want to create > For example for an OSD container if you want to create a maintenance container with COT with new image we just need to run the following command: cephadm shell --fsid ${fsid} --name osd.${osdid} ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-${osdid} --op list-pgs > /var/log/ceph/${fsid}/osd.${osdid}/osd.${osdid}_list-pgs.txt #2>/dev/null But to achieve the same thing in ODF you have to do the following: 1. scaling down rook-ceph and ocs operators oc scale deployment {rook-ceph,ocs}-operator --replicas=0 -n openshift-storage 2. Backing up osd deployment yaml oc get deployment rook-ceph-osd-${osdid} -o yaml > ${osdid}.yaml 3. Removing livenessProbe for osd.${osdid} pod osdpod=$(oc get pod -l osd=${osdid} -o name) resp=$(oc patch deployment rook-ceph-osd-${osdid} -n openshift-storage -p '{"op":"remove", "path":"/spec/template/spec/containers/0/livenessProbe"}') RETVAL=$? if [ $RETVAL -ne 0 ]; then log "ERROR: Failed to remove livenessProbe osd.${osdid} - ret: $RETVAL" exit $RETVAL fi if [ $(echo $resp | grep -c "no change") -eq 0 ]; then waitOSDPod ${osdid} ${osdpod} fi log "INFO: Sleeping osd.${osdid} pod" osdpod=$(oc get pod -l osd=${osdid} -o name) if [ ! -z "$imagerepo" ]; then resp=$(oc patch deployment rook-ceph-osd-${osdid} -n openshift-storage -p '{"spec": {"template": {"spec": {"containers": [{ "image": "'${imagerepo}'", "name": "osd", "command": ["sleep", "infinity"], "args": []}]}}}}') RETVAL=$? else resp=$(oc patch deployment rook-ceph-osd-${osdid} -n openshift-storage -p '{"spec": {"template": {"spec": {"containers": [{"name": "osd", "command": ["sleep"], "args": ["infinity"]}]}}}}') RETVAL=$? fi if [ $RETVAL -ne 0 ]; then log "ERROR: Failed to sleep osd.${osdid} - ret: $RETVAL" exit $RETVAL fi if [ $(echo $resp | grep -c "no change") -eq 0 ]; then waitOSDPod ${osdid} ${osdpod} fi