Description of problem (please be detailed as possible and provide log snippests): When the underlying disk of an OSD failed, there currently is no way (other than using the toolbox) to remove the OSD from the cluster. Thus the cluster will always stay in WARNING state (after data is rebalanced) Rook intentionally does not implement this and tells people to use the toolbox - since we do not want to document toolbox steps in OCS, we need to have OCS do the OSD purging instead: https://rook.io/docs/rook/v1.2/ceph-osd-mgmt.html Marking this for 4.4, since this will mostly be a problem with local-disks, but could just as well happen in the cloud when an EBS fails. Version of all relevant components (if applicable): All versions, new feature Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Cluster will be stuck in HEALTH WARN, but is usable Is there any workaround available to the best of your knowledge? Use the toolbox with undocumented/unsupported steps. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: No Expected results: I would assume that the OCS-Operator has some feature that can be triggered when the user is certain that a disk will not return. On activating that feature, I would assume the following happens: * ceph osd out {id} * Wait till recovery finishes * If timeout does not finish within 24h, alert * When recovery is finished: ceph osd purge {id} --yes-i-really-mean-it This follows the documented steps at: * https://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual * https://github.com/rook/rook/blob/master/Documentation/ceph-osd-mgmt.md#remove-an-osd Additional info:
This should be part of disk replacement. We don't know what 'failed disk' is until the user decides to replace it.
That's why I said in the Expected results: # I would assume that the OCS-Operator has some feature that can be triggered when the user is certain that a disk will not return So yes the user needs to trigger this and then the operator needs to perform work. I agree that this would be part of the disk replacement procedure. Currently there's no way we can document without using the toolbox and this is not limited to local disks (just more frequent with local disks)
4.3 is about to be released and 4.4 is pretty much closed. moving to 4.5.
@Chris Until there is an e2e story for replacing a disk, what other option are you suggesting than to document ceph commands from the toolbox? The OCS operator, similar to Rook, cannot make the decision to remove an OSD unless the admin confirms the action. Until there is a complete UX for failed disks I don't see another option than documenting the toolbox commands.
To clarify: I never said, that the OCS Operator would automatically detect that a disk is failed (and do operations) What I said is that the user detects that a disk is permanently lost (for whatever reason) and then does _something_ to the OCS Operator that triggers actions by the OCS Operator. At the very beginning of this is a manual action by a user (who is triggered by an alarm for example) I know that UI changes are not possible in a short-term, but I'm sure we can figure something out that works with the CLI. I can imagine two options: # 1 Remove the specific storageClassDeviceSets from the CephCluster CR Removing the item from the list of storageClassDeviceSets would trigger the OSD deletion including the removal of the OSD deployment. We would then re-add an item to this list to recreate everything for a new OSD. # 2 Remove the OSD deployment If the OCS can monitor the running Deployments, then it would be better UX if people could delete the Deployment and this would trigger the OSD removal... since the Deployment is actually the instantiation of the OSD inside of the Kubernetes cluster. Since we do not change the CephCluster CR, the OCS Operator should try to create a new OSD Deployment with a new OSD ;) I think #1 is probably safer to do, but harder to explain... while we could have people accidentally delete OSD deployments with potentially fatal results...
If the warning went away after the data was rebalanced, does that mitigate the problem? Seems like OCS shouldn't be surfacing a warning if the data isn't at risk. Replacing a disk needs a lot of thought to find the solution. The challenge is how to do this as a one-off task instead of with the typical "desired state" pattern that operators implement. Until we have an e2e solution, we can't get around documenting Ceph commands from the toolbox. Some thoughts on those two examples: #1 - If you remove the storageClassDeviceSet from the cluster CR you are signaling that your desired state is that you don't want any of the OSDs in that device set anymore. If there is only one OSD then that works in this case, but if "count" is higher than one then you will be destroying other OSDs at the same time. - You cannot edit the CephCluster directly or else the OCS operator will reset it at the next reconcile. #2 If Rook sees that a deployment is removed, it will automatically re-create it for the missing OSD. The desired state is to keep OSDs running, not allow them to be removed.
Moving back to 4.4. I fully agree that releasing 4.4 with 2 new platforms that relaying on LSO will require this BZ to be fixed.
Hello, I am not aware of a way which will handle the failed osd purging, (`ceph osd purge`) without using the toolbox as of now. Also, currently it feels like ceph command for purging osd is safe comparing to editing cephcluster CR or deleting osd deployment. The minimal steps to remove a failed osd. 1. Delete failed osd deployment 2. Delete pvc and pv related to failed osd 3. Once the osd fails and goes down. After approx 5 to 7 minutes osd will be automatically taken out of the cluster. In that case, `ceph osd out` command would not be necessary. once osd is out recovery will start occuring. 4. Purge the osd. The strictly needed command from Ceph side, `ceph osd purge osd.id --yes-i-really-mean-it` 5. If osd is not purged, it will remain for an indefinite time in the crush map. And so does in `ceph osd tree` output.
No development work on this will happen in OCS 4.4. At best this is a documentation effort which is already underway. Do we want to convert this BZ into a documentation bug or leave it for tracking a scripted solution that obfuscates the Ceph commands (which, again, will not happen in 4.4)?
1. once the osd is down, mark the down osd out if it isn't already out. ceph osd out {osd-id} 2. After marking osd out allow backfilling/recovery to complete 3. Once data was fully recovered, removed the OSD from the Ceph cluster, ceph osd purge <ID> --yes-i-really-mean-it 4. Remove the osd pod deployment and pvc 5. Login to machine with <bad_osd> (oc debug node/<node_w_bad_osd>) 6. Record /dev/disk/by-id/<bad_id> (ls -alh /mnt/local-storage/localblock) 7. Edit localvolume local-block CR and remove /dev/disk/by-id/<bad_id> for <bad_osd> 8. Login to machine with <bad_osd> (oc debug node/<node_w_bad_osd>) 9. Remove symlink (rm /mnt/local-storage/localblock/sdc) for <bad_device_name> 10.Delete pv related to the osd
If we decide that we document the ceph commands to purge an OSD (removing it from the crush map), then we also need to explain to users how to access the toolbox. Until now, we have decided that we will not include any toolbox usage in the documentation. This BZ is to figure out if and how we can solve this issue either programmatically or by changing our previous decision on telling people about the toolbox in the documentation. Pinging Yaniv, since he usually wants to avoid the toolbox. As for the exact Ceph commands that would be necessary to run to remove an OSD in OCS - we already have them here: https://docs.google.com/document/d/1adk4MeyOxU48XsAlK7LU5at4WMf0DCcEwcn66QckIfs/edit
As I explained earlier, we have no intentions to allow users to access the ceph toolbox.
In that case, we need to find a way to run these commands on the Ceph cluster without the toolbox. At this moment I do not see how we could do this only by changing/adding documentation steps. Pinging Jose in case he has an idea on how to proceed.
Of course we want to avoid the toolbox, but need a more complete solution in 4.5 before that is possible. If the issue is the complexity or extra step of starting the toolbox, another way to run ceph commands is from the rook operator. Something like this: 1. Connect to the operator pod oc rsh <operator-pod> 2. Copy the config for ceph commands into the place ceph expects them by default to make it easier to run the ceph commands cp /var/lib/rook/openshift-storage/openshift-storage.config /etc/ceph/ceph.conf 3. Run the ceph commands ceph osd purge ...
(In reply to Travis Nielsen from comment #17) > Of course we want to avoid the toolbox, but need a more complete solution in > 4.5 before that is possible. > > If the issue is the complexity or extra step of starting the toolbox, > another way to run ceph commands is from the rook operator. Something like > this: > > 1. Connect to the operator pod > oc rsh <operator-pod> > > 2. Copy the config for ceph commands into the place ceph expects them by > default to make it easier to run the ceph commands > cp /var/lib/rook/openshift-storage/openshift-storage.config > /etc/ceph/ceph.conf > > 3. Run the ceph commands > ceph osd purge ... I'm expecting either an 'oc ... <yaml file>' or better, a UI based workflow for disk replacement, for OCS BM GA.
> I'm expecting either an 'oc ... <yaml file>' or better, a UI based workflow for disk replacement, for OCS BM GA. Agreed, any ceph commands are just temporary until we have the full solution.
The design for a more complete solution is captured in this upstream issue: https://github.com/rook/rook/issues/5258. This defines the job that would trigger Rook to purge an OSD that we could target in the 4.5 release. Once we have the job implemented, we can consider what UI to build around launching the job. Let's keep this BZ to track the short-term documentation issue for 4.4.
If we're keeping this as a documentation issue, we should move it to the documentation component. However, the documentation will REQUIRE the use of the Ceph toolbox. So either we document nothing and move this BZ to OCS 4.5, leaving OSD replacement as a Support-only operation, or we relent and document the toolbox commands for 4.4. Looping in Sahina for visibility.
(In reply to Jose A. Rivera from comment #21) > If we're keeping this as a documentation issue, we should move it to the > documentation component. However, the documentation will REQUIRE the use of > the Ceph toolbox. So either we document nothing and move this BZ to OCS 4.5, > leaving OSD replacement as a Support-only operation, or we relent and > document the toolbox commands for 4.4. > > Looping in Sahina for visibility. See https://bugzilla.redhat.com/show_bug.cgi?id=1821219#c17 for an alternative to running the toolbox
(In reply to Travis Nielsen from comment #22) > See https://bugzilla.redhat.com/show_bug.cgi?id=1821219#c17 for an > alternative to running the toolbox Is this a safe operation? I wasn't aware we'd be recommending this in production clusters. That would at least avoid the toolbox Pod, but still of course require the Ceph commands.
(In reply to Jose A. Rivera from comment #23) > (In reply to Travis Nielsen from comment #22) > > See https://bugzilla.redhat.com/show_bug.cgi?id=1821219#c17 for an > > alternative to running the toolbox > > Is this a safe operation? I wasn't aware we'd be recommending this in > production clusters. That would at least avoid the toolbox Pod, but still of > course require the Ceph commands. This will simply run another process inside the operator pod. It won't affect the main operator process. After prototyping both approaches 1) running a command in the operator pod and 2) running a command in a job, I much prefer the simplicity of the operator command. 1) Here are the commands to run in the operator. No `oc rsh` is necessary, we can directly exec. Just copy/paste a few commands to your console where you execute "oc". OSD_ID_TO_REMOVE=1 ROOK_NAMESPACE=openshift-storage echo "finding operator pod" ROOK_OPERATOR_POD=$(oc -n ${ROOK_NAMESPACE} get pod -l app=rook-ceph-operator -o jsonpath='{.items[0].metadata.name}') echo "marking osd out" oc exec -it ${ROOK_OPERATOR_POD} -n ${ROOK_NAMESPACE} -- \ ceph osd out osd.${OSD_ID_TO_REMOVE} \ --cluster=${ROOK_NAMESPACE} --conf=/var/lib/rook/${ROOK_NAMESPACE}/${ROOK_NAMESPACE}.config --keyring=/var/lib/rook/${ROOK_NAMESPACE}/client.admin.keyring echo "purging osd" oc exec -it ${ROOK_OPERATOR_POD} -n ${ROOK_NAMESPACE} -- \ ceph osd purge ${OSD_ID_TO_REMOVE} --force --yes-i-really-mean-it \ --cluster=${ROOK_NAMESPACE} --conf=/var/lib/rook/${ROOK_NAMESPACE}/${ROOK_NAMESPACE}.config --keyring=/var/lib/rook/${ROOK_NAMESPACE}/client.admin.keyring 2) Here is the yaml to start a job. The complexity of managing a separate yaml file is much more complex for supportability IMO. This only runs a single ceph command. We actually need at least two ceph commands so we would need to run two jobs or else have two containers in the job. apiVersion: batch/v1 kind: Job metadata: name: remove-osd-OSD_ID_TO_REMOVE namespace: ROOK_CLUSTER_NAMESPACE labels: app: rook-remove-osd-OSD_ID_TO_REMOVE spec: template: spec: containers: - name: remove-osd-OSD_ID_TO_REMOVE image: ROOK_IMAGE volumeMounts: - mountPath: /etc/ceph/keyring-store/ name: rook-ceph-mons-keyring readOnly: true - mountPath: /etc/ceph name: rook-config-override readOnly: true env: - name: ROOK_CEPH_MON_HOST valueFrom: secretKeyRef: key: mon_host name: rook-ceph-config - name: ROOK_CEPH_MON_INITIAL_MEMBERS valueFrom: secretKeyRef: key: mon_initial_members name: rook-ceph-config command: ["bash", "-c"] args: - ceph osd purge OSD_ID_TO_REMOVE --yes-i-really-mean-it --id=admin --mon-host=$(ROOK_CEPH_MON_HOST) --mon-initial-members=$(ROOK_CEPH_MON_INITIAL_MEMBERS) --keyring=/etc/ceph/keyring-store/keyring volumes: - name: rook-ceph-mons-keyring secret: defaultMode: 420 secretName: rook-ceph-mons-keyring - name: rook-config-override configMap: name: rook-config-override defaultMode: 420 items: - key: config mode: 292 path: ceph.conf restartPolicy: OnFailure
The statement that a Job is only able to execute a single command is not true. The "nicest" way to execute multiple commands is to create a bash script with a ConfigMap, then mount that inside of the Job-Pod and run it with bash. This gives you all the flexibility, you can in-line the bash script in your YAML manifest with the Job description and apply it all at once. If you do not want to create a second object next to the Job, you can still execute multiple commands by chaining them with either '&&' or ';' This is not arguing wether 1) or 2) should be used, but I wanted to explain that you can run multiple commands in a single container ;) PS: The advantage of using a Job vs. running a local Bash script is that the Job is preserved with all the command output. This way it is easier to diagnose later when something has gone wrong that it was (or was not) caused by this maintenance operation. PS2: I think the restartPolicy should be set to "Never"
We could ship an Openshift Template that gets deployed with ocs-operator's initialization. Then the user just has to instantiate the template with osd-id as a parameter. That way, the process could be triggered with a single command.
The template could contain a Job that runs a ConfigMap bash script.
(In reply to Rohan CJ from comment #26) > We could ship an Openshift Template that gets deployed with ocs-operator's > initialization. Then the user just has to instantiate the template with > osd-id as a parameter. That way, the process could be triggered with a > single command. Hmm... interesting. With this idea, OCS Initialization could create a Template for a Job, and if we take Chris' idea also create a ConfigMap with the required script. These are things that would be easy to implement in ocs-operator, and could be done early next week if we can converge on a design.
Updates have been made to the Annette's doc so we can now avoid the toolbox. Servesha has also verified that these steps are working. The doc is currently using the approach of running the ceph commands in the operator. The PR is still going through final modifications before handing off to QE, but is certainly open to comments. https://github.com/red-hat-storage/ocs-training/pull/155/files Chris, agreed there are smarter ways to do things with the job. My argument for 4.4 would be to keep it simple and go with the document as written. It really doesn't feel like we would be simplifying things near enough to justify any code changes for 4.4. Let's save engineering efforts to improve this in 4.5. The idea of using the template for running a job is a very good one that we will be able to use for the 4.5 solution.
After discussing more with Jose, Annette, & JC we would recommend going ahead with the template solution for 4.4. - The ugly oc commands with all the ceph parameters become completely hidden from the user, thus largely achieving our goal now to hide ceph commands for 4.4 instead of waiting for 4.5. - The admin doesn't have to deal with any yaml directly. The template is created under the covers by the OCS operator. - The only detail of Ceph the admin will ever see is that they need to find the OSD ID - This is not throwaway work. In 4.5 we will anyway need a template to start the rook job that does the more complete work to remove the OSD. The template will need to be updated, but we can re-use all the work. Servesha, let's discuss in more detail what is needed to create the template for 4.4.
(In reply to Travis Nielsen from comment #30) > After discussing more with Jose, Annette, & JC we would recommend going > ahead with the template solution for 4.4. > - The ugly oc commands with all the ceph parameters become completely hidden > from the user, thus largely achieving our goal now to hide ceph commands for > 4.4 instead of waiting for 4.5. > - The admin doesn't have to deal with any yaml directly. The template is > created under the covers by the OCS operator. > - The only detail of Ceph the admin will ever see is that they need to find > the OSD ID How will the admin find the ID?
@Travis sure. Ack!
Created attachment 1680215 [details] OSD failed Prometheus alert Firing alert has OSD ID in its labels
> How will the admin find the ID? I added a screenshot of the alert that is firing when an OSD is down... as you can see, the labels include the OSD ID.
(In reply to Chris Blum from comment #34) > > How will the admin find the ID? > > I added a screenshot of the alert that is firing when an OSD is down... as > you can see, the labels include the OSD ID. Thanks - looks reasonable enough for 4.4.
Hello, @Travis, I tested the example job template[1] manually to purge the failed osd. It works! I can make respective changes in ocs-operator to merge a fix. Sample output: 'servesha$ oc logs -f rook-ceph-toolbox-job-0-b6b7g -n openshift-storage marked out osd.0. purged osd.0' [1] https://github.com/travisn/rook/blob/toolbox-job-purge/cluster/examples/kubernetes/ceph/toolbox-job.yaml
I am a bit confused where we are with this BZ: (1) we have an issue in rook with a design description: https://github.com/rook/rook/issues/5258 (2) we have a WIP PR in ocs operator with a template for the job: https://github.com/openshift/ocs-operator/pull/481 (3) do we also need a patch in rook? @Servesha?
@Travis, maybe you can provide clarity as well.
(In reply to Michael Adam from comment #41) > I am a bit confused where we are with this BZ: > > (1) we have an issue in rook with a design description: > https://github.com/rook/rook/issues/5258 This is the design for 4.5 where we will provide more functionality in a rook job to fully cleanup an OSD. This will be a separate work item from this BZ. > (2) we have a WIP PR in ocs operator with a template for the job: > https://github.com/openshift/ocs-operator/pull/481 Correct, we need to merge this. Servesha, Ashish, and I just discussed it and are hoping to have it merged tomorrow. > (3) do we also need a patch in rook? Yes, a small change was needed in rook to support this. It has been merged to the rook release-4.4 branch: https://github.com/openshift/rook/pull/44 4) Once the OCS PR is merged we can make the final updates to the docs, which is drafted here: https://github.com/red-hat-storage/ocs-training/pull/155/files
Opened the PR based on the proposal. The ocs operator will reconcile the template and will make sure it is present. Then in case of osd failure, admin can take control of deciding whether to remove the osd or not. The admin can use `oc process` command with the failed osd.id as a parameter to the template. Looking forward to pushing all the changes till tomorrow. Thanks
Moving back to Assigned since the bot was too fast to make the change before the remaining work items are done.
Thanks Travis, your explanation makes the picture perfectly clear! Servesha, thanks for the details! Let's try to get this patch forward tomorrow. As it was already expected today for the RC.. :-)
master PR is merged, backport PR created: https://github.com/openshift/ocs-operator/pull/484
Backport PR is merged!
@michael @anjana please provide an input on comment#40 Also, we now need a draft of the complete procedure for Disk Replacement in LSO including the steps to incorporate the fix which this BZ provides Some queries: -------------------- 1. Do we need a separate BZ for Documentation to track Disk Replacement procedure? If yes, we need to raise it ASAP. 2. With the fix for this BZ, do we expect any change in code? If yes, do we need to use OCS 4.4-rc2 build(not yet available) and beyond for Disk Replacement tests ? Currently, we have tried disk replacement using the draft doc shared by UAT , hence please confirm the official process.
(In reply to Servesha from comment #39) > Hello, > > @Travis, I tested the example job template[1] manually to purge the failed > osd. It works! I can make respective changes in ocs-operator to merge a fix. > > > Sample output: > 'servesha$ oc logs -f rook-ceph-toolbox-job-0-b6b7g -n openshift-storage > marked out osd.0. > purged osd.0' > > [1] > https://github.com/travisn/rook/blob/toolbox-job-purge/cluster/examples/ > kubernetes/ceph/toolbox-job.yaml Can you pass the correct link? I am not able to access this yaml file.
4.4.0-414.ci contains the fix. this is 4.4.0-rc2(In reply to Neha Berry from comment #49) > @michael @anjana please provide an input on comment#40 > > Also, we now need a draft of the complete procedure for Disk Replacement in > LSO including the steps to incorporate the fix which this BZ provides > > Some queries: > -------------------- > > 1. Do we need a separate BZ for Documentation to track Disk Replacement > procedure? If yes, we need to raise it ASAP. Per Kusuma's reply, yes, and this exists. > 2. With the fix for this BZ, do we expect any change in code? If yes, do we > need to use OCS 4.4-rc2 build(not yet available) and beyond for Disk > Replacement tests ? Yes. rc2 is now available with this fix. > Currently, we have tried disk replacement using the draft doc shared by UAT > , hence please confirm the official process.
(In reply to Bipin from comment #50) > Can you pass the correct link? I am not able to access this yaml file. Sure. I copied the contents in the Pastebin[1] since the link is not working. [1] http://pastebin.test.redhat.com/859192
@pratik better to verify the fix once https://bugzilla.redhat.com/show_bug.cgi?id=1827978 is also ON_QA and a new RC build is available with --force option for purging OSDs.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2393
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days