Bug 1821219 - [LSO] OSD is not being removed upon disk failure
Summary: [LSO] OSD is not being removed upon disk failure
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: OCS 4.4.0
Assignee: Servesha
QA Contact: Pratik Surve
Erin Donnelly
URL:
Whiteboard:
Depends On:
Blocks: 1826482
TreeView+ depends on / blocked
 
Reported: 2020-04-06 10:21 UTC by Chris Blum
Modified: 2023-09-14 05:55 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
When an underlying disk of an OSD failed, the cluster always remained in `WARNING` state after data re-balancing as there was no way to remove the OSD from the cluster. The cleaning up of the failed OSD has now been simplified with a job that the administrator can launch.
Clone Of:
Environment:
Last Closed: 2020-06-04 12:54:39 UTC
Embargoed:


Attachments (Terms of Use)
OSD failed Prometheus alert (588.10 KB, image/jpeg)
2020-04-20 09:17 UTC, Chris Blum
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift ocs-operator pull 481 0 None closed A job template to remove failed disk 2021-01-21 20:17:19 UTC
Github rook rook issues 5258 0 None closed Simplify cleaning up a failed OSD with a job that can be launched by the admin 2021-01-21 20:17:18 UTC
Red Hat Product Errata RHBA-2020:2393 0 None None None 2020-06-04 12:54:53 UTC

Description Chris Blum 2020-04-06 10:21:58 UTC
Description of problem (please be detailed as possible and provide log
snippests):
When the underlying disk of an OSD failed, there currently is no way (other than using the toolbox) to remove the OSD from the cluster. Thus the cluster will always stay in WARNING state (after data is rebalanced)
Rook intentionally does not implement this and tells people to use the toolbox - since we do not want to document toolbox steps in OCS, we need to have OCS do the OSD purging instead:
https://rook.io/docs/rook/v1.2/ceph-osd-mgmt.html

Marking this for 4.4, since this will mostly be a problem with local-disks, but could just as well happen in the cloud when an EBS fails.

Version of all relevant components (if applicable):
All versions, new feature

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Cluster will be stuck in HEALTH WARN, but is usable

Is there any workaround available to the best of your knowledge?
Use the toolbox with undocumented/unsupported steps.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
No


Expected results:
I would assume that the OCS-Operator has some feature that can be triggered when the user is certain that a disk will not return. On activating that feature, I would assume the following happens:
* ceph osd out {id}
* Wait till recovery finishes
   * If timeout does not finish within 24h, alert
* When recovery is finished: ceph osd purge {id} --yes-i-really-mean-it

This follows the documented steps at:
* https://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual
* https://github.com/rook/rook/blob/master/Documentation/ceph-osd-mgmt.md#remove-an-osd

Additional info:

Comment 3 Yaniv Kaul 2020-04-06 11:23:44 UTC
This should be part of disk replacement. We don't know what 'failed disk' is until the user decides to replace it.

Comment 4 Chris Blum 2020-04-06 11:26:46 UTC
That's why I said in the Expected results:
# I would assume that the OCS-Operator has some feature that can be triggered when the user is certain that a disk will not return

So yes the user needs to trigger this and then the operator needs to perform work. I agree that this would be part of the disk replacement procedure.
Currently there's no way we can document without using the toolbox and this is not limited to local disks (just more frequent with local disks)

Comment 5 Michael Adam 2020-04-07 12:07:14 UTC
4.3 is about to be released and 4.4 is pretty much closed. moving to 4.5.

Comment 7 Travis Nielsen 2020-04-08 15:17:28 UTC
@Chris Until there is an e2e story for replacing a disk, what other option are you suggesting than to document ceph commands from the toolbox? The OCS operator, similar to Rook, cannot make the decision to remove an OSD unless the admin confirms the action. Until there is a complete UX for failed disks I don't see another option than documenting the toolbox commands.

Comment 8 Chris Blum 2020-04-08 15:41:41 UTC
To clarify: I never said, that the OCS Operator would automatically detect that a disk is failed (and do operations)

What I said is that the user detects that a disk is permanently lost (for whatever reason) and then does _something_ to the OCS Operator that triggers actions by the OCS Operator. At the very beginning of this is a manual action by a user (who is triggered by an alarm for example)

I know that UI changes are not possible in a short-term, but I'm sure we can figure something out that works with the CLI.

I can imagine two options:

# 1 Remove the specific storageClassDeviceSets from the CephCluster CR

Removing the item from the list of storageClassDeviceSets would trigger the OSD deletion including the removal of the OSD deployment.
We would then re-add an item to this list to recreate everything for a new OSD.

# 2 Remove the OSD deployment

If the OCS can monitor the running Deployments, then it would be better UX if people could delete the Deployment and this would trigger the OSD removal... since the Deployment is actually the instantiation of the OSD inside of the Kubernetes cluster.
Since we do not change the CephCluster CR, the OCS Operator should try to create a new OSD Deployment with a new OSD ;)

I think #1 is probably safer to do, but harder to explain... while we could have people accidentally delete OSD deployments with potentially fatal results...

Comment 9 Travis Nielsen 2020-04-08 17:53:50 UTC
If the warning went away after the data was rebalanced, does that mitigate the problem? Seems like OCS shouldn't be surfacing a warning if the data isn't at risk. 

Replacing a disk needs a lot of thought to find the solution. The challenge is how to do this as a one-off task instead of with the typical "desired state" pattern that operators implement. Until we have an e2e solution, we can't get around documenting Ceph commands from the toolbox. 

Some thoughts on those two examples:

#1 
- If you remove the storageClassDeviceSet from the cluster CR you are signaling that your desired state is that you don't want any of the OSDs in that device set anymore. If there is only one OSD then that works in this case, but if "count" is higher than one then you will be destroying other OSDs at the same time.
- You cannot edit the CephCluster directly or else the OCS operator will reset it at the next reconcile. 

#2 If Rook sees that a deployment is removed, it will automatically re-create it for the missing OSD. The desired state is to keep OSDs running, not allow them to be removed.

Comment 10 Raz Tamir 2020-04-12 12:49:34 UTC
Moving back to 4.4.
I fully agree that releasing 4.4 with 2 new platforms that relaying on LSO will require this BZ to be fixed.

Comment 11 Servesha 2020-04-14 11:35:50 UTC
Hello,

I am not aware of a way which will handle the failed osd purging, (`ceph osd purge`) without using the toolbox as of now. Also, currently it feels like ceph command for purging osd is safe comparing to editing cephcluster CR or deleting osd deployment. 

The minimal steps to remove a failed osd. 
1. Delete failed osd deployment
2. Delete pvc and pv related to failed osd
3. Once the osd fails and goes down. After approx 5 to 7 minutes osd will be automatically taken out of the cluster. In that case, `ceph osd out` command would not be necessary. once osd is out recovery will start occuring.
4. Purge the osd. The strictly needed command from Ceph side, `ceph osd purge osd.id --yes-i-really-mean-it`
5. If osd is not purged, it will remain for an indefinite time in the crush map. And so does in `ceph osd tree` output.

Comment 12 Jose A. Rivera 2020-04-14 14:09:52 UTC
No development work on this will happen in OCS 4.4. At best this is a documentation effort which is already underway. Do we want to convert this BZ into a documentation bug or leave it for tracking a scripted solution that obfuscates the Ceph commands (which, again, will not happen in 4.4)?

Comment 13 Servesha 2020-04-14 14:28:29 UTC
1. once the osd is down, mark the down osd out if it isn't already out. ceph osd out {osd-id}
2. After marking osd out allow backfilling/recovery to complete
3. Once data was fully recovered, removed the OSD from the Ceph cluster, ceph osd purge <ID> --yes-i-really-mean-it
4. Remove the osd pod deployment and pvc
5. Login to machine with <bad_osd> (oc debug node/<node_w_bad_osd>)
6. Record /dev/disk/by-id/<bad_id> (ls -alh /mnt/local-storage/localblock)
7. Edit localvolume local-block CR and remove /dev/disk/by-id/<bad_id> for <bad_osd>
8. Login to machine with <bad_osd> (oc debug node/<node_w_bad_osd>)
9. Remove symlink (rm /mnt/local-storage/localblock/sdc) for <bad_device_name>
10.Delete pv related to the osd

Comment 14 Chris Blum 2020-04-14 15:38:56 UTC
If we decide that we document the ceph commands to purge an OSD (removing it from the crush map), then we also need to explain to users how to access the toolbox. Until now, we have decided that we will not include any toolbox usage in the documentation.

This BZ is to figure out if and how we can solve this issue either programmatically or by changing our previous decision on telling people about the toolbox in the documentation.

Pinging Yaniv, since he usually wants to avoid the toolbox.

As for the exact Ceph commands that would be necessary to run to remove an OSD in OCS - we already have them here:
https://docs.google.com/document/d/1adk4MeyOxU48XsAlK7LU5at4WMf0DCcEwcn66QckIfs/edit

Comment 15 Yaniv Kaul 2020-04-14 15:51:53 UTC
As I explained earlier, we have no intentions to allow users to access the ceph toolbox.

Comment 16 Chris Blum 2020-04-14 15:54:43 UTC
In that case, we need to find a way to run these commands on the Ceph cluster without the toolbox. At this moment I do not see how we could do this only by changing/adding documentation steps.

Pinging Jose in case he has an idea on how to proceed.

Comment 17 Travis Nielsen 2020-04-14 15:55:26 UTC
Of course we want to avoid the toolbox, but need a more complete solution in 4.5 before that is possible. 

If the issue is the complexity or extra step of starting the toolbox, another way to run ceph commands is from the rook operator. Something like this:

1. Connect to the operator pod
   oc rsh <operator-pod>

2. Copy the config for ceph commands into the place ceph expects them by default to make it easier to run the ceph commands
   cp /var/lib/rook/openshift-storage/openshift-storage.config /etc/ceph/ceph.conf

3. Run the ceph commands
   ceph osd purge ...

Comment 18 Yaniv Kaul 2020-04-14 16:52:07 UTC
(In reply to Travis Nielsen from comment #17)
> Of course we want to avoid the toolbox, but need a more complete solution in
> 4.5 before that is possible. 
> 
> If the issue is the complexity or extra step of starting the toolbox,
> another way to run ceph commands is from the rook operator. Something like
> this:
> 
> 1. Connect to the operator pod
>    oc rsh <operator-pod>
> 
> 2. Copy the config for ceph commands into the place ceph expects them by
> default to make it easier to run the ceph commands
>    cp /var/lib/rook/openshift-storage/openshift-storage.config
> /etc/ceph/ceph.conf
> 
> 3. Run the ceph commands
>    ceph osd purge ...


I'm expecting either an 'oc ... <yaml file>' or better, a UI based workflow for disk replacement, for OCS BM GA.

Comment 19 Travis Nielsen 2020-04-14 17:53:51 UTC
> I'm expecting either an 'oc ... <yaml file>' or better, a UI based workflow for disk replacement, for OCS BM GA.

Agreed, any ceph commands are just temporary until we have the full solution.

Comment 20 Travis Nielsen 2020-04-14 23:07:12 UTC
The design for a more complete solution is captured in this upstream issue: https://github.com/rook/rook/issues/5258. This defines the job that would trigger Rook to purge an OSD that we could target in the 4.5 release. 
Once we have the job implemented, we can consider what UI to build around launching the job.

Let's keep this BZ to track the short-term documentation issue for 4.4.

Comment 21 Jose A. Rivera 2020-04-15 14:09:22 UTC
If we're keeping this as a documentation issue, we should move it to the documentation component. However, the documentation will REQUIRE the use of the Ceph toolbox. So either we document nothing and move this BZ to OCS 4.5, leaving OSD replacement as a Support-only operation, or we relent and document the toolbox commands for 4.4.

Looping in Sahina for visibility.

Comment 22 Travis Nielsen 2020-04-15 14:12:26 UTC
(In reply to Jose A. Rivera from comment #21)
> If we're keeping this as a documentation issue, we should move it to the
> documentation component. However, the documentation will REQUIRE the use of
> the Ceph toolbox. So either we document nothing and move this BZ to OCS 4.5,
> leaving OSD replacement as a Support-only operation, or we relent and
> document the toolbox commands for 4.4.
> 
> Looping in Sahina for visibility.

See https://bugzilla.redhat.com/show_bug.cgi?id=1821219#c17 for an alternative to running the toolbox

Comment 23 Jose A. Rivera 2020-04-16 18:15:02 UTC
(In reply to Travis Nielsen from comment #22)
> See https://bugzilla.redhat.com/show_bug.cgi?id=1821219#c17 for an
> alternative to running the toolbox

Is this a safe operation? I wasn't aware we'd be recommending this in production clusters. That would at least avoid the toolbox Pod, but still of course require the Ceph commands.

Comment 24 Travis Nielsen 2020-04-17 05:32:21 UTC
(In reply to Jose A. Rivera from comment #23)
> (In reply to Travis Nielsen from comment #22)
> > See https://bugzilla.redhat.com/show_bug.cgi?id=1821219#c17 for an
> > alternative to running the toolbox
> 
> Is this a safe operation? I wasn't aware we'd be recommending this in
> production clusters. That would at least avoid the toolbox Pod, but still of
> course require the Ceph commands.

This will simply run another process inside the operator pod. It won't affect the main operator process.

After prototyping both approaches 1) running a command in the operator pod and 2) running a command in a job, I much prefer the simplicity of the operator command.

1) Here are the commands to run in the operator. No `oc rsh` is necessary, we can directly exec. Just copy/paste a few commands to your console where you execute "oc".

OSD_ID_TO_REMOVE=1
ROOK_NAMESPACE=openshift-storage

echo "finding operator pod"
ROOK_OPERATOR_POD=$(oc -n ${ROOK_NAMESPACE} get pod -l app=rook-ceph-operator -o jsonpath='{.items[0].metadata.name}')

echo "marking osd out"
oc exec -it ${ROOK_OPERATOR_POD} -n ${ROOK_NAMESPACE} -- \
  ceph osd out osd.${OSD_ID_TO_REMOVE} \
  --cluster=${ROOK_NAMESPACE} --conf=/var/lib/rook/${ROOK_NAMESPACE}/${ROOK_NAMESPACE}.config --keyring=/var/lib/rook/${ROOK_NAMESPACE}/client.admin.keyring

echo "purging osd"
oc exec -it ${ROOK_OPERATOR_POD} -n ${ROOK_NAMESPACE} -- \
  ceph osd purge ${OSD_ID_TO_REMOVE} --force --yes-i-really-mean-it \
  --cluster=${ROOK_NAMESPACE} --conf=/var/lib/rook/${ROOK_NAMESPACE}/${ROOK_NAMESPACE}.config --keyring=/var/lib/rook/${ROOK_NAMESPACE}/client.admin.keyring


2) Here is the yaml to start a job. The complexity of managing a separate yaml file is much more complex for supportability IMO.
This only runs a single ceph command. We actually need at least two ceph commands so we would need to run two jobs or else have two containers in the job.

apiVersion: batch/v1
kind: Job
metadata:
  name: remove-osd-OSD_ID_TO_REMOVE
  namespace: ROOK_CLUSTER_NAMESPACE
  labels:
    app: rook-remove-osd-OSD_ID_TO_REMOVE
spec:
  template:
    spec:
      containers:
      - name: remove-osd-OSD_ID_TO_REMOVE
        image: ROOK_IMAGE
        volumeMounts:
        - mountPath: /etc/ceph/keyring-store/
          name: rook-ceph-mons-keyring
          readOnly: true
        - mountPath: /etc/ceph
          name: rook-config-override
          readOnly: true
        env:
        - name: ROOK_CEPH_MON_HOST
          valueFrom:
            secretKeyRef:
              key: mon_host
              name: rook-ceph-config
        - name: ROOK_CEPH_MON_INITIAL_MEMBERS
          valueFrom:
            secretKeyRef:
              key: mon_initial_members
              name: rook-ceph-config
        command: ["bash", "-c"]
        args:
        - ceph osd purge OSD_ID_TO_REMOVE --yes-i-really-mean-it --id=admin --mon-host=$(ROOK_CEPH_MON_HOST) --mon-initial-members=$(ROOK_CEPH_MON_INITIAL_MEMBERS) --keyring=/etc/ceph/keyring-store/keyring
      volumes:
      - name: rook-ceph-mons-keyring
        secret:
          defaultMode: 420
          secretName: rook-ceph-mons-keyring
      - name: rook-config-override
        configMap:
          name: rook-config-override
          defaultMode: 420
          items:
          - key: config
            mode: 292
            path: ceph.conf
      restartPolicy: OnFailure

Comment 25 Chris Blum 2020-04-17 09:36:37 UTC
The statement that a Job is only able to execute a single command is not true.

The "nicest" way to execute multiple commands is to create a bash script with a ConfigMap, then mount that inside of the Job-Pod and run it with bash. This gives you all the flexibility, you can in-line the bash script in your YAML manifest with the Job description and apply it all at once.

If you do not want to create a second object next to the Job, you can still execute multiple commands by chaining them with either '&&' or ';'


This is not arguing wether 1) or 2) should be used, but I wanted to explain that you can run multiple commands in a single container ;)


PS: The advantage of using a Job vs. running a local Bash script is that the Job is preserved with all the command output. This way it is easier to diagnose later when something has gone wrong that it was (or was not) caused by this maintenance operation.
PS2: I think the restartPolicy should be set to "Never"

Comment 26 Rohan CJ 2020-04-17 13:41:25 UTC
We could ship an Openshift Template that gets deployed with ocs-operator's initialization. Then the user just has to instantiate the template with osd-id as a parameter. That way, the process could be triggered with a single command.

Comment 27 Rohan CJ 2020-04-17 13:45:59 UTC
The template could contain a Job that runs a ConfigMap bash script.

Comment 28 Jose A. Rivera 2020-04-17 14:18:17 UTC
(In reply to Rohan CJ from comment #26)
> We could ship an Openshift Template that gets deployed with ocs-operator's
> initialization. Then the user just has to instantiate the template with
> osd-id as a parameter. That way, the process could be triggered with a
> single command.

Hmm... interesting. With this idea, OCS Initialization could create a Template for a Job, and if we take Chris' idea also create a ConfigMap with the required script. These are things that would be easy to implement in ocs-operator, and could be done early next week if we can converge on a design.

Comment 29 Travis Nielsen 2020-04-17 16:17:50 UTC
Updates have been made to the Annette's doc so we can now avoid the toolbox. Servesha has also verified that these steps are working. The doc is currently using the approach of running the ceph commands in the operator. The PR is still going through final modifications before handing off to QE, but is certainly open to comments.
https://github.com/red-hat-storage/ocs-training/pull/155/files

Chris, agreed there are smarter ways to do things with the job. 

My argument for 4.4 would be to keep it simple and go with the document as written. It really doesn't feel like we would be simplifying things near enough to justify any code changes for 4.4. Let's save engineering efforts to improve this in 4.5. The idea of using the template for running a job is a very good one that we will be able to use for the 4.5 solution.

Comment 30 Travis Nielsen 2020-04-17 20:11:13 UTC
After discussing more with Jose, Annette, & JC we would recommend going ahead with the template solution for 4.4.
- The ugly oc commands with all the ceph parameters become completely hidden from the user, thus largely achieving our goal now to hide ceph commands for 4.4 instead of waiting for 4.5.
- The admin doesn't have to deal with any yaml directly. The template is created under the covers by the OCS operator. 
- The only detail of Ceph the admin will ever see is that they need to find the OSD ID
- This is not throwaway work. In 4.5 we will anyway need a template to start the rook job that does the more complete work to remove the OSD. The template will need to be updated, but we can re-use all the work.

Servesha, let's discuss in more detail what is needed to create the template for 4.4.

Comment 31 Yaniv Kaul 2020-04-19 11:48:11 UTC
(In reply to Travis Nielsen from comment #30)
> After discussing more with Jose, Annette, & JC we would recommend going
> ahead with the template solution for 4.4.
> - The ugly oc commands with all the ceph parameters become completely hidden
> from the user, thus largely achieving our goal now to hide ceph commands for
> 4.4 instead of waiting for 4.5.
> - The admin doesn't have to deal with any yaml directly. The template is
> created under the covers by the OCS operator. 
> - The only detail of Ceph the admin will ever see is that they need to find
> the OSD ID

How will the admin find the ID?

Comment 32 Servesha 2020-04-20 08:36:52 UTC
@Travis sure. Ack!

Comment 33 Chris Blum 2020-04-20 09:17:24 UTC
Created attachment 1680215 [details]
OSD failed Prometheus alert

Firing alert has OSD ID in its labels

Comment 34 Chris Blum 2020-04-20 09:18:05 UTC
> How will the admin find the ID?

I added a screenshot of the alert that is firing when an OSD is down... as you can see, the labels include the OSD ID.

Comment 35 Yaniv Kaul 2020-04-20 09:28:01 UTC
(In reply to Chris Blum from comment #34)
> > How will the admin find the ID?
> 
> I added a screenshot of the alert that is firing when an OSD is down... as
> you can see, the labels include the OSD ID.

Thanks - looks reasonable enough for 4.4.

Comment 39 Servesha 2020-04-21 11:17:40 UTC
Hello, 

@Travis, I tested the example job template[1] manually to purge the failed osd. It works! I can make respective changes in ocs-operator to merge a fix. 


Sample output:
'servesha$ oc logs -f rook-ceph-toolbox-job-0-b6b7g  -n openshift-storage
marked out osd.0. 
purged osd.0'

[1] https://github.com/travisn/rook/blob/toolbox-job-purge/cluster/examples/kubernetes/ceph/toolbox-job.yaml

Comment 41 Michael Adam 2020-04-22 16:10:33 UTC
I am a bit confused where we are with this BZ:

(1) we have an issue in rook with a design description: https://github.com/rook/rook/issues/5258
(2) we have a WIP PR in ocs operator with a template for the job: https://github.com/openshift/ocs-operator/pull/481
(3) do we also need a patch in rook?

@Servesha?

Comment 42 Michael Adam 2020-04-22 16:12:28 UTC
@Travis, maybe you can provide clarity as well.

Comment 43 Travis Nielsen 2020-04-22 17:04:53 UTC
(In reply to Michael Adam from comment #41)
> I am a bit confused where we are with this BZ:
> 
> (1) we have an issue in rook with a design description:
> https://github.com/rook/rook/issues/5258

This is the design for 4.5 where we will provide more functionality in a rook job to fully cleanup an OSD. This will be a separate work item from this BZ.

> (2) we have a WIP PR in ocs operator with a template for the job:
> https://github.com/openshift/ocs-operator/pull/481

Correct, we need to merge this. Servesha, Ashish, and I just discussed it and are hoping to have it merged tomorrow.

> (3) do we also need a patch in rook?

Yes, a small change was needed in rook to support this. It has been merged to the rook release-4.4 branch:
https://github.com/openshift/rook/pull/44

4) Once the OCS PR is merged we can make the final updates to the docs, which is drafted here:
https://github.com/red-hat-storage/ocs-training/pull/155/files

Comment 44 Servesha 2020-04-22 17:13:03 UTC
Opened the PR based on the proposal. The ocs operator will reconcile the template and will make sure it is present. Then in case of osd failure, admin can take control of deciding whether to remove the osd or not. The admin can use `oc process` command with the failed osd.id as a parameter to the template. Looking forward to pushing all the changes till tomorrow. 

Thanks

Comment 45 Travis Nielsen 2020-04-22 17:20:17 UTC
Moving back to Assigned since the bot was too fast to make the change before the remaining work items are done.

Comment 46 Michael Adam 2020-04-23 00:14:57 UTC
Thanks Travis, your explanation makes the picture perfectly clear!

Servesha, thanks for the details! Let's try to get this patch forward tomorrow. As it was already expected today for the RC.. :-)

Comment 47 Jose A. Rivera 2020-04-23 21:07:05 UTC
master PR is merged, backport PR created: https://github.com/openshift/ocs-operator/pull/484

Comment 48 Jose A. Rivera 2020-04-23 21:20:09 UTC
Backport PR is merged!

Comment 49 Neha Berry 2020-04-24 09:56:16 UTC
@michael @anjana please provide an input on comment#40

Also, we now need a draft of the complete procedure for Disk Replacement in LSO including the steps to incorporate the fix which this BZ provides

 Some queries:
--------------------

1. Do we need a separate BZ for Documentation to track Disk Replacement procedure? If yes, we need to raise it ASAP.

2. With the fix for this BZ, do we expect any change in code? If yes, do we need to use OCS 4.4-rc2 build(not yet available) and beyond for Disk Replacement tests ? 


Currently, we have tried disk replacement using the draft doc shared by UAT , hence please confirm the official process.

Comment 50 Bipin Kunal 2020-04-24 10:15:35 UTC
(In reply to Servesha from comment #39)
> Hello, 
> 
> @Travis, I tested the example job template[1] manually to purge the failed
> osd. It works! I can make respective changes in ocs-operator to merge a fix. 
> 
> 
> Sample output:
> 'servesha$ oc logs -f rook-ceph-toolbox-job-0-b6b7g  -n openshift-storage
> marked out osd.0. 
> purged osd.0'
> 
> [1]
> https://github.com/travisn/rook/blob/toolbox-job-purge/cluster/examples/
> kubernetes/ceph/toolbox-job.yaml

Can you pass the correct link? I am not able to access this yaml file.

Comment 52 Michael Adam 2020-04-24 16:23:37 UTC
4.4.0-414.ci contains the fix. this is 4.4.0-rc2(In reply to Neha Berry from comment #49)
> @michael @anjana please provide an input on comment#40
> 
> Also, we now need a draft of the complete procedure for Disk Replacement in
> LSO including the steps to incorporate the fix which this BZ provides
> 
>  Some queries:
> --------------------
> 
> 1. Do we need a separate BZ for Documentation to track Disk Replacement
> procedure? If yes, we need to raise it ASAP.

Per Kusuma's reply, yes, and this exists.

 
> 2. With the fix for this BZ, do we expect any change in code? If yes, do we
> need to use OCS 4.4-rc2 build(not yet available) and beyond for Disk
> Replacement tests ? 

Yes. rc2 is now available with this fix.

> Currently, we have tried disk replacement using the draft doc shared by UAT
> , hence please confirm the official process.

Comment 53 Servesha 2020-04-27 09:32:48 UTC
(In reply to Bipin from comment #50)

> Can you pass the correct link? I am not able to access this yaml file.

Sure. I copied the contents in the Pastebin[1] since the link is not working.

[1] http://pastebin.test.redhat.com/859192

Comment 55 Neha Berry 2020-04-30 13:13:38 UTC
@pratik better to verify the fix once https://bugzilla.redhat.com/show_bug.cgi?id=1827978 is also ON_QA and a new RC build is available with --force option for purging OSDs.

Comment 64 errata-xmlrpc 2020-06-04 12:54:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2393

Comment 65 Red Hat Bugzilla 2023-09-14 05:55:06 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.