Bug 1825911 - [Docs] [LSO] Document disk replacement procedure, which includes the fix for BZ 1821219 (OSD is not being removed upon disk failure)
Summary: [Docs] [LSO] Document disk replacement procedure, which includes the fix for ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: documentation
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: OCS 4.4.0
Assignee: Kusuma
QA Contact: Sidhant Agrawal
URL:
Whiteboard:
Depends On:
Blocks: 1826040
TreeView+ depends on / blocked
 
Reported: 2020-04-20 13:21 UTC by Raz Tamir
Modified: 2023-09-14 05:55 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-06-08 06:06:20 UTC
Embargoed:


Attachments (Terms of Use)

Description Raz Tamir 2020-04-20 13:21:25 UTC
Description of problem (please be detailed as possible and provide log
snippests):
Documentation for bug https://bugzilla.redhat.com/show_bug.cgi?id=1821219 is needed and can be found here:
https://github.com/red-hat-storage/ocs-training/pull/155/files

Comment 3 Travis Nielsen 2020-04-24 17:29:23 UTC
Note that a change was just made to this doc for the new steps with "oc process" instead of running the ceph commands in the ocs operator. There is a separate commit in that PR to see what changed.
https://github.com/red-hat-storage/ocs-training/pull/155/files

Annette, can you comment here if any other changes are needed or if they are working for you? Thanks

Comment 4 Neha Berry 2020-04-24 17:39:19 UTC
@anjana @raz this BZ is a must include for OCS 4.4 but still doesnt have the acks.  Could you please provide the acks for the same?

Comment 5 Jean-Charles Lopez 2020-04-24 23:34:13 UTC
Hi everyone,

tested the new template.

Scenario 1 - 3 OSD in the cluster, 1 per failure domain - AWS based

When exceuting the template this is what I get
oc logs rook-ceph-toolbox-job-0-lntkk
marked out osd.0.
Error EAGAIN: OSD(s) 0 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions.
You can proceed by passing --force, but be warned that this will likely mean real, permanent data loss.

Note that the job retries 6 times and everytime fails to complete as the PGs aren't active clean

Scenario 2 - 6 OSDs in the cluster, 2 per failure domain - AWS based
Note that I waited for the OSD ti be marked out and for the cluster to rebalance. On my empty cluster that meant just over 10 minute wait.

oc logs pod/rook-ceph-toolbox-job-0-rj5c5
osd.0 is already out.
purged osd.0

So the question is do we want to add the --force option to the purge command to for the following very specific scenarios:
- Single OSD per failure domain deployments
- Complete failure domain failure scenario whatever the number of OSDs deployed per failure domain

For the very specific 3 OSd deployment scenario we could ship the template with a little refinement so that the only special case remains the complete failure domain.
e.g. ceph osd out osd.${FAILED_OSD_ID};num=$(ceph osd stat | cut -f1 -d' ');if (( num == 3 )); then forceopt='--force';else forceopt=''; fi;ceph osd purge osd.${FAILED_OSD_ID} $forceopt

Let me know if you would like me to run any additional testing @travis

Comment 6 Travis Nielsen 2020-04-26 05:32:32 UTC
We really need the --force flag inside the job. Otherwise, we will have to document the ceph commands for the user to workaround this issue. I've opened this BZ as a proposed blocker for 4.4.
https://bugzilla.redhat.com/show_bug.cgi?id=1827978

Comment 7 Raz Tamir 2020-04-26 05:37:02 UTC
Thanks Travis

Comment 13 Travis Nielsen 2020-04-30 18:48:19 UTC
With the merge of https://github.com/openshift/ocs-operator/pull/490, please update the docs with the name of the pod that needs to be used to verify the logs were successful.
https://github.com/red-hat-storage/ocs-training/pull/155/files#diff-f3d8f11d485ecb01b8aeb2702a073b06R39

Comment 29 Red Hat Bugzilla 2023-09-14 05:55:42 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.