1881896 – [Doc RFE] Document simplified workflow on how to replace a failed local storage device

Bug 1881896 - [Doc RFE] Document simplified workflow on how to replace a failed local storage device

Summary: [Doc RFE] Document simplified workflow on how to replace a failed local stora...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	documentation
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.6.0
Assignee:	Kusuma
QA Contact:	Pratik Surve
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1880905
TreeView+	depends on / blocked

Reported:	2020-09-23 10:47 UTC by Anjana Suparna Sriram
Modified:	2020-12-18 11:53 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-18 11:53:49 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
/dev/sdb disk not appear in the list (50.38 KB, image/png) 2020-11-15 18:19 UTC, Itzhak	no flags	Details
compute-1 disks (56.66 KB, image/png) 2020-11-17 14:23 UTC, Itzhak	no flags	Details
View All

Comment 8 Itzhak 2020-10-21 12:29:57 UTC

What exactly needs to check here in the bug? If the documents are clear? 
Do we need to check that steps are working?

Comment 10 Itzhak 2020-11-15 18:17:50 UTC

When trying to follow the steps in this doc https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html-single/replacing_devices/index?lb_target=preview#replacing-failed-storage-devices-on-vmware-and-bare-metal-infrastructures_rhocs there was an unexpected issue. 

Here are the steps I performed:

1. Go to the node 'compute-1' console and execute "chroot /host" command when the prompt appears.

2. In order to simulate a disk failure as described in this doc https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/operations_guide/handling-a-disk-failure#simulating-a-disk-failure-ops 
I run the command:
$echo 1 > /sys/block/sdb/device/delete

The warning appears as expected, but when I tried to navigate to the disk page
the 'sdb' disk has disappeared from the list. 
I added a screenshot.

Also 2 weeks ago when performing these steps, I didn't face this issue, so it may be a regression.

Comment 11 Itzhak 2020-11-15 18:19:27 UTC

Created attachment 1729598 [details]
/dev/sdb disk not appear in the list

Comment 12 Itzhak 2020-11-16 09:23:35 UTC

As a continuation of comment https://bugzilla.redhat.com/show_bug.cgi?id=1881896#c10, Here is additional information
about the cluster I used:

Cluster conf: vSphere, LSO, OCP 4.6, OCS 4.6.

Versions:

OCP version:
Client Version: 4.3.8
Server Version: 4.6.0-0.nightly-2020-11-07-035509
Kubernetes Version: v1.19.0+9f84db3

OCS verison:
ocs-operator.v4.6.0-156.ci   OpenShift Container Storage   4.6.0-156.ci              Succeeded

cluster version
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-11-07-035509   True        False         3d4h    Cluster version is 4.6.0-0.nightly-2020-11-07-035509

Rook version
rook: 4.6-73.15d47331.release_4.6
go: go1.15.2

Ceph version
ceph version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable)

Comment 16 Itzhak 2020-11-17 14:23:32 UTC

Created attachment 1730149 [details]
compute-1 disks

Comment 17 Itzhak 2020-11-17 14:27:22 UTC

I added an attachment "compute-1 disks". Notice there is a new disk I added "/dev/sdc" after the failure of "/dev/sdb".

And here is the output of the command: 
# 'lsblk  --bytes  --pairs  --output  "NAME,ROTA,TYPE,SIZE,MODEL,VENDOR,RO,RM,STATE,FSTYPE,SERIAL,KNAME,PARTLABEL"'

NAME="loop1" ROTA="1" TYPE="loop" SIZE="107374182400" MODEL="" VENDOR="" RO="0" RM="0" STATE="" FSTYPE="" SERIAL="" KNAME="loop1" PARTLABEL=""
NAME="loop2" ROTA="1" TYPE="loop" SIZE="107374182400" MODEL="" VENDOR="" RO="0" RM="0" STATE="" FSTYPE="" SERIAL="" KNAME="loop2" PARTLABEL=""
NAME="sda" ROTA="1" TYPE="disk" SIZE="128849018880" MODEL="Virtual disk    " VENDOR="VMware  " RO="0" RM="0" STATE="running" FSTYPE="" SERIAL="6000c2973b7830f56289cd807571abc6" KNAME="sda" PARTLABEL=""
NAME="sda1" ROTA="1" TYPE="part" SIZE="402653184" MODEL="" VENDOR="" RO="0" RM="0" STATE="" FSTYPE="ext4" SERIAL="" KNAME="sda1" PARTLABEL="boot"
NAME="sda2" ROTA="1" TYPE="part" SIZE="133169152" MODEL="" VENDOR="" RO="0" RM="0" STATE="" FSTYPE="vfat" SERIAL="" KNAME="sda2" PARTLABEL="EFI-SYSTEM"
NAME="sda3" ROTA="1" TYPE="part" SIZE="1048576" MODEL="" VENDOR="" RO="0" RM="0" STATE="" FSTYPE="" SERIAL="" KNAME="sda3" PARTLABEL="BIOS-BOOT"
NAME="sda4" ROTA="1" TYPE="part" SIZE="128311082496" MODEL="" VENDOR="" RO="0" RM="0" STATE="" FSTYPE="crypto_LUKS" SERIAL="" KNAME="sda4" PARTLABEL="luks_root"
NAME="sdc" ROTA="1" TYPE="disk" SIZE="107374182400" MODEL="Virtual disk    " VENDOR="VMware  " RO="0" RM="0" STATE="running" FSTYPE="" SERIAL="6000c29da0de41582e7c827948d6c337" KNAME="sdc" PARTLABEL=""
NAME="coreos-luks-root-nocrypt" ROTA="1" TYPE="dm" SIZE="128294305280" MODEL="" VENDOR="" RO="0" RM="0" STATE="running" FSTYPE="xfs" SERIAL="" KNAME="dm-0" PARTLABEL=""

Comment 18 Santosh Pillai 2020-11-18 06:48:19 UTC

(In reply to Itzhak from comment #17)
> I added an attachment "compute-1 disks". Notice there is a new disk I added
> "/dev/sdc" after the failure of "/dev/sdb".


if the disk "/dev/sdb" was removed, then it would trigger the discover results and update it. The removed disk ("/dev/sdb") won't show up in the discovery results anymore and hence the disk won't appear under the `disks` tab (screenshot attached in comment 11). So comment 10 is an expected behavior. That is, if a disk is deleted/removed it won't show up under `disks` tab in the UI. 

Closing NeedInfo.

Comment 19 Sahina Bose 2020-11-20 15:19:39 UTC

(In reply to Santosh Pillai from comment #18)
> (In reply to Itzhak from comment #17)
> > I added an attachment "compute-1 disks". Notice there is a new disk I added
> > "/dev/sdc" after the failure of "/dev/sdb".
> 
> 
> if the disk "/dev/sdb" was removed, then it would trigger the discover
> results and update it. The removed disk ("/dev/sdb") won't show up in the
> discovery results anymore and hence the disk won't appear under the `disks`
> tab (screenshot attached in comment 11). So comment 10 is an expected
> behavior. That is, if a disk is deleted/removed it won't show up under
> `disks` tab in the UI. 
> 
> Closing NeedInfo.

If the disk is removed, how do we ensure the failed OSD is removed? We can document to run the job from CLI in such cases.

Comment 28 Oded 2020-12-08 11:31:36 UTC

we need to fix device replacement doc.
https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html-single/replacing_devices/index?lb_target=preview#replacing-operational-or-failed-storage-devices-on-vmware-user-provisioned-infrastructure_rhocs

on section 2.1, skip on step 6(delete_pvc) and 8(add device)

Add relevant information on step 7 (dm-crypt deletion)


If the above command gets stuck due to insufficient privileges, run the following commands:

* Press CTRL+z

* Check the status of the command.
------
*Find PID of dmcrypt
$ ps -ef|grep crypt

*Kill the process ID
$ kill -9 <PID>

* Verify that the device name is removed.
$ dmsetup ls

Comment 34 Rejy M Cyriac 2020-12-18 11:53:49 UTC

OCS 4.6.0 GA completed on 17 December 2020

Note You need to log in before you can comment on or make changes to this bug.