Bug 1881896

Summary:

[Doc RFE] Document simplified workflow on how to replace a failed local storage device

Product:

[Red Hat Storage] Red Hat OpenShift Container Storage

Reporter:

Anjana Suparna Sriram <asriram>

Component:

documentation

Assignee:

Kusuma <kbg>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Pratik Surve <prsurve>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.6

CC:

afrahman, ebenahar, ikave, kbg, nberry, ocs-bugs, olakra, oviner, sabose, sapillai

Target Milestone:

---

Keywords:

Documentation, FutureFeature

Target Release:

OCS 4.6.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-12-18 11:53:49 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1880905

Attachments:

Description	Flags
/dev/sdb disk not appear in the list	none
compute-1 disks	none

Comment 8 Itzhak 2020-10-21 12:29:57 UTC

What exactly needs to check here in the bug? If the documents are clear? 
Do we need to check that steps are working?

Comment 10 Itzhak 2020-11-15 18:17:50 UTC

When trying to follow the steps in this doc https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html-single/replacing_devices/index?lb_target=preview#replacing-failed-storage-devices-on-vmware-and-bare-metal-infrastructures_rhocs there was an unexpected issue. 

Here are the steps I performed:

1. Go to the node 'compute-1' console and execute "chroot /host" command when the prompt appears.

2. In order to simulate a disk failure as described in this doc https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/operations_guide/handling-a-disk-failure#simulating-a-disk-failure-ops 
I run the command:
$echo 1 > /sys/block/sdb/device/delete

The warning appears as expected, but when I tried to navigate to the disk page
the 'sdb' disk has disappeared from the list. 
I added a screenshot.

Also 2 weeks ago when performing these steps, I didn't face this issue, so it may be a regression.

Comment 11 Itzhak 2020-11-15 18:19:27 UTC

Created attachment 1729598 [details]
/dev/sdb disk not appear in the list

Comment 12 Itzhak 2020-11-16 09:23:35 UTC

As a continuation of comment https://bugzilla.redhat.com/show_bug.cgi?id=1881896#c10, Here is additional information
about the cluster I used:

Cluster conf: vSphere, LSO, OCP 4.6, OCS 4.6.

Versions:

OCP version:
Client Version: 4.3.8
Server Version: 4.6.0-0.nightly-2020-11-07-035509
Kubernetes Version: v1.19.0+9f84db3

OCS verison:
ocs-operator.v4.6.0-156.ci   OpenShift Container Storage   4.6.0-156.ci              Succeeded

cluster version
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-11-07-035509   True        False         3d4h    Cluster version is 4.6.0-0.nightly-2020-11-07-035509

Rook version
rook: 4.6-73.15d47331.release_4.6
go: go1.15.2

Ceph version
ceph version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable)

Comment 16 Itzhak 2020-11-17 14:23:32 UTC

Created attachment 1730149 [details]
compute-1 disks

Comment 17 Itzhak 2020-11-17 14:27:22 UTC

I added an attachment "compute-1 disks". Notice there is a new disk I added "/dev/sdc" after the failure of "/dev/sdb".

And here is the output of the command: 
# 'lsblk  --bytes  --pairs  --output  "NAME,ROTA,TYPE,SIZE,MODEL,VENDOR,RO,RM,STATE,FSTYPE,SERIAL,KNAME,PARTLABEL"'

NAME="loop1" ROTA="1" TYPE="loop" SIZE="107374182400" MODEL="" VENDOR="" RO="0" RM="0" STATE="" FSTYPE="" SERIAL="" KNAME="loop1" PARTLABEL=""
NAME="loop2" ROTA="1" TYPE="loop" SIZE="107374182400" MODEL="" VENDOR="" RO="0" RM="0" STATE="" FSTYPE="" SERIAL="" KNAME="loop2" PARTLABEL=""
NAME="sda" ROTA="1" TYPE="disk" SIZE="128849018880" MODEL="Virtual disk    " VENDOR="VMware  " RO="0" RM="0" STATE="running" FSTYPE="" SERIAL="6000c2973b7830f56289cd807571abc6" KNAME="sda" PARTLABEL=""
NAME="sda1" ROTA="1" TYPE="part" SIZE="402653184" MODEL="" VENDOR="" RO="0" RM="0" STATE="" FSTYPE="ext4" SERIAL="" KNAME="sda1" PARTLABEL="boot"
NAME="sda2" ROTA="1" TYPE="part" SIZE="133169152" MODEL="" VENDOR="" RO="0" RM="0" STATE="" FSTYPE="vfat" SERIAL="" KNAME="sda2" PARTLABEL="EFI-SYSTEM"
NAME="sda3" ROTA="1" TYPE="part" SIZE="1048576" MODEL="" VENDOR="" RO="0" RM="0" STATE="" FSTYPE="" SERIAL="" KNAME="sda3" PARTLABEL="BIOS-BOOT"
NAME="sda4" ROTA="1" TYPE="part" SIZE="128311082496" MODEL="" VENDOR="" RO="0" RM="0" STATE="" FSTYPE="crypto_LUKS" SERIAL="" KNAME="sda4" PARTLABEL="luks_root"
NAME="sdc" ROTA="1" TYPE="disk" SIZE="107374182400" MODEL="Virtual disk    " VENDOR="VMware  " RO="0" RM="0" STATE="running" FSTYPE="" SERIAL="6000c29da0de41582e7c827948d6c337" KNAME="sdc" PARTLABEL=""
NAME="coreos-luks-root-nocrypt" ROTA="1" TYPE="dm" SIZE="128294305280" MODEL="" VENDOR="" RO="0" RM="0" STATE="running" FSTYPE="xfs" SERIAL="" KNAME="dm-0" PARTLABEL=""

Comment 18 Santosh Pillai 2020-11-18 06:48:19 UTC

(In reply to Itzhak from comment #17)
> I added an attachment "compute-1 disks". Notice there is a new disk I added
> "/dev/sdc" after the failure of "/dev/sdb".


if the disk "/dev/sdb" was removed, then it would trigger the discover results and update it. The removed disk ("/dev/sdb") won't show up in the discovery results anymore and hence the disk won't appear under the `disks` tab (screenshot attached in comment 11). So comment 10 is an expected behavior. That is, if a disk is deleted/removed it won't show up under `disks` tab in the UI. 

Closing NeedInfo.

Comment 19 Sahina Bose 2020-11-20 15:19:39 UTC

(In reply to Santosh Pillai from comment #18)
> (In reply to Itzhak from comment #17)
> > I added an attachment "compute-1 disks". Notice there is a new disk I added
> > "/dev/sdc" after the failure of "/dev/sdb".
> 
> 
> if the disk "/dev/sdb" was removed, then it would trigger the discover
> results and update it. The removed disk ("/dev/sdb") won't show up in the
> discovery results anymore and hence the disk won't appear under the `disks`
> tab (screenshot attached in comment 11). So comment 10 is an expected
> behavior. That is, if a disk is deleted/removed it won't show up under
> `disks` tab in the UI. 
> 
> Closing NeedInfo.

If the disk is removed, how do we ensure the failed OSD is removed? We can document to run the job from CLI in such cases.

Comment 28 Oded 2020-12-08 11:31:36 UTC

we need to fix device replacement doc.
https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html-single/replacing_devices/index?lb_target=preview#replacing-operational-or-failed-storage-devices-on-vmware-user-provisioned-infrastructure_rhocs

on section 2.1, skip on step 6(delete_pvc) and 8(add device)

Add relevant information on step 7 (dm-crypt deletion)


If the above command gets stuck due to insufficient privileges, run the following commands:

* Press CTRL+z

* Check the status of the command.
------
*Find PID of dmcrypt
$ ps -ef|grep crypt

*Kill the process ID
$ kill -9 <PID>

* Verify that the device name is removed.
$ dmsetup ls

Comment 34 Rejy M Cyriac 2020-12-18 11:53:49 UTC

OCS 4.6.0 GA completed on 17 December 2020