What exactly needs to check here in the bug? If the documents are clear? Do we need to check that steps are working?
When trying to follow the steps in this doc https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html-single/replacing_devices/index?lb_target=preview#replacing-failed-storage-devices-on-vmware-and-bare-metal-infrastructures_rhocs there was an unexpected issue. Here are the steps I performed: 1. Go to the node 'compute-1' console and execute "chroot /host" command when the prompt appears. 2. In order to simulate a disk failure as described in this doc https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/operations_guide/handling-a-disk-failure#simulating-a-disk-failure-ops I run the command: $echo 1 > /sys/block/sdb/device/delete The warning appears as expected, but when I tried to navigate to the disk page the 'sdb' disk has disappeared from the list. I added a screenshot. Also 2 weeks ago when performing these steps, I didn't face this issue, so it may be a regression.
Created attachment 1729598 [details] /dev/sdb disk not appear in the list
As a continuation of comment https://bugzilla.redhat.com/show_bug.cgi?id=1881896#c10, Here is additional information about the cluster I used: Cluster conf: vSphere, LSO, OCP 4.6, OCS 4.6. Versions: OCP version: Client Version: 4.3.8 Server Version: 4.6.0-0.nightly-2020-11-07-035509 Kubernetes Version: v1.19.0+9f84db3 OCS verison: ocs-operator.v4.6.0-156.ci OpenShift Container Storage 4.6.0-156.ci Succeeded cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-11-07-035509 True False 3d4h Cluster version is 4.6.0-0.nightly-2020-11-07-035509 Rook version rook: 4.6-73.15d47331.release_4.6 go: go1.15.2 Ceph version ceph version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable)
Created attachment 1730149 [details] compute-1 disks
I added an attachment "compute-1 disks". Notice there is a new disk I added "/dev/sdc" after the failure of "/dev/sdb". And here is the output of the command: # 'lsblk --bytes --pairs --output "NAME,ROTA,TYPE,SIZE,MODEL,VENDOR,RO,RM,STATE,FSTYPE,SERIAL,KNAME,PARTLABEL"' NAME="loop1" ROTA="1" TYPE="loop" SIZE="107374182400" MODEL="" VENDOR="" RO="0" RM="0" STATE="" FSTYPE="" SERIAL="" KNAME="loop1" PARTLABEL="" NAME="loop2" ROTA="1" TYPE="loop" SIZE="107374182400" MODEL="" VENDOR="" RO="0" RM="0" STATE="" FSTYPE="" SERIAL="" KNAME="loop2" PARTLABEL="" NAME="sda" ROTA="1" TYPE="disk" SIZE="128849018880" MODEL="Virtual disk " VENDOR="VMware " RO="0" RM="0" STATE="running" FSTYPE="" SERIAL="6000c2973b7830f56289cd807571abc6" KNAME="sda" PARTLABEL="" NAME="sda1" ROTA="1" TYPE="part" SIZE="402653184" MODEL="" VENDOR="" RO="0" RM="0" STATE="" FSTYPE="ext4" SERIAL="" KNAME="sda1" PARTLABEL="boot" NAME="sda2" ROTA="1" TYPE="part" SIZE="133169152" MODEL="" VENDOR="" RO="0" RM="0" STATE="" FSTYPE="vfat" SERIAL="" KNAME="sda2" PARTLABEL="EFI-SYSTEM" NAME="sda3" ROTA="1" TYPE="part" SIZE="1048576" MODEL="" VENDOR="" RO="0" RM="0" STATE="" FSTYPE="" SERIAL="" KNAME="sda3" PARTLABEL="BIOS-BOOT" NAME="sda4" ROTA="1" TYPE="part" SIZE="128311082496" MODEL="" VENDOR="" RO="0" RM="0" STATE="" FSTYPE="crypto_LUKS" SERIAL="" KNAME="sda4" PARTLABEL="luks_root" NAME="sdc" ROTA="1" TYPE="disk" SIZE="107374182400" MODEL="Virtual disk " VENDOR="VMware " RO="0" RM="0" STATE="running" FSTYPE="" SERIAL="6000c29da0de41582e7c827948d6c337" KNAME="sdc" PARTLABEL="" NAME="coreos-luks-root-nocrypt" ROTA="1" TYPE="dm" SIZE="128294305280" MODEL="" VENDOR="" RO="0" RM="0" STATE="running" FSTYPE="xfs" SERIAL="" KNAME="dm-0" PARTLABEL=""
(In reply to Itzhak from comment #17) > I added an attachment "compute-1 disks". Notice there is a new disk I added > "/dev/sdc" after the failure of "/dev/sdb". if the disk "/dev/sdb" was removed, then it would trigger the discover results and update it. The removed disk ("/dev/sdb") won't show up in the discovery results anymore and hence the disk won't appear under the `disks` tab (screenshot attached in comment 11). So comment 10 is an expected behavior. That is, if a disk is deleted/removed it won't show up under `disks` tab in the UI. Closing NeedInfo.
(In reply to Santosh Pillai from comment #18) > (In reply to Itzhak from comment #17) > > I added an attachment "compute-1 disks". Notice there is a new disk I added > > "/dev/sdc" after the failure of "/dev/sdb". > > > if the disk "/dev/sdb" was removed, then it would trigger the discover > results and update it. The removed disk ("/dev/sdb") won't show up in the > discovery results anymore and hence the disk won't appear under the `disks` > tab (screenshot attached in comment 11). So comment 10 is an expected > behavior. That is, if a disk is deleted/removed it won't show up under > `disks` tab in the UI. > > Closing NeedInfo. If the disk is removed, how do we ensure the failed OSD is removed? We can document to run the job from CLI in such cases.
we need to fix device replacement doc. https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html-single/replacing_devices/index?lb_target=preview#replacing-operational-or-failed-storage-devices-on-vmware-user-provisioned-infrastructure_rhocs on section 2.1, skip on step 6(delete_pvc) and 8(add device) Add relevant information on step 7 (dm-crypt deletion) If the above command gets stuck due to insufficient privileges, run the following commands: * Press CTRL+z * Check the status of the command. ------ *Find PID of dmcrypt $ ps -ef|grep crypt *Kill the process ID $ kill -9 <PID> * Verify that the device name is removed. $ dmsetup ls
OCS 4.6.0 GA completed on 17 December 2020