Bug 2042694 - [ODF Scale Up] Provide a method to rescan for new NVMe disks on deployed OCS/ODF nodes
Summary: [ODF Scale Up] Provide a method to rescan for new NVMe disks on deployed OCS/...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Mudit Agarwal
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-19 23:20 UTC by Alberto Rivera Laporte
Modified: 2023-12-08 04:27 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-03-09 13:54:43 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5317381 0 None None None 2022-01-19 23:20:20 UTC

Description Alberto Rivera Laporte 2022-01-19 23:20:21 UTC
Description of problem: 

Additional NVMe disks are not present in CoreOS(COS) until the ODF node is rebooted. 



Additional info:

When adding an additional NVMe disk to a deployed ODF storage node as part of a scale up procedure using local storage operator[0] the additional NVMe disk is not visible in the table of block devices and until the node has been rebooted.  This presents with the following challenges: 

1. The disk discovery feature does not detect newly added NVMe devices until the host is rebooted. 

2.  The reboot of an ODF node could be considered a disruptive operation in production environments without properly cordoning and draining of the storage node prior to the reboot.  

COS does have the rescan-scsi-bus.sh script from sg3_utils RPM however that does not work for NVMe devices[1] so we're opening this BZ to see if there is an undocumented way to rescan for additional NVMe disks on an runing ODF COS node without requiring a reboot or if we should be pursuing an RFE to add the "nvme-cli" utility to COS.



Version-Release number of selected component (if applicable):

OCP/OCS/ODF 4.8 
---
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="48.84.202112212304-0"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION_ID="4.8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 48.84.202112212304-0 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.8/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.8"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.8"
OPENSHIFT_VERSION="4.8"
RHEL_VERSION="8.4"
OSTREE_VERSION='48.84.202112212304-0'
---

Infrastructure: VMware vSphere

How reproducible: Always


Steps to Reproduce:

1. Attach an NVMe device to an existing ODF node in vSphere
2. oc debug node/ or ssh to the node
3. Run an lsblk

Actual results:

NVMe device is not present in the OS block device list until the VM is rebooted.


Expected results:

Added NVMe device present to the OS without requiring a reboot.




[0]
https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.8/html/scaling_storage/scaling-up-storage-capacity_rhocs#scaling-up-storage-by-adding-capacity-to-your-openshift-container-storage-nodes-using-local-storage-devices_rhocs

[1]
https://access.redhat.com/solutions/5317381

Comment 5 Jose A. Rivera 2022-06-21 14:24:16 UTC
This does not seem like a bug, at least not on our end. That said, based on this line:

> COS does have the rescan-scsi-bus.sh script from sg3_utils RPM however that does not work for NVMe devices[1] so we're opening this BZ to see if there is an undocumented way to rescan for additional NVMe disks on an runing ODF COS node without requiring a reboot or if we should be pursuing an RFE to add the "nvme-cli" utility to COS.

The answer to this would be "no". An RFE would make the most sense. If there is a regression or known issue, it would probably be in RHCOS itself or the LSO.

I don't know which component would be the appropriate target, so moving it out to ODF 4.12 now.

Comment 6 Darren Carpenter 2022-08-11 20:09:04 UTC
Hi All,

Just checking in to see if anyone else has had a chance to take a look at this and what the status is.

Comment 18 Nitin Goyal 2023-03-09 13:54:43 UTC
As Jose mentioned earlier it should be coming from COS or LSO. we are not maintaining such scripts which rescan the devices and all. I am closing this BZ. Pls create the Jira issue with the LSO or COS.

Comment 19 Red Hat Bugzilla 2023-12-08 04:27:22 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.