Bug 2042694 - [ODF Scale Up] Provide a method to rescan for new NVMe disks on deployed OCS/ODF nodes [NEEDINFO]
Summary: [ODF Scale Up] Provide a method to rescan for new NVMe disks on deployed OCS/...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Mudit Agarwal
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-19 23:20 UTC by Alberto Rivera Laporte
Modified: 2023-08-09 17:00 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-03-09 13:54:43 UTC
Embargoed:
jrivera: needinfo? (ariveral)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5317381 0 None None None 2022-01-19 23:20:20 UTC

Description Alberto Rivera Laporte 2022-01-19 23:20:21 UTC
Description of problem: 

Additional NVMe disks are not present in CoreOS(COS) until the ODF node is rebooted. 



Additional info:

When adding an additional NVMe disk to a deployed ODF storage node as part of a scale up procedure using local storage operator[0] the additional NVMe disk is not visible in the table of block devices and until the node has been rebooted.  This presents with the following challenges: 

1. The disk discovery feature does not detect newly added NVMe devices until the host is rebooted. 

2.  The reboot of an ODF node could be considered a disruptive operation in production environments without properly cordoning and draining of the storage node prior to the reboot.  

COS does have the rescan-scsi-bus.sh script from sg3_utils RPM however that does not work for NVMe devices[1] so we're opening this BZ to see if there is an undocumented way to rescan for additional NVMe disks on an runing ODF COS node without requiring a reboot or if we should be pursuing an RFE to add the "nvme-cli" utility to COS.



Version-Release number of selected component (if applicable):

OCP/OCS/ODF 4.8 
---
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="48.84.202112212304-0"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION_ID="4.8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 48.84.202112212304-0 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.8/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.8"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.8"
OPENSHIFT_VERSION="4.8"
RHEL_VERSION="8.4"
OSTREE_VERSION='48.84.202112212304-0'
---

Infrastructure: VMware vSphere

How reproducible: Always


Steps to Reproduce:

1. Attach an NVMe device to an existing ODF node in vSphere
2. oc debug node/ or ssh to the node
3. Run an lsblk

Actual results:

NVMe device is not present in the OS block device list until the VM is rebooted.


Expected results:

Added NVMe device present to the OS without requiring a reboot.




[0]
https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.8/html/scaling_storage/scaling-up-storage-capacity_rhocs#scaling-up-storage-by-adding-capacity-to-your-openshift-container-storage-nodes-using-local-storage-devices_rhocs

[1]
https://access.redhat.com/solutions/5317381

Comment 5 Jose A. Rivera 2022-06-21 14:24:16 UTC
This does not seem like a bug, at least not on our end. That said, based on this line:

> COS does have the rescan-scsi-bus.sh script from sg3_utils RPM however that does not work for NVMe devices[1] so we're opening this BZ to see if there is an undocumented way to rescan for additional NVMe disks on an runing ODF COS node without requiring a reboot or if we should be pursuing an RFE to add the "nvme-cli" utility to COS.

The answer to this would be "no". An RFE would make the most sense. If there is a regression or known issue, it would probably be in RHCOS itself or the LSO.

I don't know which component would be the appropriate target, so moving it out to ODF 4.12 now.

Comment 6 Darren Carpenter 2022-08-11 20:09:04 UTC
Hi All,

Just checking in to see if anyone else has had a chance to take a look at this and what the status is.

Comment 18 Nitin Goyal 2023-03-09 13:54:43 UTC
As Jose mentioned earlier it should be coming from COS or LSO. we are not maintaining such scripts which rescan the devices and all. I am closing this BZ. Pls create the Jira issue with the LSO or COS.


Note You need to log in before you can comment on or make changes to this bug.