Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2115328

Summary: multipath device not cleared when power cycling node
Product: OpenShift Container Platform Reporter: Javier Coscia <jcoscia>
Component: StorageAssignee: Hemant Kumar <hekumar>
Storage sub component: Kubernetes External Components QA Contact: Wei Duan <wduan>
Status: CLOSED WONTFIX Docs Contact:
Severity: high    
Priority: high CC: apanagio, bzvonar, cgaynor, ealcaniz, hekumar, jdobson, jsafrane, midu, openshift-bugs-escalate
Version: 4.8   
Target Milestone: ---   
Target Release: 4.12.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-11-02 14:36:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Javier Coscia 2022-08-04 12:33:09 UTC
Description of problem:

- Customer is testing a power cycling scenario with OCP + HPE CSI + multipath.

- After a power cycle event on an OCP worker node, with multipath device attached (provided by HEP CSI) used by PODs, POD cannot start when node comes back until a scsi rescan opeartion takes place in the node followed by a multipath service restart.



Version-Release number of selected component (if applicable):

- OpenShift 4.8.45
- RHCOS 48.84.202206172122-0
- HPE CSI plugin version 1.3

How reproducible:

- Randomly when abruptly powercycling OCP nodes

Steps to Reproduce:
1. Have a workload (POD) using PVs provided by external CSI drivers, HPE in this case and multipath configured on nodes. Node boots from SAN with multipath enabled too.
2. Abruptly powercycle the node where the POD is running (and has the volume attached)
3. Node comes back but POD cannot mount the volume until a scsi rescan happens (which clear devices that are unused and no longer present on the storage array side) followed by a multipath service restart.

Actual results:

- When node comes back, POD cannot mount the volume.

Expected results:

- POD should be able to mount the PV associated with it once node gets back

Additional info:

I'm providing the information we were able to collect and location for log files to inspect / analyze

Comment 26 Colum Gaynor 2022-10-28 03:41:38 UTC
@bzvonar 

---> Case is being tracked followed on GChat Group : https://mail.google.com/chat/u/0/#chat/space/AAAAFrtdLuE
     Bill - Yes hope to now close this case soon --> let's keep this on your tracker for a little bit longer. 
     Trying now to conclude where we are with this one and notify Nokia NOM that they need to take a contact
     with HPE to sort out which CSI Driver they should use.

Colum Gaynor - Senior Partner Success Manager, Nokia Global Account

- - - - - - - - - - - - - - - - - - 

@jcoscia 
 --> Javier Is it possible for you to make a summary and public update to the support case now based on Hemant's work 
     and set the case status to "Waiting For Customer"

     Clearly recommend to Nokia NOM that they need to clarify with HPE the correct version of the CSI Driver that they should be using.
     State clearly that Red Hat recommend the case is closed. 

     Basically we cannot reproduce the original condition if I understand Hemmant correctly ?
     But we have also discovered evidence that they are on an old driver version ?

 ---- From the GChat space -----

 Hemant Kumar, Yesterday 5:18 PM
 The way I see it:
 1. If customer is running older version of CSI driver, they should try with latest version and report.
 2. If this was the DNS error, then again error is between customer and HPE.
 We do not support this driver. 
 We don't know anything about what version customer should deploy
 I am going to close this bug and ask you guys to work with HPE and NOM. 
 If the bug comes somewhere in kube stack then please loop us in.
 
 Ivan Bodunov, Yesterday 5:23 PM
 Ok. 
 So we don't need any call with HPE any more, the response was good enough for us?
 
 Hemant Kumar, Yesterday 5:24 PM
 Are we going to support HPE driver? 
 Why is nokia not talking to HPE directly?
 It sounds like, we are trying very hard  to support a driver, we know nothing about

 Ivan Bodunov, Yesterday 5:25 PM
 I don't think we should support HPE driver..
 
 Mihai Idu, Yesterday 5:26 PM
 My opinion is we try to know both sides ( HPE and RH ) and build a joint framework on this topic.
 We never know when we need HPE side

 Ivan Bodunov, Yesterday 5:26 PM
 But understanding problem would be nice..
 
 Hemant Kumar, Yesterday 5:26 PM
 In this case - the first line of contact should have been HPE.
 
 Hemant Kumar, Yesterday 5:28 PM
 The fact that, HPE is pointing us to tools they build for dealing with non-graceful shutdown of nodes etc, 
 is clear indicator that, HPE knows best how to deploy their driver
 We should figure out, why it took so long to talk to HPE?
 
 Hemant Kumar, Yesterday 6:04 PM
 @Colum Gaynor so tldr; I don't think we need to setup a call with HPE right now. 
 I hear Nokia is setting up a new environment and if the bug surfaces again in this environment. 
 IMO, we should setup a call with HPE and redhat engineering.
 
 Hemant Kumar, Yesterday 6:10 PM
 @jcoscia 

 ---- From the GChat space -----