Bug 2115328 - multipath device not cleared when power cycling node
Summary: multipath device not cleared when power cycling node
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.12.0
Assignee: Hemant Kumar
QA Contact: Wei Duan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-08-04 12:33 UTC by Javier Coscia
Modified: 2022-11-02 14:36 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-11-02 14:36:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Javier Coscia 2022-08-04 12:33:09 UTC
Description of problem:

- Customer is testing a power cycling scenario with OCP + HPE CSI + multipath.

- After a power cycle event on an OCP worker node, with multipath device attached (provided by HEP CSI) used by PODs, POD cannot start when node comes back until a scsi rescan opeartion takes place in the node followed by a multipath service restart.



Version-Release number of selected component (if applicable):

- OpenShift 4.8.45
- RHCOS 48.84.202206172122-0
- HPE CSI plugin version 1.3

How reproducible:

- Randomly when abruptly powercycling OCP nodes

Steps to Reproduce:
1. Have a workload (POD) using PVs provided by external CSI drivers, HPE in this case and multipath configured on nodes. Node boots from SAN with multipath enabled too.
2. Abruptly powercycle the node where the POD is running (and has the volume attached)
3. Node comes back but POD cannot mount the volume until a scsi rescan happens (which clear devices that are unused and no longer present on the storage array side) followed by a multipath service restart.

Actual results:

- When node comes back, POD cannot mount the volume.

Expected results:

- POD should be able to mount the PV associated with it once node gets back

Additional info:

I'm providing the information we were able to collect and location for log files to inspect / analyze

Comment 26 Colum Gaynor 2022-10-28 03:41:38 UTC
@bzvonar 

---> Case is being tracked followed on GChat Group : https://mail.google.com/chat/u/0/#chat/space/AAAAFrtdLuE
     Bill - Yes hope to now close this case soon --> let's keep this on your tracker for a little bit longer. 
     Trying now to conclude where we are with this one and notify Nokia NOM that they need to take a contact
     with HPE to sort out which CSI Driver they should use.

Colum Gaynor - Senior Partner Success Manager, Nokia Global Account

- - - - - - - - - - - - - - - - - - 

@jcoscia 
 --> Javier Is it possible for you to make a summary and public update to the support case now based on Hemant's work 
     and set the case status to "Waiting For Customer"

     Clearly recommend to Nokia NOM that they need to clarify with HPE the correct version of the CSI Driver that they should be using.
     State clearly that Red Hat recommend the case is closed. 

     Basically we cannot reproduce the original condition if I understand Hemmant correctly ?
     But we have also discovered evidence that they are on an old driver version ?

 ---- From the GChat space -----

 Hemant Kumar, Yesterday 5:18 PM
 The way I see it:
 1. If customer is running older version of CSI driver, they should try with latest version and report.
 2. If this was the DNS error, then again error is between customer and HPE.
 We do not support this driver. 
 We don't know anything about what version customer should deploy
 I am going to close this bug and ask you guys to work with HPE and NOM. 
 If the bug comes somewhere in kube stack then please loop us in.
 
 Ivan Bodunov, Yesterday 5:23 PM
 Ok. 
 So we don't need any call with HPE any more, the response was good enough for us?
 
 Hemant Kumar, Yesterday 5:24 PM
 Are we going to support HPE driver? 
 Why is nokia not talking to HPE directly?
 It sounds like, we are trying very hard  to support a driver, we know nothing about

 Ivan Bodunov, Yesterday 5:25 PM
 I don't think we should support HPE driver..
 
 Mihai Idu, Yesterday 5:26 PM
 My opinion is we try to know both sides ( HPE and RH ) and build a joint framework on this topic.
 We never know when we need HPE side

 Ivan Bodunov, Yesterday 5:26 PM
 But understanding problem would be nice..
 
 Hemant Kumar, Yesterday 5:26 PM
 In this case - the first line of contact should have been HPE.
 
 Hemant Kumar, Yesterday 5:28 PM
 The fact that, HPE is pointing us to tools they build for dealing with non-graceful shutdown of nodes etc, 
 is clear indicator that, HPE knows best how to deploy their driver
 We should figure out, why it took so long to talk to HPE?
 
 Hemant Kumar, Yesterday 6:04 PM
 @Colum Gaynor so tldr; I don't think we need to setup a call with HPE right now. 
 I hear Nokia is setting up a new environment and if the bug surfaces again in this environment. 
 IMO, we should setup a call with HPE and redhat engineering.
 
 Hemant Kumar, Yesterday 6:10 PM
 @jcoscia 

 ---- From the GChat space -----


Note You need to log in before you can comment on or make changes to this bug.