2322671 – [Stretch cluster] RWX storage issue on surviving zone : MountVolume.MountDevice failed

Bug 2322671 - [Stretch cluster] RWX storage issue on surviving zone : MountVolume.MountDevice failed [NEEDINFO]

Summary: [Stretch cluster] RWX storage issue on surviving zone : MountVolume.MountDevi...

Keywords:
Status:	NEW
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.17
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Santosh Pillai
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-10-30 06:56 UTC by morstad
Modified:	2024-11-04 06:49 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
Flags:	sapillai: needinfo? (vshankar) sapillai: needinfo? (lflores)

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OCSBZM-9455	0	None	None	None	2024-10-30 06:58:10 UTC

Description morstad 2024-10-30 06:56:13 UTC

Description of problem (please be detailed as possible and provide log
snippests):

Stretch Cluster testing w/ "DR" scenario - entire application was running on zone-1, took zone-1 down, & monitored recovery of application to zone-2.  The application relocated to zone-2 and portion was up and running in approx 20min, but several pods requiring RWX storage where struck in ContainerCreating with similar message to below :


Events:
  Type     Reason       Age    From               Message
  ----     ------       ----   ----               -------
  Normal   Scheduled    3m48s  ibm-cpd-scheduler  Successfully assigned cpd-ins/asset-files-api-5f46c7f599-hvfdz to dahorak-ibmcloud-bwvbz-worker-2-6dzcw
  Warning  FailedMount  38s    kubelet            MountVolume.MountDevice failed for volume "pvc-a45db048-7f1e-4be6-8ca2-2f47ea09046e" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-a9a70a11-eca5-4377-aad4-7f276bfb1d46 already exists


Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)? Yes - unable to recover application on surviving zone


Is there any workaround available to the best of your knowledge? No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)? 4 - custom install of application


Can this issue reproducible? likely - may be timing based


Can this issue reproduce from the UI? no


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install application with multiple pods using RWX storage on zone-1
2. Shutdown zone-1 & force delete pods once enter Terminating state
3. Monitor application relocation to surviving zone


Actual results:
Pods with RWX storage are able to mount drives

Expected results:
Pods with RWX storage were NOT able to mount drives

Additional info:

Note You need to log in before you can comment on or make changes to this bug.