Bug 2211580

Summary: noobaa DB pod stuck init state with Multi-Attach error for volume after rescheduling to a new worker node on IBM Power
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: narayanspg <ngowda>
Component: csi-driverAssignee: Rakshith <rar>
Status: CLOSED DUPLICATE QA Contact: krishnaram Karthick <kramdoss>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.13CC: ocs-bugs, odf-bz-bot, rar
Target Milestone: ---   
Target Release: ---   
Hardware: ppc64le   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-02 05:54:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description narayanspg 2023-06-01 06:21:53 UTC
Description of problem (please be detailed as possible and provide log
snippests):
noobaa DB pod stuck init state with Multi-Attach error for volume after rescheduling to a new worker node on IBM Power.

While verifying the epic "Fast recovery for NooBaa core and DB pods in case of node failure" - https://issues.redhat.com/browse/RHSTOR-3355

worker node where noobaa-db-pg-0 pod is scheduled is shutdown and then after rescheduling to another healthy worker node pod is stuck in init state with below error message.

Warning  FailedAttachVolume  115s  attachdetach-controller  Multi-Attach error for volume "pvc-b95e5212-39b0-40f8-9a6f-c642632ca966" Volume is already exclusively attached to one node and can't be attached to another

Note: noobaa core pod is successfully rescheduled to another worker node and working fine.

Version of all relevant components (if applicable):
[root@nara1-nba-odf-c1f3-sao01-bastion-0 ~]# oc get clusterversion
NAME      VERSION                                      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-ppc64le-2023-05-02-182828   True        False         27h     Cluster version is 4.13.0-0.nightly-ppc64le-2023-05-02-182828
[root@nara1-nba-odf-c1f3-sao01-bastion-0 ~]# oc describe csv odf-operator.v4.13.0 -n openshift-storage | grep full
Labels:       full_version=4.13.0-207
          f:full_version:
[root@nara1-nba-odf-c1f3-sao01-bastion-0 ~]#

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
not able to verify epic https://issues.redhat.com/browse/RHSTOR-3355

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. shutdown the worker node where noobaa db pod is running
2. pod will be rescheduled to another worker node but stuck in init state.


Actual results:
if there is node failure then noobaa-db-pg-0 pod is rescheduled to another worker node and but stuck in init state.

Expected results:
if there is node failure then noobaa-db-pg-0 pod should successfully rescheduled another worker node and run.

Additional info:
we need to know how to reschedule the noobaa-db-pg-0 pod to verify the epic - https://issues.redhat.com/browse/RHSTOR-3355

Comment 2 Rakshith 2023-06-02 05:54:58 UTC
This is a known bug (refer to https://bugzilla.redhat.com/show_bug.cgi?id=1795372#c21) and
https://issues.redhat.com/browse/RHSTOR-2500 is aimed at reducing the time taken for remount. 

The https://issues.redhat.com/browse/RHSTOR-3355 epic which you are trying to validate is about
validating `Faster recovery for NooBaa core and DB pods in case of node failure` in 4.13 compared
to 4.12. It does not mean there will be no delay in re-spin at all.

Please read the comments in the epic https://issues.redhat.com/browse/RHSTOR-3355
and story https://issues.redhat.com/browse/RHSTOR-3972

Contact QE who validated it on AWS setup for more information.

*** This bug has been marked as a duplicate of bug 1795372 ***