Bug 2211580 - noobaa DB pod stuck init state with Multi-Attach error for volume after rescheduling to a new worker node on IBM Power
Summary: noobaa DB pod stuck init state with Multi-Attach error for volume after resch...
Keywords:
Status: CLOSED DUPLICATE of bug 1795372
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: csi-driver
Version: 4.13
Hardware: ppc64le
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Rakshith
QA Contact: krishnaram Karthick
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-06-01 06:21 UTC by narayanspg
Modified: 2023-08-09 16:37 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-06-02 05:54:58 UTC
Embargoed:


Attachments (Terms of Use)

Description narayanspg 2023-06-01 06:21:53 UTC
Description of problem (please be detailed as possible and provide log
snippests):
noobaa DB pod stuck init state with Multi-Attach error for volume after rescheduling to a new worker node on IBM Power.

While verifying the epic "Fast recovery for NooBaa core and DB pods in case of node failure" - https://issues.redhat.com/browse/RHSTOR-3355

worker node where noobaa-db-pg-0 pod is scheduled is shutdown and then after rescheduling to another healthy worker node pod is stuck in init state with below error message.

Warning  FailedAttachVolume  115s  attachdetach-controller  Multi-Attach error for volume "pvc-b95e5212-39b0-40f8-9a6f-c642632ca966" Volume is already exclusively attached to one node and can't be attached to another

Note: noobaa core pod is successfully rescheduled to another worker node and working fine.

Version of all relevant components (if applicable):
[root@nara1-nba-odf-c1f3-sao01-bastion-0 ~]# oc get clusterversion
NAME      VERSION                                      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-ppc64le-2023-05-02-182828   True        False         27h     Cluster version is 4.13.0-0.nightly-ppc64le-2023-05-02-182828
[root@nara1-nba-odf-c1f3-sao01-bastion-0 ~]# oc describe csv odf-operator.v4.13.0 -n openshift-storage | grep full
Labels:       full_version=4.13.0-207
          f:full_version:
[root@nara1-nba-odf-c1f3-sao01-bastion-0 ~]#

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
not able to verify epic https://issues.redhat.com/browse/RHSTOR-3355

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. shutdown the worker node where noobaa db pod is running
2. pod will be rescheduled to another worker node but stuck in init state.


Actual results:
if there is node failure then noobaa-db-pg-0 pod is rescheduled to another worker node and but stuck in init state.

Expected results:
if there is node failure then noobaa-db-pg-0 pod should successfully rescheduled another worker node and run.

Additional info:
we need to know how to reschedule the noobaa-db-pg-0 pod to verify the epic - https://issues.redhat.com/browse/RHSTOR-3355

Comment 2 Rakshith 2023-06-02 05:54:58 UTC
This is a known bug (refer to https://bugzilla.redhat.com/show_bug.cgi?id=1795372#c21) and
https://issues.redhat.com/browse/RHSTOR-2500 is aimed at reducing the time taken for remount. 

The https://issues.redhat.com/browse/RHSTOR-3355 epic which you are trying to validate is about
validating `Faster recovery for NooBaa core and DB pods in case of node failure` in 4.13 compared
to 4.12. It does not mean there will be no delay in re-spin at all.

Please read the comments in the epic https://issues.redhat.com/browse/RHSTOR-3355
and story https://issues.redhat.com/browse/RHSTOR-3972

Contact QE who validated it on AWS setup for more information.

*** This bug has been marked as a duplicate of bug 1795372 ***


Note You need to log in before you can comment on or make changes to this bug.