Bug 2116791

Summary: OSD is stuck in Init:0/9 after performing TestNodeReplacement proactive test case
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Pratik Surve <prsurve>
Component: rookAssignee: Santosh Pillai <sapillai>
Status: CLOSED NOTABUG QA Contact: Neha Berry <nberry>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.11CC: amagrawa, kramdoss, madam, muagarwa, ocs-bugs, odf-bz-bot, rgeorge, sapillai
Target Milestone: ---Keywords: Regression
Target Release: ---Flags: ykaul: needinfo? (prsurve)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 08:07:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pratik Surve 2022-08-09 10:12:01 UTC
Description of problem (please be detailed as possible and provide log
snippets):

OSD is stuck in Init:0/9 after performing TestNodeReplacement proactive test case


Version of all relevant components (if applicable):


OCS operator	4.11.0-131
Cluster Version	4.11.0-0.nightly-2022-08-04-081314
Ceph Version	16.2.8-84.el8cp (c2980f2fd700e979d41b4bad2939bb90f0fe435c) pacific (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy KMS VAULT v2 cluster
2. Run io
3. Perform node replacement test case


Actual results:

Events:
  Type     Reason              Age                 From                     Message
  ----     ------              ----                ----                     -------
  Warning  FailedScheduling    50m (x4 over 50m)   default-scheduler        0/6 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/unschedulable: }, 1 node(s) were unschedulable, 3 Insufficient cpu, 3 node(s) didn't match pod topology spread constraints (missing required label), 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 5 node(s) didn't match Pod's node affinity/selector. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
  Warning  FailedScheduling    44m (x3 over 45m)   default-scheduler        0/5 nodes are available: 3 Insufficient cpu, 3 node(s) didn't match pod topology spread constraints (missing required label), 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 5 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.
  Normal   Scheduled           43m                 default-scheduler        Successfully assigned openshift-storage/rook-ceph-osd-0-6cc5ff4c5c-8vt7g to compute-3 by control-plane-1
  Warning  FailedMount         41m                 kubelet                  Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0k5qzf], unattached volumes=[run-udev ocs-deviceset-1-data-0k5qzf-bridge kube-api-access-tnm5p dev-mapper osd-encryption-key vault rook-ceph-log ocs-deviceset-1-data-0k5qzf rook-config-override rook-ceph-crash rook-data]: timed out waiting for the condition
  Warning  FailedMount         38m                 kubelet                  Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0k5qzf], unattached volumes=[rook-ceph-crash rook-data rook-config-override run-udev ocs-deviceset-1-data-0k5qzf-bridge vault rook-ceph-log kube-api-access-tnm5p osd-encryption-key dev-mapper ocs-deviceset-1-data-0k5qzf]: timed out waiting for the condition
  Warning  FailedMount         36m                 kubelet                  Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0k5qzf], unattached volumes=[dev-mapper rook-config-override osd-encryption-key ocs-deviceset-1-data-0k5qzf-bridge kube-api-access-tnm5p ocs-deviceset-1-data-0k5qzf rook-data rook-ceph-log rook-ceph-crash vault run-udev]: timed out waiting for the condition
  Warning  FailedMount         33m                 kubelet                  Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0k5qzf], unattached volumes=[ocs-deviceset-1-data-0k5qzf-bridge kube-api-access-tnm5p rook-ceph-log run-udev vault rook-config-override rook-ceph-crash osd-encryption-key dev-mapper rook-data ocs-deviceset-1-data-0k5qzf]: timed out waiting for the condition
  Warning  FailedMount         31m                 kubelet                  Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0k5qzf], unattached volumes=[ocs-deviceset-1-data-0k5qzf kube-api-access-tnm5p dev-mapper rook-config-override vault rook-data run-udev ocs-deviceset-1-data-0k5qzf-bridge rook-ceph-log osd-encryption-key rook-ceph-crash]: timed out waiting for the condition
  Warning  FailedMount         29m                 kubelet                  Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0k5qzf], unattached volumes=[vault ocs-deviceset-1-data-0k5qzf rook-ceph-crash run-udev kube-api-access-tnm5p rook-ceph-log ocs-deviceset-1-data-0k5qzf-bridge rook-data rook-config-override osd-encryption-key dev-mapper]: timed out waiting for the condition
  Warning  FailedMount         26m                 kubelet                  Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0k5qzf], unattached volumes=[ocs-deviceset-1-data-0k5qzf-bridge rook-data ocs-deviceset-1-data-0k5qzf kube-api-access-tnm5p vault rook-ceph-crash run-udev rook-ceph-log rook-config-override osd-encryption-key dev-mapper]: timed out waiting for the condition
  Warning  FailedMount         24m                 kubelet                  Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0k5qzf], unattached volumes=[dev-mapper kube-api-access-tnm5p ocs-deviceset-1-data-0k5qzf rook-ceph-log vault osd-encryption-key rook-data rook-config-override rook-ceph-crash run-udev ocs-deviceset-1-data-0k5qzf-bridge]: timed out waiting for the condition
  Warning  FailedMount         21m                 kubelet                  Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0k5qzf], unattached volumes=[rook-ceph-crash kube-api-access-tnm5p rook-data rook-config-override ocs-deviceset-1-data-0k5qzf-bridge rook-ceph-log run-udev osd-encryption-key vault dev-mapper ocs-deviceset-1-data-0k5qzf]: timed out waiting for the condition
  Warning  FailedAttachVolume  75s (x22 over 43m)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-5dfe54c2-1c36-4bf2-bbce-d18b4f615af6" : Failed to add disk 'scsi0:2'.
  Warning  FailedMount         66s (x9 over 19m)   kubelet                  (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0k5qzf], unattached volumes=[rook-data osd-encryption-key rook-config-override rook-ceph-log rook-ceph-crash run-udev vault ocs-deviceset-1-data-0k5qzf-bridge kube-api-access-tnm5p dev-mapper ocs-deviceset-1-data-0k5qzf]: timed out waiting for the condition



Expected results:

The pod should be in a Running state 

Additional info:

Jenkins job:- https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/4969/consoleFull

must-gather:- http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-015vukv21cs33-t4a/j-015vukv21cs33-t4a_20220808T045215/



Jenkins job rerun:- https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/4998/

must-gather:- http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-016vukv21cs33-t4a/j-016vukv21cs33-t4a_20220809T051710/

Comment 3 Santosh Pillai 2022-08-09 13:27:24 UTC
``` Warning  FailedAttachVolume  75s (x22 over 43m)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-5dfe54c2-1c36-4bf2-bbce-d18b4f615af6" : Failed to add disk 'scsi0:2'.  AttachVolume.Attach failed for volume "pvc-5dfe54c2-1c36-4bf2-bbce-d18b4f615af6" : Failed to add disk 'scsi0:2'. ```

Prime facie looks like an environment issue related to shifting of pods on a new node.

The is the closet resolution I could find for this error - https://access.redhat.com/solutions/5917391

Comment 4 Yaniv Kaul 2022-08-09 14:59:26 UTC
(In reply to Santosh Pillai from comment #3)
> ``` Warning  FailedAttachVolume  75s (x22 over 43m)  attachdetach-controller
> AttachVolume.Attach failed for volume
> "pvc-5dfe54c2-1c36-4bf2-bbce-d18b4f615af6" : Failed to add disk 'scsi0:2'. 
> AttachVolume.Attach failed for volume
> "pvc-5dfe54c2-1c36-4bf2-bbce-d18b4f615af6" : Failed to add disk 'scsi0:2'.
> ```
> 
> Prime facie looks like an environment issue related to shifting of pods on a
> new node.
> 
> The is the closet resolution I could find for this error -
> https://access.redhat.com/solutions/5917391

Pratik?