Bug 2109480

Summary: rook-ceph-mon pods are getting into pending state
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Chetna <cchetna>
Component: rookAssignee: Subham Rai <srai>
Status: CLOSED NOTABUG QA Contact: Neha Berry <nberry>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.10CC: atandale, ipinto, jrivera, madam, muagarwa, ocs-bugs, odf-bz-bot, sostapov, tnielsen
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-26 15:25:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chetna 2022-07-21 10:55:34 UTC
Description of problem (please be detailed as possible and provide log
snippests):
While installing ODF from operator hub on ROSA cluster, storage cluster is getting into error state with error: "Error while reconciling: some StorageClasses [ocs-storagecluster-cephfs,ocs-storagecluster-ceph-rbd] were skipped while waiting for pre-requisites to be met"

rook-ceph-mon pods are getting into pending state with error: "0/8 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate."

Logs of failed installation: http://pastebin.test.redhat.com/1066592

Version of all relevant components (if applicable):
ROSA cluster version: 4.10
Rosa cluster specifications: machine_type: m5.4xlarge, nodes: 3
ODF version: 4.10

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
we need ODF to create storage class for CNV storage, The storage cluster is getting into ERROR state, so storage classes are not getting created.
This blocks our team for testing CNV+ODF on ROSA clusters

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?
not facing this issue when storage system is created from UI, storage cluster is getting created itself and it reaches the "PROGRESSING" state.

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
Install ODF and the storage system and storage cluster using ansible automation on ROSA cluster: http://pastebin.test.redhat.com/1065816

Actual results:
1. Storage Cluster getting into error state.
2. rook-ceph-mon pod is getting into pending state
3. Storage Class "ocs-storagecluster-ceph-rbd" and "ocs-storagecluster-cephfs" are not getting created

Expected results:
1. Storage Cluster in READY state
2. rook-ceph-mon pods should be in "running" state
3. Storage Class "ocs-storagecluster-ceph-rbd" and "ocs-storagecluster-cephfs" created

Additional info: The provided ansible script is installing ODF from operator hub but not every time.

Comment 5 Travis Nielsen 2022-09-12 15:25:50 UTC
Is this still an issue or did you fix the cluster? Can we close this issue?

Comment 6 Travis Nielsen 2022-09-26 15:25:23 UTC
Please reopen if still an issue.

Comment 7 Anandprakash Tandale 2022-09-28 13:29:32 UTC
It seems the bug is reproducible, so opening it again.

Comment 9 Subham Rai 2022-09-29 16:49:34 UTC
the infra nodes have taints that stopped pod to consume
```
{
  "providerID": "aws:///us-east-1a/i-038b6e10991928fca",
  "taints": [
    {
      "effect": "NoSchedule",
      "key": "node-role.kubernetes.io/master"
    }
  ]
}
```
once I added these taints to storage cluster spec, the cluster was in good state mon, osd came up and running

```rook-ceph-mgr-a-54cd6f947d-jfp6f                                  2/2     Running     0          7m43s   10.129.2.17    ip-10-0-208-187.ec2.internal   <none>           <none>
rook-ceph-mon-a-8db947566-z44p2                                   2/2     Running     0          9m57s   10.129.2.16    ip-10-0-208-187.ec2.internal   <none>           <none>
rook-ceph-mon-b-59fbdb949d-kbgbp                                  2/2     Running     0          9m29s   10.131.0.77    ip-10-0-211-138.ec2.internal   <none>           <none>
rook-ceph-mon-c-76bf5d5cb8-x9trg                                  2/2     Running     0          9m12s   10.128.2.72    ip-10-0-212-154.ec2.internal   <none>           <none>
rook-ceph-operator-848dd966fd-8pvc4                               1/1     Running     0          3m43s   10.131.0.80    ip-10-0-211-138.ec2.internal   <none>           <none>
rook-ceph-osd-0-99ffb9d6f-n82tj                                   2/2     Running     0          4m57s   10.131.0.81    ip-10-0-211-138.ec2.internal   <none>           <none>
rook-ceph-osd-1-9b8565ccc-b4jsl                                   2/2     Running     0          7m3s    10.128.2.83    ip-10-0-212-154.ec2.internal   <none>           <none>
rook-ceph-osd-2-6c669479f7-lbnv6                                  2/2     Running     0          6m32s   10.130.2.59    ip-10-0-226-41.ec2.internal    <none>           <none>
```

```
placement:
    all:
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/infra
```