Bug 2109480

Summary:	rook-ceph-mon pods are getting into pending state
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Chetna <cchetna>
Component:	rook	Assignee:	Subham Rai <srai>
Status:	CLOSED NOTABUG	QA Contact:	Neha Berry <nberry>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.10	CC:	atandale, ipinto, jarrpa, madam, muagarwa, ocs-bugs, odf-bz-bot, sostapov, tnielsen
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-09-26 15:25:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Chetna 2022-07-21 10:55:34 UTC

Description of problem (please be detailed as possible and provide log
snippests):
While installing ODF from operator hub on ROSA cluster, storage cluster is getting into error state with error: "Error while reconciling: some StorageClasses [ocs-storagecluster-cephfs,ocs-storagecluster-ceph-rbd] were skipped while waiting for pre-requisites to be met"

rook-ceph-mon pods are getting into pending state with error: "0/8 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate."

Logs of failed installation: http://pastebin.test.redhat.com/1066592

Version of all relevant components (if applicable):
ROSA cluster version: 4.10
Rosa cluster specifications: machine_type: m5.4xlarge, nodes: 3
ODF version: 4.10

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
we need ODF to create storage class for CNV storage, The storage cluster is getting into ERROR state, so storage classes are not getting created.
This blocks our team for testing CNV+ODF on ROSA clusters

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?
not facing this issue when storage system is created from UI, storage cluster is getting created itself and it reaches the "PROGRESSING" state.

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
Install ODF and the storage system and storage cluster using ansible automation on ROSA cluster: http://pastebin.test.redhat.com/1065816

Actual results:
1. Storage Cluster getting into error state.
2. rook-ceph-mon pod is getting into pending state
3. Storage Class "ocs-storagecluster-ceph-rbd" and "ocs-storagecluster-cephfs" are not getting created

Expected results:
1. Storage Cluster in READY state
2. rook-ceph-mon pods should be in "running" state
3. Storage Class "ocs-storagecluster-ceph-rbd" and "ocs-storagecluster-cephfs" created

Additional info: The provided ansible script is installing ODF from operator hub but not every time.

Comment 5 Travis Nielsen 2022-09-12 15:25:50 UTC

Is this still an issue or did you fix the cluster? Can we close this issue?

Comment 6 Travis Nielsen 2022-09-26 15:25:23 UTC

Please reopen if still an issue.

Comment 7 Anandprakash Tandale 2022-09-28 13:29:32 UTC

It seems the bug is reproducible, so opening it again.

Comment 9 Subham Rai 2022-09-29 16:49:34 UTC

the infra nodes have taints that stopped pod to consume
```
{
  "providerID": "aws:///us-east-1a/i-038b6e10991928fca",
  "taints": [
    {
      "effect": "NoSchedule",
      "key": "node-role.kubernetes.io/master"
    }
  ]
}
```
once I added these taints to storage cluster spec, the cluster was in good state mon, osd came up and running

```rook-ceph-mgr-a-54cd6f947d-jfp6f                                  2/2     Running     0          7m43s   10.129.2.17    ip-10-0-208-187.ec2.internal   <none>           <none>
rook-ceph-mon-a-8db947566-z44p2                                   2/2     Running     0          9m57s   10.129.2.16    ip-10-0-208-187.ec2.internal   <none>           <none>
rook-ceph-mon-b-59fbdb949d-kbgbp                                  2/2     Running     0          9m29s   10.131.0.77    ip-10-0-211-138.ec2.internal   <none>           <none>
rook-ceph-mon-c-76bf5d5cb8-x9trg                                  2/2     Running     0          9m12s   10.128.2.72    ip-10-0-212-154.ec2.internal   <none>           <none>
rook-ceph-operator-848dd966fd-8pvc4                               1/1     Running     0          3m43s   10.131.0.80    ip-10-0-211-138.ec2.internal   <none>           <none>
rook-ceph-osd-0-99ffb9d6f-n82tj                                   2/2     Running     0          4m57s   10.131.0.81    ip-10-0-211-138.ec2.internal   <none>           <none>
rook-ceph-osd-1-9b8565ccc-b4jsl                                   2/2     Running     0          7m3s    10.128.2.83    ip-10-0-212-154.ec2.internal   <none>           <none>
rook-ceph-osd-2-6c669479f7-lbnv6                                  2/2     Running     0          6m32s   10.130.2.59    ip-10-0-226-41.ec2.internal    <none>           <none>
```

```
placement:
    all:
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/infra
```