Bug 2109480
| Summary: | rook-ceph-mon pods are getting into pending state | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Chetna <cchetna> |
| Component: | rook | Assignee: | Subham Rai <srai> |
| Status: | CLOSED NOTABUG | QA Contact: | Neha Berry <nberry> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.10 | CC: | atandale, ipinto, jrivera, madam, muagarwa, ocs-bugs, odf-bz-bot, sostapov, tnielsen |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-09-26 15:25:23 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Is this still an issue or did you fix the cluster? Can we close this issue? Please reopen if still an issue. It seems the bug is reproducible, so opening it again. the infra nodes have taints that stopped pod to consume
```
{
"providerID": "aws:///us-east-1a/i-038b6e10991928fca",
"taints": [
{
"effect": "NoSchedule",
"key": "node-role.kubernetes.io/master"
}
]
}
```
once I added these taints to storage cluster spec, the cluster was in good state mon, osd came up and running
```rook-ceph-mgr-a-54cd6f947d-jfp6f 2/2 Running 0 7m43s 10.129.2.17 ip-10-0-208-187.ec2.internal <none> <none>
rook-ceph-mon-a-8db947566-z44p2 2/2 Running 0 9m57s 10.129.2.16 ip-10-0-208-187.ec2.internal <none> <none>
rook-ceph-mon-b-59fbdb949d-kbgbp 2/2 Running 0 9m29s 10.131.0.77 ip-10-0-211-138.ec2.internal <none> <none>
rook-ceph-mon-c-76bf5d5cb8-x9trg 2/2 Running 0 9m12s 10.128.2.72 ip-10-0-212-154.ec2.internal <none> <none>
rook-ceph-operator-848dd966fd-8pvc4 1/1 Running 0 3m43s 10.131.0.80 ip-10-0-211-138.ec2.internal <none> <none>
rook-ceph-osd-0-99ffb9d6f-n82tj 2/2 Running 0 4m57s 10.131.0.81 ip-10-0-211-138.ec2.internal <none> <none>
rook-ceph-osd-1-9b8565ccc-b4jsl 2/2 Running 0 7m3s 10.128.2.83 ip-10-0-212-154.ec2.internal <none> <none>
rook-ceph-osd-2-6c669479f7-lbnv6 2/2 Running 0 6m32s 10.130.2.59 ip-10-0-226-41.ec2.internal <none> <none>
```
```
placement:
all:
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/infra
```
|
Description of problem (please be detailed as possible and provide log snippests): While installing ODF from operator hub on ROSA cluster, storage cluster is getting into error state with error: "Error while reconciling: some StorageClasses [ocs-storagecluster-cephfs,ocs-storagecluster-ceph-rbd] were skipped while waiting for pre-requisites to be met" rook-ceph-mon pods are getting into pending state with error: "0/8 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate." Logs of failed installation: http://pastebin.test.redhat.com/1066592 Version of all relevant components (if applicable): ROSA cluster version: 4.10 Rosa cluster specifications: machine_type: m5.4xlarge, nodes: 3 ODF version: 4.10 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? we need ODF to create storage class for CNV storage, The storage cluster is getting into ERROR state, so storage classes are not getting created. This blocks our team for testing CNV+ODF on ROSA clusters Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? not facing this issue when storage system is created from UI, storage cluster is getting created itself and it reaches the "PROGRESSING" state. If this is a regression, please provide more details to justify this: Steps to Reproduce: Install ODF and the storage system and storage cluster using ansible automation on ROSA cluster: http://pastebin.test.redhat.com/1065816 Actual results: 1. Storage Cluster getting into error state. 2. rook-ceph-mon pod is getting into pending state 3. Storage Class "ocs-storagecluster-ceph-rbd" and "ocs-storagecluster-cephfs" are not getting created Expected results: 1. Storage Cluster in READY state 2. rook-ceph-mon pods should be in "running" state 3. Storage Class "ocs-storagecluster-ceph-rbd" and "ocs-storagecluster-cephfs" created Additional info: The provided ansible script is installing ODF from operator hub but not every time.