Bug 2166915
| Summary: | rook-ceph-crashcollector pods on a provider cluster with size 20 in Pending state: Insufficient cpu and memory | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Filip Balák <fbalak> |
| Component: | odf-managed-service | Assignee: | Ohad <omitrani> |
| Status: | VERIFIED --- | QA Contact: | Filip Balák <fbalak> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.10 | CC: | cblum, jijoy, nberry, odf-bz-bot, rohgupta |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
In recent deployment it is observed that the issue is not only insufficient cpu , but insufficient memory too. Two nodes were in Pending state due to insufficient cpu and memory.
$ oc get pods -o wide | grep "Pending"
rook-ceph-crashcollector-3029f20177d9a592b0c4f30821ac602f-mks6c 0/1 Pending 0 150m <none> <none> <none> <none>
rook-ceph-crashcollector-f60865df61f4dc2103724c18d5f5b65e-qx84n 0/1 Pending 0 148m <none> <none> <none> <none>
$ oc describe pod rook-ceph-crashcollector-3029f20177d9a592b0c4f30821ac602f-mks6c | grep "Events:" -A 30
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 147m default-scheduler 0/12 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector.
Warning FailedScheduling 147m default-scheduler 0/12 nodes are available: 1 Insufficient memory, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector.
Warning FailedScheduling 147m (x2 over 147m) default-scheduler 0/12 nodes are available: 1 Insufficient memory, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector.
Warning FailedScheduling 146m (x4 over 147m) default-scheduler 0/12 nodes are available: 1 Insufficient memory, 1 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector.
Warning FailedScheduling 143m (x10 over 146m) default-scheduler 0/12 nodes are available: 1 Insufficient memory, 1 node(s) were unschedulable, 2 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 node(s) didn't match Pod's node affinity/selector.
Warning FailedScheduling 141m (x3 over 143m) default-scheduler 0/12 nodes are available: 1 Insufficient memory, 1 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector.
Warning FailedScheduling 30m (x140 over 141m) default-scheduler 0/12 nodes are available: 1 Insufficient memory, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 node(s) didn't match Pod's node affinity/selector.
$ oc describe pod rook-ceph-crashcollector-3029f20177d9a592b0c4f30821ac602f-mks6c | grep node_name
node_name=ip-10-0-14-194.us-east-2.compute.internal
Pods running on the nodes where the pod rook-ceph-crashcollector-3029f20177d9a592b0c4f30821ac602f-mks6c is trying to come up.
$ oc get pods -o wide | grep ip-10-0-14-194.us-east-2.compute.internal
rook-ceph-mgr-a-578f57f87d-d6cgb 2/2 Running 0 150m 10.0.14.194 ip-10-0-14-194.us-east-2.compute.internal <none> <none>
rook-ceph-mon-c-54564cd9cc-sdgsw 2/2 Running 0 152m 10.0.14.194 ip-10-0-14-194.us-east-2.compute.internal <none> <none>
rook-ceph-osd-12-85f55945d6-cpdps 2/2 Running 0 148m 10.0.14.194 ip-10-0-14-194.us-east-2.compute.internal <none> <none>
rook-ceph-osd-3-c5985f568-xs8q7 2/2 Running 0 149m 10.0.14.194 ip-10-0-14-194.us-east-2.compute.internal <none> <none>
rook-ceph-osd-4-667f88675d-tbjrz 2/2 Running 0 149m 10.0.14.194 ip-10-0-14-194.us-east-2.compute.internal <none> <none>
rook-ceph-osd-prepare-default-0-data-0zb589-wpbjw 0/1 Completed 0 150m 10.0.14.194 ip-10-0-14-194.us-east-2.compute.internal <none> <none>
rook-ceph-osd-prepare-default-1-data-3ccnsk-pw4zm 0/1 Completed 0 149m 10.0.14.194 ip-10-0-14-194.us-east-2.compute.internal <none> <none>
rook-ceph-osd-prepare-default-2-data-4drz6r-lzlfj 0/1 Completed 0 149m 10.0.14.194 ip-10-0-14-194.us-east-2.compute.internal <none> <none>
All the nodes are in Ready state.
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-12-93.us-east-2.compute.internal Ready master 3h5m v1.23.12+8a6bfe4
ip-10-0-14-194.us-east-2.compute.internal Ready worker 178m v1.23.12+8a6bfe4
ip-10-0-14-24.us-east-2.compute.internal Ready worker 178m v1.23.12+8a6bfe4
ip-10-0-15-163.us-east-2.compute.internal Ready infra,worker 162m v1.23.12+8a6bfe4
ip-10-0-17-135.us-east-2.compute.internal Ready master 3h5m v1.23.12+8a6bfe4
ip-10-0-17-34.us-east-2.compute.internal Ready worker 178m v1.23.12+8a6bfe4
ip-10-0-18-145.us-east-2.compute.internal Ready infra,worker 162m v1.23.12+8a6bfe4
ip-10-0-19-38.us-east-2.compute.internal Ready worker 178m v1.23.12+8a6bfe4
ip-10-0-21-244.us-east-2.compute.internal Ready worker 178m v1.23.12+8a6bfe4
ip-10-0-21-42.us-east-2.compute.internal Ready master 3h5m v1.23.12+8a6bfe4
ip-10-0-21-90.us-east-2.compute.internal Ready infra,worker 162m v1.23.12+8a6bfe4
ip-10-0-22-40.us-east-2.compute.internal Ready worker 178m v1.23.12+8a6bfe4
must-gather logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-f6-s20-pr/jijoy-f6-s20-pr_20230206T073026/logs/deployment_1675674405/
Can this be closed now with the new image that engineering provided? This issue was fixed by https://github.com/red-hat-storage/ocs-osd-deployer/pull/280 |
Description of problem: rook-ceph-crashcollector pod is in Pending state with event: 0/12 nodes are available: 1 Insufficient cpu, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 node(s) didn't match Pod's node affinity/selector. Version-Release number of selected component (if applicable): ocs-operator.v4.10.8 ocs-osd-deployer.v2.0.11 How reproducible: 1/1 Steps to Reproduce: 1. Install a setup with size 20 TiB. 2. Check that all pods in openshift-storage are in healthy state. Actual results: rook-ceph-crashcollector pod is in Pending state with an even informing that there are insufficient resources. Expected results: All pods should be Ready or Finished Additional info: must-gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-20s-qe-pr/jijoy-20s-qe-pr_20230203T045602/logs/testcases_1675421592/