Description of problem: rook-ceph-crashcollector pod is in Pending state with event: 0/12 nodes are available: 1 Insufficient cpu, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 node(s) didn't match Pod's node affinity/selector. Version-Release number of selected component (if applicable): ocs-operator.v4.10.8 ocs-osd-deployer.v2.0.11 How reproducible: 1/1 Steps to Reproduce: 1. Install a setup with size 20 TiB. 2. Check that all pods in openshift-storage are in healthy state. Actual results: rook-ceph-crashcollector pod is in Pending state with an even informing that there are insufficient resources. Expected results: All pods should be Ready or Finished Additional info: must-gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-20s-qe-pr/jijoy-20s-qe-pr_20230203T045602/logs/testcases_1675421592/
In recent deployment it is observed that the issue is not only insufficient cpu , but insufficient memory too. Two nodes were in Pending state due to insufficient cpu and memory. $ oc get pods -o wide | grep "Pending" rook-ceph-crashcollector-3029f20177d9a592b0c4f30821ac602f-mks6c 0/1 Pending 0 150m <none> <none> <none> <none> rook-ceph-crashcollector-f60865df61f4dc2103724c18d5f5b65e-qx84n 0/1 Pending 0 148m <none> <none> <none> <none> $ oc describe pod rook-ceph-crashcollector-3029f20177d9a592b0c4f30821ac602f-mks6c | grep "Events:" -A 30 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 147m default-scheduler 0/12 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector. Warning FailedScheduling 147m default-scheduler 0/12 nodes are available: 1 Insufficient memory, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector. Warning FailedScheduling 147m (x2 over 147m) default-scheduler 0/12 nodes are available: 1 Insufficient memory, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector. Warning FailedScheduling 146m (x4 over 147m) default-scheduler 0/12 nodes are available: 1 Insufficient memory, 1 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector. Warning FailedScheduling 143m (x10 over 146m) default-scheduler 0/12 nodes are available: 1 Insufficient memory, 1 node(s) were unschedulable, 2 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 node(s) didn't match Pod's node affinity/selector. Warning FailedScheduling 141m (x3 over 143m) default-scheduler 0/12 nodes are available: 1 Insufficient memory, 1 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector. Warning FailedScheduling 30m (x140 over 141m) default-scheduler 0/12 nodes are available: 1 Insufficient memory, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 node(s) didn't match Pod's node affinity/selector. $ oc describe pod rook-ceph-crashcollector-3029f20177d9a592b0c4f30821ac602f-mks6c | grep node_name node_name=ip-10-0-14-194.us-east-2.compute.internal Pods running on the nodes where the pod rook-ceph-crashcollector-3029f20177d9a592b0c4f30821ac602f-mks6c is trying to come up. $ oc get pods -o wide | grep ip-10-0-14-194.us-east-2.compute.internal rook-ceph-mgr-a-578f57f87d-d6cgb 2/2 Running 0 150m 10.0.14.194 ip-10-0-14-194.us-east-2.compute.internal <none> <none> rook-ceph-mon-c-54564cd9cc-sdgsw 2/2 Running 0 152m 10.0.14.194 ip-10-0-14-194.us-east-2.compute.internal <none> <none> rook-ceph-osd-12-85f55945d6-cpdps 2/2 Running 0 148m 10.0.14.194 ip-10-0-14-194.us-east-2.compute.internal <none> <none> rook-ceph-osd-3-c5985f568-xs8q7 2/2 Running 0 149m 10.0.14.194 ip-10-0-14-194.us-east-2.compute.internal <none> <none> rook-ceph-osd-4-667f88675d-tbjrz 2/2 Running 0 149m 10.0.14.194 ip-10-0-14-194.us-east-2.compute.internal <none> <none> rook-ceph-osd-prepare-default-0-data-0zb589-wpbjw 0/1 Completed 0 150m 10.0.14.194 ip-10-0-14-194.us-east-2.compute.internal <none> <none> rook-ceph-osd-prepare-default-1-data-3ccnsk-pw4zm 0/1 Completed 0 149m 10.0.14.194 ip-10-0-14-194.us-east-2.compute.internal <none> <none> rook-ceph-osd-prepare-default-2-data-4drz6r-lzlfj 0/1 Completed 0 149m 10.0.14.194 ip-10-0-14-194.us-east-2.compute.internal <none> <none> All the nodes are in Ready state. $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-12-93.us-east-2.compute.internal Ready master 3h5m v1.23.12+8a6bfe4 ip-10-0-14-194.us-east-2.compute.internal Ready worker 178m v1.23.12+8a6bfe4 ip-10-0-14-24.us-east-2.compute.internal Ready worker 178m v1.23.12+8a6bfe4 ip-10-0-15-163.us-east-2.compute.internal Ready infra,worker 162m v1.23.12+8a6bfe4 ip-10-0-17-135.us-east-2.compute.internal Ready master 3h5m v1.23.12+8a6bfe4 ip-10-0-17-34.us-east-2.compute.internal Ready worker 178m v1.23.12+8a6bfe4 ip-10-0-18-145.us-east-2.compute.internal Ready infra,worker 162m v1.23.12+8a6bfe4 ip-10-0-19-38.us-east-2.compute.internal Ready worker 178m v1.23.12+8a6bfe4 ip-10-0-21-244.us-east-2.compute.internal Ready worker 178m v1.23.12+8a6bfe4 ip-10-0-21-42.us-east-2.compute.internal Ready master 3h5m v1.23.12+8a6bfe4 ip-10-0-21-90.us-east-2.compute.internal Ready infra,worker 162m v1.23.12+8a6bfe4 ip-10-0-22-40.us-east-2.compute.internal Ready worker 178m v1.23.12+8a6bfe4 must-gather logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-f6-s20-pr/jijoy-f6-s20-pr_20230206T073026/logs/deployment_1675674405/
Can this be closed now with the new image that engineering provided?
This issue was fixed by https://github.com/red-hat-storage/ocs-osd-deployer/pull/280