Bug 2166915 - rook-ceph-crashcollector pods on a provider cluster with size 20 in Pending state: Insufficient cpu and memory
Summary: rook-ceph-crashcollector pods on a provider cluster with size 20 in Pending s...
Keywords:
Status: VERIFIED
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-managed-service
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Ohad
QA Contact: Filip Balák
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-02-03 13:45 UTC by Filip Balák
Modified: 2023-08-09 17:00 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 2166900 0 unspecified CLOSED RBD PVCs are not working with 8 TiB and 20 TiB clusters 2023-08-09 17:00:26 UTC

Description Filip Balák 2023-02-03 13:45:14 UTC
Description of problem:
rook-ceph-crashcollector pod is in Pending state with event:

0/12 nodes are available: 1 Insufficient cpu, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 node(s) didn't match Pod's node affinity/selector.

Version-Release number of selected component (if applicable):
ocs-operator.v4.10.8
ocs-osd-deployer.v2.0.11

How reproducible:
1/1

Steps to Reproduce:
1. Install a setup with size 20 TiB.
2. Check that all pods in openshift-storage are in healthy state.

Actual results:
rook-ceph-crashcollector pod is in Pending state with an even informing that there are insufficient resources.

Expected results:
All pods should be Ready or Finished

Additional info:
must-gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-20s-qe-pr/jijoy-20s-qe-pr_20230203T045602/logs/testcases_1675421592/

Comment 2 Jilju Joy 2023-02-06 10:59:11 UTC
In recent deployment it is observed that the issue is not only insufficient cpu , but insufficient memory too. Two nodes were in Pending state due to insufficient cpu and memory.

$ oc get pods -o wide | grep "Pending"
rook-ceph-crashcollector-3029f20177d9a592b0c4f30821ac602f-mks6c   0/1     Pending     0             150m   <none>        <none>                                      <none>           <none>
rook-ceph-crashcollector-f60865df61f4dc2103724c18d5f5b65e-qx84n   0/1     Pending     0             148m   <none>        <none>                                      <none>           <none>



$ oc describe pod rook-ceph-crashcollector-3029f20177d9a592b0c4f30821ac602f-mks6c | grep "Events:" -A 30
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  147m                  default-scheduler  0/12 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  147m                  default-scheduler  0/12 nodes are available: 1 Insufficient memory, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  147m (x2 over 147m)   default-scheduler  0/12 nodes are available: 1 Insufficient memory, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  146m (x4 over 147m)   default-scheduler  0/12 nodes are available: 1 Insufficient memory, 1 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  143m (x10 over 146m)  default-scheduler  0/12 nodes are available: 1 Insufficient memory, 1 node(s) were unschedulable, 2 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  141m (x3 over 143m)   default-scheduler  0/12 nodes are available: 1 Insufficient memory, 1 node(s) were unschedulable, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  30m (x140 over 141m)  default-scheduler  0/12 nodes are available: 1 Insufficient memory, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 node(s) didn't match Pod's node affinity/selector.




$ oc describe pod rook-ceph-crashcollector-3029f20177d9a592b0c4f30821ac602f-mks6c | grep node_name
                node_name=ip-10-0-14-194.us-east-2.compute.internal




Pods running on the nodes where the pod rook-ceph-crashcollector-3029f20177d9a592b0c4f30821ac602f-mks6c is trying to come up.

$ oc get pods -o wide | grep ip-10-0-14-194.us-east-2.compute.internal
rook-ceph-mgr-a-578f57f87d-d6cgb                                  2/2     Running     0             150m   10.0.14.194   ip-10-0-14-194.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-c-54564cd9cc-sdgsw                                  2/2     Running     0             152m   10.0.14.194   ip-10-0-14-194.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-12-85f55945d6-cpdps                                 2/2     Running     0             148m   10.0.14.194   ip-10-0-14-194.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-3-c5985f568-xs8q7                                   2/2     Running     0             149m   10.0.14.194   ip-10-0-14-194.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-4-667f88675d-tbjrz                                  2/2     Running     0             149m   10.0.14.194   ip-10-0-14-194.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-prepare-default-0-data-0zb589-wpbjw                 0/1     Completed   0             150m   10.0.14.194   ip-10-0-14-194.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-prepare-default-1-data-3ccnsk-pw4zm                 0/1     Completed   0             149m   10.0.14.194   ip-10-0-14-194.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-prepare-default-2-data-4drz6r-lzlfj                 0/1     Completed   0             149m   10.0.14.194   ip-10-0-14-194.us-east-2.compute.internal   <none>           <none>


All the nodes are in Ready state.
$ oc get nodes
NAME                                        STATUS   ROLES          AGE    VERSION
ip-10-0-12-93.us-east-2.compute.internal    Ready    master         3h5m   v1.23.12+8a6bfe4
ip-10-0-14-194.us-east-2.compute.internal   Ready    worker         178m   v1.23.12+8a6bfe4
ip-10-0-14-24.us-east-2.compute.internal    Ready    worker         178m   v1.23.12+8a6bfe4
ip-10-0-15-163.us-east-2.compute.internal   Ready    infra,worker   162m   v1.23.12+8a6bfe4
ip-10-0-17-135.us-east-2.compute.internal   Ready    master         3h5m   v1.23.12+8a6bfe4
ip-10-0-17-34.us-east-2.compute.internal    Ready    worker         178m   v1.23.12+8a6bfe4
ip-10-0-18-145.us-east-2.compute.internal   Ready    infra,worker   162m   v1.23.12+8a6bfe4
ip-10-0-19-38.us-east-2.compute.internal    Ready    worker         178m   v1.23.12+8a6bfe4
ip-10-0-21-244.us-east-2.compute.internal   Ready    worker         178m   v1.23.12+8a6bfe4
ip-10-0-21-42.us-east-2.compute.internal    Ready    master         3h5m   v1.23.12+8a6bfe4
ip-10-0-21-90.us-east-2.compute.internal    Ready    infra,worker   162m   v1.23.12+8a6bfe4
ip-10-0-22-40.us-east-2.compute.internal    Ready    worker         178m   v1.23.12+8a6bfe4


must-gather logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-f6-s20-pr/jijoy-f6-s20-pr_20230206T073026/logs/deployment_1675674405/

Comment 3 Chris Blum 2023-02-10 10:23:21 UTC
Can this be closed now with the new image that engineering provided?

Comment 4 Rohan Gupta 2023-03-27 12:29:32 UTC
This issue was fixed by https://github.com/red-hat-storage/ocs-osd-deployer/pull/280


Note You need to log in before you can comment on or make changes to this bug.