Bug 2133041 - Pod rook-ceph-crashcollector did not reach Running state due to Insufficient memory.
Summary: Pod rook-ceph-crashcollector did not reach Running state due to Insufficient ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-managed-service
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Leela Venkaiah Gangavarapu
QA Contact: Jilju Joy
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-10-07 15:50 UTC by Jilju Joy
Modified: 2023-08-09 17:00 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-03-14 15:35:53 UTC
Embargoed:


Attachments (Terms of Use)

Description Jilju Joy 2022-10-07 15:50:03 UTC
Description of problem:
Pod rook-ceph-crashcollector did not reach Running state due to Insufficient memory.
Testing was done using the addon 'ocs-provider-dev' which contain changes related to ODFMS-55. Size parameter was set to 4.


$ oc describe pod rook-ceph-crashcollector-ip-10-0-150-29.ec2.internal-5b7b8b9l6f
Name:           rook-ceph-crashcollector-ip-10-0-150-29.ec2.internal-5b7b8b9l6f
Namespace:      openshift-storage
Priority:       0
Node:           <none>
Labels:         app=rook-ceph-crashcollector
                ceph-version=16.2.7-126
                ceph_daemon_id=crash
                crashcollector=crash
                kubernetes.io/hostname=ip-10-0-150-29.ec2.internal
                node_name=ip-10-0-150-29.ec2.internal
                pod-template-hash=5b7b8876b4
                rook-version=v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63
                rook_cluster=openshift-storage
Annotations:    openshift.io/scc: rook-ceph
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/rook-ceph-crashcollector-ip-10-0-150-29.ec2.internal-5b7b8876b4
Init Containers:
  make-container-crash-dir:
    Image:      registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6
    Port:       <none>
    Host Port:  <none>
    Command:
      mkdir
      -p
    Args:
      /var/lib/ceph/crash/posted
    Limits:
      cpu:     50m
      memory:  60Mi
    Requests:
      cpu:        50m
      memory:     60Mi
    Environment:  <none>
    Mounts:
      /etc/ceph from rook-config-override (ro)
      /var/lib/ceph/crash from rook-ceph-crash (rw)
      /var/log/ceph from rook-ceph-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qfss5 (ro)
  chown-container-data-dir:
    Image:      registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6
    Port:       <none>
    Host Port:  <none>
    Command:
      chown
    Args:
      --verbose
      --recursive
      ceph:ceph
      /var/log/ceph
      /var/lib/ceph/crash
    Limits:
      cpu:     50m
      memory:  60Mi
    Requests:
      cpu:        50m
      memory:     60Mi
    Environment:  <none>
    Mounts:
      /etc/ceph from rook-config-override (ro)
      /var/lib/ceph/crash from rook-ceph-crash (rw)
      /var/log/ceph from rook-ceph-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qfss5 (ro)
Containers:
  ceph-crash:
    Image:      registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6
    Port:       <none>
    Host Port:  <none>
    Command:
      ceph-crash
    Limits:
      cpu:     50m
      memory:  60Mi
    Requests:
      cpu:     50m
      memory:  60Mi
    Environment:
      CONTAINER_IMAGE:                registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6
      POD_NAME:                       rook-ceph-crashcollector-ip-10-0-150-29.ec2.internal-5b7b8b9l6f (v1:metadata.name)
      POD_NAMESPACE:                  openshift-storage (v1:metadata.namespace)
      NODE_NAME:                       (v1:spec.nodeName)
      POD_MEMORY_LIMIT:               62914560 (limits.memory)
      POD_MEMORY_REQUEST:             62914560 (requests.memory)
      POD_CPU_LIMIT:                  1 (limits.cpu)
      POD_CPU_REQUEST:                1 (requests.cpu)
      ROOK_CEPH_MON_HOST:             <set to the key 'mon_host' in secret 'rook-ceph-config'>             Optional: false
      ROOK_CEPH_MON_INITIAL_MEMBERS:  <set to the key 'mon_initial_members' in secret 'rook-ceph-config'>  Optional: false
      CEPH_ARGS:                      -m $(ROOK_CEPH_MON_HOST) -k /etc/ceph/crash-collector-keyring-store/keyring
    Mounts:
      /etc/ceph from rook-config-override (ro)
      /etc/ceph/crash-collector-keyring-store/ from rook-ceph-crash-collector-keyring (ro)
      /var/lib/ceph/crash from rook-ceph-crash (rw)
      /var/log/ceph from rook-ceph-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qfss5 (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  rook-config-override:
    Type:               Projected (a volume that contains injected data from multiple sources)
    ConfigMapName:      rook-config-override
    ConfigMapOptional:  <nil>
  rook-ceph-log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/rook/openshift-storage/log
    HostPathType:  
  rook-ceph-crash:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/rook/openshift-storage/crash
    HostPathType:  
  rook-ceph-crash-collector-keyring:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  rook-ceph-crash-collector-keyring
    Optional:    false
  kube-api-access-qfss5:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Guaranteed
Node-Selectors:              kubernetes.io/hostname=ip-10-0-150-29.ec2.internal
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 5s
                             node.ocs.openshift.io/storage=true:NoSchedule
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  27m (x839 over 11h)  default-scheduler  0/12 nodes are available: 1 Insufficient memory, 2 node(s) didn't match Pod's node affinity/selector, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had taint {node.ocs.openshift.io/osd: true}, that the pod didn't tolerate.


Pods list (this was taken after deleting the pod rook-ceph-crashcollector-ip-10-0-150-29.ec2.internal-5b7b8b9l6f while trying out some workaround):

$ oc get pods -o wide
NAME                                                              READY   STATUS      RESTARTS   AGE   IP             NODE                           NOMINATED NODE   READINESS GATES
addon-ocs-provider-dev-catalog-whqhc                              1/1     Running     0          12h   10.128.2.20    ip-10-0-136-1.ec2.internal     <none>           <none>
alertmanager-managed-ocs-alertmanager-0                           2/2     Running     0          12h   10.128.2.15    ip-10-0-136-1.ec2.internal     <none>           <none>
alertmanager-managed-ocs-alertmanager-1                           2/2     Running     0          12h   10.131.0.16    ip-10-0-150-29.ec2.internal    <none>           <none>
alertmanager-managed-ocs-alertmanager-2                           2/2     Running     0          12h   10.128.2.16    ip-10-0-136-1.ec2.internal     <none>           <none>
csi-addons-controller-manager-b8b965868-szb6r                     2/2     Running     0          12h   10.128.2.19    ip-10-0-136-1.ec2.internal     <none>           <none>
ocs-metrics-exporter-577574796b-rfgs8                             1/1     Running     0          12h   10.128.2.14    ip-10-0-136-1.ec2.internal     <none>           <none>
ocs-operator-5c77756ddd-mfhrj                                     1/1     Running     0          12h   10.131.0.10    ip-10-0-150-29.ec2.internal    <none>           <none>
ocs-osd-aws-data-gather-54b4bd7d6c-z9f4g                          1/1     Running     0          12h   10.0.150.29    ip-10-0-150-29.ec2.internal    <none>           <none>
ocs-osd-controller-manager-66dc698b5b-4gjkk                       3/3     Running     0          12h   10.131.0.8     ip-10-0-150-29.ec2.internal    <none>           <none>
ocs-provider-server-6f888bbffb-d8wv6                              1/1     Running     0          12h   10.131.0.31    ip-10-0-150-29.ec2.internal    <none>           <none>
odf-console-585db6ddb-q7pdn                                       1/1     Running     0          12h   10.128.2.17    ip-10-0-136-1.ec2.internal     <none>           <none>
odf-operator-controller-manager-7866b5fdbb-5gg4b                  2/2     Running     0          12h   10.128.2.12    ip-10-0-136-1.ec2.internal     <none>           <none>
prometheus-managed-ocs-prometheus-0                               3/3     Running     0          12h   10.129.2.8     ip-10-0-170-254.ec2.internal   <none>           <none>
prometheus-operator-8547cc9f89-sq584                              1/1     Running     0          12h   10.128.2.18    ip-10-0-136-1.ec2.internal     <none>           <none>
rook-ceph-crashcollector-ip-10-0-136-1.ec2.internal-5cc5bdtq6fw   1/1     Running     0          12h   10.0.136.1     ip-10-0-136-1.ec2.internal     <none>           <none>
rook-ceph-crashcollector-ip-10-0-142-131.ec2.internal-7b9dhr7nr   1/1     Running     0          11h   10.0.142.131   ip-10-0-142-131.ec2.internal   <none>           <none>
rook-ceph-crashcollector-ip-10-0-150-29.ec2.internal-5b7b88bzv7   0/1     Pending     0          19s   <none>         <none>                         <none>           <none>
rook-ceph-crashcollector-ip-10-0-152-20.ec2.internal-8679b7k67v   1/1     Running     0          12h   10.0.152.20    ip-10-0-152-20.ec2.internal    <none>           <none>
rook-ceph-crashcollector-ip-10-0-170-254.ec2.internal-76c7hpqwj   1/1     Running     0          12h   10.0.170.254   ip-10-0-170-254.ec2.internal   <none>           <none>
rook-ceph-crashcollector-ip-10-0-171-253.ec2.internal-65c66d8wf   1/1     Running     0          12h   10.0.171.253   ip-10-0-171-253.ec2.internal   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-58cfc967g8r5z   2/2     Running     0          12h   10.0.150.29    ip-10-0-150-29.ec2.internal    <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-b574cd94g4cdz   2/2     Running     0          12h   10.0.170.254   ip-10-0-170-254.ec2.internal   <none>           <none>
rook-ceph-mgr-a-c999476d6-qz958                                   2/2     Running     0          12h   10.0.136.1     ip-10-0-136-1.ec2.internal     <none>           <none>
rook-ceph-mon-a-5fddc4774-mb9g7                                   2/2     Running     0          12h   10.0.170.254   ip-10-0-170-254.ec2.internal   <none>           <none>
rook-ceph-mon-b-7846f47ddb-xcp5z                                  2/2     Running     0          12h   10.0.136.1     ip-10-0-136-1.ec2.internal     <none>           <none>
rook-ceph-mon-c-6f4c85457b-dncgl                                  2/2     Running     0          12h   10.0.150.29    ip-10-0-150-29.ec2.internal    <none>           <none>
rook-ceph-operator-564cb5cb98-gcj8n                               1/1     Running     0          12h   10.128.2.13    ip-10-0-136-1.ec2.internal     <none>           <none>
rook-ceph-osd-0-9f8488c5-w72gx                                    2/2     Running     0          12h   10.0.142.131   ip-10-0-142-131.ec2.internal   <none>           <none>
rook-ceph-osd-2-564c6d49c8-6ffmd                                  2/2     Running     0          12h   10.0.171.253   ip-10-0-171-253.ec2.internal   <none>           <none>
rook-ceph-osd-3-fd4cf4999-24n56                                   2/2     Running     0          12h   10.0.152.20    ip-10-0-152-20.ec2.internal    <none>           <none>
rook-ceph-osd-prepare-default-1-data-0p86j8-6k785                 0/1     Completed   0          12h   10.0.152.20    ip-10-0-152-20.ec2.internal    <none>           <none>
rook-ceph-tools-787676bdbd-z25ps                                  1/1     Running     0          82s   10.0.170.254   ip-10-0-170-254.ec2.internal   <none>           <none>


As a workaround, deleted ocs-operator pod which was running on the node ip-10-0-150-29.ec2.internal. New ocs-operator pod was created on a different node. The pod rook-ceph-crashcollector-ip-10-0-150-29.ec2.internal reached Running state. 

$ oc get  nodes
NAME                           STATUS   ROLES          AGE   VERSION
ip-10-0-130-155.ec2.internal   Ready    worker         9h    v1.23.5+8471591
ip-10-0-136-1.ec2.internal     Ready    worker         32h   v1.23.5+8471591
ip-10-0-138-49.ec2.internal    Ready    infra,worker   32h   v1.23.5+8471591
ip-10-0-140-159.ec2.internal   Ready    master         32h   v1.23.5+8471591
ip-10-0-150-29.ec2.internal    Ready    worker         32h   v1.23.5+8471591
ip-10-0-152-20.ec2.internal    Ready    worker         32h   v1.23.5+8471591
ip-10-0-155-226.ec2.internal   Ready    master         32h   v1.23.5+8471591
ip-10-0-157-194.ec2.internal   Ready    infra,worker   32h   v1.23.5+8471591
ip-10-0-169-109.ec2.internal   Ready    master         32h   v1.23.5+8471591
ip-10-0-170-254.ec2.internal   Ready    worker         32h   v1.23.5+8471591
ip-10-0-171-253.ec2.internal   Ready    worker         32h   v1.23.5+8471591
ip-10-0-175-139.ec2.internal   Ready    infra,worker   32h   v1.23.5+8471591



must-gather logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-4ti4b-pr/jijoy-4ti4b-pr_20221006T043953/logs/failed_testcase_ocs_logs_1665073038/test_rolling_nodes_restart%5bworker%5d_ocs_logs/ocs_must_gather/
=========================================================================================

Version-Release number of selected component (if applicable):
$ oc get csv
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.10.6                      NooBaa Operator               4.10.6            mcg-operator.v4.10.5                      Succeeded
ocs-operator.v4.10.5                      OpenShift Container Storage   4.10.5            ocs-operator.v4.10.4                      Succeeded
ocs-osd-deployer.v2.0.8                   OCS OSD Deployer              2.0.8                                                       Succeeded
odf-csi-addons-operator.v4.10.5           CSI Addons                    4.10.5            odf-csi-addons-operator.v4.10.4           Succeeded
odf-operator.v4.10.5                      OpenShift Data Foundation     4.10.5            odf-operator.v4.10.4                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.422-151be96   Route Monitor Operator        0.1.422-151be96   route-monitor-operator.v0.1.420-b65f47e   Succeeded

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.34   True        False         32h     Cluster version is 4.10.34
======================================================================================

How reproducible:
Observed 2 times in 3 deployment attempts.

Steps to Reproduce:
1. Install provider cluster with dev addon:
eg: rosa create service --type ocs-provider-dev --name jijoy-4ti4b-pr --machine-cidr 10.0.0.0/16 --size 4 --onboarding-validation-key <key> --subnet-ids <subnet-ids> --region us-east-1
2. Verify that rook-ceph-crashcollector pods are in Running state.

=========================================================================================
Actual results:
One pod among rook-ceph-crashcollector pods is in Pending state.

Expected results:
All rook-ceph-crashcollector pods should be running.


Additional info:

Comment 1 Leela Venkaiah Gangavarapu 2022-10-10 11:33:58 UTC
hi,

- generally, addon-operator workloads shouldn't be running on worker nodes
- an issue was raised with MT-SRE to make changes in their scheduling to not consume resources from ODF specific worker nodes and is being tracked at https://issues.redhat.com/browse/MTSRE-714
- unless the jira issue is completed we'll seeing crashcollector (or any other pod) going into pending state intermittently in ODF Control Plane nodes, when you see the same issue in ODF Data plane nodes then that'll be a blocker

workaround:
- run "kubectl get po -l 'app.kubernetes.io/name in (addon-operator,addon-operator-webhook-server)' --no-headers -nopenshift-addon-operator -oname | xargs -I {} kubectl delete {} -nopenshift-addon-operator"
- this command will bounce the addon-operator pods to Infra nodes where there'll definitely be resources available and it's not harmful


thanks,
leela.

Comment 2 Leela Venkaiah Gangavarapu 2022-10-13 03:06:09 UTC
hi,

- the work wrt referenced Jira issue is pushed to stage and there shouldn't be resource crunch on ODF Control plane
- below command o/p should be empty

worker=($(kubectl get machines -nopenshift-machine-api -l machine.openshift.io/cluster-api-machine-role=worker -ojsonpath='{.items[?(@.status.phase=="Running")].status.nodeRef.name}'))

for node in ${worker[@]}; do echo $node; kubectl get pods --field-selector=spec.nodeName=$node,status.phase=Running -nopenshift-addon-operator --no-headers -oname; done;

thanks,
leela.

Comment 3 Jilju Joy 2022-10-14 14:36:52 UTC
Hi Leela,

Which version contains the fix ?

The output of the command given in the comment #2 is not empty.

$ worker=($(kubectl get machines -nopenshift-machine-api -l machine.openshift.io/cluster-api-machine-role=worker -ojsonpath='{.items[?(@.status.phase=="Running")].status.nodeRef.name}'))

$ for node in ${worker[@]}; do echo $node; kubectl get pods --field-selector=spec.nodeName=$node,status.phase=Running -nopenshift-addon-operator --no-headers -oname; done;
ip-10-0-133-96.ec2.internal
pod/addon-operator-catalog-bw6zl
ip-10-0-159-32.ec2.internal
pod/addon-operator-webhooks-7bdc97545-xdjw8
ip-10-0-171-179.ec2.internal



Pod rook-ceph-crashcollector-ip-10-0-159-32.ec2.internal-57f9dgk2mm is in Pending state due to insufficient cpu.

$ oc describe pod rook-ceph-crashcollector-ip-10-0-159-32.ec2.internal-57f9dgk2mm
Name:           rook-ceph-crashcollector-ip-10-0-159-32.ec2.internal-57f9dgk2mm
Namespace:      openshift-storage
Priority:       0
Node:           <none>
Labels:         app=rook-ceph-crashcollector
                ceph-version=16.2.7-126
                ceph_daemon_id=crash
                crashcollector=crash
                kubernetes.io/hostname=ip-10-0-159-32.ec2.internal
                node_name=ip-10-0-159-32.ec2.internal
                pod-template-hash=57f9d6748f
                rook-version=v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63
                rook_cluster=openshift-storage
Annotations:    openshift.io/scc: rook-ceph
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/rook-ceph-crashcollector-ip-10-0-159-32.ec2.internal-57f9d6748f
Init Containers:
  make-container-crash-dir:
    Image:      registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6
    Port:       <none>
    Host Port:  <none>
    Command:
      mkdir
      -p
    Args:
      /var/lib/ceph/crash/posted
    Limits:
      cpu:     50m
      memory:  60Mi
    Requests:
      cpu:        50m
      memory:     60Mi
    Environment:  <none>
    Mounts:
      /etc/ceph from rook-config-override (ro)
      /var/lib/ceph/crash from rook-ceph-crash (rw)
      /var/log/ceph from rook-ceph-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vmlzs (ro)
  chown-container-data-dir:
    Image:      registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6
    Port:       <none>
    Host Port:  <none>
    Command:
      chown
    Args:
      --verbose
      --recursive
      ceph:ceph
      /var/log/ceph
      /var/lib/ceph/crash
    Limits:
      cpu:     50m
      memory:  60Mi
    Requests:
      cpu:        50m
      memory:     60Mi
    Environment:  <none>
    Mounts:
      /etc/ceph from rook-config-override (ro)
      /var/lib/ceph/crash from rook-ceph-crash (rw)
      /var/log/ceph from rook-ceph-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vmlzs (ro)
Containers:
  ceph-crash:
    Image:      registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6
    Port:       <none>
    Host Port:  <none>
    Command:
      ceph-crash
    Limits:
      cpu:     50m
      memory:  60Mi
    Requests:
      cpu:     50m
      memory:  60Mi
    Environment:
      CONTAINER_IMAGE:                registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6
      POD_NAME:                       rook-ceph-crashcollector-ip-10-0-159-32.ec2.internal-57f9dgk2mm (v1:metadata.name)
      POD_NAMESPACE:                  openshift-storage (v1:metadata.namespace)
      NODE_NAME:                       (v1:spec.nodeName)
      POD_MEMORY_LIMIT:               62914560 (limits.memory)
      POD_MEMORY_REQUEST:             62914560 (requests.memory)
      POD_CPU_LIMIT:                  1 (limits.cpu)
      POD_CPU_REQUEST:                1 (requests.cpu)
      ROOK_CEPH_MON_HOST:             <set to the key 'mon_host' in secret 'rook-ceph-config'>             Optional: false
      ROOK_CEPH_MON_INITIAL_MEMBERS:  <set to the key 'mon_initial_members' in secret 'rook-ceph-config'>  Optional: false
      CEPH_ARGS:                      -m $(ROOK_CEPH_MON_HOST) -k /etc/ceph/crash-collector-keyring-store/keyring
    Mounts:
      /etc/ceph from rook-config-override (ro)
      /etc/ceph/crash-collector-keyring-store/ from rook-ceph-crash-collector-keyring (ro)
      /var/lib/ceph/crash from rook-ceph-crash (rw)
      /var/log/ceph from rook-ceph-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vmlzs (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  rook-config-override:
    Type:               Projected (a volume that contains injected data from multiple sources)
    ConfigMapName:      rook-config-override
    ConfigMapOptional:  <nil>
  rook-ceph-log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/rook/openshift-storage/log
    HostPathType:  
  rook-ceph-crash:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/rook/openshift-storage/crash
    HostPathType:  
  rook-ceph-crash-collector-keyring:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  rook-ceph-crash-collector-keyring
    Optional:    false
  kube-api-access-vmlzs:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Guaranteed
Node-Selectors:              kubernetes.io/hostname=ip-10-0-159-32.ec2.internal
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 5s
                             node.ocs.openshift.io/storage=true:NoSchedule
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  5m24s (x627 over 8h)  default-scheduler  0/12 nodes are available: 1 Insufficient cpu, 2 node(s) didn't match Pod's node affinity/selector, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had taint {node.ocs.openshift.io/osd: true}, that the pod didn't tolerate.



must-gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-o14-pr/jijoy-o14-pr_20221014T041828/logs/testcases_1665757256/


============================================================

Verson:
$ oc get csv
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.10.6                      NooBaa Operator               4.10.6            mcg-operator.v4.10.5                      Succeeded
ocs-operator.v4.10.5                      OpenShift Container Storage   4.10.5            ocs-operator.v4.10.4                      Succeeded
ocs-osd-deployer.v2.0.8                   OCS OSD Deployer              2.0.8                                                       Succeeded
odf-csi-addons-operator.v4.10.5           CSI Addons                    4.10.5            odf-csi-addons-operator.v4.10.4           Succeeded
odf-operator.v4.10.5                      OpenShift Data Foundation     4.10.5            odf-operator.v4.10.4                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.450-6e98c37   Route Monitor Operator        0.1.450-6e98c37   route-monitor-operator.v0.1.448-b25b8ee   Succeeded

$ oc get csv ocs-osd-deployer.v2.0.8 -o yaml | grep image:
                image: quay.io/osd-addons/ocs-osd-deployer:5ca1ab1e
                image: registry.redhat.io/openshift4/ose-kube-rbac-proxy:v4.11.0-202209130958.p0.ga805ba5.assembly.stream
                image: quay.io/osd-addons/ocs-osd-deployer:5ca1ab1e
                image: quay.io/osd-addons/ocs-osd-deployer:5ca1ab1e


$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.35   True        False         9h      Cluster version is 4.10.35

Comment 4 Leela Venkaiah Gangavarapu 2022-10-27 13:36:59 UTC
- As mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2133041#c1, the referenced Jira issue https://issues.redhat.com/browse/MTSRE-714 was worked on by MTSRE and moved to Platform SRE for changes from their side
- Until SRE-P completes actions from their side, this issue will be hit intermittently and a tricky workaround exists if this is hit, this shouldn't be a blocker for testing AFAIK

Comment 5 Leela Venkaiah Gangavarapu 2022-11-21 09:21:12 UTC
- By 2022-11-08 the build with the fix was delivered to QE

Comment 6 Jilju Joy 2022-12-23 10:21:39 UTC
Tested in a 4TiB cluster. One pod "rook-ceph-crashcollector" is in Pending state due to insufficient cpu and memory.


$ oc get pods -o wide | grep -v 'Running\|Completed'
NAME                                                              READY   STATUS      RESTARTS   AGE     IP             NODE                           NOMINATED NODE   READINESS GATES
rook-ceph-crashcollector-ip-10-0-148-170.ec2.internal-7d4486f6b   0/1     Pending     0          4h45m   <none>         <none>                         <none>           <none>


Events:
  Type     Reason            Age                      From               Message
  ----     ------            ----                     ----               -------
  Warning  FailedScheduling  5m52s (x318 over 4h21m)  default-scheduler  0/12 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 2 node(s) didn't match Pod's node affinity/selector, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had taint {node.ocs.openshift.io/osd: true}, that the pod didn't tolerate.


The output of the command given in the comment #2 is not empty.
$ worker=($(kubectl get machines -nopenshift-machine-api -l machine.openshift.io/cluster-api-machine-role=worker -ojsonpath='{.items[?(@.status.phase=="Running")].status.nodeRef.name}'))
$ for node in ${worker[@]}; do echo $node; kubectl get pods --field-selector=spec.nodeName=$node,status.phase=Running -nopenshift-addon-operator --no-headers -oname; done;
ip-10-0-136-247.ec2.internal
ip-10-0-148-170.ec2.internal
ip-10-0-171-222.ec2.internal
pod/addon-operator-catalog-hzqzd

--------------------------------------------------------


Tested in version:
$ oc get csv
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.10.9                      NooBaa Operator               4.10.9            mcg-operator.v4.10.8                      Succeeded
observability-operator.v0.0.17            Observability Operator        0.0.17            observability-operator.v0.0.17-rc         Succeeded
ocs-operator.v4.10.9                      OpenShift Container Storage   4.10.9            ocs-operator.v4.10.8                      Succeeded
ocs-osd-deployer.v2.0.11                  OCS OSD Deployer              2.0.11            ocs-osd-deployer.v2.0.10                  Succeeded
odf-csi-addons-operator.v4.10.9           CSI Addons                    4.10.9            odf-csi-addons-operator.v4.10.8           Succeeded
odf-operator.v4.10.9                      OpenShift Data Foundation     4.10.9            odf-operator.v4.10.8                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.451-3df1ed1   Route Monitor Operator        0.1.451-3df1ed1   route-monitor-operator.v0.1.450-6e98c37   Succeeded


$ oc get storagecluster
NAME                 AGE     PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   4h57m   Ready              2022-12-23T05:14:14Z   

@Leela

If the issue is https://issues.redhat.com/browse/MTSRE-714, can we move this bug back to ASSIGNED state and test after the Jira issue gets fixed ?

Comment 7 Leela Venkaiah Gangavarapu 2022-12-27 08:13:03 UTC
> MTSRE-714
- Not related to this as we asked not to run addon-operators on worker nodes
- However, from the listing we can see catalog operator is running on worker node (cpu: 10m, memory: 50Mi)
- Our crash-collector requires (cpu: 50m, memory: 80Mi) and maybe we are left with only (cpu < 50m and memory < 80Mi) due to addon catalog operator
- In the worst-case we need to reach out to MT-SRE again to not have anything running on worker nodes and at the same time that may not be a straight-forward fix from their side
- Before reaching out, I also need a confirmation that all nodes were schedula-ble when above issue was observed


> 1 Insufficient cpu, 1 Insufficient memory, 2 node(s) didn't match Pod's node affinity/selector
- crash-collector supposed to be scheduled on one node where it has less resources available, need to find how much we are short of as well.

Can only proceed after looking at a live cluster, must-gather might work only as a last resort but complicated.

Comment 17 Ritesh Chikatwar 2023-03-14 15:29:51 UTC
Closing this bug as fixed in v2.0.11 and tested by QE.


Note You need to log in before you can comment on or make changes to this bug.