Bug 2133041
| Summary: | Pod rook-ceph-crashcollector did not reach Running state due to Insufficient memory. | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Jilju Joy <jijoy> |
| Component: | odf-managed-service | Assignee: | Leela Venkaiah Gangavarapu <lgangava> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Jilju Joy <jijoy> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.10 | CC: | aeyal, ebenahar, fbalak, lgangava, muagarwa, nberry, ocs-bugs, odf-bz-bot, rchikatw |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-03-14 15:35:53 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
hi, - generally, addon-operator workloads shouldn't be running on worker nodes - an issue was raised with MT-SRE to make changes in their scheduling to not consume resources from ODF specific worker nodes and is being tracked at https://issues.redhat.com/browse/MTSRE-714 - unless the jira issue is completed we'll seeing crashcollector (or any other pod) going into pending state intermittently in ODF Control Plane nodes, when you see the same issue in ODF Data plane nodes then that'll be a blocker workaround: - run "kubectl get po -l 'app.kubernetes.io/name in (addon-operator,addon-operator-webhook-server)' --no-headers -nopenshift-addon-operator -oname | xargs -I {} kubectl delete {} -nopenshift-addon-operator" - this command will bounce the addon-operator pods to Infra nodes where there'll definitely be resources available and it's not harmful thanks, leela. hi,
- the work wrt referenced Jira issue is pushed to stage and there shouldn't be resource crunch on ODF Control plane
- below command o/p should be empty
worker=($(kubectl get machines -nopenshift-machine-api -l machine.openshift.io/cluster-api-machine-role=worker -ojsonpath='{.items[?(@.status.phase=="Running")].status.nodeRef.name}'))
for node in ${worker[@]}; do echo $node; kubectl get pods --field-selector=spec.nodeName=$node,status.phase=Running -nopenshift-addon-operator --no-headers -oname; done;
thanks,
leela.
Hi Leela, Which version contains the fix ? The output of the command given in the comment #2 is not empty. $ worker=($(kubectl get machines -nopenshift-machine-api -l machine.openshift.io/cluster-api-machine-role=worker -ojsonpath='{.items[?(@.status.phase=="Running")].status.nodeRef.name}')) $ for node in ${worker[@]}; do echo $node; kubectl get pods --field-selector=spec.nodeName=$node,status.phase=Running -nopenshift-addon-operator --no-headers -oname; done; ip-10-0-133-96.ec2.internal pod/addon-operator-catalog-bw6zl ip-10-0-159-32.ec2.internal pod/addon-operator-webhooks-7bdc97545-xdjw8 ip-10-0-171-179.ec2.internal Pod rook-ceph-crashcollector-ip-10-0-159-32.ec2.internal-57f9dgk2mm is in Pending state due to insufficient cpu. $ oc describe pod rook-ceph-crashcollector-ip-10-0-159-32.ec2.internal-57f9dgk2mm Name: rook-ceph-crashcollector-ip-10-0-159-32.ec2.internal-57f9dgk2mm Namespace: openshift-storage Priority: 0 Node: <none> Labels: app=rook-ceph-crashcollector ceph-version=16.2.7-126 ceph_daemon_id=crash crashcollector=crash kubernetes.io/hostname=ip-10-0-159-32.ec2.internal node_name=ip-10-0-159-32.ec2.internal pod-template-hash=57f9d6748f rook-version=v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63 rook_cluster=openshift-storage Annotations: openshift.io/scc: rook-ceph Status: Pending IP: IPs: <none> Controlled By: ReplicaSet/rook-ceph-crashcollector-ip-10-0-159-32.ec2.internal-57f9d6748f Init Containers: make-container-crash-dir: Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6 Port: <none> Host Port: <none> Command: mkdir -p Args: /var/lib/ceph/crash/posted Limits: cpu: 50m memory: 60Mi Requests: cpu: 50m memory: 60Mi Environment: <none> Mounts: /etc/ceph from rook-config-override (ro) /var/lib/ceph/crash from rook-ceph-crash (rw) /var/log/ceph from rook-ceph-log (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vmlzs (ro) chown-container-data-dir: Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6 Port: <none> Host Port: <none> Command: chown Args: --verbose --recursive ceph:ceph /var/log/ceph /var/lib/ceph/crash Limits: cpu: 50m memory: 60Mi Requests: cpu: 50m memory: 60Mi Environment: <none> Mounts: /etc/ceph from rook-config-override (ro) /var/lib/ceph/crash from rook-ceph-crash (rw) /var/log/ceph from rook-ceph-log (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vmlzs (ro) Containers: ceph-crash: Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6 Port: <none> Host Port: <none> Command: ceph-crash Limits: cpu: 50m memory: 60Mi Requests: cpu: 50m memory: 60Mi Environment: CONTAINER_IMAGE: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6 POD_NAME: rook-ceph-crashcollector-ip-10-0-159-32.ec2.internal-57f9dgk2mm (v1:metadata.name) POD_NAMESPACE: openshift-storage (v1:metadata.namespace) NODE_NAME: (v1:spec.nodeName) POD_MEMORY_LIMIT: 62914560 (limits.memory) POD_MEMORY_REQUEST: 62914560 (requests.memory) POD_CPU_LIMIT: 1 (limits.cpu) POD_CPU_REQUEST: 1 (requests.cpu) ROOK_CEPH_MON_HOST: <set to the key 'mon_host' in secret 'rook-ceph-config'> Optional: false ROOK_CEPH_MON_INITIAL_MEMBERS: <set to the key 'mon_initial_members' in secret 'rook-ceph-config'> Optional: false CEPH_ARGS: -m $(ROOK_CEPH_MON_HOST) -k /etc/ceph/crash-collector-keyring-store/keyring Mounts: /etc/ceph from rook-config-override (ro) /etc/ceph/crash-collector-keyring-store/ from rook-ceph-crash-collector-keyring (ro) /var/lib/ceph/crash from rook-ceph-crash (rw) /var/log/ceph from rook-ceph-log (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vmlzs (ro) Conditions: Type Status PodScheduled False Volumes: rook-config-override: Type: Projected (a volume that contains injected data from multiple sources) ConfigMapName: rook-config-override ConfigMapOptional: <nil> rook-ceph-log: Type: HostPath (bare host directory volume) Path: /var/lib/rook/openshift-storage/log HostPathType: rook-ceph-crash: Type: HostPath (bare host directory volume) Path: /var/lib/rook/openshift-storage/crash HostPathType: rook-ceph-crash-collector-keyring: Type: Secret (a volume populated by a Secret) SecretName: rook-ceph-crash-collector-keyring Optional: false kube-api-access-vmlzs: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Guaranteed Node-Selectors: kubernetes.io/hostname=ip-10-0-159-32.ec2.internal Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 5s node.ocs.openshift.io/storage=true:NoSchedule Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 5m24s (x627 over 8h) default-scheduler 0/12 nodes are available: 1 Insufficient cpu, 2 node(s) didn't match Pod's node affinity/selector, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had taint {node.ocs.openshift.io/osd: true}, that the pod didn't tolerate. must-gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-o14-pr/jijoy-o14-pr_20221014T041828/logs/testcases_1665757256/ ============================================================ Verson: $ oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.6 NooBaa Operator 4.10.6 mcg-operator.v4.10.5 Succeeded ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded ocs-osd-deployer.v2.0.8 OCS OSD Deployer 2.0.8 Succeeded odf-csi-addons-operator.v4.10.5 CSI Addons 4.10.5 odf-csi-addons-operator.v4.10.4 Succeeded odf-operator.v4.10.5 OpenShift Data Foundation 4.10.5 odf-operator.v4.10.4 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.450-6e98c37 Route Monitor Operator 0.1.450-6e98c37 route-monitor-operator.v0.1.448-b25b8ee Succeeded $ oc get csv ocs-osd-deployer.v2.0.8 -o yaml | grep image: image: quay.io/osd-addons/ocs-osd-deployer:5ca1ab1e image: registry.redhat.io/openshift4/ose-kube-rbac-proxy:v4.11.0-202209130958.p0.ga805ba5.assembly.stream image: quay.io/osd-addons/ocs-osd-deployer:5ca1ab1e image: quay.io/osd-addons/ocs-osd-deployer:5ca1ab1e $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.35 True False 9h Cluster version is 4.10.35 - As mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2133041#c1, the referenced Jira issue https://issues.redhat.com/browse/MTSRE-714 was worked on by MTSRE and moved to Platform SRE for changes from their side - Until SRE-P completes actions from their side, this issue will be hit intermittently and a tricky workaround exists if this is hit, this shouldn't be a blocker for testing AFAIK - By 2022-11-08 the build with the fix was delivered to QE Tested in a 4TiB cluster. One pod "rook-ceph-crashcollector" is in Pending state due to insufficient cpu and memory.
$ oc get pods -o wide | grep -v 'Running\|Completed'
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
rook-ceph-crashcollector-ip-10-0-148-170.ec2.internal-7d4486f6b 0/1 Pending 0 4h45m <none> <none> <none> <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 5m52s (x318 over 4h21m) default-scheduler 0/12 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 2 node(s) didn't match Pod's node affinity/selector, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had taint {node.ocs.openshift.io/osd: true}, that the pod didn't tolerate.
The output of the command given in the comment #2 is not empty.
$ worker=($(kubectl get machines -nopenshift-machine-api -l machine.openshift.io/cluster-api-machine-role=worker -ojsonpath='{.items[?(@.status.phase=="Running")].status.nodeRef.name}'))
$ for node in ${worker[@]}; do echo $node; kubectl get pods --field-selector=spec.nodeName=$node,status.phase=Running -nopenshift-addon-operator --no-headers -oname; done;
ip-10-0-136-247.ec2.internal
ip-10-0-148-170.ec2.internal
ip-10-0-171-222.ec2.internal
pod/addon-operator-catalog-hzqzd
--------------------------------------------------------
Tested in version:
$ oc get csv
NAME DISPLAY VERSION REPLACES PHASE
mcg-operator.v4.10.9 NooBaa Operator 4.10.9 mcg-operator.v4.10.8 Succeeded
observability-operator.v0.0.17 Observability Operator 0.0.17 observability-operator.v0.0.17-rc Succeeded
ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.8 Succeeded
ocs-osd-deployer.v2.0.11 OCS OSD Deployer 2.0.11 ocs-osd-deployer.v2.0.10 Succeeded
odf-csi-addons-operator.v4.10.9 CSI Addons 4.10.9 odf-csi-addons-operator.v4.10.8 Succeeded
odf-operator.v4.10.9 OpenShift Data Foundation 4.10.9 odf-operator.v4.10.8 Succeeded
ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded
route-monitor-operator.v0.1.451-3df1ed1 Route Monitor Operator 0.1.451-3df1ed1 route-monitor-operator.v0.1.450-6e98c37 Succeeded
$ oc get storagecluster
NAME AGE PHASE EXTERNAL CREATED AT VERSION
ocs-storagecluster 4h57m Ready 2022-12-23T05:14:14Z
@Leela
If the issue is https://issues.redhat.com/browse/MTSRE-714, can we move this bug back to ASSIGNED state and test after the Jira issue gets fixed ?
> MTSRE-714 - Not related to this as we asked not to run addon-operators on worker nodes - However, from the listing we can see catalog operator is running on worker node (cpu: 10m, memory: 50Mi) - Our crash-collector requires (cpu: 50m, memory: 80Mi) and maybe we are left with only (cpu < 50m and memory < 80Mi) due to addon catalog operator - In the worst-case we need to reach out to MT-SRE again to not have anything running on worker nodes and at the same time that may not be a straight-forward fix from their side - Before reaching out, I also need a confirmation that all nodes were schedula-ble when above issue was observed > 1 Insufficient cpu, 1 Insufficient memory, 2 node(s) didn't match Pod's node affinity/selector - crash-collector supposed to be scheduled on one node where it has less resources available, need to find how much we are short of as well. Can only proceed after looking at a live cluster, must-gather might work only as a last resort but complicated. Verified in version: OCP 4.10.50 ODF 4.10.9-7 ocs-osd-deployer.v2.0.11 Verified in size 4, 8 and 20. Pods list in size 4 cluster: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy144-pr/jijoy144-pr_20230214T072156/logs/deployment_1676364181/ocs_must_gather/quay-io-ocs-dev-ocs-must-gather-sha256-3d57a983f6c2b53ef2dd4ee3a5a4df7bc86daf53abb703e9a7872d07c00ed3c7/namespaces/openshift-storage/oc_output/all_-o_wide Closing this bug as fixed in v2.0.11 and tested by QE. |
Description of problem: Pod rook-ceph-crashcollector did not reach Running state due to Insufficient memory. Testing was done using the addon 'ocs-provider-dev' which contain changes related to ODFMS-55. Size parameter was set to 4. $ oc describe pod rook-ceph-crashcollector-ip-10-0-150-29.ec2.internal-5b7b8b9l6f Name: rook-ceph-crashcollector-ip-10-0-150-29.ec2.internal-5b7b8b9l6f Namespace: openshift-storage Priority: 0 Node: <none> Labels: app=rook-ceph-crashcollector ceph-version=16.2.7-126 ceph_daemon_id=crash crashcollector=crash kubernetes.io/hostname=ip-10-0-150-29.ec2.internal node_name=ip-10-0-150-29.ec2.internal pod-template-hash=5b7b8876b4 rook-version=v4.10.5-0.985405daeba3b29a178cb19aa864324e65548a63 rook_cluster=openshift-storage Annotations: openshift.io/scc: rook-ceph Status: Pending IP: IPs: <none> Controlled By: ReplicaSet/rook-ceph-crashcollector-ip-10-0-150-29.ec2.internal-5b7b8876b4 Init Containers: make-container-crash-dir: Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6 Port: <none> Host Port: <none> Command: mkdir -p Args: /var/lib/ceph/crash/posted Limits: cpu: 50m memory: 60Mi Requests: cpu: 50m memory: 60Mi Environment: <none> Mounts: /etc/ceph from rook-config-override (ro) /var/lib/ceph/crash from rook-ceph-crash (rw) /var/log/ceph from rook-ceph-log (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qfss5 (ro) chown-container-data-dir: Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6 Port: <none> Host Port: <none> Command: chown Args: --verbose --recursive ceph:ceph /var/log/ceph /var/lib/ceph/crash Limits: cpu: 50m memory: 60Mi Requests: cpu: 50m memory: 60Mi Environment: <none> Mounts: /etc/ceph from rook-config-override (ro) /var/lib/ceph/crash from rook-ceph-crash (rw) /var/log/ceph from rook-ceph-log (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qfss5 (ro) Containers: ceph-crash: Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6 Port: <none> Host Port: <none> Command: ceph-crash Limits: cpu: 50m memory: 60Mi Requests: cpu: 50m memory: 60Mi Environment: CONTAINER_IMAGE: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6 POD_NAME: rook-ceph-crashcollector-ip-10-0-150-29.ec2.internal-5b7b8b9l6f (v1:metadata.name) POD_NAMESPACE: openshift-storage (v1:metadata.namespace) NODE_NAME: (v1:spec.nodeName) POD_MEMORY_LIMIT: 62914560 (limits.memory) POD_MEMORY_REQUEST: 62914560 (requests.memory) POD_CPU_LIMIT: 1 (limits.cpu) POD_CPU_REQUEST: 1 (requests.cpu) ROOK_CEPH_MON_HOST: <set to the key 'mon_host' in secret 'rook-ceph-config'> Optional: false ROOK_CEPH_MON_INITIAL_MEMBERS: <set to the key 'mon_initial_members' in secret 'rook-ceph-config'> Optional: false CEPH_ARGS: -m $(ROOK_CEPH_MON_HOST) -k /etc/ceph/crash-collector-keyring-store/keyring Mounts: /etc/ceph from rook-config-override (ro) /etc/ceph/crash-collector-keyring-store/ from rook-ceph-crash-collector-keyring (ro) /var/lib/ceph/crash from rook-ceph-crash (rw) /var/log/ceph from rook-ceph-log (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qfss5 (ro) Conditions: Type Status PodScheduled False Volumes: rook-config-override: Type: Projected (a volume that contains injected data from multiple sources) ConfigMapName: rook-config-override ConfigMapOptional: <nil> rook-ceph-log: Type: HostPath (bare host directory volume) Path: /var/lib/rook/openshift-storage/log HostPathType: rook-ceph-crash: Type: HostPath (bare host directory volume) Path: /var/lib/rook/openshift-storage/crash HostPathType: rook-ceph-crash-collector-keyring: Type: Secret (a volume populated by a Secret) SecretName: rook-ceph-crash-collector-keyring Optional: false kube-api-access-qfss5: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Guaranteed Node-Selectors: kubernetes.io/hostname=ip-10-0-150-29.ec2.internal Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 5s node.ocs.openshift.io/storage=true:NoSchedule Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 27m (x839 over 11h) default-scheduler 0/12 nodes are available: 1 Insufficient memory, 2 node(s) didn't match Pod's node affinity/selector, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had taint {node.ocs.openshift.io/osd: true}, that the pod didn't tolerate. Pods list (this was taken after deleting the pod rook-ceph-crashcollector-ip-10-0-150-29.ec2.internal-5b7b8b9l6f while trying out some workaround): $ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES addon-ocs-provider-dev-catalog-whqhc 1/1 Running 0 12h 10.128.2.20 ip-10-0-136-1.ec2.internal <none> <none> alertmanager-managed-ocs-alertmanager-0 2/2 Running 0 12h 10.128.2.15 ip-10-0-136-1.ec2.internal <none> <none> alertmanager-managed-ocs-alertmanager-1 2/2 Running 0 12h 10.131.0.16 ip-10-0-150-29.ec2.internal <none> <none> alertmanager-managed-ocs-alertmanager-2 2/2 Running 0 12h 10.128.2.16 ip-10-0-136-1.ec2.internal <none> <none> csi-addons-controller-manager-b8b965868-szb6r 2/2 Running 0 12h 10.128.2.19 ip-10-0-136-1.ec2.internal <none> <none> ocs-metrics-exporter-577574796b-rfgs8 1/1 Running 0 12h 10.128.2.14 ip-10-0-136-1.ec2.internal <none> <none> ocs-operator-5c77756ddd-mfhrj 1/1 Running 0 12h 10.131.0.10 ip-10-0-150-29.ec2.internal <none> <none> ocs-osd-aws-data-gather-54b4bd7d6c-z9f4g 1/1 Running 0 12h 10.0.150.29 ip-10-0-150-29.ec2.internal <none> <none> ocs-osd-controller-manager-66dc698b5b-4gjkk 3/3 Running 0 12h 10.131.0.8 ip-10-0-150-29.ec2.internal <none> <none> ocs-provider-server-6f888bbffb-d8wv6 1/1 Running 0 12h 10.131.0.31 ip-10-0-150-29.ec2.internal <none> <none> odf-console-585db6ddb-q7pdn 1/1 Running 0 12h 10.128.2.17 ip-10-0-136-1.ec2.internal <none> <none> odf-operator-controller-manager-7866b5fdbb-5gg4b 2/2 Running 0 12h 10.128.2.12 ip-10-0-136-1.ec2.internal <none> <none> prometheus-managed-ocs-prometheus-0 3/3 Running 0 12h 10.129.2.8 ip-10-0-170-254.ec2.internal <none> <none> prometheus-operator-8547cc9f89-sq584 1/1 Running 0 12h 10.128.2.18 ip-10-0-136-1.ec2.internal <none> <none> rook-ceph-crashcollector-ip-10-0-136-1.ec2.internal-5cc5bdtq6fw 1/1 Running 0 12h 10.0.136.1 ip-10-0-136-1.ec2.internal <none> <none> rook-ceph-crashcollector-ip-10-0-142-131.ec2.internal-7b9dhr7nr 1/1 Running 0 11h 10.0.142.131 ip-10-0-142-131.ec2.internal <none> <none> rook-ceph-crashcollector-ip-10-0-150-29.ec2.internal-5b7b88bzv7 0/1 Pending 0 19s <none> <none> <none> <none> rook-ceph-crashcollector-ip-10-0-152-20.ec2.internal-8679b7k67v 1/1 Running 0 12h 10.0.152.20 ip-10-0-152-20.ec2.internal <none> <none> rook-ceph-crashcollector-ip-10-0-170-254.ec2.internal-76c7hpqwj 1/1 Running 0 12h 10.0.170.254 ip-10-0-170-254.ec2.internal <none> <none> rook-ceph-crashcollector-ip-10-0-171-253.ec2.internal-65c66d8wf 1/1 Running 0 12h 10.0.171.253 ip-10-0-171-253.ec2.internal <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-58cfc967g8r5z 2/2 Running 0 12h 10.0.150.29 ip-10-0-150-29.ec2.internal <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-b574cd94g4cdz 2/2 Running 0 12h 10.0.170.254 ip-10-0-170-254.ec2.internal <none> <none> rook-ceph-mgr-a-c999476d6-qz958 2/2 Running 0 12h 10.0.136.1 ip-10-0-136-1.ec2.internal <none> <none> rook-ceph-mon-a-5fddc4774-mb9g7 2/2 Running 0 12h 10.0.170.254 ip-10-0-170-254.ec2.internal <none> <none> rook-ceph-mon-b-7846f47ddb-xcp5z 2/2 Running 0 12h 10.0.136.1 ip-10-0-136-1.ec2.internal <none> <none> rook-ceph-mon-c-6f4c85457b-dncgl 2/2 Running 0 12h 10.0.150.29 ip-10-0-150-29.ec2.internal <none> <none> rook-ceph-operator-564cb5cb98-gcj8n 1/1 Running 0 12h 10.128.2.13 ip-10-0-136-1.ec2.internal <none> <none> rook-ceph-osd-0-9f8488c5-w72gx 2/2 Running 0 12h 10.0.142.131 ip-10-0-142-131.ec2.internal <none> <none> rook-ceph-osd-2-564c6d49c8-6ffmd 2/2 Running 0 12h 10.0.171.253 ip-10-0-171-253.ec2.internal <none> <none> rook-ceph-osd-3-fd4cf4999-24n56 2/2 Running 0 12h 10.0.152.20 ip-10-0-152-20.ec2.internal <none> <none> rook-ceph-osd-prepare-default-1-data-0p86j8-6k785 0/1 Completed 0 12h 10.0.152.20 ip-10-0-152-20.ec2.internal <none> <none> rook-ceph-tools-787676bdbd-z25ps 1/1 Running 0 82s 10.0.170.254 ip-10-0-170-254.ec2.internal <none> <none> As a workaround, deleted ocs-operator pod which was running on the node ip-10-0-150-29.ec2.internal. New ocs-operator pod was created on a different node. The pod rook-ceph-crashcollector-ip-10-0-150-29.ec2.internal reached Running state. $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-130-155.ec2.internal Ready worker 9h v1.23.5+8471591 ip-10-0-136-1.ec2.internal Ready worker 32h v1.23.5+8471591 ip-10-0-138-49.ec2.internal Ready infra,worker 32h v1.23.5+8471591 ip-10-0-140-159.ec2.internal Ready master 32h v1.23.5+8471591 ip-10-0-150-29.ec2.internal Ready worker 32h v1.23.5+8471591 ip-10-0-152-20.ec2.internal Ready worker 32h v1.23.5+8471591 ip-10-0-155-226.ec2.internal Ready master 32h v1.23.5+8471591 ip-10-0-157-194.ec2.internal Ready infra,worker 32h v1.23.5+8471591 ip-10-0-169-109.ec2.internal Ready master 32h v1.23.5+8471591 ip-10-0-170-254.ec2.internal Ready worker 32h v1.23.5+8471591 ip-10-0-171-253.ec2.internal Ready worker 32h v1.23.5+8471591 ip-10-0-175-139.ec2.internal Ready infra,worker 32h v1.23.5+8471591 must-gather logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-4ti4b-pr/jijoy-4ti4b-pr_20221006T043953/logs/failed_testcase_ocs_logs_1665073038/test_rolling_nodes_restart%5bworker%5d_ocs_logs/ocs_must_gather/ ========================================================================================= Version-Release number of selected component (if applicable): $ oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.6 NooBaa Operator 4.10.6 mcg-operator.v4.10.5 Succeeded ocs-operator.v4.10.5 OpenShift Container Storage 4.10.5 ocs-operator.v4.10.4 Succeeded ocs-osd-deployer.v2.0.8 OCS OSD Deployer 2.0.8 Succeeded odf-csi-addons-operator.v4.10.5 CSI Addons 4.10.5 odf-csi-addons-operator.v4.10.4 Succeeded odf-operator.v4.10.5 OpenShift Data Foundation 4.10.5 odf-operator.v4.10.4 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.422-151be96 Route Monitor Operator 0.1.422-151be96 route-monitor-operator.v0.1.420-b65f47e Succeeded $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.34 True False 32h Cluster version is 4.10.34 ====================================================================================== How reproducible: Observed 2 times in 3 deployment attempts. Steps to Reproduce: 1. Install provider cluster with dev addon: eg: rosa create service --type ocs-provider-dev --name jijoy-4ti4b-pr --machine-cidr 10.0.0.0/16 --size 4 --onboarding-validation-key <key> --subnet-ids <subnet-ids> --region us-east-1 2. Verify that rook-ceph-crashcollector pods are in Running state. ========================================================================================= Actual results: One pod among rook-ceph-crashcollector pods is in Pending state. Expected results: All rook-ceph-crashcollector pods should be running. Additional info: