2133022 – rook-ceph-crashcollector pod running on non-osd nodes

Bug 2133022 - rook-ceph-crashcollector pod running on non-osd nodes

Summary: rook-ceph-crashcollector pod running on non-osd nodes

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-managed-service
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ohad
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-10-07 13:57 UTC by Jilju Joy
Modified:	2023-08-09 17:00 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-10-07 15:24:06 UTC
Embargoed:

Attachments	(Terms of Use)

Description Jilju Joy 2022-10-07 13:57:35 UTC

Description of problem:

rook-ceph-crashcollector pods are coming up on nodes in which osd is not running.
Sometimes(observed in more than 2 deployments) one or more of these pods which is trying to come up on non-osd nodes will remain in Pending state due to unavailable resources.
Testing was done using the addon 'ocs-provider-dev' which contain changes related to ODFMS-55. Size parameter was set to 4.

$ oc get pods -o wide
NAME                                                              READY   STATUS      RESTARTS   AGE   IP             NODE                           NOMINATED NODE   READINESS GATES
addon-ocs-provider-dev-catalog-whqhc                              1/1     Running     0          12h   10.128.2.20    ip-10-0-136-1.ec2.internal     <none>           <none>
alertmanager-managed-ocs-alertmanager-0                           2/2     Running     0          12h   10.128.2.15    ip-10-0-136-1.ec2.internal     <none>           <none>
alertmanager-managed-ocs-alertmanager-1                           2/2     Running     0          12h   10.131.0.16    ip-10-0-150-29.ec2.internal    <none>           <none>
alertmanager-managed-ocs-alertmanager-2                           2/2     Running     0          12h   10.128.2.16    ip-10-0-136-1.ec2.internal     <none>           <none>
csi-addons-controller-manager-b8b965868-szb6r                     2/2     Running     0          12h   10.128.2.19    ip-10-0-136-1.ec2.internal     <none>           <none>
ocs-metrics-exporter-577574796b-rfgs8                             1/1     Running     0          12h   10.128.2.14    ip-10-0-136-1.ec2.internal     <none>           <none>
ocs-operator-5c77756ddd-mfhrj                                     1/1     Running     0          12h   10.131.0.10    ip-10-0-150-29.ec2.internal    <none>           <none>
ocs-osd-aws-data-gather-54b4bd7d6c-z9f4g                          1/1     Running     0          12h   10.0.150.29    ip-10-0-150-29.ec2.internal    <none>           <none>
ocs-osd-controller-manager-66dc698b5b-4gjkk                       3/3     Running     0          12h   10.131.0.8     ip-10-0-150-29.ec2.internal    <none>           <none>
ocs-provider-server-6f888bbffb-d8wv6                              1/1     Running     0          12h   10.131.0.31    ip-10-0-150-29.ec2.internal    <none>           <none>
odf-console-585db6ddb-q7pdn                                       1/1     Running     0          12h   10.128.2.17    ip-10-0-136-1.ec2.internal     <none>           <none>
odf-operator-controller-manager-7866b5fdbb-5gg4b                  2/2     Running     0          12h   10.128.2.12    ip-10-0-136-1.ec2.internal     <none>           <none>
prometheus-managed-ocs-prometheus-0                               3/3     Running     0          12h   10.129.2.8     ip-10-0-170-254.ec2.internal   <none>           <none>
prometheus-operator-8547cc9f89-sq584                              1/1     Running     0          12h   10.128.2.18    ip-10-0-136-1.ec2.internal     <none>           <none>
rook-ceph-crashcollector-ip-10-0-136-1.ec2.internal-5cc5bdtq6fw   1/1     Running     0          12h   10.0.136.1     ip-10-0-136-1.ec2.internal     <none>           <none>
rook-ceph-crashcollector-ip-10-0-142-131.ec2.internal-7b9dhr7nr   1/1     Running     0          11h   10.0.142.131   ip-10-0-142-131.ec2.internal   <none>           <none>
rook-ceph-crashcollector-ip-10-0-150-29.ec2.internal-5b7b88bzv7   0/1     Pending     0          19s   <none>         <none>                         <none>           <none>
rook-ceph-crashcollector-ip-10-0-152-20.ec2.internal-8679b7k67v   1/1     Running     0          12h   10.0.152.20    ip-10-0-152-20.ec2.internal    <none>           <none>
rook-ceph-crashcollector-ip-10-0-170-254.ec2.internal-76c7hpqwj   1/1     Running     0          12h   10.0.170.254   ip-10-0-170-254.ec2.internal   <none>           <none>
rook-ceph-crashcollector-ip-10-0-171-253.ec2.internal-65c66d8wf   1/1     Running     0          12h   10.0.171.253   ip-10-0-171-253.ec2.internal   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-58cfc967g8r5z   2/2     Running     0          12h   10.0.150.29    ip-10-0-150-29.ec2.internal    <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-b574cd94g4cdz   2/2     Running     0          12h   10.0.170.254   ip-10-0-170-254.ec2.internal   <none>           <none>
rook-ceph-mgr-a-c999476d6-qz958                                   2/2     Running     0          12h   10.0.136.1     ip-10-0-136-1.ec2.internal     <none>           <none>
rook-ceph-mon-a-5fddc4774-mb9g7                                   2/2     Running     0          12h   10.0.170.254   ip-10-0-170-254.ec2.internal   <none>           <none>
rook-ceph-mon-b-7846f47ddb-xcp5z                                  2/2     Running     0          12h   10.0.136.1     ip-10-0-136-1.ec2.internal     <none>           <none>
rook-ceph-mon-c-6f4c85457b-dncgl                                  2/2     Running     0          12h   10.0.150.29    ip-10-0-150-29.ec2.internal    <none>           <none>
rook-ceph-operator-564cb5cb98-gcj8n                               1/1     Running     0          12h   10.128.2.13    ip-10-0-136-1.ec2.internal     <none>           <none>
rook-ceph-osd-0-9f8488c5-w72gx                                    2/2     Running     0          12h   10.0.142.131   ip-10-0-142-131.ec2.internal   <none>           <none>
rook-ceph-osd-2-564c6d49c8-6ffmd                                  2/2     Running     0          12h   10.0.171.253   ip-10-0-171-253.ec2.internal   <none>           <none>
rook-ceph-osd-3-fd4cf4999-24n56                                   2/2     Running     0          12h   10.0.152.20    ip-10-0-152-20.ec2.internal    <none>           <none>
rook-ceph-osd-prepare-default-1-data-0p86j8-6k785                 0/1     Completed   0          12h   10.0.152.20    ip-10-0-152-20.ec2.internal    <none>           <none>
rook-ceph-tools-787676bdbd-z25ps                                  1/1     Running     0          82s   10.0.170.254   ip-10-0-170-254.ec2.internal   <none>           <none>


$ oc get  nodes
NAME                           STATUS   ROLES          AGE   VERSION
ip-10-0-130-155.ec2.internal   Ready    worker         9h    v1.23.5+8471591
ip-10-0-136-1.ec2.internal     Ready    worker         32h   v1.23.5+8471591
ip-10-0-138-49.ec2.internal    Ready    infra,worker   32h   v1.23.5+8471591
ip-10-0-140-159.ec2.internal   Ready    master         32h   v1.23.5+8471591
ip-10-0-150-29.ec2.internal    Ready    worker         32h   v1.23.5+8471591
ip-10-0-152-20.ec2.internal    Ready    worker         32h   v1.23.5+8471591
ip-10-0-155-226.ec2.internal   Ready    master         32h   v1.23.5+8471591
ip-10-0-157-194.ec2.internal   Ready    infra,worker   32h   v1.23.5+8471591
ip-10-0-169-109.ec2.internal   Ready    master         32h   v1.23.5+8471591
ip-10-0-170-254.ec2.internal   Ready    worker         32h   v1.23.5+8471591
ip-10-0-171-253.ec2.internal   Ready    worker         32h   v1.23.5+8471591
ip-10-0-175-139.ec2.internal   Ready    infra,worker   32h   v1.23.5+8471591



must-gather logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-4ti4b-pr/jijoy-4ti4b-pr_20221006T043953/logs/failed_testcase_ocs_logs_1665073038/test_rolling_nodes_restart%5bworker%5d_ocs_logs/ocs_must_gather/

===================================================================================
Version-Release number of selected component (if applicable):
$ oc get csv
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.10.6                      NooBaa Operator               4.10.6            mcg-operator.v4.10.5                      Succeeded
ocs-operator.v4.10.5                      OpenShift Container Storage   4.10.5            ocs-operator.v4.10.4                      Succeeded
ocs-osd-deployer.v2.0.8                   OCS OSD Deployer              2.0.8                                                       Succeeded
odf-csi-addons-operator.v4.10.5           CSI Addons                    4.10.5            odf-csi-addons-operator.v4.10.4           Succeeded
odf-operator.v4.10.5                      OpenShift Data Foundation     4.10.5            odf-operator.v4.10.4                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.422-151be96   Route Monitor Operator        0.1.422-151be96   route-monitor-operator.v0.1.420-b65f47e   Succeeded

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.34   True        False         32h     Cluster version is 4.10.34


How reproducible:
100%

Steps to Reproduce:
1. Install provider cluster:
eg: rosa create service --type ocs-provider-dev --name jijoy-4ti4b-pr --machine-cidr 10.0.0.0/16 --size 4 --onboarding-validation-key <key> --subnet-ids <subnet-ids> --region us-east-1
2. Check the nodes in which the pods "rook-ceph-crashcollector" are running. Also check whether these pods are actually in the state 'Running'

Actual results:
"rook-ceph-crashcollector" pods are scheduled on all worker nodes. One or more of these pods which is trying to come up on non-osd nodes remained in Pending state due to unavailable resources.

Expected results:
"rook-ceph-crashcollector" pods should be scheduled in osd nodes and should be in running state.

Additional info:

Note You need to log in before you can comment on or make changes to this bug.