Bug 2167045
| Summary: | All OSDs were not created in MS provider cluster of size 20 | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Jilju Joy <jijoy> |
| Component: | odf-managed-service | Assignee: | Ohad <omitrani> |
| Status: | CLOSED WORKSFORME | QA Contact: | Jilju Joy <jijoy> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.10 | CC: | cblum, fbalak, ocs-bugs, odf-bz-bot |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-03-27 10:48:09 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Is this resolved by the new image? |
Description of problem: 2/15 OSD pods were not created in the Managed Services provider cluster of size 20. Installation was done with ocs-provider-qe addon. $ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES addon-ocs-provider-qe-catalog-6hlvd 1/1 Running 0 11h 10.130.2.22 ip-10-0-14-9.us-east-2.compute.internal <none> <none> alertmanager-managed-ocs-alertmanager-0 2/2 Running 0 11h 10.131.2.17 ip-10-0-23-153.us-east-2.compute.internal <none> <none> csi-addons-controller-manager-759b488df-xrhx4 2/2 Running 0 11h 10.131.2.18 ip-10-0-23-153.us-east-2.compute.internal <none> <none> ocs-metrics-exporter-5dd96c885b-x9ls7 1/1 Running 0 11h 10.131.2.14 ip-10-0-23-153.us-east-2.compute.internal <none> <none> ocs-operator-6888799d6b-qn9b7 1/1 Running 0 11h 10.129.2.9 ip-10-0-23-98.us-east-2.compute.internal <none> <none> ocs-osd-aws-data-gather-87db84b8b-rh452 1/1 Running 0 11h 10.0.14.9 ip-10-0-14-9.us-east-2.compute.internal <none> <none> ocs-osd-controller-manager-8d55ffccd-kzwmr 3/3 Running 0 11h 10.130.2.28 ip-10-0-14-9.us-east-2.compute.internal <none> <none> ocs-provider-server-6c47b6c7c9-c65n4 1/1 Running 0 11h 10.131.2.12 ip-10-0-23-153.us-east-2.compute.internal <none> <none> odf-console-57b8476cd4-fkmwg 1/1 Running 0 11h 10.130.2.29 ip-10-0-14-9.us-east-2.compute.internal <none> <none> odf-operator-controller-manager-6f44676f4f-p48b2 2/2 Running 0 11h 10.130.2.26 ip-10-0-14-9.us-east-2.compute.internal <none> <none> prometheus-managed-ocs-prometheus-0 3/3 Running 0 11h 10.131.2.16 ip-10-0-23-153.us-east-2.compute.internal <none> <none> prometheus-operator-8547cc9f89-lgjqz 1/1 Running 0 11h 10.130.2.24 ip-10-0-14-9.us-east-2.compute.internal <none> <none> rook-ceph-crashcollector-04c17f9d1a57254b9b8f55072ae1557b-24ll5 1/1 Running 0 11h 10.0.19.29 ip-10-0-19-29.us-east-2.compute.internal <none> <none> rook-ceph-crashcollector-0a0d3c68c7ab5b2c0e551505cd3d86fc-jp5x9 1/1 Running 0 11h 10.0.17.148 ip-10-0-17-148.us-east-2.compute.internal <none> <none> rook-ceph-crashcollector-b319c5fd4e7a6a3619e66719a3d16180-t4gzx 1/1 Running 0 11h 10.0.23.98 ip-10-0-23-98.us-east-2.compute.internal <none> <none> rook-ceph-crashcollector-c0db9449293af286deacfef5500c908c-59j59 1/1 Running 0 11h 10.0.23.153 ip-10-0-23-153.us-east-2.compute.internal <none> <none> rook-ceph-crashcollector-c5123a40ecb6868b9caaab671f018de4-hb49r 1/1 Running 0 11h 10.0.14.9 ip-10-0-14-9.us-east-2.compute.internal <none> <none> rook-ceph-crashcollector-cbdb47f615d01227c71b30b63707e135-p86c7 1/1 Running 0 11h 10.0.14.108 ip-10-0-14-108.us-east-2.compute.internal <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-79b6b894rf5kg 2/2 Running 0 11h 10.0.23.98 ip-10-0-23-98.us-east-2.compute.internal <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-d9876ff7vkt4x 2/2 Running 0 11h 10.0.14.108 ip-10-0-14-108.us-east-2.compute.internal <none> <none> rook-ceph-mgr-a-f8895b454-fgj55 2/2 Running 0 11h 10.0.17.148 ip-10-0-17-148.us-east-2.compute.internal <none> <none> rook-ceph-mon-a-6bd7ccb97c-qk6gb 2/2 Running 0 11h 10.0.17.148 ip-10-0-17-148.us-east-2.compute.internal <none> <none> rook-ceph-mon-b-79d9cb59f6-j92bf 2/2 Running 0 11h 10.0.23.153 ip-10-0-23-153.us-east-2.compute.internal <none> <none> rook-ceph-mon-c-c9fbf66fb-fdqd9 2/2 Running 0 11h 10.0.14.9 ip-10-0-14-9.us-east-2.compute.internal <none> <none> rook-ceph-operator-548b87d44b-mncjg 1/1 Running 0 11h 10.129.2.8 ip-10-0-23-98.us-east-2.compute.internal <none> <none> rook-ceph-osd-0-7b87b47858-kpfv6 2/2 Running 0 11h 10.0.23.153 ip-10-0-23-153.us-east-2.compute.internal <none> <none> rook-ceph-osd-1-88c6867fb-wrfgs 2/2 Running 0 11h 10.0.17.148 ip-10-0-17-148.us-east-2.compute.internal <none> <none> rook-ceph-osd-11-7d7458f76-wql6k 2/2 Running 0 11h 10.0.23.98 ip-10-0-23-98.us-east-2.compute.internal <none> <none> rook-ceph-osd-13-846c6fcd8c-6gl52 2/2 Running 0 11h 10.0.19.29 ip-10-0-19-29.us-east-2.compute.internal <none> <none> rook-ceph-osd-14-5445548848-nkllt 2/2 Running 0 11h 10.0.19.29 ip-10-0-19-29.us-east-2.compute.internal <none> <none> rook-ceph-osd-2-6c5466b9d7-qpr8n 2/2 Running 0 11h 10.0.14.9 ip-10-0-14-9.us-east-2.compute.internal <none> <none> rook-ceph-osd-3-5df7c9f96c-rxjp4 2/2 Running 0 11h 10.0.14.108 ip-10-0-14-108.us-east-2.compute.internal <none> <none> rook-ceph-osd-4-556c86bc54-kw94j 2/2 Running 0 11h 10.0.17.148 ip-10-0-17-148.us-east-2.compute.internal <none> <none> rook-ceph-osd-5-7ff7958f64-kldhm 2/2 Running 0 11h 10.0.14.9 ip-10-0-14-9.us-east-2.compute.internal <none> <none> rook-ceph-osd-6-c67b4f987-rdw22 2/2 Running 0 11h 10.0.23.98 ip-10-0-23-98.us-east-2.compute.internal <none> <none> rook-ceph-osd-7-76cf788868-nvbsb 2/2 Running 0 11h 10.0.23.153 ip-10-0-23-153.us-east-2.compute.internal <none> <none> rook-ceph-osd-8-7d5f78c798-4dn5w 2/2 Running 0 11h 10.0.14.108 ip-10-0-14-108.us-east-2.compute.internal <none> <none> rook-ceph-osd-9-99977b6dd-p9cpx 2/2 Running 0 11h 10.0.23.153 ip-10-0-23-153.us-east-2.compute.internal <none> <none> rook-ceph-osd-prepare-default-0-data-02pt7c-7chxv 0/1 Completed 0 11h 10.0.17.148 ip-10-0-17-148.us-east-2.compute.internal <none> <none> rook-ceph-osd-prepare-default-0-data-16zc9r-4k45d 0/1 Completed 0 11h 10.0.14.9 ip-10-0-14-9.us-east-2.compute.internal <none> <none> rook-ceph-osd-prepare-default-0-data-24q4cg-mlsql 0/1 Completed 0 11h 10.0.23.153 ip-10-0-23-153.us-east-2.compute.internal <none> <none> rook-ceph-osd-prepare-default-1-data-0dstck-mbj9v 0/1 Completed 0 11h 10.0.23.153 ip-10-0-23-153.us-east-2.compute.internal <none> <none> rook-ceph-osd-prepare-default-1-data-1d6k26-cnhjl 0/1 Pending 0 11h <none> <none> <none> <none> rook-ceph-osd-prepare-default-1-data-3hbbtb-6wj6m 0/1 Completed 0 11h 10.0.23.98 ip-10-0-23-98.us-east-2.compute.internal <none> <none> rook-ceph-osd-prepare-default-1-data-4npcdn-sbl84 0/1 Completed 0 11h 10.0.23.153 ip-10-0-23-153.us-east-2.compute.internal <none> <none> rook-ceph-osd-prepare-default-2-data-0kdf28-t54mb 0/1 Completed 0 11h 10.0.14.9 ip-10-0-14-9.us-east-2.compute.internal <none> <none> rook-ceph-osd-prepare-default-2-data-1mjk7w-nss5f 0/1 Completed 0 11h 10.0.23.153 ip-10-0-23-153.us-east-2.compute.internal <none> <none> rook-ceph-osd-prepare-default-2-data-4s9snn-ntr9j 0/1 Pending 0 11h <none> <none> <none> <none> rook-ceph-tools-7c8c77bd96-9rtnx 1/1 Running 0 11h 10.0.23.153 ip-10-0-23-153.us-east-2.compute.internal <none> <none> $ oc describe pod rook-ceph-osd-prepare-default-1-data-1d6k26-cnhjl | grep "Events:" -A 20 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 29m (x806 over 11h) default-scheduler 0/12 nodes are available: 2 node(s) didn't match pod topology spread constraints, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) had volume node affinity conflict. $ oc rsh rook-ceph-tools-7c8c77bd96-9rtnx ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 52.00000 root default -5 52.00000 region us-east-2 -10 16.00000 zone us-east-2a -9 4.00000 host default-0-data-16zc9r 2 ssd 4.00000 osd.2 up 1.00000 1.00000 -13 4.00000 host default-0-data-4rw242 3 ssd 4.00000 osd.3 up 1.00000 1.00000 -15 4.00000 host default-2-data-0kdf28 5 ssd 4.00000 osd.5 up 1.00000 1.00000 -27 4.00000 host default-2-data-3w45nk 8 ssd 4.00000 osd.8 up 1.00000 1.00000 -4 16.00000 zone us-east-2b -3 4.00000 host default-0-data-02pt7c 1 ssd 4.00000 osd.1 up 1.00000 1.00000 -17 4.00000 host default-0-data-3h8zxb 4 ssd 4.00000 osd.4 up 1.00000 1.00000 -31 4.00000 host default-1-data-28gzn7 13 ssd 4.00000 osd.13 up 1.00000 1.00000 -33 4.00000 host default-2-data-2npbmj 14 ssd 4.00000 osd.14 up 1.00000 1.00000 -20 20.00000 zone us-east-2c -23 4.00000 host default-0-data-24q4cg 0 ssd 4.00000 osd.0 up 1.00000 1.00000 -19 4.00000 host default-1-data-0dstck 7 ssd 4.00000 osd.7 up 1.00000 1.00000 -35 4.00000 host default-1-data-3hbbtb 11 ssd 4.00000 osd.11 up 1.00000 1.00000 -25 4.00000 host default-1-data-4npcdn 6 ssd 4.00000 osd.6 up 1.00000 1.00000 -29 4.00000 host default-2-data-1mjk7w 9 ssd 4.00000 osd.9 up 1.00000 1.00000 10 0 osd.10 down 0 1.00000 12 0 osd.12 down 0 1.00000 $ oc rsh rook-ceph-tools-7c8c77bd96-9rtnx ceph status cluster: id: b271d171-d9c5-4b32-ba39-c095e12f4d28 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 12h) mgr: a(active, since 12h) mds: 1/1 daemons up, 1 hot standby osd: 15 osds: 13 up (since 11h), 13 in (since 11h) data: volumes: 1/1 healthy pools: 3 pools, 545 pgs objects: 38 objects, 3.3 KiB usage: 112 MiB used, 52 TiB / 52 TiB avail pgs: 545 active+clean io: client: 852 B/s rd, 1 op/s rd, 0 op/s wr must-gather logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-20tb-pr/jijoy-20tb-pr_20230203T170743/logs/failed_testcase_ocs_logs_1675448890/test_deployment_ocs_logs/ ================================================================================= Version-Release number of selected component (if applicable): $ oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.9 NooBaa Operator 4.10.9 mcg-operator.v4.10.8 Succeeded observability-operator.v0.0.20 Observability Operator 0.0.20 observability-operator.v0.0.19 Succeeded ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.8 Succeeded ocs-osd-deployer.v2.0.11 OCS OSD Deployer 2.0.11 ocs-osd-deployer.v2.0.10 Succeeded odf-csi-addons-operator.v4.10.9 CSI Addons 4.10.9 odf-csi-addons-operator.v4.10.8 Succeeded odf-operator.v4.10.9 OpenShift Data Foundation 4.10.9 odf-operator.v4.10.8 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.456-02ea942 Route Monitor Operator 0.1.456-02ea942 route-monitor-operator.v0.1.454-494fffd Succeeded $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.50 True False 12h Cluster version is 4.10.50 ================================================================================= How reproducible: Reporting the first occurrence. Steps to Reproduce: 1. Deploy MS provider cluster of size 20 with ocs-provider-qe addon. Deployment example: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/20290/ ================================================================================= Actual results: Cluster verification after the installation of addon failed because of the absence of 2 osd pods. Expected results: Required OSD pods should be running. Additional info: