Description of problem: osd.9 and osd.10 are marked as down in ceph osd tree output. osd-9 and osd-10 pods are missing. osd-15 and osd-16 pods are running instead. $ oc get pods -o wide -l osd NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES rook-ceph-osd-0-5fb844c65b-kvzzg 2/2 Running 0 157m 10.0.17.34 ip-10-0-17-34.us-east-2.compute.internal <none> <none> rook-ceph-osd-1-67d56959d8-l49bq 2/2 Running 0 157m 10.0.14.24 ip-10-0-14-24.us-east-2.compute.internal <none> <none> rook-ceph-osd-11-698f94f8f4-vqvp7 2/2 Running 0 152m 10.0.19.38 ip-10-0-19-38.us-east-2.compute.internal <none> <none> rook-ceph-osd-12-85f55945d6-cpdps 2/2 Running 0 157m 10.0.14.194 ip-10-0-14-194.us-east-2.compute.internal <none> <none> rook-ceph-osd-13-7f4c5dc66d-dh7q9 2/2 Running 0 157m 10.0.14.24 ip-10-0-14-24.us-east-2.compute.internal <none> <none> rook-ceph-osd-14-7b9ddd9967-qzj25 2/2 Running 0 157m 10.0.14.24 ip-10-0-14-24.us-east-2.compute.internal <none> <none> rook-ceph-osd-15-65b76d4467-gctxj 2/2 Running 0 155m 10.0.21.244 ip-10-0-21-244.us-east-2.compute.internal <none> <none> rook-ceph-osd-16-5bc8c589df-gszfq 2/2 Running 0 155m 10.0.22.40 ip-10-0-22-40.us-east-2.compute.internal <none> <none> rook-ceph-osd-2-76479488bb-vlbz8 2/2 Running 0 157m 10.0.22.40 ip-10-0-22-40.us-east-2.compute.internal <none> <none> rook-ceph-osd-3-c5985f568-xs8q7 2/2 Running 0 157m 10.0.14.194 ip-10-0-14-194.us-east-2.compute.internal <none> <none> rook-ceph-osd-4-667f88675d-tbjrz 2/2 Running 0 157m 10.0.14.194 ip-10-0-14-194.us-east-2.compute.internal <none> <none> rook-ceph-osd-5-5cc6866895-q96s6 2/2 Running 0 157m 10.0.17.34 ip-10-0-17-34.us-east-2.compute.internal <none> <none> rook-ceph-osd-6-58b5b97594-m6txx 2/2 Running 0 152m 10.0.19.38 ip-10-0-19-38.us-east-2.compute.internal <none> <none> rook-ceph-osd-7-7c688846cb-pmfjv 2/2 Running 0 157m 10.0.22.40 ip-10-0-22-40.us-east-2.compute.internal <none> <none> rook-ceph-osd-8-65ff789554-vf6x5 2/2 Running 0 157m 10.0.21.244 ip-10-0-21-244.us-east-2.compute.internal <none> <none> The weight is not distributed correctly per zones. $ oc exec rook-ceph-tools-7c8c77bd96-g9r2v -- ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 60.00000 root default -5 60.00000 region us-east-2 -10 24.00000 zone us-east-2a -19 4.00000 host default-0-data-0zb589 1 ssd 4.00000 osd.1 up 1.00000 1.00000 -27 4.00000 host default-1-data-0dgqxh 13 ssd 4.00000 osd.13 up 1.00000 1.00000 -29 4.00000 host default-1-data-18p4w9 14 ssd 4.00000 osd.14 up 1.00000 1.00000 -13 4.00000 host default-1-data-3ccnsk 4 ssd 4.00000 osd.4 up 1.00000 1.00000 -31 4.00000 host default-2-data-1crwhx 12 ssd 4.00000 osd.12 up 1.00000 1.00000 -9 4.00000 host default-2-data-4drz6r 3 ssd 4.00000 osd.3 up 1.00000 1.00000 -4 16.00000 zone us-east-2b -3 4.00000 host default-0-data-1lmf4s 0 ssd 4.00000 osd.0 up 1.00000 1.00000 -21 4.00000 host default-0-data-4c6c6c 6 ssd 4.00000 osd.6 up 1.00000 1.00000 -25 4.00000 host default-1-data-4qj4rs 5 ssd 4.00000 osd.5 up 1.00000 1.00000 -23 4.00000 host default-2-data-3cwl4r 11 ssd 4.00000 osd.11 up 1.00000 1.00000 -16 20.00000 zone us-east-2c -35 4.00000 host default-0-data-2w7jjk 2 ssd 4.00000 osd.2 up 1.00000 1.00000 -37 4.00000 host default-0-data-32lmdw 15 ssd 4.00000 osd.15 up 1.00000 1.00000 -33 4.00000 host default-1-data-28l6lr 7 ssd 4.00000 osd.7 up 1.00000 1.00000 -39 4.00000 host default-2-data-0b5gpt 16 ssd 4.00000 osd.16 up 1.00000 1.00000 -15 4.00000 host default-2-data-26pmj5 8 ssd 4.00000 osd.8 up 1.00000 1.00000 9 0 osd.9 down 0 1.00000 10 0 osd.10 down 0 1.00000 must-gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-f6-s20-pr/jijoy-f6-s20-pr_20230206T073026/logs/deployment_1675674405/ ================================================================================================================= Version-Release number of selected component (if applicable): $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.50 True False 176m Cluster version is 4.10.50 $ oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.9 NooBaa Operator 4.10.9 mcg-operator.v4.10.8 Succeeded observability-operator.v0.0.20 Observability Operator 0.0.20 observability-operator.v0.0.19 Succeeded ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.8 Succeeded ocs-osd-deployer.v2.0.11 OCS OSD Deployer 2.0.11 ocs-osd-deployer.v2.0.10 Succeeded odf-csi-addons-operator.v4.10.9 CSI Addons 4.10.9 odf-csi-addons-operator.v4.10.8 Succeeded odf-operator.v4.10.9 OpenShift Data Foundation 4.10.9 odf-operator.v4.10.8 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.456-02ea942 Route Monitor Operator 0.1.456-02ea942 route-monitor-operator.v0.1.454-494fffd Succeeded ========================================================================================================================= How reproducible: Reporting the first occurrence of the issue with the revised topology changes Steps to Reproduce: 1. Install Managed Services Provider cluster with size 20 2. Verify "ceph osd tree" output and the list of pods Actual results: Some OSDs are marked as down. Incorrect number of OSDs per zone. Expected results: ceph osd tree output should not have any issues. Additional info: Bug #2166915 was also seen in the cluster. There is a similar bug #2136378 (closed as not a bug after discussions) where ceph health was not okay. Ceph health is HEALTH_OK in this case.
duplicate of 2167045 or same root cause? did we see this with the latest image again?
OSDs are distributed equally after https://github.com/red-hat-storage/ocs-osd-deployer/pull/281 was merged
Moving this to Verified because the bug was not reproduced during the testing after the PR mentioned in comment #c3 was applied in the build.
Closing this as this is been verified BY QE and fixed in v2.0.11.