Bug 2072900
| Summary: | ceph osd tree shows some OSDs as up when all the OSDs in the cluster are scaled down | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Rachael <rgeorge> | |
| Component: | odf-managed-service | Assignee: | Nobody <nobody> | |
| Status: | CLOSED NOTABUG | QA Contact: | Filip Balák <fbalak> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 4.10 | CC: | aeyal, dbindra, fbalak, mmuench, nberry, ocs-bugs, odf-bz-bot | |
| Target Milestone: | --- | Keywords: | Tracking | |
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2125123 (view as bug list) | Environment: | ||
| Last Closed: | 2023-04-13 12:41:19 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2125123 | |||
@rgeorge Can you please open a tracker bug in the product? |
Description of problem: ----------------------- When all the OSD deployments on the provider cluster were scaled down, the ceph osd tree output showed only 11/15 OSDs marked as Down. 4 OSDs belonging to the same zone were marked as Up. The storagecluster was also in Ready state. $ oc get pods |grep rook-ceph-osd $ $ oc get deployment|grep rook-ceph-osd rook-ceph-osd-0 0/0 0 0 24h rook-ceph-osd-1 0/0 0 0 24h rook-ceph-osd-10 0/0 0 0 24h rook-ceph-osd-11 0/0 0 0 24h rook-ceph-osd-12 0/0 0 0 24h rook-ceph-osd-13 0/0 0 0 24h rook-ceph-osd-14 0/0 0 0 24h rook-ceph-osd-2 0/0 0 0 24h rook-ceph-osd-3 0/0 0 0 24h rook-ceph-osd-4 0/0 0 0 24h rook-ceph-osd-5 0/0 0 0 24h rook-ceph-osd-6 0/0 0 0 24h rook-ceph-osd-7 0/0 0 0 24h rook-ceph-osd-8 0/0 0 0 24h rook-ceph-osd-9 0/0 0 0 24h ======ceph osd tree === ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 60.00000 root default -5 60.00000 region us-east-1 -4 20.00000 zone us-east-1a -13 4.00000 host default-0-data-1zdggq 7 ssd 4.00000 osd.7 down 1.00000 1.00000 -37 4.00000 host default-0-data-3gwklp 14 ssd 4.00000 osd.14 down 1.00000 1.00000 -3 4.00000 host default-1-data-2rczfm 1 ssd 4.00000 osd.1 down 1.00000 1.00000 -33 4.00000 host default-1-data-4zs69q 13 ssd 4.00000 osd.13 down 1.00000 1.00000 -23 4.00000 host default-2-data-375ffh 0 ssd 4.00000 osd.0 down 1.00000 1.00000 -16 20.00000 zone us-east-1b -15 4.00000 host default-0-data-0csszt 5 ssd 4.00000 osd.5 down 1.00000 1.00000 -31 4.00000 host default-0-data-4bqv28 10 ssd 4.00000 osd.10 down 1.00000 1.00000 -25 4.00000 host default-1-data-1jwj6h 4 ssd 4.00000 osd.4 down 1.00000 1.00000 -29 4.00000 host default-2-data-0h96sb 9 ssd 4.00000 osd.9 down 1.00000 1.00000 -27 4.00000 host default-2-data-2b5gf2 6 ssd 4.00000 osd.6 down 1.00000 1.00000 -10 20.00000 zone us-east-1c -19 4.00000 host default-0-data-26nqmh 8 ssd 4.00000 osd.8 down 1.00000 1.00000 -39 4.00000 host default-1-data-08dt65 12 ssd 4.00000 osd.12 up 1.00000 1.00000 -21 4.00000 host default-1-data-37fw5n 2 ssd 4.00000 osd.2 up 1.00000 1.00000 -35 4.00000 host default-2-data-1cf9tb 11 ssd 4.00000 osd.11 up 1.00000 1.00000 -9 4.00000 host default-2-data-4chbq2 3 ssd 4.00000 osd.3 up 1.00000 1.00000 $ oc get storagecluster NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 25h Ready 2022-04-06T06:05:37Z Version-Release number of selected component (if applicable): ------------------------------------------------------------- NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.0 NooBaa Operator 4.10.0 Succeeded ocs-operator.v4.10.0 OpenShift Container Storage 4.10.0 Succeeded ocs-osd-deployer.v2.0.0 OCS OSD Deployer 2.0.0 Succeeded odf-operator.v4.10.0 OpenShift Data Foundation 4.10.0 (full_version=4.10.0-219) Succeeded ose-prometheus-operator.4.8.0 Prometheus Operator 4.8.0 Succeeded route-monitor-operator.v0.1.408-c2256a2 Route Monitor Operator 0.1.408-c2256a2 route-monitor-operator.v0.1.406-54ff884 Succeeded How reproducible: 2/2 Steps to Reproduce: ------------------- 1. Scale down all the OSD deployments in the provider cluster oc scale deployment rook-ceph-osd-0 --replicas=0 2. Check the status of the OSD pods 3. Check the output of ceph osd tree Actual results: --------------- Even though all the OSD pods are down, the ceph OSD tree reports 4/15 OSDs as Up. Expected results: ----------------- The OSDs marked as Up in ceph osd tree should match the number of OSD pods that are running in the cluster Additional info: ---------------- >> When all the OSDs came back up after reconcile, ~15 mins after they were scaled down, one of the OSDs was marked as down even though the corresponding OSD pod was up and running. sh-4.4$ ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 60.00000 root default [...] -5 60.00000 region us-east-1 -10 20.00000 zone us-east-1c -35 4.00000 host default-2-data-1cf9tb 11 ssd 4.00000 osd.11 down 0 1.00000 $ oc get pods |grep rook-ceph-osd-11 rook-ceph-osd-11-7d758bc689-7gwdl 2/2 Running 0 98m $ oc get deployment |grep rook-ceph-osd-11 rook-ceph-osd-11 1/1 1 1 26h