Bug 2033607
| Summary: | [IBM ROKS] OSDs are not equally spread over ODF nodes | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Elvir Kuric <ekuric> |
| Component: | ocs-operator | Assignee: | Malay Kumar parida <mparida> |
| Status: | ASSIGNED --- | QA Contact: | Elad <ebenahar> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.8 | CC: | mbukatov, MCGINNES, mmuench, mparida, muagarwa, nigoyal, odf-bz-bot, sbose, sostapov |
| Target Milestone: | --- | Keywords: | Performance |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
# oc adm cordon 10.240.64.10 # oc delete pod rook-ceph-osd-11-68cdb6b48b-6svzv pod now start on node with less OSDs. # oc get pods -o wide|grep osd |grep -v Com |grep -v drop rook-ceph-osd-0-5fd67b877-bhj5z 2/2 Running 0 27h 172.17.113.166 10.240.128.6 <none> <none> rook-ceph-osd-1-76b4648f55-hnxsd 2/2 Running 0 27h 172.17.83.157 10.240.0.23 <none> <none> rook-ceph-osd-10-77786c9b7d-459bf 2/2 Running 0 55m 172.17.83.185 10.240.0.23 <none> <none> rook-ceph-osd-11-68cdb6b48b-4f7n9 2/2 Running 0 5m40s 172.17.88.67 10.240.64.9 <none> <none> rook-ceph-osd-2-784ff6bb85-hk47l 2/2 Running 3 27h 172.17.88.69 10.240.64.9 <none> <none> rook-ceph-osd-3-696956c7c9-gz7f7 2/2 Running 0 27h 172.17.116.5 10.240.128.7 <none> <none> rook-ceph-osd-4-75c65759bb-rfl78 2/2 Running 0 27h 172.17.99.132 10.240.64.10 <none> <none> rook-ceph-osd-5-5557f5b757-qlrnm 2/2 Running 0 27h 172.17.112.217 10.240.0.24 <none> <none> rook-ceph-osd-6-85675f89fd-tj6bn 2/2 Running 0 74m 172.17.116.18 10.240.128.7 <none> <none> rook-ceph-osd-7-7dc69c6f6d-5vgdt 2/2 Running 0 74m 172.17.99.160 10.240.64.10 <none> <none> rook-ceph-osd-8-674486cbf5-7xbg7 2/2 Running 0 74m 172.17.112.211 10.240.0.24 <none> <none> rook-ceph-osd-9-76d976f4df-f8kb8 2/2 Running 0 56m 172.17.113.164 10.240.128.6 <none> <none> # oc adm uncordon 10.240.64.10 Moving it to the rook. If you (rook maintainers) think this should go to ocs-operator, pls move accordingly. Are the OSDs expected to be portable? In the cephcluster CR [1] I see that "portable: true". If the OSDs are really expected to be spread evenly across the nodes, the OSDs should not be portable.
In the CephCluster CR [1] the topology spread constraints look expected, with DoNotSchedule to ensure the OSDs are evenly spread across zones and ScheduleAnyway for the host spread. With that preference for host spread, I'm surprised the OSDs aren't commonly getting spread more evenly. But I
topologySpreadConstraints:
- labelSelector:
matchExpressions:
- key: ceph.rook.io/pvc
operator: Exists
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
- labelSelector:
matchExpressions:
- key: ceph.rook.io/pvc
operator: Exists
maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
When portable=false, I would expect the OCS operator to set "whenUnsatisfiable: DoNotSchedule". If this is not the case, let's move it to the OCS operator. But if the portable=true is intended, we need to understand why the TSCs aren't working as configured.
[1] http://perf148b.perf.lab.eng.bos.redhat.com/bz_osd_ibm/registry-redhat-io-ocs4-ocs-must-gather-rhel8-sha256-bfb5c6e78f74c584cf169e1f431d687314ab48472dddc46fe6767a836ea4bb3e/ceph/namespaces/openshift-storage/ceph.rook.io/cephclusters/ocs-storagecluster-cephcluster.yaml
I am not sure why "portable=true" in clusterconfiguration, I will check with IBM team why is this / is it IBM ROKS specific for ODF. (In reply to Elvir Kuric from comment #7) > I am not sure why "portable=true" in clusterconfiguration, I will check with > IBM team why is this / is it IBM ROKS specific for ODF. Any updates? Due to lack of priority, and we're approaching Dev Freeze, moving this to ODF 4.11. Due to lack of priority, and we're approaching Dev Freeze (again), moving this to ODF 4.12. Are you still able to reproduce the problem? IS this happening anywhere outside of ROKS? |
Description of problem (please be detailed as possible and provide log snippests): On IBM cloud ( ROKS ) if we have 6 OCP nodes and all of them labeled to serve as ODF nodes too # oc get nodes NAME STATUS ROLES AGE VERSION 10.240.0.23 Ready master,worker 14d v1.21.4+6438632 10.240.0.24 Ready master,worker 14d v1.21.4+6438632 10.240.128.6 Ready master,worker 14d v1.21.4+6438632 10.240.128.7 Ready master,worker 14d v1.21.4+6438632 10.240.64.10 Ready master,worker 14d v1.21.4+6438632 10.240.64.9 Ready master,worker 14d v1.21.4+6438632 [root@perf148b ~]# oc get nodes --show-labels |grep storage |awk '{print $1}' 10.240.0.23 10.240.0.24 10.240.128.6 10.240.128.7 10.240.64.10 10.240.64.9 Then if creaed ODF cluster on this setup the following will happen -> adding first three OSDs will ensure that we get 1 OSD per node -> adding additional three will lead that these will land on same nodes as first 3 OSDs. To ensure that every node gets one OSD, we have to "cordon" node, delete one OSD and then it will start on new node where no OSD runs. -> if we add 3 more OSDs ( 6->9 OSDs ) they will spread one per node -> if we add 3 more OSDs ( 9-> OSDs ) we will have situation that some nodes will get 3 OSDs while other will stay on 1 OSD - eg nodes "10.240.64.10" - 3 OSD, and node "10.240.64.9" - 1 OSD in below output # oc get pods -o wide|grep osd |grep -v Com |grep -v drop rook-ceph-osd-0-5fd67b877-bhj5z 2/2 Running 0 26h 172.17.113.166 10.240.128.6 <none> <none> rook-ceph-osd-1-76b4648f55-hnxsd 2/2 Running 0 26h 172.17.83.157 10.240.0.23 <none> <none> rook-ceph-osd-10-77786c9b7d-459bf 2/2 Running 0 24m 172.17.83.185 10.240.0.23 <none> <none> rook-ceph-osd-11-68cdb6b48b-6svzv 2/2 Running 0 23m 172.17.99.176 10.240.64.10 <none> <none> rook-ceph-osd-2-784ff6bb85-hk47l 2/2 Running 3 26h 172.17.88.69 10.240.64.9 <none> <none> rook-ceph-osd-3-696956c7c9-gz7f7 2/2 Running 0 26h 172.17.116.5 10.240.128.7 <none> <none> rook-ceph-osd-4-75c65759bb-rfl78 2/2 Running 0 26h 172.17.99.132 10.240.64.10 <none> <none> rook-ceph-osd-5-5557f5b757-qlrnm 2/2 Running 0 26h 172.17.112.217 10.240.0.24 <none> <none> rook-ceph-osd-6-85675f89fd-tj6bn 2/2 Running 0 42m 172.17.116.18 10.240.128.7 <none> <none> rook-ceph-osd-7-7dc69c6f6d-5vgdt 2/2 Running 0 42m 172.17.99.160 10.240.64.10 <none> <none> rook-ceph-osd-8-674486cbf5-7xbg7 2/2 Running 0 42m 172.17.112.211 10.240.0.24 <none> <none> rook-ceph-osd-9-76d976f4df-f8kb8 2/2 Running 0 24m 172.17.113.164 10.240.128.6 <none> <none> From Performance / Scale Perspective this is not ideal. Version of all relevant components (if applicable): OCP v4.8 / ODF v4.8 Installed on IBM cloud Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No Is there any workaround available to the best of your knowledge? Yes Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? yes Can this issue reproduce from the UI? NA Steps to Reproduce: 1. Install 6 nodes OCP / ODF cluster and start adding OSDs to cluster Actual results: OSD pods will not be spread equally across ODF nodes Expected results: Every node to get same number of OSDs Additional info: We expand cluster with : # oc edit storagecluster -n openshift-storage increase count for storageDeviceSets: storageDeviceSets: - config: {} count: 4 dataPVCTemplate: metadata: {} Attached will be storagecluster.yml and must-gather from affected cluster.