Bug 2033607 - [IBM ROKS] OSDs are not equally spread over ODF nodes
Summary: [IBM ROKS] OSDs are not equally spread over ODF nodes
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Malay Kumar parida
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-12-17 11:54 UTC by Elvir Kuric
Modified: 2023-08-09 17:00 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 2004801 1 unspecified ASSIGNED OSDs not distributed uniformly across OCS Worker nodes 2023-08-09 17:00:43 UTC

Internal Links: 2004801

Description Elvir Kuric 2021-12-17 11:54:05 UTC
Description of problem (please be detailed as possible and provide log
snippests):

On IBM cloud ( ROKS ) if we have 6 OCP nodes  and all of them labeled to serve as ODF nodes too

# oc get nodes
NAME           STATUS   ROLES           AGE   VERSION
10.240.0.23    Ready    master,worker   14d   v1.21.4+6438632
10.240.0.24    Ready    master,worker   14d   v1.21.4+6438632
10.240.128.6   Ready    master,worker   14d   v1.21.4+6438632
10.240.128.7   Ready    master,worker   14d   v1.21.4+6438632
10.240.64.10   Ready    master,worker   14d   v1.21.4+6438632
10.240.64.9    Ready    master,worker   14d   v1.21.4+6438632

[root@perf148b ~]# oc get nodes --show-labels |grep storage |awk '{print $1}'
10.240.0.23
10.240.0.24
10.240.128.6
10.240.128.7
10.240.64.10
10.240.64.9

Then if creaed ODF cluster on this setup the following will happen

-> adding first three OSDs will ensure that we get 1 OSD per node
-> adding additional three will lead that these will land on same nodes as first 3 OSDs.
To ensure that every node gets one OSD, we have to "cordon" node, delete one OSD and then it will start on new node where 
no OSD runs. 

-> if we add 3 more OSDs ( 6->9 OSDs ) they will spread one per node
-> if we add 3 more OSDs ( 9-> OSDs ) we will have situation that some nodes will get 3 OSDs while other will stay on 1 OSD - eg nodes "10.240.64.10" - 3 OSD, and node "10.240.64.9" - 1 OSD in below output 


# oc get pods  -o wide|grep osd |grep -v Com |grep -v drop
rook-ceph-osd-0-5fd67b877-bhj5z                                   2/2     Running     0          26h    172.17.113.166   10.240.128.6   <none>           <none>
rook-ceph-osd-1-76b4648f55-hnxsd                                  2/2     Running     0          26h    172.17.83.157    10.240.0.23    <none>           <none>
rook-ceph-osd-10-77786c9b7d-459bf                                 2/2     Running     0          24m    172.17.83.185    10.240.0.23    <none>           <none>
rook-ceph-osd-11-68cdb6b48b-6svzv                                 2/2     Running     0          23m    172.17.99.176    10.240.64.10   <none>           <none>
rook-ceph-osd-2-784ff6bb85-hk47l                                  2/2     Running     3          26h    172.17.88.69     10.240.64.9    <none>           <none>
rook-ceph-osd-3-696956c7c9-gz7f7                                  2/2     Running     0          26h    172.17.116.5     10.240.128.7   <none>           <none>
rook-ceph-osd-4-75c65759bb-rfl78                                  2/2     Running     0          26h    172.17.99.132    10.240.64.10   <none>           <none>
rook-ceph-osd-5-5557f5b757-qlrnm                                  2/2     Running     0          26h    172.17.112.217   10.240.0.24    <none>           <none>
rook-ceph-osd-6-85675f89fd-tj6bn                                  2/2     Running     0          42m    172.17.116.18    10.240.128.7   <none>           <none>
rook-ceph-osd-7-7dc69c6f6d-5vgdt                                  2/2     Running     0          42m    172.17.99.160    10.240.64.10   <none>           <none>
rook-ceph-osd-8-674486cbf5-7xbg7                                  2/2     Running     0          42m    172.17.112.211   10.240.0.24    <none>           <none>
rook-ceph-osd-9-76d976f4df-f8kb8                                  2/2     Running     0          24m    172.17.113.164   10.240.128.6   <none>           <none>


From Performance  / Scale Perspective this is not ideal. 


Version of all relevant components (if applicable):

OCP v4.8 / ODF v4.8 Installed on IBM cloud 

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
Yes

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
NA

Steps to Reproduce:
1. Install 6 nodes OCP / ODF cluster and start adding OSDs to cluster 


Actual results:
OSD pods will not be spread equally across ODF nodes 


Expected results:
Every node to get same number of OSDs 


Additional info:

We expand cluster with :

# oc edit storagecluster -n openshift-storage 
increase count for storageDeviceSets: 

storageDeviceSets:
  - config: {}
    count: 4
    dataPVCTemplate:
      metadata: {}


Attached will be storagecluster.yml and must-gather from affected cluster.

Comment 4 Elvir Kuric 2021-12-17 12:18:17 UTC
# oc adm cordon 10.240.64.10
# oc delete pod rook-ceph-osd-11-68cdb6b48b-6svzv

pod now start on node with less OSDs.

# oc get pods  -o wide|grep osd |grep -v Com |grep -v drop

rook-ceph-osd-0-5fd67b877-bhj5z                                   2/2     Running     0          27h     172.17.113.166   10.240.128.6   <none>           <none>
rook-ceph-osd-1-76b4648f55-hnxsd                                  2/2     Running     0          27h     172.17.83.157    10.240.0.23    <none>           <none>
rook-ceph-osd-10-77786c9b7d-459bf                                 2/2     Running     0          55m     172.17.83.185    10.240.0.23    <none>           <none>
rook-ceph-osd-11-68cdb6b48b-4f7n9                                 2/2     Running     0          5m40s   172.17.88.67     10.240.64.9    <none>           <none>
rook-ceph-osd-2-784ff6bb85-hk47l                                  2/2     Running     3          27h     172.17.88.69     10.240.64.9    <none>           <none>
rook-ceph-osd-3-696956c7c9-gz7f7                                  2/2     Running     0          27h     172.17.116.5     10.240.128.7   <none>           <none>
rook-ceph-osd-4-75c65759bb-rfl78                                  2/2     Running     0          27h     172.17.99.132    10.240.64.10   <none>           <none>
rook-ceph-osd-5-5557f5b757-qlrnm                                  2/2     Running     0          27h     172.17.112.217   10.240.0.24    <none>           <none>
rook-ceph-osd-6-85675f89fd-tj6bn                                  2/2     Running     0          74m     172.17.116.18    10.240.128.7   <none>           <none>
rook-ceph-osd-7-7dc69c6f6d-5vgdt                                  2/2     Running     0          74m     172.17.99.160    10.240.64.10   <none>           <none>
rook-ceph-osd-8-674486cbf5-7xbg7                                  2/2     Running     0          74m     172.17.112.211   10.240.0.24    <none>           <none>
rook-ceph-osd-9-76d976f4df-f8kb8                                  2/2     Running     0          56m     172.17.113.164   10.240.128.6   <none>           <none>

# oc adm uncordon 10.240.64.10

Comment 5 Nitin Goyal 2021-12-17 13:24:36 UTC
Moving it to the rook. If you (rook maintainers) think this should go to ocs-operator, pls move accordingly.

Comment 6 Travis Nielsen 2021-12-17 17:16:21 UTC
Are the OSDs expected to be portable? In the cephcluster CR [1] I see that "portable: true". If the OSDs are really expected to be spread evenly across the nodes, the OSDs should not be portable. 

In the CephCluster CR [1] the topology spread constraints look expected, with DoNotSchedule to ensure the OSDs are evenly spread across zones and ScheduleAnyway for the host spread. With that preference for host spread, I'm surprised the OSDs aren't commonly getting spread more evenly. But I 

        topologySpreadConstraints:
        - labelSelector:
            matchExpressions:
            - key: ceph.rook.io/pvc
              operator: Exists
          maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
        - labelSelector:
            matchExpressions:
            - key: ceph.rook.io/pvc
              operator: Exists
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway

When portable=false, I would expect the OCS operator to set "whenUnsatisfiable: DoNotSchedule". If this is not the case, let's move it to the OCS operator. But if the portable=true is intended, we need to understand why the TSCs aren't working as configured.


[1] http://perf148b.perf.lab.eng.bos.redhat.com/bz_osd_ibm/registry-redhat-io-ocs4-ocs-must-gather-rhel8-sha256-bfb5c6e78f74c584cf169e1f431d687314ab48472dddc46fe6767a836ea4bb3e/ceph/namespaces/openshift-storage/ceph.rook.io/cephclusters/ocs-storagecluster-cephcluster.yaml

Comment 7 Elvir Kuric 2021-12-20 11:32:52 UTC
I am not sure why "portable=true" in clusterconfiguration, I will check with IBM team why is this / is it IBM ROKS specific for ODF.

Comment 8 Yaniv Kaul 2022-01-16 09:42:20 UTC
(In reply to Elvir Kuric from comment #7)
> I am not sure why "portable=true" in clusterconfiguration, I will check with
> IBM team why is this / is it IBM ROKS specific for ODF.

Any updates?

Comment 13 Jose A. Rivera 2022-02-22 15:27:39 UTC
Due to lack of priority, and we're approaching Dev Freeze, moving this to ODF 4.11.

Comment 14 Jose A. Rivera 2022-06-21 14:12:27 UTC
Due to lack of priority, and we're approaching Dev Freeze (again), moving this to ODF 4.12.

Are you still able to reproduce the problem? IS this happening anywhere outside of ROKS?


Note You need to log in before you can comment on or make changes to this bug.