2254035 – OSD pods scheduling is inconsistent during multiple device sets scenarios

Bug 2254035 - OSD pods scheduling is inconsistent during multiple device sets scenarios

Summary: OSD pods scheduling is inconsistent during multiple device sets scenarios

Keywords:
Status:	ASSIGNED
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.13
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Malay Kumar parida
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-12-11 16:25 UTC by T K Chandra Hasan
Modified:	2024-09-02 13:15 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-12-19 09:45:41 UTC
Embargoed:

Attachments	(Terms of Use)

Description T K Chandra Hasan 2023-12-11 16:25:46 UTC

Description of problem (please be detailed as possible and provide log
snippests):
In IBMCloud ROKS cluster, we were validating the multiple deviceset features and are observing inconsistency in OSD pod scheduling. We are following this article to create devicesets

https://access.redhat.com/articles/6214381

This issue has been observed on both 4.13 & 4.14 ROKS clusters with the respective latest version of ODF.

We have created a ROKS clusters with 3 workers of flavors 16x64G initially from the IBMCloud and after the cluster creation, have installed our addon to install the ODF. This by default installs ODF by creating a single deviceset with name "ocs-deviceset" and storage class as "ibmc-vpc-block-metro-10iops-tier" and all the OSD pods are evenly spread across the available workers.
##########################################
    - config: {}
      count: 1
      dataPVCTemplate:
        metadata: {}
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 512Gi
          storageClassName: ibmc-vpc-block-metro-10iops-tier
          volumeMode: Block
        status: {}
      name: ocs-deviceset
      placement: {}
      portable: true
      preparePlacement: {}
      replica: 3
      resources: {}
###########################################
Work> oc get no -owide -l ibm-cloud.kubernetes.io/worker-pool-name=default
NAME           STATUS   ROLES           AGE   VERSION           INTERNAL-IP    EXTERNAL-IP    OS-IMAGE                               KERNEL-VERSION                 CONTAINER-RUNTIME
10.241.0.7     Ready    master,worker   8h    v1.26.9+aa37255   10.241.0.7     10.241.0.7     Red Hat Enterprise Linux 8.8 (Ootpa)   4.18.0-477.27.1.el8_8.x86_64   cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
10.241.128.7   Ready    master,worker   8h    v1.26.9+aa37255   10.241.128.7   10.241.128.7   Red Hat Enterprise Linux 8.8 (Ootpa)   4.18.0-477.27.1.el8_8.x86_64   cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
10.241.64.6    Ready    master,worker   8h    v1.26.9+aa37255   10.241.64.6    10.241.64.6    Red Hat Enterprise Linux 8.8 (Ootpa)   4.18.0-477.27.1.el8_8.x86_64   cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
##########################################
rook-ceph-osd-0-794885b46f-c2dx8                                  2/2     Running     0          7h39m   172.17.89.250    10.241.64.6     <none>           <none>
rook-ceph-osd-1-8699d65d57-88z2g                                  2/2     Running     0          7h39m   172.17.66.223    10.241.128.7    <none>           <none>
rook-ceph-osd-2-6b48c9b99-k8tb6                                   2/2     Running     0          7h38m   172.17.68.230    10.241.0.7      <none>           <none>
##########################################

Lets add another deviceset by editing the storagecluster cr as per the above article except that the deviceClass parameter is not added with storage class as "ibmc-vpc-block-metro-5iops-tier". In this case, the OSD pods are getting scheduled on the above listed nodes and are being spread across the zones.
##########################################
  - config: {}
    count: 1
    dataPVCTemplate:
      metadata: {}
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 512Gi
        storageClassName: ibmc-vpc-block-metro-5iops-tier
        volumeMode: Block
      status: {}
    name: ocs-deviceset-2
    placement: {}
    portable: true
    preparePlacement: {}
    replica: 3
    resources: {}
##########################################
rook-ceph-osd-3-549df4f77d-l7w5s                                  2/2     Running     0          7h8m    172.17.89.249    10.241.64.6     <none>           <none>
rook-ceph-osd-4-56464899-qk2bl                                    2/2     Running     0          7h8m    172.17.66.232    10.241.128.7    <none>           <none>
rook-ceph-osd-5-7bb8c4b8c4-zszfr                                  2/2     Running     0          7h7m    172.17.68.238    10.241.0.7      <none>           <none>
##########################################

Now create a worker pool of 3 workers from IBMCloud UI with name "deviceset-3" and add the following labels. Lets create another deviceset with deviceClass as "deviceset-3", storage class as "ibmc-vpc-block-metro-5iops-tier" and placement policies as well. In this case, the OSD pods are either gets scheduled across any 2 zones or on any one of the workers based on the affinity condition.
##########################################
cluster.ocs.openshift.io/openshift-storage: ""
cluster.ocs.openshift.io/openshift-storage-device-class: deviceset-3
##########################################
  - config: {}
    count: 1
    dataPVCTemplate:
      metadata: {}
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 512Gi
        storageClassName: ibmc-vpc-block-metro-5iops-tier
        volumeMode: Block
      status: {}
    deviceClass: deviceset-3
    name: ocs-deviceset-3
    placement:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: cluster.ocs.openshift.io/openshift-storage-device-class
              operator: In
              values:
              - deviceset-3
##########################################
Work> oc get no -owide -l ibm-cloud.kubernetes.io/worker-pool-name=deviceset-3
NAME            STATUS   ROLES           AGE     VERSION           INTERNAL-IP     EXTERNAL-IP     OS-IMAGE                               KERNEL-VERSION                 CONTAINER-RUNTIME
10.241.0.9      Ready    master,worker   7h23m   v1.26.9+aa37255   10.241.0.9      10.241.0.9      Red Hat Enterprise Linux 8.8 (Ootpa)   4.18.0-477.27.1.el8_8.x86_64   cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
10.241.128.12   Ready    master,worker   7h23m   v1.26.9+aa37255   10.241.128.12   10.241.128.12   Red Hat Enterprise Linux 8.8 (Ootpa)   4.18.0-477.27.1.el8_8.x86_64   cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
10.241.64.11    Ready    master,worker   7h23m   v1.26.9+aa37255   10.241.64.11    10.241.64.11    Red Hat Enterprise Linux 8.8 (Ootpa)   4.18.0-477.27.1.el8_8.x86_64   cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
##########################################
rook-ceph-osd-6-6b456f7844-jp4x2                                  2/2     Running     0          6h14m   172.17.110.209   10.241.64.11    <none>           <none>
rook-ceph-osd-7-55b98ff548-v4rsh                                  2/2     Running     0          6h14m   172.17.110.212   10.241.64.11    <none>           <none>
rook-ceph-osd-8-b45474c5f-6vnqv                                   2/2     Running     0          6h13m   172.17.110.214   10.241.64.11    <none>           <none>
##########################################

Same steps as previous scenario but with different storage class "ibmc-vpc-block-metro-general-purpose". In this case, the OSD pods are all distributed across zone as expected
##########################################
  - config: {}
    count: 1
    dataPVCTemplate:
      metadata: {}
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 512Gi
        storageClassName: ibmc-vpc-block-metro-general-purpose
        volumeMode: Block
      status: {}
    deviceClass: deviceset-4
    name: ocs-deviceset-4
    placement:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: cluster.ocs.openshift.io/openshift-storage-device-class
              operator: In
              values:
              - deviceset-4
    portable: true
    preparePlacement: {}
    replica: 3
    resources: {}
##########################################
Work> oc get no -owide -l ibm-cloud.kubernetes.io/worker-pool-name=deviceset-4
NAME            STATUS   ROLES           AGE     VERSION           INTERNAL-IP     EXTERNAL-IP     OS-IMAGE                               KERNEL-VERSION                 CONTAINER-RUNTIME
10.241.0.10     Ready    master,worker   7h57m   v1.26.9+aa37255   10.241.0.10     10.241.0.10     Red Hat Enterprise Linux 8.8 (Ootpa)   4.18.0-477.27.1.el8_8.x86_64   cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
10.241.128.13   Ready    master,worker   7h56m   v1.26.9+aa37255   10.241.128.13   10.241.128.13   Red Hat Enterprise Linux 8.8 (Ootpa)   4.18.0-477.27.1.el8_8.x86_64   cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
10.241.64.12    Ready    master,worker   7h57m   v1.26.9+aa37255   10.241.64.12    10.241.64.12    Red Hat Enterprise Linux 8.8 (Ootpa)   4.18.0-477.27.1.el8_8.x86_64   cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
##########################################
rook-ceph-osd-9-67dd868dc8-jhw4q                                  2/2     Running     0          4h56m   172.17.116.72    10.241.128.13   <none>           <none>
rook-ceph-osd-10-54d5b69df5-mvvzj                                 2/2     Running     0          4h56m   172.17.125.8     10.241.64.12    <none>           <none>
rook-ceph-osd-11-548ff94bdb-sp7cv                                 2/2     Running     0          4h56m   172.17.75.137    10.241.0.10     <none>           <none>
##########################################

Same step as previous scenario with same storage class "ibmc-vpc-block-metro-general-purpose". In this case, the OSD pods are distributed unevenly with 2 OSDs on same zone.
##########################################
  - config: {}
    count: 1
    dataPVCTemplate:
      metadata: {}
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 512Gi
        storageClassName: ibmc-vpc-block-metro-general-purpose
        volumeMode: Block
      status: {}
    deviceClass: deviceset-5
    name: ocs-deviceset-5
    placement:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: cluster.ocs.openshift.io/openshift-storage-device-class
              operator: In
              values:
              - deviceset-5
    portable: true
    preparePlacement: {}
    replica: 3
    resources: {}
##########################################
Work> oc get no -owide -l ibm-cloud.kubernetes.io/worker-pool-name=deviceset-5
NAME            STATUS   ROLES           AGE     VERSION           INTERNAL-IP     EXTERNAL-IP     OS-IMAGE                               KERNEL-VERSION                 CONTAINER-RUNTIME
10.241.0.11     Ready    master,worker   6h30m   v1.26.9+aa37255   10.241.0.11     10.241.0.11     Red Hat Enterprise Linux 8.8 (Ootpa)   4.18.0-477.27.1.el8_8.x86_64   cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
10.241.128.14   Ready    master,worker   6h30m   v1.26.9+aa37255   10.241.128.14   10.241.128.14   Red Hat Enterprise Linux 8.8 (Ootpa)   4.18.0-477.27.1.el8_8.x86_64   cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
10.241.64.13    Ready    master,worker   6h30m   v1.26.9+aa37255   10.241.64.13    10.241.64.13    Red Hat Enterprise Linux 8.8 (Ootpa)   4.18.0-477.27.1.el8_8.x86_64   cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8
##########################################
rook-ceph-osd-12-6fc6c68645-cwdwz                                 2/2     Running     0          4h1m    172.17.91.201    10.241.64.13    <none>           <none>
rook-ceph-osd-13-6f6cb46d4f-55xsz                                 2/2     Running     0          4h1m    172.17.91.203    10.241.64.13    <none>           <none>
rook-ceph-osd-14-7988b69947-csrkl                                 2/2     Running     0          4h      172.17.103.72    10.241.0.11     <none>           <none>
##########################################


Version of all relevant components (if applicable):
Latest ODF 4.13 & 4.14

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
We are assessing the multiple deviceset feature for customers from ODF on IBMCloud


Is there any workaround available to the best of your knowledge?
If we include PodAntiAffinity rules on OSD Prepare jobs, OSD pods are being scheduled as expected
##############################################
      placement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage-device-class
                operator: In
                values:
                - deviceset-8
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - rook-ceph-osd
                  - rook-ceph-osd-prepare
              topologyKey: topology.kubernetes.io/zone
            weight: 100
##############################################

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3


Can this issue reproducible?
Yes, Tried in 2 clusters in 2 different env.
4.13 on Prod Env
4.14 on Internal Stage Env

Can this issue reproduce from the UI?
NA


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
All scenarios are detailed in the description


Actual results:
OSD pods should be scheduled across nodes from different zones

Expected results:
Works only when SC are different in device sets.


Additional info:
NA

Comment 2 T K Chandra Hasan 2023-12-11 16:37:35 UTC

I have collected odf must gather but not able to attach it here as it exceeds the limit.
Is there another way to upload the logs?

Comment 6 Santosh Pillai 2023-12-14 07:31:51 UTC

(In reply to T K Chandra Hasan from comment #0)
> Description of problem (please be detailed as possible and provide log
> snippests):
> In IBMCloud ROKS cluster, we were validating the multiple deviceset features
> and are observing inconsistency in OSD pod scheduling. We are following this
> article to create devicesets
> 
> https://access.redhat.com/articles/6214381

If I'm reading it correctly, the article helps to segregate nodes based on the labels and then customers can run workloads on those particular nodes. It does not mention that workload will be evenly distributed on the labeled nodes. Also this article is from 4.8. We now use `topologySpreadConstraints` to evenly distribute the OSD prepare pod and OSD pods across nodes. You can refer the `storageClassDeviceSets` in the `cephCluster` yaml below. Only ocs-deviceset (ocs-deviceset-0, ocs-deviceset-1, ocs-deviceset-2) seems to have the topologyspreadConstraint set. 

```
  storage:
    storageClassDeviceSets:
    - count: 1
      name: ocs-deviceset-0
      placement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage
                operator: Exists
        tolerations:
        - effect: NoSchedule
          key: node.ocs.openshift.io/storage
          operator: Equal
          value: "true"
        topologySpreadConstraints:
        - labelSelector:
            matchExpressions:
            - key: ceph.rook.io/pvc
              operator: Exists
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
      portable: true
      preparePlacement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage
                operator: Exists
        tolerations:
        - effect: NoSchedule
          key: node.ocs.openshift.io/storage
          operator: Equal
          value: "true"
        topologySpreadConstraints:
        - labelSelector:
            matchExpressions:
            - key: ceph.rook.io/pvc
              operator: Exists
          maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
        - labelSelector:
            matchExpressions:
            - key: ceph.rook.io/pvc
              operator: Exists
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
      resources:
        limits:
          cpu: "2"
          memory: 5Gi
        requests:
          cpu: "2"
          memory: 5Gi
      tuneFastDeviceClass: true
      volumeClaimTemplates:
      - metadata:
          annotations:
            crushDeviceClass: ""
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 512Gi
          storageClassName: ibmc-vpc-block-metro-10iops-tier
          volumeMode: Block
        status: {}
    - count: 1
      name: ocs-deviceset-1
      placement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage
                operator: Exists
        tolerations:
        - effect: NoSchedule
          key: node.ocs.openshift.io/storage
          operator: Equal
          value: "true"
        topologySpreadConstraints:
        - labelSelector:
            matchExpressions:
            - key: ceph.rook.io/pvc
              operator: Exists
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
      portable: true
      preparePlacement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage
                operator: Exists
        tolerations:
        - effect: NoSchedule
          key: node.ocs.openshift.io/storage
          operator: Equal
          value: "true"
        topologySpreadConstraints:
        - labelSelector:
            matchExpressions:
            - key: ceph.rook.io/pvc
              operator: Exists
          maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
        - labelSelector:
            matchExpressions:
            - key: ceph.rook.io/pvc
              operator: Exists
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
      resources:
        limits:
          cpu: "2"
          memory: 5Gi
        requests:
          cpu: "2"
          memory: 5Gi
      tuneFastDeviceClass: true
      volumeClaimTemplates:
      - metadata:
          annotations:
            crushDeviceClass: ""
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 512Gi
          storageClassName: ibmc-vpc-block-metro-10iops-tier
          volumeMode: Block
        status: {}
    - count: 1
      name: ocs-deviceset-2
      placement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage
                operator: Exists
        tolerations:
        - effect: NoSchedule
          key: node.ocs.openshift.io/storage
          operator: Equal
          value: "true"
        topologySpreadConstraints:
        - labelSelector:
            matchExpressions:
            - key: ceph.rook.io/pvc
              operator: Exists
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
      portable: true
      preparePlacement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage
                operator: Exists
        tolerations:
        - effect: NoSchedule
          key: node.ocs.openshift.io/storage
          operator: Equal
          value: "true"
        topologySpreadConstraints:
        - labelSelector:
            matchExpressions:
            - key: ceph.rook.io/pvc
              operator: Exists
          maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
        - labelSelector:
            matchExpressions:
            - key: ceph.rook.io/pvc
              operator: Exists
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
      resources:
        limits:
          cpu: "2"
          memory: 5Gi
        requests:
          cpu: "2"
          memory: 5Gi
      tuneFastDeviceClass: true
      volumeClaimTemplates:
      - metadata:
          annotations:
            crushDeviceClass: ""
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 512Gi
          storageClassName: ibmc-vpc-block-metro-10iops-tier
          volumeMode: Block
        status: {}
    - count: 1
      name: ocs-deviceset-2-0
      placement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage
                operator: Exists
        tolerations:
        - effect: NoSchedule
          key: node.ocs.openshift.io/storage
          operator: Equal
          value: "true"
        topologySpreadConstraints:
        - labelSelector:
            matchExpressions:
            - key: ceph.rook.io/pvc
              operator: Exists
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
      portable: true
      preparePlacement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage
                operator: Exists
        tolerations:
        - effect: NoSchedule
          key: node.ocs.openshift.io/storage
          operator: Equal
          value: "true"
        topologySpreadConstraints:
        - labelSelector:
            matchExpressions:
            - key: ceph.rook.io/pvc
              operator: Exists
          maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
        - labelSelector:
            matchExpressions:
            - key: ceph.rook.io/pvc
              operator: Exists
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
      resources:
        limits:
          cpu: "2"
          memory: 5Gi
        requests:
          cpu: "2"
          memory: 5Gi
      tuneFastDeviceClass: true
      volumeClaimTemplates:
      - metadata:
          annotations:
            crushDeviceClass: ""
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 512Gi
          storageClassName: ibmc-vpc-block-metro-5iops-tier
          volumeMode: Block
        status: {}
    - count: 1
      name: ocs-deviceset-2-1
      placement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage
                operator: Exists
        tolerations:
        - effect: NoSchedule
          key: node.ocs.openshift.io/storage
          operator: Equal
          value: "true"
        topologySpreadConstraints:
        - labelSelector:
            matchExpressions:
            - key: ceph.rook.io/pvc
              operator: Exists
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
      portable: true
      preparePlacement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage
                operator: Exists
        tolerations:
        - effect: NoSchedule
          key: node.ocs.openshift.io/storage
          operator: Equal
          value: "true"
        topologySpreadConstraints:
        - labelSelector:
            matchExpressions:
            - key: ceph.rook.io/pvc
              operator: Exists
          maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
        - labelSelector:
            matchExpressions:
            - key: ceph.rook.io/pvc
              operator: Exists
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
      resources:
        limits:
          cpu: "2"
          memory: 5Gi
        requests:
          cpu: "2"
          memory: 5Gi
      tuneFastDeviceClass: true
      volumeClaimTemplates:
      - metadata:
          annotations:
            crushDeviceClass: ""
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 512Gi
          storageClassName: ibmc-vpc-block-metro-5iops-tier
          volumeMode: Block
        status: {}
    - count: 1
      name: ocs-deviceset-2-2
      placement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage
                operator: Exists
        tolerations:
        - effect: NoSchedule
          key: node.ocs.openshift.io/storage
          operator: Equal
          value: "true"
        topologySpreadConstraints:
        - labelSelector:
            matchExpressions:
            - key: ceph.rook.io/pvc
              operator: Exists
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
      portable: true
      preparePlacement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage
                operator: Exists
        tolerations:
        - effect: NoSchedule
          key: node.ocs.openshift.io/storage
          operator: Equal
          value: "true"
        topologySpreadConstraints:
        - labelSelector:
            matchExpressions:
            - key: ceph.rook.io/pvc
              operator: Exists
          maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
        - labelSelector:
            matchExpressions:
            - key: ceph.rook.io/pvc
              operator: Exists
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
      resources:
        limits:
          cpu: "2"
          memory: 5Gi
        requests:
          cpu: "2"
          memory: 5Gi
      tuneFastDeviceClass: true
      volumeClaimTemplates:
      - metadata:
          annotations:
            crushDeviceClass: ""
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 512Gi
          storageClassName: ibmc-vpc-block-metro-5iops-tier
          volumeMode: Block
        status: {}
    - count: 1
      name: ocs-deviceset-3-0
      placement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage-device-class
                operator: In
                values:
                - deviceset-3
      portable: true
      preparePlacement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage-device-class
                operator: In
                values:
                - deviceset-3
      resources:
        limits:
          cpu: "2"
          memory: 5Gi
        requests:
          cpu: "2"
          memory: 5Gi
      tuneFastDeviceClass: true
      volumeClaimTemplates:
      - metadata:
          annotations:
            crushDeviceClass: deviceset-3
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 512Gi
          storageClassName: ibmc-vpc-block-metro-5iops-tier
          volumeMode: Block
        status: {}
    - count: 1
      name: ocs-deviceset-3-1
      placement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage-device-class
                operator: In
                values:
                - deviceset-3
      portable: true
      preparePlacement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage-device-class
                operator: In
                values:
                - deviceset-3
      resources:
        limits:
          cpu: "2"
          memory: 5Gi
        requests:
          cpu: "2"
          memory: 5Gi
      tuneFastDeviceClass: true
      volumeClaimTemplates:
      - metadata:
          annotations:
            crushDeviceClass: deviceset-3
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 512Gi
          storageClassName: ibmc-vpc-block-metro-5iops-tier
          volumeMode: Block
        status: {}
    - count: 1
      name: ocs-deviceset-3-2
      placement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage-device-class
                operator: In
                values:
                - deviceset-3
      portable: true
      preparePlacement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage-device-class
                operator: In
                values:
                - deviceset-3
      resources:
        limits:
          cpu: "2"
          memory: 5Gi
        requests:
          cpu: "2"
          memory: 5Gi
      tuneFastDeviceClass: true
      volumeClaimTemplates:
      - metadata:
          annotations:
            crushDeviceClass: deviceset-3
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 512Gi
          storageClassName: ibmc-vpc-block-metro-5iops-tier
          volumeMode: Block
        status: {}
    - count: 1
      name: ocs-deviceset-4-0
      placement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage-device-class
                operator: In
                values:
                - deviceset-4
      portable: true
      preparePlacement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage-device-class
                operator: In
                values:
                - deviceset-4
      resources:
        limits:
          cpu: "2"
          memory: 5Gi
        requests:
          cpu: "2"
          memory: 5Gi
      tuneFastDeviceClass: true
      volumeClaimTemplates:
      - metadata:
          annotations:
            crushDeviceClass: deviceset-4
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 512Gi
          storageClassName: ibmc-vpc-block-metro-general-purpose
          volumeMode: Block
        status: {}
    - count: 1
      name: ocs-deviceset-4-1
      placement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage-device-class
                operator: In
                values:
                - deviceset-4
      portable: true
      preparePlacement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage-device-class
                operator: In
                values:
                - deviceset-4
      resources:
        limits:
          cpu: "2"
          memory: 5Gi
        requests:
          cpu: "2"
          memory: 5Gi
      tuneFastDeviceClass: true
      volumeClaimTemplates:
      - metadata:
          annotations:
            crushDeviceClass: deviceset-4
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 512Gi
          storageClassName: ibmc-vpc-block-metro-general-purpose
          volumeMode: Block
        status: {}
    - count: 1
      name: ocs-deviceset-4-2
      placement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage-device-class
                operator: In
                values:
                - deviceset-4
      portable: true
      preparePlacement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage-device-class
                operator: In
                values:
                - deviceset-4
      resources:
        limits:
          cpu: "2"
          memory: 5Gi
        requests:
          cpu: "2"
          memory: 5Gi
      tuneFastDeviceClass: true
      volumeClaimTemplates:
      - metadata:
          annotations:
            crushDeviceClass: deviceset-4
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 512Gi
          storageClassName: ibmc-vpc-block-metro-general-purpose
          volumeMode: Block
        status: {}
    - count: 1
      name: ocs-deviceset-5-0
      placement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage-device-class
                operator: In
                values:
                - deviceset-5
      portable: true
      preparePlacement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage-device-class
                operator: In
                values:
                - deviceset-5
      resources:
        limits:
          cpu: "2"
          memory: 5Gi
        requests:
          cpu: "2"
          memory: 5Gi
      tuneFastDeviceClass: true
      volumeClaimTemplates:
      - metadata:
          annotations:
            crushDeviceClass: deviceset-5
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 512Gi
          storageClassName: ibmc-vpc-block-metro-general-purpose
          volumeMode: Block
        status: {}
    - count: 1
      name: ocs-deviceset-5-1
      placement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage-device-class
                operator: In
                values:
                - deviceset-5
      portable: true
      preparePlacement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage-device-class
                operator: In
                values:
                - deviceset-5
      resources:
        limits:
          cpu: "2"
          memory: 5Gi
        requests:
          cpu: "2"
          memory: 5Gi
      tuneFastDeviceClass: true
      volumeClaimTemplates:
      - metadata:
          annotations:
            crushDeviceClass: deviceset-5
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 512Gi
          storageClassName: ibmc-vpc-block-metro-general-purpose
          volumeMode: Block
        status: {}
    - count: 1
      name: ocs-deviceset-5-2
      placement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage-device-class
                operator: In
                values:
                - deviceset-5
      portable: true
      preparePlacement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage-device-class
                operator: In
                values:
                - deviceset-5
      resources:
        limits:
          cpu: "2"
          memory: 5Gi
        requests:
          cpu: "2"
          memory: 5Gi
      tuneFastDeviceClass: true
      volumeClaimTemplates:
      - metadata:
          annotations:
            crushDeviceClass: deviceset-5
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 512Gi
          storageClassName: ibmc-vpc-block-metro-general-purpose
          volumeMode: Block
        status: {}

Comment 7 T K Chandra Hasan 2023-12-14 10:30:35 UTC

(In reply to Santosh Pillai from comment #6)
> (In reply to T K Chandra Hasan from comment #0)
> > Description of problem (please be detailed as possible and provide log
> > snippests):
> > In IBMCloud ROKS cluster, we were validating the multiple deviceset features
> > and are observing inconsistency in OSD pod scheduling. We are following this
> > article to create devicesets
> > 
> > https://access.redhat.com/articles/6214381
> 
> If I'm reading it correctly, the article helps to segregate nodes based on
> the labels and then customers can run workloads on those particular nodes.
> It does not mention that workload will be evenly distributed on the labeled
> nodes. Also this article is from 4.8. We now use `topologySpreadConstraints`
> to evenly distribute the OSD prepare pod and OSD pods across nodes. You can
> refer the `storageClassDeviceSets` in the `cephCluster` yaml below. Only
> ocs-deviceset (ocs-deviceset-0, ocs-deviceset-1, ocs-deviceset-2) seems to
> have the topologyspreadConstraint set. 

Thank you Santosh. I understand the article isn't update to date, but how does the OSD Pods getting distributed evenly when we use different storage class name? All the worker nodes are newly configured with same flavors. Could you provide me a sample deviceset snippet with topologySpreadConstraints which I can try and check.

Comment 8 Santosh Pillai 2023-12-14 12:29:40 UTC

(In reply to T K Chandra Hasan from comment #7)
> (In reply to Santosh Pillai from comment #6)
> > (In reply to T K Chandra Hasan from comment #0)
> > > Description of problem (please be detailed as possible and provide log
> > > snippests):
> > > In IBMCloud ROKS cluster, we were validating the multiple deviceset features
> > > and are observing inconsistency in OSD pod scheduling. We are following this
> > > article to create devicesets
> > > 
> > > https://access.redhat.com/articles/6214381
> > 
> > If I'm reading it correctly, the article helps to segregate nodes based on
> > the labels and then customers can run workloads on those particular nodes.
> > It does not mention that workload will be evenly distributed on the labeled
> > nodes. Also this article is from 4.8. We now use `topologySpreadConstraints`
> > to evenly distribute the OSD prepare pod and OSD pods across nodes. You can
> > refer the `storageClassDeviceSets` in the `cephCluster` yaml below. Only
> > ocs-deviceset (ocs-deviceset-0, ocs-deviceset-1, ocs-deviceset-2) seems to
> > have the topologyspreadConstraint set. 
> 
> Thank you Santosh. I understand the article isn't update to date, but how
> does the OSD Pods getting distributed evenly when we use different storage
> class name? 

The article states that all the OSDs with deviceClass `set1` will be created on nodes with `cluster.ocs.openshift.io/openshift-storage-device-class:set1` label. 
And then user can create blockpool using this deviceClass `set1`

```
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
Metadata:
  name: set1-pool
  namespace: openshift-storage
Spec:
  deviceClass: set1
  Parameters:
```

And then create storageClass using this blockpool name:
```
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: set1-sc
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
  pool: set1-pool
```

So all the PVs using `set1-sc` storageClass will have the data restricted to nodes with labels `cluster.ocs.openshift.io/openshift-storage-device-class:set1` 

So I don't think the article is suggesting that OSDs will be evenly distributed on each node. Just that the workload will be spread across specific storage nodes. 

All the worker nodes are newly configured with same flavors.
> Could you provide me a sample deviceset snippet with
> topologySpreadConstraints which I can try and check.

Currently I don't have it. I'll get back to you.

Comment 9 T K Chandra Hasan 2023-12-15 07:08:10 UTC

Yesterday, spent some time in going through the ocs & ceph operator code in github and having following concern

https://github.com/red-hat-storage/ocs-operator/blob/main/controllers/storagecluster/cephcluster.go#L750:L756

As per the above lines, the placement variable is set to true only when no parameters are specified in the deviceset which includes tolerationSpreadConstraint as well. Ideally if the user hasn't specified tolerationSpreadConstraint, then whatever the default has to be picked even though other parameters are specified. This would/might probably fix the pod scheduling issue in this case.

Comment 10 T K Chandra Hasan 2023-12-15 08:33:51 UTC

Tried the following tsc which works fine, It would be great if operator handles this as well when not specified.
```
    - config: {}
      count: 1
      dataPVCTemplate:
        metadata: {}
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 512Gi
          storageClassName: ibmc-vpc-block-metro-10iops-tier
          volumeMode: Block
        status: {}
      deviceClass: deviceset-2
      name: ocs-deviceset-2
      placement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage-device-class
                operator: In
                values:
                - deviceset-2
        tolerations:
        - effect: NoSchedule
          key: node.ocs.openshift.io/storage
          operator: Equal
          value: "true"
        topologySpreadConstraints:
        - labelSelector:
            matchExpressions:
            - key: ceph.rook.io/pvc
              operator: Exists
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
      portable: true
      preparePlacement:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cluster.ocs.openshift.io/openshift-storage-device-class
                operator: In
                values:
                - deviceset-2
        tolerations:
        - effect: NoSchedule
          key: node.ocs.openshift.io/storage
          operator: Equal
          value: "true"
        topologySpreadConstraints:
        - labelSelector:
            matchExpressions:
            - key: ceph.rook.io/pvc
              operator: Exists
          maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
        - labelSelector:
            matchExpressions:
            - key: ceph.rook.io/pvc
              operator: Exists
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
      replica: 3
      resources: {}
```

Comment 11 Santosh Pillai 2023-12-18 04:07:20 UTC

Good to know its working based on comment #10

(In reply to T K Chandra Hasan from comment #9)
> Yesterday, spent some time in going through the ocs & ceph operator code in
> github and having following concern
> 
> https://github.com/red-hat-storage/ocs-operator/blob/main/controllers/
> storagecluster/cephcluster.go#L750:L756
> 
> As per the above lines, the placement variable is set to true only when no
> parameters are specified in the deviceset which includes
> tolerationSpreadConstraint as well. Ideally if the user hasn't specified
> tolerationSpreadConstraint, then whatever the default has to be picked even
> though other parameters are specified. This would/might probably fix the pod
> scheduling issue in this case.

TopologySpreadContraints (supportTSC) will always be enabled based on your current k8s server version. 

So if you not providing anything in deviceSet.Placement, that is`placement.NodeAffinity && ds.Placement.PodAffinity && ds.Placement.PodAntiAffinity && ds.Placement.TopolopgySpreadConstraints` are all nil, then OCS operator will use the defaults mentioned here - https://github.com/red-hat-storage/ocs-operator/blob/442ac957f5606c46c6f1c8401eb22b4e57e65ef0/controllers/defaults/placements.go#L58-L73.

But since you have already provided the placement.NodeAffinity (for example in deviceSet-3), the operator is not using the defaults and using what you have provided in the CR. 

So IMO, the operator is working as expected. You should be editing the StorageCluster DeviceSet CR like you did in comment #10, by providing both the nodeAffinity (to restrict the osds on nodes based on the `openshift-storage-device-class` labels) and the topolopgySpreadConstraints (to ensure osds are equally distributed among those nodes)

Comment 12 Santosh Pillai 2023-12-19 07:27:08 UTC

Moving it to the ODF operator team for analysis as they control the creation of storageClassDeviceSets and CephCluster CR.

Comment 13 Nitin Goyal 2023-12-19 09:45:41 UTC

Closing it as per comment 10

Comment 14 T K Chandra Hasan 2023-12-20 11:16:16 UTC

This configuration setup needs to be documented before closing this bug. I had hard time understanding the behavior after going through the code.

Comment 15 Malay Kumar parida 2024-01-18 08:06:08 UTC

Hi Travis can you please take a look on this BZ specifically https://bugzilla.redhat.com/show_bug.cgi?id=2254035#c11.
And give your views on whether we should always enforce the TSC even when there are some placement specs present. Or do we need some doc mention.
Please give your views.

Comment 16 Travis Nielsen 2024-01-18 21:52:50 UTC

(In reply to Malay Kumar parida from comment #15)
> Hi Travis can you please take a look on this BZ specifically
> https://bugzilla.redhat.com/show_bug.cgi?id=2254035#c11.
> And give your views on whether we should always enforce the TSC even when
> there are some placement specs present. Or do we need some doc mention.
> Please give your views.

Yes let's always enforce the TSCs even if the placement is specified by the user. Without the TSCs, the OSDs will rarely be balanced.

Comment 17 Malay Kumar parida 2024-01-22 05:52:10 UTC

Not a blocker for 4.15, moving out to 4.16

Note You need to log in before you can comment on or make changes to this bug.