Description of problem (please be detailed as possible and provide log snippests): In IBMCloud ROKS cluster, we were validating the multiple deviceset features and are observing inconsistency in OSD pod scheduling. We are following this article to create devicesets https://access.redhat.com/articles/6214381 This issue has been observed on both 4.13 & 4.14 ROKS clusters with the respective latest version of ODF. We have created a ROKS clusters with 3 workers of flavors 16x64G initially from the IBMCloud and after the cluster creation, have installed our addon to install the ODF. This by default installs ODF by creating a single deviceset with name "ocs-deviceset" and storage class as "ibmc-vpc-block-metro-10iops-tier" and all the OSD pods are evenly spread across the available workers. ########################################## - config: {} count: 1 dataPVCTemplate: metadata: {} spec: accessModes: - ReadWriteOnce resources: requests: storage: 512Gi storageClassName: ibmc-vpc-block-metro-10iops-tier volumeMode: Block status: {} name: ocs-deviceset placement: {} portable: true preparePlacement: {} replica: 3 resources: {} ########################################### Work> oc get no -owide -l ibm-cloud.kubernetes.io/worker-pool-name=default NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME 10.241.0.7 Ready master,worker 8h v1.26.9+aa37255 10.241.0.7 10.241.0.7 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8 10.241.128.7 Ready master,worker 8h v1.26.9+aa37255 10.241.128.7 10.241.128.7 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8 10.241.64.6 Ready master,worker 8h v1.26.9+aa37255 10.241.64.6 10.241.64.6 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8 ########################################## rook-ceph-osd-0-794885b46f-c2dx8 2/2 Running 0 7h39m 172.17.89.250 10.241.64.6 <none> <none> rook-ceph-osd-1-8699d65d57-88z2g 2/2 Running 0 7h39m 172.17.66.223 10.241.128.7 <none> <none> rook-ceph-osd-2-6b48c9b99-k8tb6 2/2 Running 0 7h38m 172.17.68.230 10.241.0.7 <none> <none> ########################################## Lets add another deviceset by editing the storagecluster cr as per the above article except that the deviceClass parameter is not added with storage class as "ibmc-vpc-block-metro-5iops-tier". In this case, the OSD pods are getting scheduled on the above listed nodes and are being spread across the zones. ########################################## - config: {} count: 1 dataPVCTemplate: metadata: {} spec: accessModes: - ReadWriteOnce resources: requests: storage: 512Gi storageClassName: ibmc-vpc-block-metro-5iops-tier volumeMode: Block status: {} name: ocs-deviceset-2 placement: {} portable: true preparePlacement: {} replica: 3 resources: {} ########################################## rook-ceph-osd-3-549df4f77d-l7w5s 2/2 Running 0 7h8m 172.17.89.249 10.241.64.6 <none> <none> rook-ceph-osd-4-56464899-qk2bl 2/2 Running 0 7h8m 172.17.66.232 10.241.128.7 <none> <none> rook-ceph-osd-5-7bb8c4b8c4-zszfr 2/2 Running 0 7h7m 172.17.68.238 10.241.0.7 <none> <none> ########################################## Now create a worker pool of 3 workers from IBMCloud UI with name "deviceset-3" and add the following labels. Lets create another deviceset with deviceClass as "deviceset-3", storage class as "ibmc-vpc-block-metro-5iops-tier" and placement policies as well. In this case, the OSD pods are either gets scheduled across any 2 zones or on any one of the workers based on the affinity condition. ########################################## cluster.ocs.openshift.io/openshift-storage: "" cluster.ocs.openshift.io/openshift-storage-device-class: deviceset-3 ########################################## - config: {} count: 1 dataPVCTemplate: metadata: {} spec: accessModes: - ReadWriteOnce resources: requests: storage: 512Gi storageClassName: ibmc-vpc-block-metro-5iops-tier volumeMode: Block status: {} deviceClass: deviceset-3 name: ocs-deviceset-3 placement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-3 ########################################## Work> oc get no -owide -l ibm-cloud.kubernetes.io/worker-pool-name=deviceset-3 NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME 10.241.0.9 Ready master,worker 7h23m v1.26.9+aa37255 10.241.0.9 10.241.0.9 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8 10.241.128.12 Ready master,worker 7h23m v1.26.9+aa37255 10.241.128.12 10.241.128.12 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8 10.241.64.11 Ready master,worker 7h23m v1.26.9+aa37255 10.241.64.11 10.241.64.11 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8 ########################################## rook-ceph-osd-6-6b456f7844-jp4x2 2/2 Running 0 6h14m 172.17.110.209 10.241.64.11 <none> <none> rook-ceph-osd-7-55b98ff548-v4rsh 2/2 Running 0 6h14m 172.17.110.212 10.241.64.11 <none> <none> rook-ceph-osd-8-b45474c5f-6vnqv 2/2 Running 0 6h13m 172.17.110.214 10.241.64.11 <none> <none> ########################################## Same steps as previous scenario but with different storage class "ibmc-vpc-block-metro-general-purpose". In this case, the OSD pods are all distributed across zone as expected ########################################## - config: {} count: 1 dataPVCTemplate: metadata: {} spec: accessModes: - ReadWriteOnce resources: requests: storage: 512Gi storageClassName: ibmc-vpc-block-metro-general-purpose volumeMode: Block status: {} deviceClass: deviceset-4 name: ocs-deviceset-4 placement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-4 portable: true preparePlacement: {} replica: 3 resources: {} ########################################## Work> oc get no -owide -l ibm-cloud.kubernetes.io/worker-pool-name=deviceset-4 NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME 10.241.0.10 Ready master,worker 7h57m v1.26.9+aa37255 10.241.0.10 10.241.0.10 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8 10.241.128.13 Ready master,worker 7h56m v1.26.9+aa37255 10.241.128.13 10.241.128.13 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8 10.241.64.12 Ready master,worker 7h57m v1.26.9+aa37255 10.241.64.12 10.241.64.12 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8 ########################################## rook-ceph-osd-9-67dd868dc8-jhw4q 2/2 Running 0 4h56m 172.17.116.72 10.241.128.13 <none> <none> rook-ceph-osd-10-54d5b69df5-mvvzj 2/2 Running 0 4h56m 172.17.125.8 10.241.64.12 <none> <none> rook-ceph-osd-11-548ff94bdb-sp7cv 2/2 Running 0 4h56m 172.17.75.137 10.241.0.10 <none> <none> ########################################## Same step as previous scenario with same storage class "ibmc-vpc-block-metro-general-purpose". In this case, the OSD pods are distributed unevenly with 2 OSDs on same zone. ########################################## - config: {} count: 1 dataPVCTemplate: metadata: {} spec: accessModes: - ReadWriteOnce resources: requests: storage: 512Gi storageClassName: ibmc-vpc-block-metro-general-purpose volumeMode: Block status: {} deviceClass: deviceset-5 name: ocs-deviceset-5 placement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-5 portable: true preparePlacement: {} replica: 3 resources: {} ########################################## Work> oc get no -owide -l ibm-cloud.kubernetes.io/worker-pool-name=deviceset-5 NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME 10.241.0.11 Ready master,worker 6h30m v1.26.9+aa37255 10.241.0.11 10.241.0.11 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8 10.241.128.14 Ready master,worker 6h30m v1.26.9+aa37255 10.241.128.14 10.241.128.14 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8 10.241.64.13 Ready master,worker 6h30m v1.26.9+aa37255 10.241.64.13 10.241.64.13 Red Hat Enterprise Linux 8.8 (Ootpa) 4.18.0-477.27.1.el8_8.x86_64 cri-o://1.26.4-5.1.rhaos4.13.git969e013.el8 ########################################## rook-ceph-osd-12-6fc6c68645-cwdwz 2/2 Running 0 4h1m 172.17.91.201 10.241.64.13 <none> <none> rook-ceph-osd-13-6f6cb46d4f-55xsz 2/2 Running 0 4h1m 172.17.91.203 10.241.64.13 <none> <none> rook-ceph-osd-14-7988b69947-csrkl 2/2 Running 0 4h 172.17.103.72 10.241.0.11 <none> <none> ########################################## Version of all relevant components (if applicable): Latest ODF 4.13 & 4.14 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? We are assessing the multiple deviceset feature for customers from ODF on IBMCloud Is there any workaround available to the best of your knowledge? If we include PodAntiAffinity rules on OSD Prepare jobs, OSD pods are being scheduled as expected ############################################## placement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-8 podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - rook-ceph-osd - rook-ceph-osd-prepare topologyKey: topology.kubernetes.io/zone weight: 100 ############################################## Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? Yes, Tried in 2 clusters in 2 different env. 4.13 on Prod Env 4.14 on Internal Stage Env Can this issue reproduce from the UI? NA If this is a regression, please provide more details to justify this: Steps to Reproduce: All scenarios are detailed in the description Actual results: OSD pods should be scheduled across nodes from different zones Expected results: Works only when SC are different in device sets. Additional info: NA
I have collected odf must gather but not able to attach it here as it exceeds the limit. Is there another way to upload the logs?
(In reply to T K Chandra Hasan from comment #0) > Description of problem (please be detailed as possible and provide log > snippests): > In IBMCloud ROKS cluster, we were validating the multiple deviceset features > and are observing inconsistency in OSD pod scheduling. We are following this > article to create devicesets > > https://access.redhat.com/articles/6214381 If I'm reading it correctly, the article helps to segregate nodes based on the labels and then customers can run workloads on those particular nodes. It does not mention that workload will be evenly distributed on the labeled nodes. Also this article is from 4.8. We now use `topologySpreadConstraints` to evenly distribute the OSD prepare pod and OSD pods across nodes. You can refer the `storageClassDeviceSets` in the `cephCluster` yaml below. Only ocs-deviceset (ocs-deviceset-0, ocs-deviceset-1, ocs-deviceset-2) seems to have the topologyspreadConstraint set. ``` storage: storageClassDeviceSets: - count: 1 name: ocs-deviceset-0 placement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage operator: Exists tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway portable: true preparePlacement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage operator: Exists tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway resources: limits: cpu: "2" memory: 5Gi requests: cpu: "2" memory: 5Gi tuneFastDeviceClass: true volumeClaimTemplates: - metadata: annotations: crushDeviceClass: "" spec: accessModes: - ReadWriteOnce resources: requests: storage: 512Gi storageClassName: ibmc-vpc-block-metro-10iops-tier volumeMode: Block status: {} - count: 1 name: ocs-deviceset-1 placement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage operator: Exists tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway portable: true preparePlacement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage operator: Exists tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway resources: limits: cpu: "2" memory: 5Gi requests: cpu: "2" memory: 5Gi tuneFastDeviceClass: true volumeClaimTemplates: - metadata: annotations: crushDeviceClass: "" spec: accessModes: - ReadWriteOnce resources: requests: storage: 512Gi storageClassName: ibmc-vpc-block-metro-10iops-tier volumeMode: Block status: {} - count: 1 name: ocs-deviceset-2 placement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage operator: Exists tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway portable: true preparePlacement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage operator: Exists tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway resources: limits: cpu: "2" memory: 5Gi requests: cpu: "2" memory: 5Gi tuneFastDeviceClass: true volumeClaimTemplates: - metadata: annotations: crushDeviceClass: "" spec: accessModes: - ReadWriteOnce resources: requests: storage: 512Gi storageClassName: ibmc-vpc-block-metro-10iops-tier volumeMode: Block status: {} - count: 1 name: ocs-deviceset-2-0 placement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage operator: Exists tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway portable: true preparePlacement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage operator: Exists tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway resources: limits: cpu: "2" memory: 5Gi requests: cpu: "2" memory: 5Gi tuneFastDeviceClass: true volumeClaimTemplates: - metadata: annotations: crushDeviceClass: "" spec: accessModes: - ReadWriteOnce resources: requests: storage: 512Gi storageClassName: ibmc-vpc-block-metro-5iops-tier volumeMode: Block status: {} - count: 1 name: ocs-deviceset-2-1 placement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage operator: Exists tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway portable: true preparePlacement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage operator: Exists tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway resources: limits: cpu: "2" memory: 5Gi requests: cpu: "2" memory: 5Gi tuneFastDeviceClass: true volumeClaimTemplates: - metadata: annotations: crushDeviceClass: "" spec: accessModes: - ReadWriteOnce resources: requests: storage: 512Gi storageClassName: ibmc-vpc-block-metro-5iops-tier volumeMode: Block status: {} - count: 1 name: ocs-deviceset-2-2 placement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage operator: Exists tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway portable: true preparePlacement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage operator: Exists tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway resources: limits: cpu: "2" memory: 5Gi requests: cpu: "2" memory: 5Gi tuneFastDeviceClass: true volumeClaimTemplates: - metadata: annotations: crushDeviceClass: "" spec: accessModes: - ReadWriteOnce resources: requests: storage: 512Gi storageClassName: ibmc-vpc-block-metro-5iops-tier volumeMode: Block status: {} - count: 1 name: ocs-deviceset-3-0 placement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-3 portable: true preparePlacement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-3 resources: limits: cpu: "2" memory: 5Gi requests: cpu: "2" memory: 5Gi tuneFastDeviceClass: true volumeClaimTemplates: - metadata: annotations: crushDeviceClass: deviceset-3 spec: accessModes: - ReadWriteOnce resources: requests: storage: 512Gi storageClassName: ibmc-vpc-block-metro-5iops-tier volumeMode: Block status: {} - count: 1 name: ocs-deviceset-3-1 placement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-3 portable: true preparePlacement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-3 resources: limits: cpu: "2" memory: 5Gi requests: cpu: "2" memory: 5Gi tuneFastDeviceClass: true volumeClaimTemplates: - metadata: annotations: crushDeviceClass: deviceset-3 spec: accessModes: - ReadWriteOnce resources: requests: storage: 512Gi storageClassName: ibmc-vpc-block-metro-5iops-tier volumeMode: Block status: {} - count: 1 name: ocs-deviceset-3-2 placement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-3 portable: true preparePlacement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-3 resources: limits: cpu: "2" memory: 5Gi requests: cpu: "2" memory: 5Gi tuneFastDeviceClass: true volumeClaimTemplates: - metadata: annotations: crushDeviceClass: deviceset-3 spec: accessModes: - ReadWriteOnce resources: requests: storage: 512Gi storageClassName: ibmc-vpc-block-metro-5iops-tier volumeMode: Block status: {} - count: 1 name: ocs-deviceset-4-0 placement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-4 portable: true preparePlacement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-4 resources: limits: cpu: "2" memory: 5Gi requests: cpu: "2" memory: 5Gi tuneFastDeviceClass: true volumeClaimTemplates: - metadata: annotations: crushDeviceClass: deviceset-4 spec: accessModes: - ReadWriteOnce resources: requests: storage: 512Gi storageClassName: ibmc-vpc-block-metro-general-purpose volumeMode: Block status: {} - count: 1 name: ocs-deviceset-4-1 placement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-4 portable: true preparePlacement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-4 resources: limits: cpu: "2" memory: 5Gi requests: cpu: "2" memory: 5Gi tuneFastDeviceClass: true volumeClaimTemplates: - metadata: annotations: crushDeviceClass: deviceset-4 spec: accessModes: - ReadWriteOnce resources: requests: storage: 512Gi storageClassName: ibmc-vpc-block-metro-general-purpose volumeMode: Block status: {} - count: 1 name: ocs-deviceset-4-2 placement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-4 portable: true preparePlacement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-4 resources: limits: cpu: "2" memory: 5Gi requests: cpu: "2" memory: 5Gi tuneFastDeviceClass: true volumeClaimTemplates: - metadata: annotations: crushDeviceClass: deviceset-4 spec: accessModes: - ReadWriteOnce resources: requests: storage: 512Gi storageClassName: ibmc-vpc-block-metro-general-purpose volumeMode: Block status: {} - count: 1 name: ocs-deviceset-5-0 placement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-5 portable: true preparePlacement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-5 resources: limits: cpu: "2" memory: 5Gi requests: cpu: "2" memory: 5Gi tuneFastDeviceClass: true volumeClaimTemplates: - metadata: annotations: crushDeviceClass: deviceset-5 spec: accessModes: - ReadWriteOnce resources: requests: storage: 512Gi storageClassName: ibmc-vpc-block-metro-general-purpose volumeMode: Block status: {} - count: 1 name: ocs-deviceset-5-1 placement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-5 portable: true preparePlacement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-5 resources: limits: cpu: "2" memory: 5Gi requests: cpu: "2" memory: 5Gi tuneFastDeviceClass: true volumeClaimTemplates: - metadata: annotations: crushDeviceClass: deviceset-5 spec: accessModes: - ReadWriteOnce resources: requests: storage: 512Gi storageClassName: ibmc-vpc-block-metro-general-purpose volumeMode: Block status: {} - count: 1 name: ocs-deviceset-5-2 placement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-5 portable: true preparePlacement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-5 resources: limits: cpu: "2" memory: 5Gi requests: cpu: "2" memory: 5Gi tuneFastDeviceClass: true volumeClaimTemplates: - metadata: annotations: crushDeviceClass: deviceset-5 spec: accessModes: - ReadWriteOnce resources: requests: storage: 512Gi storageClassName: ibmc-vpc-block-metro-general-purpose volumeMode: Block status: {}
(In reply to Santosh Pillai from comment #6) > (In reply to T K Chandra Hasan from comment #0) > > Description of problem (please be detailed as possible and provide log > > snippests): > > In IBMCloud ROKS cluster, we were validating the multiple deviceset features > > and are observing inconsistency in OSD pod scheduling. We are following this > > article to create devicesets > > > > https://access.redhat.com/articles/6214381 > > If I'm reading it correctly, the article helps to segregate nodes based on > the labels and then customers can run workloads on those particular nodes. > It does not mention that workload will be evenly distributed on the labeled > nodes. Also this article is from 4.8. We now use `topologySpreadConstraints` > to evenly distribute the OSD prepare pod and OSD pods across nodes. You can > refer the `storageClassDeviceSets` in the `cephCluster` yaml below. Only > ocs-deviceset (ocs-deviceset-0, ocs-deviceset-1, ocs-deviceset-2) seems to > have the topologyspreadConstraint set. Thank you Santosh. I understand the article isn't update to date, but how does the OSD Pods getting distributed evenly when we use different storage class name? All the worker nodes are newly configured with same flavors. Could you provide me a sample deviceset snippet with topologySpreadConstraints which I can try and check.
(In reply to T K Chandra Hasan from comment #7) > (In reply to Santosh Pillai from comment #6) > > (In reply to T K Chandra Hasan from comment #0) > > > Description of problem (please be detailed as possible and provide log > > > snippests): > > > In IBMCloud ROKS cluster, we were validating the multiple deviceset features > > > and are observing inconsistency in OSD pod scheduling. We are following this > > > article to create devicesets > > > > > > https://access.redhat.com/articles/6214381 > > > > If I'm reading it correctly, the article helps to segregate nodes based on > > the labels and then customers can run workloads on those particular nodes. > > It does not mention that workload will be evenly distributed on the labeled > > nodes. Also this article is from 4.8. We now use `topologySpreadConstraints` > > to evenly distribute the OSD prepare pod and OSD pods across nodes. You can > > refer the `storageClassDeviceSets` in the `cephCluster` yaml below. Only > > ocs-deviceset (ocs-deviceset-0, ocs-deviceset-1, ocs-deviceset-2) seems to > > have the topologyspreadConstraint set. > > Thank you Santosh. I understand the article isn't update to date, but how > does the OSD Pods getting distributed evenly when we use different storage > class name? The article states that all the OSDs with deviceClass `set1` will be created on nodes with `cluster.ocs.openshift.io/openshift-storage-device-class:set1` label. And then user can create blockpool using this deviceClass `set1` ``` apiVersion: ceph.rook.io/v1 kind: CephBlockPool Metadata: name: set1-pool namespace: openshift-storage Spec: deviceClass: set1 Parameters: ``` And then create storageClass using this blockpool name: ``` apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: set1-sc provisioner: rook-ceph.rbd.csi.ceph.com parameters: pool: set1-pool ``` So all the PVs using `set1-sc` storageClass will have the data restricted to nodes with labels `cluster.ocs.openshift.io/openshift-storage-device-class:set1` So I don't think the article is suggesting that OSDs will be evenly distributed on each node. Just that the workload will be spread across specific storage nodes. All the worker nodes are newly configured with same flavors. > Could you provide me a sample deviceset snippet with > topologySpreadConstraints which I can try and check. Currently I don't have it. I'll get back to you.
Yesterday, spent some time in going through the ocs & ceph operator code in github and having following concern https://github.com/red-hat-storage/ocs-operator/blob/main/controllers/storagecluster/cephcluster.go#L750:L756 As per the above lines, the placement variable is set to true only when no parameters are specified in the deviceset which includes tolerationSpreadConstraint as well. Ideally if the user hasn't specified tolerationSpreadConstraint, then whatever the default has to be picked even though other parameters are specified. This would/might probably fix the pod scheduling issue in this case.
Tried the following tsc which works fine, It would be great if operator handles this as well when not specified. ``` - config: {} count: 1 dataPVCTemplate: metadata: {} spec: accessModes: - ReadWriteOnce resources: requests: storage: 512Gi storageClassName: ibmc-vpc-block-metro-10iops-tier volumeMode: Block status: {} deviceClass: deviceset-2 name: ocs-deviceset-2 placement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-2 tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway portable: true preparePlacement: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cluster.ocs.openshift.io/openshift-storage-device-class operator: In values: - deviceset-2 tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" topologySpreadConstraints: - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule - labelSelector: matchExpressions: - key: ceph.rook.io/pvc operator: Exists maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway replica: 3 resources: {} ```
Good to know its working based on comment #10 (In reply to T K Chandra Hasan from comment #9) > Yesterday, spent some time in going through the ocs & ceph operator code in > github and having following concern > > https://github.com/red-hat-storage/ocs-operator/blob/main/controllers/ > storagecluster/cephcluster.go#L750:L756 > > As per the above lines, the placement variable is set to true only when no > parameters are specified in the deviceset which includes > tolerationSpreadConstraint as well. Ideally if the user hasn't specified > tolerationSpreadConstraint, then whatever the default has to be picked even > though other parameters are specified. This would/might probably fix the pod > scheduling issue in this case. TopologySpreadContraints (supportTSC) will always be enabled based on your current k8s server version. So if you not providing anything in deviceSet.Placement, that is`placement.NodeAffinity && ds.Placement.PodAffinity && ds.Placement.PodAntiAffinity && ds.Placement.TopolopgySpreadConstraints` are all nil, then OCS operator will use the defaults mentioned here - https://github.com/red-hat-storage/ocs-operator/blob/442ac957f5606c46c6f1c8401eb22b4e57e65ef0/controllers/defaults/placements.go#L58-L73. But since you have already provided the placement.NodeAffinity (for example in deviceSet-3), the operator is not using the defaults and using what you have provided in the CR. So IMO, the operator is working as expected. You should be editing the StorageCluster DeviceSet CR like you did in comment #10, by providing both the nodeAffinity (to restrict the osds on nodes based on the `openshift-storage-device-class` labels) and the topolopgySpreadConstraints (to ensure osds are equally distributed among those nodes)
Moving it to the ODF operator team for analysis as they control the creation of storageClassDeviceSets and CephCluster CR.
Closing it as per comment 10
This configuration setup needs to be documented before closing this bug. I had hard time understanding the behavior after going through the code.
Hi Travis can you please take a look on this BZ specifically https://bugzilla.redhat.com/show_bug.cgi?id=2254035#c11. And give your views on whether we should always enforce the TSC even when there are some placement specs present. Or do we need some doc mention. Please give your views.
(In reply to Malay Kumar parida from comment #15) > Hi Travis can you please take a look on this BZ specifically > https://bugzilla.redhat.com/show_bug.cgi?id=2254035#c11. > And give your views on whether we should always enforce the TSC even when > there are some placement specs present. Or do we need some doc mention. > Please give your views. Yes let's always enforce the TSCs even if the placement is specified by the user. Without the TSCs, the OSDs will rarely be balanced.
Not a blocker for 4.15, moving out to 4.16