Description of problem (please be detailed as possible and provide log snippests): MDS pod scheduling blocked on hybrid clusters /w two Cephfs instances. Example: $ omc get pods|grep mds rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-58456bbfsstst 0/2 Pending 0 2d rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-86596766ksndh 2/2 Running 0 2d rook-ceph-mds-ocs-storagecluster-cephfilesystem-gcp-a-79652gmdj 2/2 Running 0 2d rook-ceph-mds-ocs-storagecluster-cephfilesystem-gcp-b-64fcz527z 2/2 Running 0 2d $ omc get pod rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-58456bbfsstst -o json|jq '.status.conditions[].message' "0/32 nodes are available: 14 node(s) had untolerated taint {XXXXX.XX.XXX/gcp-poc: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 7 node(s) didn't match pod anti-affinity rules, 8 node(s) had untolerated taint {node.kubernetes.io/role: XXXXXXXXX}. preemption: 0/32 nodes are available: 25 Preemption is not helpful for scheduling, 7 node(s) didn't match pod anti-affinity rules.." This appears to be the result of a rack based failure domain and only 3 racks available in the Ceph crushmap and rook topology. Version of all relevant components (if applicable): $ omc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.13 True False 6h18m Cluster version is 4.14.13 $ omc get csv NAME DISPLAY VERSION REPLACES PHASE loki-operator.v5.6.16 Loki Operator 5.6.16 loki-operator.v5.6.11 Succeeded mcg-operator.v4.14.6-rhodf NooBaa Operator 4.14.6-rhodf mcg-operator.v4.14.5-rhodf Succeeded network-observability-operator.v1.3.0 Network Observability 1.3.0 network-observability-operator.v1.2.0 Succeeded node-maintenance-operator.v5.3.0 Node Maintenance Operator 5.3.0 node-maintenance-operator.v5.2.0 Succeeded ocs-operator.v4.14.6-rhodf OpenShift Container Storage 4.14.6-rhodf ocs-operator.v4.14.5-rhodf Succeeded odf-csi-addons-operator.v4.14.6-rhodf CSI Addons 4.14.6-rhodf odf-csi-addons-operator.v4.14.5-rhodf Succeeded odf-operator.v4.14.6-rhodf OpenShift Data Foundation 4.14.6-rhodf odf-operator.v4.14.5-rhodf Succeeded vault-secrets-operator.v0.5.1 Vault Secrets Operator 0.5.1 vault-secrets-operator.v0.5.0 Succeeded Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Causes potential Cephfs availability issues during possible MDS failover. Is there any workaround available to the best of your knowledge? Unknown at this time. Adding a fourth rack to the topology seems like a possible fix. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Is this issue reproducible? Yes. Can this issue reproduce from the UI? N/A If this is a regression, please provide more details to justify this: Unknown Steps to Reproduce: 1. Setup hybrid cluster with multiple providers and networks. 2. Host ODF workload across nodes on all 2 or more of the providers. 3. Verify only 3 racks exist across all nodes. 4. Create second ceph filesystem. Actual results: Only 3/4 MDS pods can find a node to schedule on despite numerous nodes being available. Expected results: Not totally sure what the expected behavior should be. I'd likely expect that 2 racks could share all 4 MDS pods as long as two active MDS pods don't fall into a common failure domain? Additional info:
The customer applied the change with mixed results. See below: I've applied it to the pre-Prod cluster and it does resolve the problem in that secondary MDS is running but it ended up running on the same node: rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-677655dcqtjhk 2/2 Running 0 8m55s 11.18.17.136 k8sbm-1494886.ny.fw.gs.com <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-599cbd6dv44qg 2/2 Running 0 4m10s 11.18.17.137 k8sbm-1494886.ny.fw.gs.com <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-gcp-a-74f97w9vz 2/2 Running 0 30d 11.18.92.46 d158815-gcp-k8s27.sky.gs.com <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-gcp-b-6579524gf 2/2 Running 0 30d 11.18.94.52 d158815-gcp-k8s29.sky.gs.com <none> <none> obviously, this isn't much redundancy and it seemed like the anti-affinity didn't actually work. Deleting one pod and letting it restart elsewhere resulted in a different node, which is good: ~> kca-1018 -n openshift-storage delete pod rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-599cbd6dv44qg pod "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-599cbd6dv44qg" deleted ~> kc-1018 -n openshift-storage get pods -o wide | grep mds rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-677655dcqtjhk 2/2 Running 0 11m 11.18.17.136 k8sbm-1494886.ny.fw.gs.com <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-599cbd6d7rqwq 2/2 Running 0 19s 11.18.13.186 k8sbm-1494920.ny.fw.gs.com <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-gcp-a-74f97w9vz 2/2 Running 0 30d 11.18.92.46 d158815-gcp-k8s27.sky.gs.com <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-gcp-b-6579524gf 2/2 Running 0 30d 11.18.94.52 d158815-gcp-k8s29.sky.gs.com <none> <none> but they're in the same rack, so an upgrade or machine config change will hit both: ~> kc-get-node-labels 1018 k8sbm-1494886.ny.fw.gs.com ─────────────────────────────────────────────────┬──────────────────────────── admin.gs.com/firewall-policy │ admin.gs.com/legacy-hostname │ beta.kubernetes.io/arch │ amd64 beta.kubernetes.io/os │ linux cluster.ocs.openshift.io/openshift-storage │ firewall.gs.com/all │ true gs.com/location_Building │ 1300FED gs.com/requires-node-check │ 0 kubernetes.io/arch │ amd64 kubernetes.io/hostname │ k8sbm-1494886.ny.fw.gs.com kubernetes.io/os │ linux node-role.kubernetes.io/worker │ node.kubernetes.io/ingress │ contour-auth node.openshift.io/os_id │ rhcos policy.gs.com/train-enabled │ remediation.medik8s.io/exclude-from-remediation │ true topology.kubernetes.io/zone │ compute0 topology.rook.io/rack │ rack0 ─────────────────────────────────────────────────┴──────────────────────────── ~> kc-get-node-labels 1018 k8sbm-1494920.ny.fw.gs.com ─────────────────────────────────────────────────┬──────────────────────────── admin.gs.com/firewall-policy │ admin.gs.com/legacy-hostname │ beta.kubernetes.io/arch │ amd64 beta.kubernetes.io/os │ linux cluster.ocs.openshift.io/openshift-storage │ firewall.gs.com/all │ true gs.com/location_Building │ 1300FED gs.com/requires-node-check │ 0 gs.com/vip-hostname │ k8sbm-1494920.ny.fw.gs.com kubernetes.io/arch │ amd64 kubernetes.io/hostname │ k8sbm-1494920.ny.fw.gs.com kubernetes.io/os │ linux node-role.kubernetes.io/worker │ node.kubernetes.io/ingress │ contour node.kubernetes.io/role │ ingress node.openshift.io/os_id │ rhcos policy.gs.com/train-enabled │ remediation.medik8s.io/exclude-from-remediation │ true topology.kubernetes.io/zone │ compute0 topology.rook.io/rack │ rack0 ─────────────────────────────────────────────────┴──────────────────────────── Which is less than ideal.
Ok, what we really want is required anti-affinity, but only for the two instances of the same object store. To be clear, you have to CephFS instances, correct? This would show two instances: oc get cephfilesystem So you need to define the antiaffinity differently for each of those two instances. The placement for the first instance would be controlled by the StorageCluster. But the second instance of CephFS, you created the CephFilesystem CR directly, right? In that case, please try: - Edit the StorageCluster placement to use required anti-affinity for the label "app.kubernetes.io/part-of=ocs-storagecluster-cephfilesystem" - Add placement to the 2nd CephFilesystem CR (since it's not controlled by the StorageCluster CR) to use required anti-affinity for the label "app.kubernetes.io/part-of=ocs-storagecluster-cephfilesystem-gcp"
They have two instances of CephFS: ``` ocs-storagecluster-cephfilesystem 1 293d Ready ocs-storagecluster-cephfilesystem-gcp 1 49d Ready ``` To clarify, you're suggesting they have the storagecluster CR handle the placement of mds for ocs-storagecluster-cephfilesystem and the CephFilesystem CR handle placement for ocs-storagecluster-cephfilesystem-gcp ? Does it matter which? Could the storagecluster CR handle either of the filesystems while the cephfilesystem CR handles the other? The current CephFilesystem CR is attached to the case in supportshell.
(In reply to Matt See from comment #18) > They have two instances of CephFS: > ``` > ocs-storagecluster-cephfilesystem 1 293d Ready > ocs-storagecluster-cephfilesystem-gcp 1 49d Ready > ``` > > To clarify, you're suggesting they have the storagecluster CR handle the > placement of mds for ocs-storagecluster-cephfilesystem and the > CephFilesystem CR handle placement for ocs-storagecluster-cephfilesystem-gcp > ? > > Does it matter which? Could the storagecluster CR handle either of the > filesystems while the cephfilesystem CR handles the other? Whichever CR owns creation of the filesystem would own specifying its placement. So I would expect: 1. ocs-storagecluster-cephfilesystem was created by default by ODF, and its placement is owned by the StorageCluster CR 2. ocs-storagecluster-cephfilesystem-gcp was created directly with a CephFilesystem CR (and not controlled by any setting in the StorageCluster CR), therefore its placement needs to be specified in the CephFilesystem CR
Please update the RDT flag/text appropriately.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.17.0 Security, Enhancement, & Bug Fix Update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:8676
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days