Bug 2279876

Summary:	[cee/sd][ODF] MDS pod scheduling blocked on hybrid clusters /w two Cephfs instances
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Anton Mark <amark>
Component:	ocs-operator	Assignee:	Parth Arora <paarora>
Status:	CLOSED ERRATA	QA Contact:	Nagendra Reddy <nagreddy>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.14	CC:	bkunal, edonnell, etamir, hnallurv, msee, muagarwa, nagreddy, nberry, nigoyal, odf-bz-bot, paarora, tdesala, tnielsen
Target Milestone:	---
Target Release:	ODF 4.17.0
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:	4.17.0-92	Doc Type:	Enhancement
Doc Text:	.Support for creating multiple filesystems This enhancement allows users to create multiple filesystems on the same cluster node for hybrid cluster or any other use cases.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-10-30 14:27:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2281703

Description Anton Mark 2024-05-09 13:41:59 UTC

Description of problem (please be detailed as possible and provide log
snippests):

MDS pod scheduling blocked on hybrid clusters /w two Cephfs instances. Example:

$ omc get pods|grep mds
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-58456bbfsstst   0/2     Pending            0          2d
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-86596766ksndh   2/2     Running            0          2d
rook-ceph-mds-ocs-storagecluster-cephfilesystem-gcp-a-79652gmdj   2/2     Running            0          2d
rook-ceph-mds-ocs-storagecluster-cephfilesystem-gcp-b-64fcz527z   2/2     Running            0          2d

$ omc get pod rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-58456bbfsstst -o json|jq '.status.conditions[].message'
"0/32 nodes are available: 14 node(s) had untolerated taint {XXXXX.XX.XXX/gcp-poc: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 7 node(s) didn't match pod anti-affinity rules, 8 node(s) had untolerated taint {node.kubernetes.io/role: XXXXXXXXX}. preemption: 0/32 nodes are available: 25 Preemption is not helpful for scheduling, 7 node(s) didn't match pod anti-affinity rules.."


This appears to be the result of a rack based failure domain and only 3 racks available in the Ceph crushmap and rook topology.



Version of all relevant components (if applicable):

$ omc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.13   True        False         6h18m   Cluster version is 4.14.13

$ omc get csv
NAME                                    DISPLAY                       VERSION        REPLACES                                PHASE
loki-operator.v5.6.16                   Loki Operator                 5.6.16         loki-operator.v5.6.11                   Succeeded
mcg-operator.v4.14.6-rhodf              NooBaa Operator               4.14.6-rhodf   mcg-operator.v4.14.5-rhodf              Succeeded
network-observability-operator.v1.3.0   Network Observability         1.3.0          network-observability-operator.v1.2.0   Succeeded
node-maintenance-operator.v5.3.0        Node Maintenance Operator     5.3.0          node-maintenance-operator.v5.2.0        Succeeded
ocs-operator.v4.14.6-rhodf              OpenShift Container Storage   4.14.6-rhodf   ocs-operator.v4.14.5-rhodf              Succeeded
odf-csi-addons-operator.v4.14.6-rhodf   CSI Addons                    4.14.6-rhodf   odf-csi-addons-operator.v4.14.5-rhodf   Succeeded
odf-operator.v4.14.6-rhodf              OpenShift Data Foundation     4.14.6-rhodf   odf-operator.v4.14.5-rhodf              Succeeded
vault-secrets-operator.v0.5.1           Vault Secrets Operator        0.5.1          vault-secrets-operator.v0.5.0           Succeeded



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Causes potential Cephfs availability issues during possible MDS failover.



Is there any workaround available to the best of your knowledge?

Unknown at this time. Adding a fourth rack to the topology seems like a possible fix.



Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3


Is this issue reproducible?

Yes.


Can this issue reproduce from the UI?

N/A


If this is a regression, please provide more details to justify this:

Unknown


Steps to Reproduce:
1. Setup hybrid cluster with multiple providers and networks.
2. Host ODF workload across nodes on all 2 or more of the providers.
3. Verify only 3 racks exist across all nodes.
4. Create second ceph filesystem.


Actual results:

Only 3/4 MDS pods can find a node to schedule on despite numerous nodes being available.


Expected results:

Not totally sure what the expected behavior should be. I'd likely expect that 2 racks could share all 4 MDS pods as long as two active MDS pods don't fall into a common failure domain?


Additional info:

Comment 16 Matt See 2024-05-15 16:07:32 UTC

The customer applied the change with mixed results. 

See below:

I've applied it to the pre-Prod cluster and it does resolve the problem in that secondary MDS is running but it ended up running on the same node:


rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-677655dcqtjhk   2/2     Running            0                  8m55s   11.18.17.136     k8sbm-1494886.ny.fw.gs.com            <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-599cbd6dv44qg   2/2     Running            0                  4m10s   11.18.17.137     k8sbm-1494886.ny.fw.gs.com            <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-gcp-a-74f97w9vz   2/2     Running            0                  30d     11.18.92.46      d158815-gcp-k8s27.sky.gs.com          <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-gcp-b-6579524gf   2/2     Running            0                  30d     11.18.94.52      d158815-gcp-k8s29.sky.gs.com          <none>           <none>
obviously, this isn't much redundancy and it seemed like the anti-affinity didn't actually work. Deleting one pod and letting it restart elsewhere resulted in a different node, which is good:


~> kca-1018 -n openshift-storage delete pod rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-599cbd6dv44qg
pod "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-599cbd6dv44qg" deleted
~> kc-1018 -n openshift-storage get pods -o wide | grep mds
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-677655dcqtjhk   2/2     Running            0                  11m     11.18.17.136     k8sbm-1494886.ny.fw.gs.com            <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-599cbd6d7rqwq   2/2     Running            0                  19s     11.18.13.186     k8sbm-1494920.ny.fw.gs.com            <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-gcp-a-74f97w9vz   2/2     Running            0                  30d     11.18.92.46      d158815-gcp-k8s27.sky.gs.com          <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-gcp-b-6579524gf   2/2     Running            0                  30d     11.18.94.52      d158815-gcp-k8s29.sky.gs.com          <none>           <none>
but they're in the same rack, so an upgrade or machine config change will hit both:


~> kc-get-node-labels 1018 k8sbm-1494886.ny.fw.gs.com
─────────────────────────────────────────────────┬────────────────────────────
 admin.gs.com/firewall-policy                    │
 admin.gs.com/legacy-hostname                    │
 beta.kubernetes.io/arch                         │ amd64
 beta.kubernetes.io/os                           │ linux
 cluster.ocs.openshift.io/openshift-storage      │
 firewall.gs.com/all                             │ true
 gs.com/location_Building                        │ 1300FED
 gs.com/requires-node-check                      │ 0
 kubernetes.io/arch                              │ amd64
 kubernetes.io/hostname                          │ k8sbm-1494886.ny.fw.gs.com
 kubernetes.io/os                                │ linux
 node-role.kubernetes.io/worker                  │
 node.kubernetes.io/ingress                      │ contour-auth
 node.openshift.io/os_id                         │ rhcos
 policy.gs.com/train-enabled                     │
 remediation.medik8s.io/exclude-from-remediation │ true
 topology.kubernetes.io/zone                     │ compute0
 topology.rook.io/rack                           │ rack0
─────────────────────────────────────────────────┴────────────────────────────
~> kc-get-node-labels 1018 k8sbm-1494920.ny.fw.gs.com
─────────────────────────────────────────────────┬────────────────────────────
 admin.gs.com/firewall-policy                    │
 admin.gs.com/legacy-hostname                    │
 beta.kubernetes.io/arch                         │ amd64
 beta.kubernetes.io/os                           │ linux
 cluster.ocs.openshift.io/openshift-storage      │
 firewall.gs.com/all                             │ true
 gs.com/location_Building                        │ 1300FED
 gs.com/requires-node-check                      │ 0
 gs.com/vip-hostname                             │ k8sbm-1494920.ny.fw.gs.com
 kubernetes.io/arch                              │ amd64
 kubernetes.io/hostname                          │ k8sbm-1494920.ny.fw.gs.com
 kubernetes.io/os                                │ linux
 node-role.kubernetes.io/worker                  │
 node.kubernetes.io/ingress                      │ contour
 node.kubernetes.io/role                         │ ingress
 node.openshift.io/os_id                         │ rhcos
 policy.gs.com/train-enabled                     │
 remediation.medik8s.io/exclude-from-remediation │ true
 topology.kubernetes.io/zone                     │ compute0
 topology.rook.io/rack                           │ rack0
─────────────────────────────────────────────────┴────────────────────────────
Which is less than ideal.

Comment 17 Travis Nielsen 2024-05-15 17:17:40 UTC

Ok, what we really want is required anti-affinity, but only for the two instances of the same object store.

To be clear, you have to CephFS instances, correct? This would show two instances:
oc get cephfilesystem

So you need to define the antiaffinity differently for each of those two instances. The placement for the first instance would be controlled by the StorageCluster. But the second instance of CephFS, you created the CephFilesystem CR directly, right? In that case, please try:
- Edit the StorageCluster placement to use required anti-affinity for the label "app.kubernetes.io/part-of=ocs-storagecluster-cephfilesystem"
- Add placement to the 2nd CephFilesystem CR (since it's not controlled by the StorageCluster CR) to use required anti-affinity for the label "app.kubernetes.io/part-of=ocs-storagecluster-cephfilesystem-gcp"

Comment 18 Matt See 2024-05-16 15:15:49 UTC

They have two instances of CephFS:
```
ocs-storagecluster-cephfilesystem       1           293d   Ready
ocs-storagecluster-cephfilesystem-gcp   1           49d    Ready
```

To clarify, you're suggesting they have the storagecluster CR handle the placement of mds for ocs-storagecluster-cephfilesystem and the CephFilesystem CR handle placement for ocs-storagecluster-cephfilesystem-gcp ?

Does it matter which? Could the storagecluster CR handle either of the filesystems while the cephfilesystem CR handles the other?

The current CephFilesystem CR is attached to the case in supportshell.

Comment 19 Travis Nielsen 2024-05-16 19:28:02 UTC

(In reply to Matt See from comment #18)
> They have two instances of CephFS:
> ```
> ocs-storagecluster-cephfilesystem       1           293d   Ready
> ocs-storagecluster-cephfilesystem-gcp   1           49d    Ready
> ```
> 
> To clarify, you're suggesting they have the storagecluster CR handle the
> placement of mds for ocs-storagecluster-cephfilesystem and the
> CephFilesystem CR handle placement for ocs-storagecluster-cephfilesystem-gcp
> ?
> 
> Does it matter which? Could the storagecluster CR handle either of the
> filesystems while the cephfilesystem CR handles the other?

Whichever CR owns creation of the filesystem would own specifying its placement. So I would expect:
1. ocs-storagecluster-cephfilesystem was created by default by ODF, and its placement is owned by the StorageCluster CR
2. ocs-storagecluster-cephfilesystem-gcp was created directly with a CephFilesystem CR (and not controlled by any setting in the StorageCluster CR), therefore its placement needs to be specified in the CephFilesystem CR

Comment 34 Sunil Kumar Acharya 2024-09-18 12:06:54 UTC

Please update the RDT flag/text appropriately.

Comment 42 errata-xmlrpc 2024-10-30 14:27:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.17.0 Security, Enhancement, & Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:8676

Comment 43 Red Hat Bugzilla 2025-02-28 04:25:15 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days