Bug 2228380
| Summary: | [GSS]rook failed to reconcile ceph version, causing an osd running older version | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Sonal <sarora> |
| Component: | ocs-operator | Assignee: | Malay Kumar parida <mparida> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Elad <ebenahar> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.11 | CC: | mparida, odf-bz-bot, srai |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-08-11 07:16:14 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Sonal
2023-08-02 09:28:47 UTC
Hi Travis the BZ https://bugzilla.redhat.com/show_bug.cgi?id=2193220 was backported to 4.13 only as the issue arised only in 4.13 due to us trying fix another BZ https://bugzilla.redhat.com/show_bug.cgi?id=2102304. To explain briefly BZ 2102304 We never removed the labels for deleted nodes, instead, we just added any new information to the existing topology map. As a result, even after a node has been deleted, its labels will still appear on the topology map. So, it needs to be fixed by generating a fresh topology map for each reconcile so that it will always be in sync with the cluster's current state. BZ 2193220 As we construct the topologymap in each reconcile, the order of the label values are not consistent. This results in the nodetopologies to be updated frequently. This was causing reconcile to be cancelled in-between causing issues with rook-ceph-operator in stretched cluster. So all this problem was created & sorted out in 4.13 itself, but the customer her is with ODF 4.11. So we can rule out the relation with BZ 2193220. I am looking into the case also, let me see if I can see some other issue. From the logs before rook ceph operator pod restart I can see two problems "CR has changed for "ocs-storagecluster-cephcluster" 132 times. Here in the cephcluster CR the placement spec for the mon, specifically the mon podAntiAffinity topologyKey keeps changing from topology.kubernetes.io/hostname to kubernetes.io/hostname back & forth. And also inside the storage.storageclassdevicesets PreparePlacement.TopologySpreadConstraints keeps changing back & forth. "CR has changed for "ocs-storagecluster-cephfilesystem" 83 times. Here in the cephfilesystem CR in the MetadataServerSpec.Placement.PodAntiAffinity topologyKey keeps changing from topology.kubernetes.io/hostname to kubernetes.io/hostname back & forth. Due to so frequent changing in the placement sections of the CephCluster CR & the CephFileSystem CR the reconcile doesn't go ahead. And rook operator fails with the message "rookcmd: failed to run operator: gave up to run the operator manager: failed to set up overall controller-runtime manager: error listening on :8080: listen tcp :8080: bind: address already in use". I understand ocs operator is the one making these changes to the CRs, So I will concentrate the investigation on that direction. Found the root cause, While seeing the nodes in the must gather I saw the nodes have duplicate kind of labels,
All the nodes have 2 labels kubernetes.io/hostname, opology.kubernetes.io/hostname.
Then I saw our storagecluster.yaml and it has picked up both the labels
Node Topologies:
Labels:
kubernetes.io/hostname:
storage-0.nicqa.tahakom.com
storage-2.nicqa.tahakom.com
storage-1.nicqa.tahakom.com
topology.kubernetes.io/hostname:
storage-0.nicqa.tahakom.com
storage-2.nicqa.tahakom.com
storage-1.nicqa.tahakom.com
.
Our code when it determines the placement it picks randomly from the Node Topologies as the failuredomain is kubernetes.io/hostname,
And both the labels kubernetes.io/hostname & topology.kubernetes.io/hostname contain the string. So sometimes it picks the 1st one & sometimes the 2nd.
This causes frequent changes in the placement section of the CRs as the topology key is constantly changing.
How to get out of this situation?
* First customer has to relabel their nodes to remove the topology.kubernetes.io/hostname labels, the kubernetes.io/hostname is a well-known used label as per standard and it's already determined as the failure domain so all topology.kubernetes.io/hostname labels should be removed from the nodes.
* Scale Down ocs operator
oc patch deployment ocs-operator -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/replicas", "value": 0 }]'
* Patch the Storagecluster to remove the NodeTopologies field so that it's reconstructed freshly when ocs operator comes back
oc patch storagecluster ocs-storagecluster -n openshift-storage --type json --subresource status --patch '[{ "op": "replace", "path": "/status/nodeTopologies/labels", "value": }]'
(if this patch command doesn't work, please upgrade your oc cli to 4.11)
* Now Scale Up ocs operator
oc patch deployment ocs-operator -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/replicas", "value": 1 }]'
* Check if the new Nodetopologymap doesn't has the topology.kubernetes.io/hostname labels
oc get storagecluster ocs-storagecluster -n openshift-storage -o=jsonpath='{.status.nodeTopologies}' | jq
Delete the existing rook-ceph-operator pod & a new one should come up.
Before running these steps please confirm with the customer about the reason for adding the label topology.kubernetes.io/hostname to their nodes even when kubernetes.io/hostname already existed, is it critical for some of their other workload on the cluster? Please confirm this before running these steps. Hi Sonal, both the CR CephCluster & CephFileSystem are created & kept updated by ocs-operator I can see from the attached case ``` The cluster looks good. All osds and ceph components are on same version now i.e 16.2.10-138.el8cp ceph cluster is HEALTH_OK and storagecluster is in Ready phase. Noobaa is in Ready phase. ``` As the underlying issue for which the BZ was raised is now solved I feel it's correct to close this BZ. If there are any further issues that come up related to this cause this BZ can be reopened. Probably yes, But this was for the first time I saw when two labels were added where one is a substring of the other. I will talk to the team if we should fix this case. |