Bug 2228380

Summary: [GSS]rook failed to reconcile ceph version, causing an osd running older version
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Sonal <sarora>
Component: ocs-operatorAssignee: Malay Kumar parida <mparida>
Status: CLOSED CURRENTRELEASE QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.11CC: mparida, odf-bz-bot, srai
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-11 07:16:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sonal 2023-08-02 09:28:47 UTC
Description of problem (please be detailed as possible and provide log
snippests):

After upgrading from ODF 4.9.2 to 4.11.3, storagecluster is in error state, due to error in reconciling storageclass. rook-ceph-operator failed to reconcile ceph version, throwing below error:

~~~
2023-08-01T12:42:29.044242502Z 2023-08-01 12:42:29.044198 E | ceph-file-controller: failed to reconcile failed to detect running and desired ceph version: failed to detect ceph image version: failed to complete ceph version job: failed to run CmdReporter ceph-file-controller-detect-version successfully. failed to run job. context canceled
2023-08-01T12:42:29.044269309Z 2023-08-01 12:42:29.044242 E | ceph-file-controller: failed to reconcile CephFilesystem "openshift-storage/ocs-storagecluster-cephfilesystem". failed to detect running and desired ceph version: failed to detect ceph image version: failed to complete ceph version job: failed to run CmdReporter ceph-file-controller-detect-version successfully. failed to run job. context canceled
~~~

As a result, one of the osd is running on older version 16.2.0-152 while rest of the ceph components is on  16.2.10-138:

Version of all relevant components (if applicable):
ODF 4.11.9

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No, there is no impact on using applications.

Is there any workaround available to the best of your knowledge?
Tried restarting the rook-ceph-operator pod, it didn't help.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Additional info:
In next private comment

Comment 9 Malay Kumar parida 2023-08-03 08:23:20 UTC
Hi Travis the BZ https://bugzilla.redhat.com/show_bug.cgi?id=2193220 was backported to 4.13 only as the issue arised only in 4.13 due to us trying fix another BZ https://bugzilla.redhat.com/show_bug.cgi?id=2102304.
To explain briefly 
BZ 2102304
We never removed the labels for deleted nodes, instead, we just added any new information to the existing topology map. As a result, even after a node has been deleted, its labels will still appear on the
topology map. So, it needs to be fixed by generating a fresh topology map for each reconcile so that it will always be in sync with the cluster's current state. 

BZ 2193220
As we construct the topologymap in each reconcile, the order of the label values are not consistent. This results in the nodetopologies to be updated frequently. 
This was causing reconcile to be cancelled in-between causing issues with rook-ceph-operator in stretched cluster.

So all this problem was created & sorted out in 4.13 itself, but the customer her is with ODF 4.11. So we can rule out the relation with BZ 2193220.

I am looking into the case also, let me see if I can see some other issue.

Comment 11 Malay Kumar parida 2023-08-03 09:25:14 UTC
From the logs before rook ceph operator pod restart I can see two problems

"CR has changed for "ocs-storagecluster-cephcluster" 132 times.
Here in the cephcluster CR the placement spec for the mon, specifically the mon podAntiAffinity topologyKey keeps changing from topology.kubernetes.io/hostname to kubernetes.io/hostname back & forth.
And also inside the storage.storageclassdevicesets PreparePlacement.TopologySpreadConstraints keeps changing back & forth.


"CR has changed for "ocs-storagecluster-cephfilesystem" 83 times.
Here in the cephfilesystem CR in the MetadataServerSpec.Placement.PodAntiAffinity topologyKey keeps changing from topology.kubernetes.io/hostname to kubernetes.io/hostname back & forth.

Due to so frequent changing in the placement sections of the CephCluster CR & the CephFileSystem CR the reconcile doesn't go ahead. And rook operator fails with the message
"rookcmd: failed to run operator: gave up to run the operator manager: failed to set up overall controller-runtime manager: error listening on :8080: listen tcp :8080: bind: address already in use".

I understand ocs operator is the one making these changes to the CRs, So I will concentrate the investigation on that direction.

Comment 12 Malay Kumar parida 2023-08-03 10:02:24 UTC
Found the root cause, While seeing the nodes in the must gather I saw the nodes have duplicate kind of labels,
All the nodes have 2 labels kubernetes.io/hostname, opology.kubernetes.io/hostname.

Then I saw our storagecluster.yaml and it has picked up both the labels 
 Node Topologies:
    Labels:
      kubernetes.io/hostname:
        storage-0.nicqa.tahakom.com
        storage-2.nicqa.tahakom.com
        storage-1.nicqa.tahakom.com
      topology.kubernetes.io/hostname:
        storage-0.nicqa.tahakom.com
        storage-2.nicqa.tahakom.com
        storage-1.nicqa.tahakom.com
.
Our code when it determines the placement it picks randomly from the Node Topologies as the failuredomain is kubernetes.io/hostname,
And both the labels kubernetes.io/hostname & topology.kubernetes.io/hostname contain the string. So sometimes it picks the 1st one & sometimes the 2nd.

This causes frequent changes in the placement section of the CRs as the topology key is constantly changing.

Comment 14 Malay Kumar parida 2023-08-03 11:55:32 UTC
How to get out of this situation? 

* First customer has to relabel their nodes to remove the topology.kubernetes.io/hostname labels, the kubernetes.io/hostname is a well-known used label as per standard and it's already determined as the failure domain so all topology.kubernetes.io/hostname labels should be removed from the nodes.

* Scale Down ocs operator
oc patch deployment ocs-operator -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/replicas", "value": 0 }]'

* Patch the Storagecluster to remove the NodeTopologies field so that it's reconstructed freshly when ocs operator comes back
oc patch storagecluster ocs-storagecluster -n openshift-storage --type json --subresource status --patch '[{ "op": "replace", "path": "/status/nodeTopologies/labels", "value":  }]'
(if this patch command doesn't work, please upgrade your oc cli to 4.11)

* Now Scale Up ocs operator
oc patch deployment ocs-operator -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/replicas", "value": 1 }]'

* Check if the new Nodetopologymap doesn't has the topology.kubernetes.io/hostname labels
oc get storagecluster ocs-storagecluster -n openshift-storage -o=jsonpath='{.status.nodeTopologies}' | jq

Delete the existing rook-ceph-operator pod & a new one should come up.

Comment 15 Malay Kumar parida 2023-08-03 11:59:31 UTC
Before running these steps please confirm with the customer about the reason for adding the label topology.kubernetes.io/hostname to their nodes even when kubernetes.io/hostname already existed, is it critical for some of their other workload on the cluster? Please confirm this before running these steps.

Comment 17 Malay Kumar parida 2023-08-03 12:49:22 UTC
Hi Sonal, both the CR CephCluster & CephFileSystem are created & kept updated by ocs-operator

Comment 18 Malay Kumar parida 2023-08-11 07:13:28 UTC
I can see from the attached case

```
The cluster looks good.

All osds and ceph components are on same version now i.e 16.2.10-138.el8cp

ceph cluster is HEALTH_OK and storagecluster is in Ready phase.

Noobaa is in Ready phase.

```

As the underlying issue for which the BZ was raised is now solved I feel it's correct to close this BZ. 
If there are any further issues that come up related to this cause this BZ can be reopened.

Comment 20 Malay Kumar parida 2023-08-11 09:35:35 UTC
Probably yes, But this was for the first time I saw when two labels were added where one is a substring of the other. I will talk to the team if we should fix this case.