Description of problem (please be detailed as possible and provide log snippests): After upgrading from ODF 4.9.2 to 4.11.3, storagecluster is in error state, due to error in reconciling storageclass. rook-ceph-operator failed to reconcile ceph version, throwing below error: ~~~ 2023-08-01T12:42:29.044242502Z 2023-08-01 12:42:29.044198 E | ceph-file-controller: failed to reconcile failed to detect running and desired ceph version: failed to detect ceph image version: failed to complete ceph version job: failed to run CmdReporter ceph-file-controller-detect-version successfully. failed to run job. context canceled 2023-08-01T12:42:29.044269309Z 2023-08-01 12:42:29.044242 E | ceph-file-controller: failed to reconcile CephFilesystem "openshift-storage/ocs-storagecluster-cephfilesystem". failed to detect running and desired ceph version: failed to detect ceph image version: failed to complete ceph version job: failed to run CmdReporter ceph-file-controller-detect-version successfully. failed to run job. context canceled ~~~ As a result, one of the osd is running on older version 16.2.0-152 while rest of the ceph components is on 16.2.10-138: Version of all relevant components (if applicable): ODF 4.11.9 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No, there is no impact on using applications. Is there any workaround available to the best of your knowledge? Tried restarting the rook-ceph-operator pod, it didn't help. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Additional info: In next private comment
Hi Travis the BZ https://bugzilla.redhat.com/show_bug.cgi?id=2193220 was backported to 4.13 only as the issue arised only in 4.13 due to us trying fix another BZ https://bugzilla.redhat.com/show_bug.cgi?id=2102304. To explain briefly BZ 2102304 We never removed the labels for deleted nodes, instead, we just added any new information to the existing topology map. As a result, even after a node has been deleted, its labels will still appear on the topology map. So, it needs to be fixed by generating a fresh topology map for each reconcile so that it will always be in sync with the cluster's current state. BZ 2193220 As we construct the topologymap in each reconcile, the order of the label values are not consistent. This results in the nodetopologies to be updated frequently. This was causing reconcile to be cancelled in-between causing issues with rook-ceph-operator in stretched cluster. So all this problem was created & sorted out in 4.13 itself, but the customer her is with ODF 4.11. So we can rule out the relation with BZ 2193220. I am looking into the case also, let me see if I can see some other issue.
From the logs before rook ceph operator pod restart I can see two problems "CR has changed for "ocs-storagecluster-cephcluster" 132 times. Here in the cephcluster CR the placement spec for the mon, specifically the mon podAntiAffinity topologyKey keeps changing from topology.kubernetes.io/hostname to kubernetes.io/hostname back & forth. And also inside the storage.storageclassdevicesets PreparePlacement.TopologySpreadConstraints keeps changing back & forth. "CR has changed for "ocs-storagecluster-cephfilesystem" 83 times. Here in the cephfilesystem CR in the MetadataServerSpec.Placement.PodAntiAffinity topologyKey keeps changing from topology.kubernetes.io/hostname to kubernetes.io/hostname back & forth. Due to so frequent changing in the placement sections of the CephCluster CR & the CephFileSystem CR the reconcile doesn't go ahead. And rook operator fails with the message "rookcmd: failed to run operator: gave up to run the operator manager: failed to set up overall controller-runtime manager: error listening on :8080: listen tcp :8080: bind: address already in use". I understand ocs operator is the one making these changes to the CRs, So I will concentrate the investigation on that direction.
Found the root cause, While seeing the nodes in the must gather I saw the nodes have duplicate kind of labels, All the nodes have 2 labels kubernetes.io/hostname, opology.kubernetes.io/hostname. Then I saw our storagecluster.yaml and it has picked up both the labels Node Topologies: Labels: kubernetes.io/hostname: storage-0.nicqa.tahakom.com storage-2.nicqa.tahakom.com storage-1.nicqa.tahakom.com topology.kubernetes.io/hostname: storage-0.nicqa.tahakom.com storage-2.nicqa.tahakom.com storage-1.nicqa.tahakom.com . Our code when it determines the placement it picks randomly from the Node Topologies as the failuredomain is kubernetes.io/hostname, And both the labels kubernetes.io/hostname & topology.kubernetes.io/hostname contain the string. So sometimes it picks the 1st one & sometimes the 2nd. This causes frequent changes in the placement section of the CRs as the topology key is constantly changing.
How to get out of this situation? * First customer has to relabel their nodes to remove the topology.kubernetes.io/hostname labels, the kubernetes.io/hostname is a well-known used label as per standard and it's already determined as the failure domain so all topology.kubernetes.io/hostname labels should be removed from the nodes. * Scale Down ocs operator oc patch deployment ocs-operator -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/replicas", "value": 0 }]' * Patch the Storagecluster to remove the NodeTopologies field so that it's reconstructed freshly when ocs operator comes back oc patch storagecluster ocs-storagecluster -n openshift-storage --type json --subresource status --patch '[{ "op": "replace", "path": "/status/nodeTopologies/labels", "value": }]' (if this patch command doesn't work, please upgrade your oc cli to 4.11) * Now Scale Up ocs operator oc patch deployment ocs-operator -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/replicas", "value": 1 }]' * Check if the new Nodetopologymap doesn't has the topology.kubernetes.io/hostname labels oc get storagecluster ocs-storagecluster -n openshift-storage -o=jsonpath='{.status.nodeTopologies}' | jq Delete the existing rook-ceph-operator pod & a new one should come up.
Before running these steps please confirm with the customer about the reason for adding the label topology.kubernetes.io/hostname to their nodes even when kubernetes.io/hostname already existed, is it critical for some of their other workload on the cluster? Please confirm this before running these steps.
Hi Sonal, both the CR CephCluster & CephFileSystem are created & kept updated by ocs-operator
I can see from the attached case ``` The cluster looks good. All osds and ceph components are on same version now i.e 16.2.10-138.el8cp ceph cluster is HEALTH_OK and storagecluster is in Ready phase. Noobaa is in Ready phase. ``` As the underlying issue for which the BZ was raised is now solved I feel it's correct to close this BZ. If there are any further issues that come up related to this cause this BZ can be reopened.
Probably yes, But this was for the first time I saw when two labels were added where one is a substring of the other. I will talk to the team if we should fix this case.