Bug 2102304 - [GSS] Remove the entry of removed node from Storagecluster under Node Topology
Summary: [GSS] Remove the entry of removed node from Storagecluster under Node Topology
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ODF 4.13.0
Assignee: Malay Kumar parida
QA Contact: Oded
URL:
Whiteboard:
Depends On:
Blocks: 2252940
TreeView+ depends on / blocked
 
Reported: 2022-06-29 15:38 UTC by Priya Pandey
Modified: 2023-12-05 11:17 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2252940 (view as bug list)
Environment:
Last Closed: 2023-06-21 15:22:14 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-ci pull 7920 0 None Merged GSS Automation, Remove the entry of removed node from Storagecluster under Node Topology 2023-07-26 06:28:24 UTC
Github red-hat-storage ocs-operator pull 1973 0 None Merged Construct the topology map instead of just adding to the map 2023-12-19 07:29:33 UTC
Github red-hat-storage ocs-operator pull 1990 0 None Merged Bug 2102304: [release-4.13] Construct the topology map instead of just adding to the map 2023-12-19 07:29:36 UTC
Red Hat Product Errata RHBA-2023:3742 0 None None None 2023-06-21 15:23:07 UTC

Description Priya Pandey 2022-06-29 15:38:05 UTC
Description of problem (please be detailed as possible and provide log
snippets):

- Cu has removed an OCS node from the cluster and performed steps to replace the 
  node.

- The entry of the removed node is still visible in the storagecluster.

------------------------------------------

oc get StorageCluster -o yaml | grep nodeTopologies -A 11
    nodeTopologies:
      labels:
        kubernetes.io/hostname:
        - dk1osn1019.eva.danskenet.com
        - dk1osn101a.eva.danskenet.com
        - dk1osn1018.eva.danskenet.com   -------> Removed node
        - dk1osn1010.eva.danskenet.com
------------------------------------------

- We tried to remove the entries from the storagecluster but it got reconciled.

- Need assistance on how to remove the entry from the Storagecluster.

Version of all relevant components (if applicable):

v4.8


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

- No Impact in the storagecluster.


Is there any workaround available to the best of your knowledge?

N/A

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?
No

If this is a regression, please provide more details to justify this:
N/A


Steps to Reproduce:
1. Remove Existing node and add new node to the cluster
2. Follow the replacement of node procedure.
3. Try to remove the entry from the storagecluster


Actual results:

- The node topology entry should not be reconciled after changes.
 

Expected results:

- The node topology entry is getting reconciled after changes.

Additional info:
In the next steps

Comment 17 avdhoot 2023-05-29 09:33:11 UTC
@mparida 

Can you help me out on which platform I have to test this BZ like aws , vmware , baremetal?

If it is independent of platform, is it ok to test it on vmware?

Comment 18 Malay Kumar parida 2023-05-29 10:04:28 UTC
This is platform-independent, Can be tested on any platform.

Comment 19 Oded 2023-06-13 09:40:17 UTC
Hi,
I have 2 questions
1. Is the rack label created automatically When labeling a new node with OCS? [based on replaced node]
2. The replaced node is automatically deleted from the storagecluster.. is this the expected behavior or do we need to delete it manually?

SetUp:
ODF Version: 4.13.0-218
OCP Version: 4.13.0-0.nightly-2023-06-09-152551
Platform: Vsphere


Test Process:
1.Install OCP4.13 
2.Install ODF4.13
3.Install Stroragecluster and label 3 nodes with OCS
4.Check ceph status:
sh-5.1$ ceph -s
  cluster:
    id:     00cabf1d-0951-459f-8388-2cac249d5851
    health: HEALTH_OK
5.Check storagecluster:
$ oc get storageclusters.ocs.openshift.io 
NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   11m   Ready              2023-06-13T08:57:28Z   4.13.0

    nodeTopologies:
      labels:
        kubernetes.io/hostname:
        - compute-0
        - compute-1
        - compute-2
        topology.rook.io/rack:
        - rack0
        - rack1
        - rack2


6. Delete Compute-0
$ oc get nodes --show-labels
NAME              STATUS   ROLES                  AGE   VERSION           LABELS
compute-0         Ready    worker                 18h   v1.26.5+7d22122   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-0,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack0
compute-1         Ready    worker                 18h   v1.26.5+7d22122   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-1,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack1
compute-2         Ready    worker                 18h   v1.26.5+7d22122   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-2,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack2
compute-3         Ready    worker                 18h   v1.26.5+7d22122   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-3,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos
control-plane-0   Ready    control-plane,master   18h   v1.26.5+7d22122   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-16gb.os-unknown,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=control-plane-0,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-16gb.os-unknown,node.openshift.io/os_id=rhcos


oviner:auth$ oc adm cordon compute-0
oviner:auth$ oc adm drain compute-0 --force --delete-emptydir-data=true --ignore-daemonsets
oviner:auth$ oc delete nodes compute-0
node "compute-0" deleted

7.Apply the OpenShift Data Foundation label to the new node using any one of the following
$ oc label node compute-3 cluster.ocs.openshift.io/openshift-storage=""
node/compute-3 labeled

$ oc get nodes compute-3 --show-labels 
NAME        STATUS   ROLES    AGE   VERSION           LABELS
compute-3   Ready    worker   18h   v1.26.5+7d22122   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-3,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack0

8.Verify that the new Object Storage Device (OSD) pods are running on the replacement node:
$ oc get pods -o wide | grep compute-3
csi-addons-controller-manager-b6f965bdb-hn826                     2/2     Running     1 (14m ago)   25m     10.128.2.49   compute-3   <none>           <none>
csi-cephfsplugin-fsd87                                            2/2     Running     0             23m     10.1.161.91   compute-3   <none>           <none>
csi-cephfsplugin-provisioner-886476949-xddqv                      5/5     Running     0             23m     10.128.2.51   compute-3   <none>           <none>
csi-rbdplugin-fpnwn                                               3/3     Running     0             23m     10.1.161.91   compute-3   <none>           <none>
noobaa-operator-6bbc975866-84x4f                                  1/1     Running     0             5m5s    10.128.2.57   compute-3   <none>           <none>
ocs-operator-655d6b4c4c-bxhfx                                     1/1     Running     1 (14m ago)   27m     10.128.2.47   compute-3   <none>           <none>
odf-operator-controller-manager-79dc8569db-zb2vs                  2/2     Running     0             27m     10.128.2.48   compute-3   <none>           <none>
rook-ceph-crashcollector-compute-3-84fd6c5847-nggb2               1/1     Running     0             2m26s   10.128.2.65   compute-3   <none>           <none>
rook-ceph-exporter-compute-3-69bc7869c9-wg8sm                     1/1     Running     0             2m26s   10.128.2.64   compute-3   <none>           <none>
rook-ceph-mon-c-6cf8fb4ff5-28k74                                  2/2     Running     0             5m5s    10.128.2.63   compute-3   <none>           <none>
rook-ceph-operator-b99c67644-flbtv                                1/1     Running     0             23m     10.128.2.50   compute-3   <none>           <none>
rook-ceph-osd-0-85649955dd-pct9p                                  2/2     Running     0             5m5s    10.128.2.66   compute-3   <none>           <none>
rook-ceph-tools-75bc769bdd-jwc6p                                  1/1     Running     0             13m     10.128.2.55   compute-3   <none>           <none>

9.Check ceph status:
sh-5.1$ ceph -s
  cluster:
    id:     00cabf1d-0951-459f-8388-2cac249d5851
    health: HEALTH_OK

10.Check storagecluster:
$ oc get storageclusters.ocs.openshift.io 
NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   25m   Ready              2023-06-13T08:57:28Z   4.13.0


    nodeTopologies:
      labels:
        kubernetes.io/hostname:
        - compute-1
        - compute-2
        - compute-3
        topology.rook.io/rack:
        - rack0
        - rack1
        - rack2

For more info: https://docs.google.com/document/d/1U9bxOpkCcQdFqwD_C1Vea921r0B72-q3I8Q119x0eso/edit

Comment 20 Malay Kumar parida 2023-06-13 09:52:25 UTC
Hi,
1. Yes the rack labels are created automatically by OCS, So the behavior you saw where it automatically labeled the new node after you labeled it with OCS, Is expected behavior.

2. Yes, That is the expected behavior after the fix. i.e. OCS Operator should automatically remove the removed node from node topology. So the behavior you saw is good to mark the BZ as verified.

Comment 21 Oded 2023-06-13 10:20:45 UTC
move to verified state based on https://bugzilla.redhat.com/show_bug.cgi?id=2102304#c20

Comment 22 errata-xmlrpc 2023-06-21 15:22:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742

Comment 23 Malay Kumar parida 2023-08-03 20:34:19 UTC
Since the BZ has been fixed only on ODF 4.13 people can still encounter this issue while on earlier releases. While dealing with another case I found a workaround for the BZ so pasting it here for reference. The workaround can be used if a customer is on some earlier odf release.

* Scale Down ocs operator
oc patch deployment ocs-operator -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/replicas", "value": 0 }]'

* Patch the Storagecluster to remove the NodeTopologies field so that it's reconstructed freshly when ocs operator comes back
oc patch storagecluster ocs-storagecluster -n openshift-storage --type json --subresource status --patch '[{ "op": "replace", "path": "/status/nodeTopologies/labels", "value":  }]'
(if this patch command doesn't work, please upgrade your oc cli to 4.11)

* Now Scale Up ocs operator
oc patch deployment ocs-operator -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/replicas", "value": 1 }]'

* Check if the new Nodetopologymap is the desired one now
oc get storagecluster ocs-storagecluster -n openshift-storage -o=jsonpath='{.status.nodeTopologies}' | jq


Note You need to log in before you can comment on or make changes to this bug.