Bug 2102304

Summary:	[GSS] Remove the entry of removed node from Storagecluster under Node Topology
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Priya Pandey <prpandey>
Component:	ocs-operator	Assignee:	Malay Kumar parida <mparida>
Status:	CLOSED ERRATA	QA Contact:	Oded <oviner>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.8	CC:	aivaraslaimikis, ebenahar, hnallurv, mhackett, mparida, muagarwa, ocs-bugs, odf-bz-bot, oviner, sarora, sostapov, tdesala
Target Milestone:	---
Target Release:	ODF 4.13.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	2252940 (view as bug list)		Environment:
Last Closed:	2023-06-21 15:22:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2252940

Description Priya Pandey 2022-06-29 15:38:05 UTC

Description of problem (please be detailed as possible and provide log
snippets):

- Cu has removed an OCS node from the cluster and performed steps to replace the 
  node.

- The entry of the removed node is still visible in the storagecluster.

------------------------------------------

oc get StorageCluster -o yaml | grep nodeTopologies -A 11
    nodeTopologies:
      labels:
        kubernetes.io/hostname:
        - dk1osn1019.eva.danskenet.com
        - dk1osn101a.eva.danskenet.com
        - dk1osn1018.eva.danskenet.com   -------> Removed node
        - dk1osn1010.eva.danskenet.com
------------------------------------------

- We tried to remove the entries from the storagecluster but it got reconciled.

- Need assistance on how to remove the entry from the Storagecluster.

Version of all relevant components (if applicable):

v4.8


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

- No Impact in the storagecluster.


Is there any workaround available to the best of your knowledge?

N/A

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?
No

If this is a regression, please provide more details to justify this:
N/A


Steps to Reproduce:
1. Remove Existing node and add new node to the cluster
2. Follow the replacement of node procedure.
3. Try to remove the entry from the storagecluster


Actual results:

- The node topology entry should not be reconciled after changes.
 

Expected results:

- The node topology entry is getting reconciled after changes.

Additional info:
In the next steps

Comment 6 Oded 2022-10-28 07:40:27 UTC

If it's LSO, did the customer delete the old node from localvolumediscovery and localvolumeset?
https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.8/html/replacing_nodes/openshift_container_storage_deployed_using_local_storage_devices#replacing_storage_nodes_on_vmware_infrastructure

Comment 17 avdhoot 2023-05-29 09:33:11 UTC

@mparida 

Can you help me out on which platform I have to test this BZ like aws , vmware , baremetal?

If it is independent of platform, is it ok to test it on vmware?

Comment 18 Malay Kumar parida 2023-05-29 10:04:28 UTC

This is platform-independent, Can be tested on any platform.

Comment 19 Oded 2023-06-13 09:40:17 UTC

Hi,
I have 2 questions
1. Is the rack label created automatically When labeling a new node with OCS? [based on replaced node]
2. The replaced node is automatically deleted from the storagecluster.. is this the expected behavior or do we need to delete it manually?

SetUp:
ODF Version: 4.13.0-218
OCP Version: 4.13.0-0.nightly-2023-06-09-152551
Platform: Vsphere


Test Process:
1.Install OCP4.13 
2.Install ODF4.13
3.Install Stroragecluster and label 3 nodes with OCS
4.Check ceph status:
sh-5.1$ ceph -s
  cluster:
    id:     00cabf1d-0951-459f-8388-2cac249d5851
    health: HEALTH_OK
5.Check storagecluster:
$ oc get storageclusters.ocs.openshift.io 
NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   11m   Ready              2023-06-13T08:57:28Z   4.13.0

    nodeTopologies:
      labels:
        kubernetes.io/hostname:
        - compute-0
        - compute-1
        - compute-2
        topology.rook.io/rack:
        - rack0
        - rack1
        - rack2


6. Delete Compute-0
$ oc get nodes --show-labels
NAME              STATUS   ROLES                  AGE   VERSION           LABELS
compute-0         Ready    worker                 18h   v1.26.5+7d22122   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-0,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack0
compute-1         Ready    worker                 18h   v1.26.5+7d22122   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-1,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack1
compute-2         Ready    worker                 18h   v1.26.5+7d22122   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-2,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack2
compute-3         Ready    worker                 18h   v1.26.5+7d22122   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-3,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos
control-plane-0   Ready    control-plane,master   18h   v1.26.5+7d22122   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-16gb.os-unknown,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=control-plane-0,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-16gb.os-unknown,node.openshift.io/os_id=rhcos


oviner:auth$ oc adm cordon compute-0
oviner:auth$ oc adm drain compute-0 --force --delete-emptydir-data=true --ignore-daemonsets
oviner:auth$ oc delete nodes compute-0
node "compute-0" deleted

7.Apply the OpenShift Data Foundation label to the new node using any one of the following
$ oc label node compute-3 cluster.ocs.openshift.io/openshift-storage=""
node/compute-3 labeled

$ oc get nodes compute-3 --show-labels 
NAME        STATUS   ROLES    AGE   VERSION           LABELS
compute-3   Ready    worker   18h   v1.26.5+7d22122   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-3,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack0

8.Verify that the new Object Storage Device (OSD) pods are running on the replacement node:
$ oc get pods -o wide | grep compute-3
csi-addons-controller-manager-b6f965bdb-hn826                     2/2     Running     1 (14m ago)   25m     10.128.2.49   compute-3   <none>           <none>
csi-cephfsplugin-fsd87                                            2/2     Running     0             23m     10.1.161.91   compute-3   <none>           <none>
csi-cephfsplugin-provisioner-886476949-xddqv                      5/5     Running     0             23m     10.128.2.51   compute-3   <none>           <none>
csi-rbdplugin-fpnwn                                               3/3     Running     0             23m     10.1.161.91   compute-3   <none>           <none>
noobaa-operator-6bbc975866-84x4f                                  1/1     Running     0             5m5s    10.128.2.57   compute-3   <none>           <none>
ocs-operator-655d6b4c4c-bxhfx                                     1/1     Running     1 (14m ago)   27m     10.128.2.47   compute-3   <none>           <none>
odf-operator-controller-manager-79dc8569db-zb2vs                  2/2     Running     0             27m     10.128.2.48   compute-3   <none>           <none>
rook-ceph-crashcollector-compute-3-84fd6c5847-nggb2               1/1     Running     0             2m26s   10.128.2.65   compute-3   <none>           <none>
rook-ceph-exporter-compute-3-69bc7869c9-wg8sm                     1/1     Running     0             2m26s   10.128.2.64   compute-3   <none>           <none>
rook-ceph-mon-c-6cf8fb4ff5-28k74                                  2/2     Running     0             5m5s    10.128.2.63   compute-3   <none>           <none>
rook-ceph-operator-b99c67644-flbtv                                1/1     Running     0             23m     10.128.2.50   compute-3   <none>           <none>
rook-ceph-osd-0-85649955dd-pct9p                                  2/2     Running     0             5m5s    10.128.2.66   compute-3   <none>           <none>
rook-ceph-tools-75bc769bdd-jwc6p                                  1/1     Running     0             13m     10.128.2.55   compute-3   <none>           <none>

9.Check ceph status:
sh-5.1$ ceph -s
  cluster:
    id:     00cabf1d-0951-459f-8388-2cac249d5851
    health: HEALTH_OK

10.Check storagecluster:
$ oc get storageclusters.ocs.openshift.io 
NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   25m   Ready              2023-06-13T08:57:28Z   4.13.0


    nodeTopologies:
      labels:
        kubernetes.io/hostname:
        - compute-1
        - compute-2
        - compute-3
        topology.rook.io/rack:
        - rack0
        - rack1
        - rack2

For more info: https://docs.google.com/document/d/1U9bxOpkCcQdFqwD_C1Vea921r0B72-q3I8Q119x0eso/edit

Comment 20 Malay Kumar parida 2023-06-13 09:52:25 UTC

Hi,
1. Yes the rack labels are created automatically by OCS, So the behavior you saw where it automatically labeled the new node after you labeled it with OCS, Is expected behavior.

2. Yes, That is the expected behavior after the fix. i.e. OCS Operator should automatically remove the removed node from node topology. So the behavior you saw is good to mark the BZ as verified.

Comment 21 Oded 2023-06-13 10:20:45 UTC

move to verified state based on https://bugzilla.redhat.com/show_bug.cgi?id=2102304#c20

Comment 22 errata-xmlrpc 2023-06-21 15:22:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742

Comment 23 Malay Kumar parida 2023-08-03 20:34:19 UTC

Since the BZ has been fixed only on ODF 4.13 people can still encounter this issue while on earlier releases. While dealing with another case I found a workaround for the BZ so pasting it here for reference. The workaround can be used if a customer is on some earlier odf release.

* Scale Down ocs operator
oc patch deployment ocs-operator -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/replicas", "value": 0 }]'

* Patch the Storagecluster to remove the NodeTopologies field so that it's reconstructed freshly when ocs operator comes back
oc patch storagecluster ocs-storagecluster -n openshift-storage --type json --subresource status --patch '[{ "op": "replace", "path": "/status/nodeTopologies/labels", "value":  }]'
(if this patch command doesn't work, please upgrade your oc cli to 4.11)

* Now Scale Up ocs operator
oc patch deployment ocs-operator -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/replicas", "value": 1 }]'

* Check if the new Nodetopologymap is the desired one now
oc get storagecluster ocs-storagecluster -n openshift-storage -o=jsonpath='{.status.nodeTopologies}' | jq