2102304 – [GSS] Remove the entry of removed node from Storagecluster under Node Topology

Bug 2102304 - [GSS] Remove the entry of removed node from Storagecluster under Node Topology

Summary: [GSS] Remove the entry of removed node from Storagecluster under Node Topology

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	ODF 4.13.0
Assignee:	Malay Kumar parida
QA Contact:	Oded
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2252940
TreeView+	depends on / blocked

Reported:	2022-06-29 15:38 UTC by Priya Pandey
Modified:	2023-12-05 11:17 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2252940 (view as bug list)
Environment:
Last Closed:	2023-06-21 15:22:14 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-ci pull 7920	None	Merged	GSS Automation, Remove the entry of removed node from Storagecluster under Node Topology	2023-07-26 06:28:24 UTC
Github	red-hat-storage ocs-operator pull 1973	None	Merged	Construct the topology map instead of just adding to the map	2023-12-19 07:29:33 UTC
Github	red-hat-storage ocs-operator pull 1990	None	Merged	Bug 2102304: [release-4.13] Construct the topology map instead of just adding to the map	2023-12-19 07:29:36 UTC
Red Hat Product Errata	RHBA-2023:3742	None	None	None	2023-06-21 15:23:07 UTC

Description Priya Pandey 2022-06-29 15:38:05 UTC

Description of problem (please be detailed as possible and provide log
snippets):

- Cu has removed an OCS node from the cluster and performed steps to replace the 
  node.

- The entry of the removed node is still visible in the storagecluster.

------------------------------------------

oc get StorageCluster -o yaml | grep nodeTopologies -A 11
    nodeTopologies:
      labels:
        kubernetes.io/hostname:
        - dk1osn1019.eva.danskenet.com
        - dk1osn101a.eva.danskenet.com
        - dk1osn1018.eva.danskenet.com   -------> Removed node
        - dk1osn1010.eva.danskenet.com
------------------------------------------

- We tried to remove the entries from the storagecluster but it got reconciled.

- Need assistance on how to remove the entry from the Storagecluster.

Version of all relevant components (if applicable):

v4.8


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

- No Impact in the storagecluster.


Is there any workaround available to the best of your knowledge?

N/A

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?
No

If this is a regression, please provide more details to justify this:
N/A


Steps to Reproduce:
1. Remove Existing node and add new node to the cluster
2. Follow the replacement of node procedure.
3. Try to remove the entry from the storagecluster


Actual results:

- The node topology entry should not be reconciled after changes.
 

Expected results:

- The node topology entry is getting reconciled after changes.

Additional info:
In the next steps

Comment 6 Oded 2022-10-28 07:40:27 UTC

If it's LSO, did the customer delete the old node from localvolumediscovery and localvolumeset?
https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.8/html/replacing_nodes/openshift_container_storage_deployed_using_local_storage_devices#replacing_storage_nodes_on_vmware_infrastructure

Comment 17 avdhoot 2023-05-29 09:33:11 UTC

@mparida 

Can you help me out on which platform I have to test this BZ like aws , vmware , baremetal?

If it is independent of platform, is it ok to test it on vmware?

Comment 18 Malay Kumar parida 2023-05-29 10:04:28 UTC

This is platform-independent, Can be tested on any platform.

Comment 19 Oded 2023-06-13 09:40:17 UTC

Hi,
I have 2 questions
1. Is the rack label created automatically When labeling a new node with OCS? [based on replaced node]
2. The replaced node is automatically deleted from the storagecluster.. is this the expected behavior or do we need to delete it manually?

SetUp:
ODF Version: 4.13.0-218
OCP Version: 4.13.0-0.nightly-2023-06-09-152551
Platform: Vsphere


Test Process:
1.Install OCP4.13 
2.Install ODF4.13
3.Install Stroragecluster and label 3 nodes with OCS
4.Check ceph status:
sh-5.1$ ceph -s
  cluster:
    id:     00cabf1d-0951-459f-8388-2cac249d5851
    health: HEALTH_OK
5.Check storagecluster:
$ oc get storageclusters.ocs.openshift.io 
NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   11m   Ready              2023-06-13T08:57:28Z   4.13.0

    nodeTopologies:
      labels:
        kubernetes.io/hostname:
        - compute-0
        - compute-1
        - compute-2
        topology.rook.io/rack:
        - rack0
        - rack1
        - rack2


6. Delete Compute-0
$ oc get nodes --show-labels
NAME              STATUS   ROLES                  AGE   VERSION           LABELS
compute-0         Ready    worker                 18h   v1.26.5+7d22122   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-0,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack0
compute-1         Ready    worker                 18h   v1.26.5+7d22122   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-1,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack1
compute-2         Ready    worker                 18h   v1.26.5+7d22122   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-2,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack2
compute-3         Ready    worker                 18h   v1.26.5+7d22122   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-3,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos
control-plane-0   Ready    control-plane,master   18h   v1.26.5+7d22122   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-16gb.os-unknown,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=control-plane-0,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-16gb.os-unknown,node.openshift.io/os_id=rhcos


oviner:auth$ oc adm cordon compute-0
oviner:auth$ oc adm drain compute-0 --force --delete-emptydir-data=true --ignore-daemonsets
oviner:auth$ oc delete nodes compute-0
node "compute-0" deleted

7.Apply the OpenShift Data Foundation label to the new node using any one of the following
$ oc label node compute-3 cluster.ocs.openshift.io/openshift-storage=""
node/compute-3 labeled

$ oc get nodes compute-3 --show-labels 
NAME        STATUS   ROLES    AGE   VERSION           LABELS
compute-3   Ready    worker   18h   v1.26.5+7d22122   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-3,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack0

8.Verify that the new Object Storage Device (OSD) pods are running on the replacement node:
$ oc get pods -o wide | grep compute-3
csi-addons-controller-manager-b6f965bdb-hn826                     2/2     Running     1 (14m ago)   25m     10.128.2.49   compute-3   <none>           <none>
csi-cephfsplugin-fsd87                                            2/2     Running     0             23m     10.1.161.91   compute-3   <none>           <none>
csi-cephfsplugin-provisioner-886476949-xddqv                      5/5     Running     0             23m     10.128.2.51   compute-3   <none>           <none>
csi-rbdplugin-fpnwn                                               3/3     Running     0             23m     10.1.161.91   compute-3   <none>           <none>
noobaa-operator-6bbc975866-84x4f                                  1/1     Running     0             5m5s    10.128.2.57   compute-3   <none>           <none>
ocs-operator-655d6b4c4c-bxhfx                                     1/1     Running     1 (14m ago)   27m     10.128.2.47   compute-3   <none>           <none>
odf-operator-controller-manager-79dc8569db-zb2vs                  2/2     Running     0             27m     10.128.2.48   compute-3   <none>           <none>
rook-ceph-crashcollector-compute-3-84fd6c5847-nggb2               1/1     Running     0             2m26s   10.128.2.65   compute-3   <none>           <none>
rook-ceph-exporter-compute-3-69bc7869c9-wg8sm                     1/1     Running     0             2m26s   10.128.2.64   compute-3   <none>           <none>
rook-ceph-mon-c-6cf8fb4ff5-28k74                                  2/2     Running     0             5m5s    10.128.2.63   compute-3   <none>           <none>
rook-ceph-operator-b99c67644-flbtv                                1/1     Running     0             23m     10.128.2.50   compute-3   <none>           <none>
rook-ceph-osd-0-85649955dd-pct9p                                  2/2     Running     0             5m5s    10.128.2.66   compute-3   <none>           <none>
rook-ceph-tools-75bc769bdd-jwc6p                                  1/1     Running     0             13m     10.128.2.55   compute-3   <none>           <none>

9.Check ceph status:
sh-5.1$ ceph -s
  cluster:
    id:     00cabf1d-0951-459f-8388-2cac249d5851
    health: HEALTH_OK

10.Check storagecluster:
$ oc get storageclusters.ocs.openshift.io 
NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   25m   Ready              2023-06-13T08:57:28Z   4.13.0


    nodeTopologies:
      labels:
        kubernetes.io/hostname:
        - compute-1
        - compute-2
        - compute-3
        topology.rook.io/rack:
        - rack0
        - rack1
        - rack2

For more info: https://docs.google.com/document/d/1U9bxOpkCcQdFqwD_C1Vea921r0B72-q3I8Q119x0eso/edit

Comment 20 Malay Kumar parida 2023-06-13 09:52:25 UTC

Hi,
1. Yes the rack labels are created automatically by OCS, So the behavior you saw where it automatically labeled the new node after you labeled it with OCS, Is expected behavior.

2. Yes, That is the expected behavior after the fix. i.e. OCS Operator should automatically remove the removed node from node topology. So the behavior you saw is good to mark the BZ as verified.

Comment 21 Oded 2023-06-13 10:20:45 UTC

move to verified state based on https://bugzilla.redhat.com/show_bug.cgi?id=2102304#c20

Comment 22 errata-xmlrpc 2023-06-21 15:22:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742

Comment 23 Malay Kumar parida 2023-08-03 20:34:19 UTC

Since the BZ has been fixed only on ODF 4.13 people can still encounter this issue while on earlier releases. While dealing with another case I found a workaround for the BZ so pasting it here for reference. The workaround can be used if a customer is on some earlier odf release.

* Scale Down ocs operator
oc patch deployment ocs-operator -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/replicas", "value": 0 }]'

* Patch the Storagecluster to remove the NodeTopologies field so that it's reconstructed freshly when ocs operator comes back
oc patch storagecluster ocs-storagecluster -n openshift-storage --type json --subresource status --patch '[{ "op": "replace", "path": "/status/nodeTopologies/labels", "value":  }]'
(if this patch command doesn't work, please upgrade your oc cli to 4.11)

* Now Scale Up ocs operator
oc patch deployment ocs-operator -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/replicas", "value": 1 }]'

* Check if the new Nodetopologymap is the desired one now
oc get storagecluster ocs-storagecluster -n openshift-storage -o=jsonpath='{.status.nodeTopologies}' | jq

Note You need to log in before you can comment on or make changes to this bug.