Bug 2102304
| Summary: | [GSS] Remove the entry of removed node from Storagecluster under Node Topology | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Priya Pandey <prpandey> |
| Component: | ocs-operator | Assignee: | Malay Kumar parida <mparida> |
| Status: | CLOSED ERRATA | QA Contact: | Oded <oviner> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.8 | CC: | aivaras.laimikis, ebenahar, hnallurv, mhackett, mparida, muagarwa, ocs-bugs, odf-bz-bot, oviner, sarora, sostapov, tdesala |
| Target Milestone: | --- | ||
| Target Release: | ODF 4.13.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-06-21 15:22:14 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
If it's LSO, did the customer delete the old node from localvolumediscovery and localvolumeset? https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.8/html/replacing_nodes/openshift_container_storage_deployed_using_local_storage_devices#replacing_storage_nodes_on_vmware_infrastructure @mparida Can you help me out on which platform I have to test this BZ like aws , vmware , baremetal? If it is independent of platform, is it ok to test it on vmware? This is platform-independent, Can be tested on any platform. Hi,
I have 2 questions
1. Is the rack label created automatically When labeling a new node with OCS? [based on replaced node]
2. The replaced node is automatically deleted from the storagecluster.. is this the expected behavior or do we need to delete it manually?
SetUp:
ODF Version: 4.13.0-218
OCP Version: 4.13.0-0.nightly-2023-06-09-152551
Platform: Vsphere
Test Process:
1.Install OCP4.13
2.Install ODF4.13
3.Install Stroragecluster and label 3 nodes with OCS
4.Check ceph status:
sh-5.1$ ceph -s
cluster:
id: 00cabf1d-0951-459f-8388-2cac249d5851
health: HEALTH_OK
5.Check storagecluster:
$ oc get storageclusters.ocs.openshift.io
NAME AGE PHASE EXTERNAL CREATED AT VERSION
ocs-storagecluster 11m Ready 2023-06-13T08:57:28Z 4.13.0
nodeTopologies:
labels:
kubernetes.io/hostname:
- compute-0
- compute-1
- compute-2
topology.rook.io/rack:
- rack0
- rack1
- rack2
6. Delete Compute-0
$ oc get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
compute-0 Ready worker 18h v1.26.5+7d22122 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-0,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack0
compute-1 Ready worker 18h v1.26.5+7d22122 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-1,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack1
compute-2 Ready worker 18h v1.26.5+7d22122 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-2,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack2
compute-3 Ready worker 18h v1.26.5+7d22122 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-3,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos
control-plane-0 Ready control-plane,master 18h v1.26.5+7d22122 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-16gb.os-unknown,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=control-plane-0,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-16gb.os-unknown,node.openshift.io/os_id=rhcos
oviner:auth$ oc adm cordon compute-0
oviner:auth$ oc adm drain compute-0 --force --delete-emptydir-data=true --ignore-daemonsets
oviner:auth$ oc delete nodes compute-0
node "compute-0" deleted
7.Apply the OpenShift Data Foundation label to the new node using any one of the following
$ oc label node compute-3 cluster.ocs.openshift.io/openshift-storage=""
node/compute-3 labeled
$ oc get nodes compute-3 --show-labels
NAME STATUS ROLES AGE VERSION LABELS
compute-3 Ready worker 18h v1.26.5+7d22122 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-3,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack0
8.Verify that the new Object Storage Device (OSD) pods are running on the replacement node:
$ oc get pods -o wide | grep compute-3
csi-addons-controller-manager-b6f965bdb-hn826 2/2 Running 1 (14m ago) 25m 10.128.2.49 compute-3 <none> <none>
csi-cephfsplugin-fsd87 2/2 Running 0 23m 10.1.161.91 compute-3 <none> <none>
csi-cephfsplugin-provisioner-886476949-xddqv 5/5 Running 0 23m 10.128.2.51 compute-3 <none> <none>
csi-rbdplugin-fpnwn 3/3 Running 0 23m 10.1.161.91 compute-3 <none> <none>
noobaa-operator-6bbc975866-84x4f 1/1 Running 0 5m5s 10.128.2.57 compute-3 <none> <none>
ocs-operator-655d6b4c4c-bxhfx 1/1 Running 1 (14m ago) 27m 10.128.2.47 compute-3 <none> <none>
odf-operator-controller-manager-79dc8569db-zb2vs 2/2 Running 0 27m 10.128.2.48 compute-3 <none> <none>
rook-ceph-crashcollector-compute-3-84fd6c5847-nggb2 1/1 Running 0 2m26s 10.128.2.65 compute-3 <none> <none>
rook-ceph-exporter-compute-3-69bc7869c9-wg8sm 1/1 Running 0 2m26s 10.128.2.64 compute-3 <none> <none>
rook-ceph-mon-c-6cf8fb4ff5-28k74 2/2 Running 0 5m5s 10.128.2.63 compute-3 <none> <none>
rook-ceph-operator-b99c67644-flbtv 1/1 Running 0 23m 10.128.2.50 compute-3 <none> <none>
rook-ceph-osd-0-85649955dd-pct9p 2/2 Running 0 5m5s 10.128.2.66 compute-3 <none> <none>
rook-ceph-tools-75bc769bdd-jwc6p 1/1 Running 0 13m 10.128.2.55 compute-3 <none> <none>
9.Check ceph status:
sh-5.1$ ceph -s
cluster:
id: 00cabf1d-0951-459f-8388-2cac249d5851
health: HEALTH_OK
10.Check storagecluster:
$ oc get storageclusters.ocs.openshift.io
NAME AGE PHASE EXTERNAL CREATED AT VERSION
ocs-storagecluster 25m Ready 2023-06-13T08:57:28Z 4.13.0
nodeTopologies:
labels:
kubernetes.io/hostname:
- compute-1
- compute-2
- compute-3
topology.rook.io/rack:
- rack0
- rack1
- rack2
For more info: https://docs.google.com/document/d/1U9bxOpkCcQdFqwD_C1Vea921r0B72-q3I8Q119x0eso/edit
Hi, 1. Yes the rack labels are created automatically by OCS, So the behavior you saw where it automatically labeled the new node after you labeled it with OCS, Is expected behavior. 2. Yes, That is the expected behavior after the fix. i.e. OCS Operator should automatically remove the removed node from node topology. So the behavior you saw is good to mark the BZ as verified. move to verified state based on https://bugzilla.redhat.com/show_bug.cgi?id=2102304#c20 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3742 Since the BZ has been fixed only on ODF 4.13 people can still encounter this issue while on earlier releases. While dealing with another case I found a workaround for the BZ so pasting it here for reference. The workaround can be used if a customer is on some earlier odf release.
* Scale Down ocs operator
oc patch deployment ocs-operator -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/replicas", "value": 0 }]'
* Patch the Storagecluster to remove the NodeTopologies field so that it's reconstructed freshly when ocs operator comes back
oc patch storagecluster ocs-storagecluster -n openshift-storage --type json --subresource status --patch '[{ "op": "replace", "path": "/status/nodeTopologies/labels", "value": }]'
(if this patch command doesn't work, please upgrade your oc cli to 4.11)
* Now Scale Up ocs operator
oc patch deployment ocs-operator -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/replicas", "value": 1 }]'
* Check if the new Nodetopologymap is the desired one now
oc get storagecluster ocs-storagecluster -n openshift-storage -o=jsonpath='{.status.nodeTopologies}' | jq
|
Description of problem (please be detailed as possible and provide log snippets): - Cu has removed an OCS node from the cluster and performed steps to replace the node. - The entry of the removed node is still visible in the storagecluster. ------------------------------------------ oc get StorageCluster -o yaml | grep nodeTopologies -A 11 nodeTopologies: labels: kubernetes.io/hostname: - dk1osn1019.eva.danskenet.com - dk1osn101a.eva.danskenet.com - dk1osn1018.eva.danskenet.com -------> Removed node - dk1osn1010.eva.danskenet.com ------------------------------------------ - We tried to remove the entries from the storagecluster but it got reconciled. - Need assistance on how to remove the entry from the Storagecluster. Version of all relevant components (if applicable): v4.8 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? - No Impact in the storagecluster. Is there any workaround available to the best of your knowledge? N/A Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? No If this is a regression, please provide more details to justify this: N/A Steps to Reproduce: 1. Remove Existing node and add new node to the cluster 2. Follow the replacement of node procedure. 3. Try to remove the entry from the storagecluster Actual results: - The node topology entry should not be reconciled after changes. Expected results: - The node topology entry is getting reconciled after changes. Additional info: In the next steps