Description of problem (please be detailed as possible and provide log snippets): - Cu has removed an OCS node from the cluster and performed steps to replace the node. - The entry of the removed node is still visible in the storagecluster. ------------------------------------------ oc get StorageCluster -o yaml | grep nodeTopologies -A 11 nodeTopologies: labels: kubernetes.io/hostname: - dk1osn1019.eva.danskenet.com - dk1osn101a.eva.danskenet.com - dk1osn1018.eva.danskenet.com -------> Removed node - dk1osn1010.eva.danskenet.com ------------------------------------------ - We tried to remove the entries from the storagecluster but it got reconciled. - Need assistance on how to remove the entry from the Storagecluster. Version of all relevant components (if applicable): v4.8 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? - No Impact in the storagecluster. Is there any workaround available to the best of your knowledge? N/A Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? No If this is a regression, please provide more details to justify this: N/A Steps to Reproduce: 1. Remove Existing node and add new node to the cluster 2. Follow the replacement of node procedure. 3. Try to remove the entry from the storagecluster Actual results: - The node topology entry should not be reconciled after changes. Expected results: - The node topology entry is getting reconciled after changes. Additional info: In the next steps
If it's LSO, did the customer delete the old node from localvolumediscovery and localvolumeset? https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.8/html/replacing_nodes/openshift_container_storage_deployed_using_local_storage_devices#replacing_storage_nodes_on_vmware_infrastructure
@mparida Can you help me out on which platform I have to test this BZ like aws , vmware , baremetal? If it is independent of platform, is it ok to test it on vmware?
This is platform-independent, Can be tested on any platform.
Hi, I have 2 questions 1. Is the rack label created automatically When labeling a new node with OCS? [based on replaced node] 2. The replaced node is automatically deleted from the storagecluster.. is this the expected behavior or do we need to delete it manually? SetUp: ODF Version: 4.13.0-218 OCP Version: 4.13.0-0.nightly-2023-06-09-152551 Platform: Vsphere Test Process: 1.Install OCP4.13 2.Install ODF4.13 3.Install Stroragecluster and label 3 nodes with OCS 4.Check ceph status: sh-5.1$ ceph -s cluster: id: 00cabf1d-0951-459f-8388-2cac249d5851 health: HEALTH_OK 5.Check storagecluster: $ oc get storageclusters.ocs.openshift.io NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 11m Ready 2023-06-13T08:57:28Z 4.13.0 nodeTopologies: labels: kubernetes.io/hostname: - compute-0 - compute-1 - compute-2 topology.rook.io/rack: - rack0 - rack1 - rack2 6. Delete Compute-0 $ oc get nodes --show-labels NAME STATUS ROLES AGE VERSION LABELS compute-0 Ready worker 18h v1.26.5+7d22122 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-0,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack0 compute-1 Ready worker 18h v1.26.5+7d22122 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-1,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack1 compute-2 Ready worker 18h v1.26.5+7d22122 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-2,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack2 compute-3 Ready worker 18h v1.26.5+7d22122 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-3,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos control-plane-0 Ready control-plane,master 18h v1.26.5+7d22122 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-16gb.os-unknown,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=control-plane-0,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-16gb.os-unknown,node.openshift.io/os_id=rhcos oviner:auth$ oc adm cordon compute-0 oviner:auth$ oc adm drain compute-0 --force --delete-emptydir-data=true --ignore-daemonsets oviner:auth$ oc delete nodes compute-0 node "compute-0" deleted 7.Apply the OpenShift Data Foundation label to the new node using any one of the following $ oc label node compute-3 cluster.ocs.openshift.io/openshift-storage="" node/compute-3 labeled $ oc get nodes compute-3 --show-labels NAME STATUS ROLES AGE VERSION LABELS compute-3 Ready worker 18h v1.26.5+7d22122 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-3,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=vsphere-vm.cpu-16.mem-64gb.os-unknown,node.openshift.io/os_id=rhcos,topology.rook.io/rack=rack0 8.Verify that the new Object Storage Device (OSD) pods are running on the replacement node: $ oc get pods -o wide | grep compute-3 csi-addons-controller-manager-b6f965bdb-hn826 2/2 Running 1 (14m ago) 25m 10.128.2.49 compute-3 <none> <none> csi-cephfsplugin-fsd87 2/2 Running 0 23m 10.1.161.91 compute-3 <none> <none> csi-cephfsplugin-provisioner-886476949-xddqv 5/5 Running 0 23m 10.128.2.51 compute-3 <none> <none> csi-rbdplugin-fpnwn 3/3 Running 0 23m 10.1.161.91 compute-3 <none> <none> noobaa-operator-6bbc975866-84x4f 1/1 Running 0 5m5s 10.128.2.57 compute-3 <none> <none> ocs-operator-655d6b4c4c-bxhfx 1/1 Running 1 (14m ago) 27m 10.128.2.47 compute-3 <none> <none> odf-operator-controller-manager-79dc8569db-zb2vs 2/2 Running 0 27m 10.128.2.48 compute-3 <none> <none> rook-ceph-crashcollector-compute-3-84fd6c5847-nggb2 1/1 Running 0 2m26s 10.128.2.65 compute-3 <none> <none> rook-ceph-exporter-compute-3-69bc7869c9-wg8sm 1/1 Running 0 2m26s 10.128.2.64 compute-3 <none> <none> rook-ceph-mon-c-6cf8fb4ff5-28k74 2/2 Running 0 5m5s 10.128.2.63 compute-3 <none> <none> rook-ceph-operator-b99c67644-flbtv 1/1 Running 0 23m 10.128.2.50 compute-3 <none> <none> rook-ceph-osd-0-85649955dd-pct9p 2/2 Running 0 5m5s 10.128.2.66 compute-3 <none> <none> rook-ceph-tools-75bc769bdd-jwc6p 1/1 Running 0 13m 10.128.2.55 compute-3 <none> <none> 9.Check ceph status: sh-5.1$ ceph -s cluster: id: 00cabf1d-0951-459f-8388-2cac249d5851 health: HEALTH_OK 10.Check storagecluster: $ oc get storageclusters.ocs.openshift.io NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 25m Ready 2023-06-13T08:57:28Z 4.13.0 nodeTopologies: labels: kubernetes.io/hostname: - compute-1 - compute-2 - compute-3 topology.rook.io/rack: - rack0 - rack1 - rack2 For more info: https://docs.google.com/document/d/1U9bxOpkCcQdFqwD_C1Vea921r0B72-q3I8Q119x0eso/edit
Hi, 1. Yes the rack labels are created automatically by OCS, So the behavior you saw where it automatically labeled the new node after you labeled it with OCS, Is expected behavior. 2. Yes, That is the expected behavior after the fix. i.e. OCS Operator should automatically remove the removed node from node topology. So the behavior you saw is good to mark the BZ as verified.
move to verified state based on https://bugzilla.redhat.com/show_bug.cgi?id=2102304#c20
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3742
Since the BZ has been fixed only on ODF 4.13 people can still encounter this issue while on earlier releases. While dealing with another case I found a workaround for the BZ so pasting it here for reference. The workaround can be used if a customer is on some earlier odf release. * Scale Down ocs operator oc patch deployment ocs-operator -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/replicas", "value": 0 }]' * Patch the Storagecluster to remove the NodeTopologies field so that it's reconstructed freshly when ocs operator comes back oc patch storagecluster ocs-storagecluster -n openshift-storage --type json --subresource status --patch '[{ "op": "replace", "path": "/status/nodeTopologies/labels", "value": }]' (if this patch command doesn't work, please upgrade your oc cli to 4.11) * Now Scale Up ocs operator oc patch deployment ocs-operator -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/replicas", "value": 1 }]' * Check if the new Nodetopologymap is the desired one now oc get storagecluster ocs-storagecluster -n openshift-storage -o=jsonpath='{.status.nodeTopologies}' | jq