Bug 2208371 - Node That No Longer Exists Still in storagecluster.yaml Cannot Edit/Remove
Summary: Node That No Longer Exists Still in storagecluster.yaml Cannot Edit/Remove
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: ---
Assignee: Malay Kumar parida
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-05-18 18:11 UTC by Craig Wayman
Modified: 2023-08-09 17:00 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-06-05 05:28:21 UTC
Embargoed:


Attachments (Terms of Use)

Description Craig Wayman 2023-05-18 18:11:13 UTC
Description of problem (please be detailed as possible and provide log
snippests):

  This case has been open for quite a while. To provide some background context. The customer forcefully removed an ODF node instead of performing a node replacement IAW the ODF Product Documentation.  Obviously, this caused issues in ODF and A node needed to be added back and labeled. Long story short, everything is back to normal/healthy and functioning properly. 

  Why this bug is being opened, is because the node the customer forcefully removed is still being reported in the storagecluster.yaml. Although it’s under status: it’s still believed that if this case has issues in the future, this could lead support to explore this node that no longer exists as a possible delta, therefore, going down a troubleshooting path that they shouldn’t be going down.

Version of all relevant components (if applicable):

OCP                              4.10.55
NooBaa Operator                  4.9.14
OpenShift Container Storage      4.9.14
OpenShift Data Foundation        4.9.14

Ceph Versions:

{
    "mon": {
        "ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)": 1
    },
    "osd": {
        "ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)": 3
    },
    "mds": {
        "ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)": 2
    },
    "rgw": {
        "ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)": 1
    },
    "overall": {
        "ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)": 10
    }
}


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

  No, this is a non-urgent BZ.

Is there any workaround available to the best of your knowledge?

No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3

Can this issue be reproducible?
Yes, forcefully removing an ODF node without performing the steps in the node replacement ODF doc will cause this issue


Additional info:

  There have been a couple of ocs-tech-list eMails that went out about this issue. There was also many sync session with OCS team members including Ashish and the result is that nobody has yet to answer the question of how to remove this node from the storagecluster.yaml. Editing the storagecluster will not save. It discards the edit, so it was determined that we’ll open a BZ to address this issue.

  It was mentioned to check LSO. We checked LSO, and node newprod-n6z2g-storage-0-8vc9h is completely gone in LSO however, it does come up in an LSO must-gather in the LocalVolumeDiscoveryResults entry however, so do all other nodes that have previously existed. So most likely, LSO isn’t the issue. 

  Tried to remove labels, but that didn’t work, because the node no longer exists. 

  FYI, the customer did not follow this documentation when they removed the node:
https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.9/html-single/replacing_nodes/index#replacing-an-operational-node-using-local-storage-devices_bm-upi-failed

  All logs/must-gathers have been yanked in supportshell however, if you'd like I can upload any specific logs/must-gathers to gDrive and share it to make it easier, just let me know.

Comment 6 Malay Kumar parida 2023-05-31 06:28:27 UTC
Hi Craig, 
As it's a field in the status, we can not edit it, even after scaling down the operator. The root cause is, while constructing that node topology field it just appends any new nodes if present to the existing list in the status field. So even after a node has been removed it just continues to be there. So in 4.13, we changed this behavior to fix this, but I couldn't think of a workaround for the past versions. So we really don't have a workaround for this issue, as of now at least.

Comment 8 Malay Kumar parida 2023-06-05 05:28:21 UTC
Hi Craig, Yes. This is correct. I am closing the Bug too. In case of any more queries in this regard feel free to reopen the BZ or directly reach out to me.
Thanks.

Comment 9 Malay Kumar parida 2023-08-03 20:31:24 UTC
Just found a workaround for this BZ while dealing with another case. Pasting it here also for reference in case it will be required for someone.

* Scale Down ocs operator
oc patch deployment ocs-operator -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/replicas", "value": 0 }]'

* Patch the Storagecluster to remove the NodeTopologies field so that it's reconstructed freshly when ocs operator comes back
oc patch storagecluster ocs-storagecluster -n openshift-storage --type json --subresource status --patch '[{ "op": "replace", "path": "/status/nodeTopologies/labels", "value":  }]'
(if this patch command doesn't work, please upgrade your oc cli to 4.11)

* Now Scale Up ocs operator
oc patch deployment ocs-operator -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/replicas", "value": 1 }]'

* Check if the new Nodetopologymap is the desired one now
oc get storagecluster ocs-storagecluster -n openshift-storage -o=jsonpath='{.status.nodeTopologies}' | jq


Note You need to log in before you can comment on or make changes to this bug.