Bug 2208371

Summary: Node That No Longer Exists Still in storagecluster.yaml Cannot Edit/Remove
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Craig Wayman <crwayman>
Component: ocs-operatorAssignee: Malay Kumar parida <mparida>
Status: CLOSED CURRENTRELEASE QA Contact: Elad <ebenahar>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.9CC: hnallurv, mparida, muagarwa, ocs-bugs, odf-bz-bot
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-05 05:28:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Craig Wayman 2023-05-18 18:11:13 UTC
Description of problem (please be detailed as possible and provide log
snippests):

  This case has been open for quite a while. To provide some background context. The customer forcefully removed an ODF node instead of performing a node replacement IAW the ODF Product Documentation.  Obviously, this caused issues in ODF and A node needed to be added back and labeled. Long story short, everything is back to normal/healthy and functioning properly. 

  Why this bug is being opened, is because the node the customer forcefully removed is still being reported in the storagecluster.yaml. Although it’s under status: it’s still believed that if this case has issues in the future, this could lead support to explore this node that no longer exists as a possible delta, therefore, going down a troubleshooting path that they shouldn’t be going down.

Version of all relevant components (if applicable):

OCP                              4.10.55
NooBaa Operator                  4.9.14
OpenShift Container Storage      4.9.14
OpenShift Data Foundation        4.9.14

Ceph Versions:

{
    "mon": {
        "ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)": 1
    },
    "osd": {
        "ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)": 3
    },
    "mds": {
        "ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)": 2
    },
    "rgw": {
        "ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)": 1
    },
    "overall": {
        "ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)": 10
    }
}


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

  No, this is a non-urgent BZ.

Is there any workaround available to the best of your knowledge?

No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3

Can this issue be reproducible?
Yes, forcefully removing an ODF node without performing the steps in the node replacement ODF doc will cause this issue


Additional info:

  There have been a couple of ocs-tech-list eMails that went out about this issue. There was also many sync session with OCS team members including Ashish and the result is that nobody has yet to answer the question of how to remove this node from the storagecluster.yaml. Editing the storagecluster will not save. It discards the edit, so it was determined that we’ll open a BZ to address this issue.

  It was mentioned to check LSO. We checked LSO, and node newprod-n6z2g-storage-0-8vc9h is completely gone in LSO however, it does come up in an LSO must-gather in the LocalVolumeDiscoveryResults entry however, so do all other nodes that have previously existed. So most likely, LSO isn’t the issue. 

  Tried to remove labels, but that didn’t work, because the node no longer exists. 

  FYI, the customer did not follow this documentation when they removed the node:
https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.9/html-single/replacing_nodes/index#replacing-an-operational-node-using-local-storage-devices_bm-upi-failed

  All logs/must-gathers have been yanked in supportshell however, if you'd like I can upload any specific logs/must-gathers to gDrive and share it to make it easier, just let me know.

Comment 6 Malay Kumar parida 2023-05-31 06:28:27 UTC
Hi Craig, 
As it's a field in the status, we can not edit it, even after scaling down the operator. The root cause is, while constructing that node topology field it just appends any new nodes if present to the existing list in the status field. So even after a node has been removed it just continues to be there. So in 4.13, we changed this behavior to fix this, but I couldn't think of a workaround for the past versions. So we really don't have a workaround for this issue, as of now at least.

Comment 8 Malay Kumar parida 2023-06-05 05:28:21 UTC
Hi Craig, Yes. This is correct. I am closing the Bug too. In case of any more queries in this regard feel free to reopen the BZ or directly reach out to me.
Thanks.

Comment 9 Malay Kumar parida 2023-08-03 20:31:24 UTC
Just found a workaround for this BZ while dealing with another case. Pasting it here also for reference in case it will be required for someone.

* Scale Down ocs operator
oc patch deployment ocs-operator -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/replicas", "value": 0 }]'

* Patch the Storagecluster to remove the NodeTopologies field so that it's reconstructed freshly when ocs operator comes back
oc patch storagecluster ocs-storagecluster -n openshift-storage --type json --subresource status --patch '[{ "op": "replace", "path": "/status/nodeTopologies/labels", "value":  }]'
(if this patch command doesn't work, please upgrade your oc cli to 4.11)

* Now Scale Up ocs operator
oc patch deployment ocs-operator -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/replicas", "value": 1 }]'

* Check if the new Nodetopologymap is the desired one now
oc get storagecluster ocs-storagecluster -n openshift-storage -o=jsonpath='{.status.nodeTopologies}' | jq