Bug 2030290

Summary: [GSS] rook does not unset noout flag on failure domain after MC update
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Sonal <sarora>
Component: rookAssignee: Santosh Pillai <sapillai>
Status: CLOSED CURRENTRELEASE QA Contact: Mahesh Shetty <mashetty>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.7CC: abhishku, bkunal, hnallurv, madam, mashetty, mhackett, mmuench, muagarwa, ocs-bugs, odf-bz-bot, owasserm, sapillai, shan, s.heijmans, sshome, tnielsen
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-04-05 13:50:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sonal 2021-12-08 11:34:12 UTC
Description of problem (please be detailed as possible and provide log
snippests):

Customer performed a MCP/kubeletconfig update which rebooted the OCS nodes. After rebooting rack2, the noout flag remains set on it, causing issue in draining osd's belonging to rack0.


Version of all relevant components (if applicable):
OCS 4.7.5

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

=> Yes, unable to drain pods and hence node.

Is there any workaround available to the best of your knowledge?

=> Yes, customer restarted rook-ceph-operator, which unset the noout flag.
Or manually unset the flag on failure domain.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

2

Can this issue reproducible?
In customer's environment

Can this issue reproduce from the UI?
No

If this is a regression, please provide more details to justify this:
NA

Steps to Reproduce:
NA


Actual results:
noout flag remain set on the failure domain, once the node was up.

Expected results:
rook-ceph operator should unset the noout flag once the node is up.

Additional info:
In the next private comment

Comment 3 Travis Nielsen 2021-12-08 18:35:46 UTC
Not blocking for 4.9

Comment 8 Santosh Pillai 2021-12-10 07:35:08 UTC
Noout flag is not removed if the PGs are not active+clean even after the drained node is back and OSDs on the drained node have started running again. Users should wait for the PGs to be active+clean before draining the next node. 

The operator logs suggest that the PGs were not active+clean.

Can you please confirm if that was the case.

Comment 17 Santosh Pillai 2022-01-20 13:38:34 UTC
Regarding the customer having to wait for more than an hour for the OSD to re-balance, I checked with the someone working in ceph. Its pretty much possible to have a waiting for time of more than 1 hr. It depends on the factors like the size of the OSD that was lost, the network, the workload on the cluster

Comment 19 Travis Nielsen 2022-01-31 16:21:03 UTC
Mashesh Any more info yet on the repro, or shall we close this issue? It's not clear there is a bug here for Rook.

Comment 23 Travis Nielsen 2022-02-07 16:12:37 UTC
Closing since there is no repro, thanks

Comment 25 Travis Nielsen 2022-02-28 16:28:38 UTC
Moving to 4.11 to continue investigation.

Comment 44 Travis Nielsen 2022-04-04 19:22:43 UTC
Sonal Any more questions from the customer? I believe we have exhausted the analysis on 4.7 and will need a repro on a newer release if further investigation is needed.