Description of problem (please be detailed as possible and provide log snippests): Customer performed a MCP/kubeletconfig update which rebooted the OCS nodes. After rebooting rack2, the noout flag remains set on it, causing issue in draining osd's belonging to rack0. Version of all relevant components (if applicable): OCS 4.7.5 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? => Yes, unable to drain pods and hence node. Is there any workaround available to the best of your knowledge? => Yes, customer restarted rook-ceph-operator, which unset the noout flag. Or manually unset the flag on failure domain. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? In customer's environment Can this issue reproduce from the UI? No If this is a regression, please provide more details to justify this: NA Steps to Reproduce: NA Actual results: noout flag remain set on the failure domain, once the node was up. Expected results: rook-ceph operator should unset the noout flag once the node is up. Additional info: In the next private comment
Not blocking for 4.9
Noout flag is not removed if the PGs are not active+clean even after the drained node is back and OSDs on the drained node have started running again. Users should wait for the PGs to be active+clean before draining the next node. The operator logs suggest that the PGs were not active+clean. Can you please confirm if that was the case.
Regarding the customer having to wait for more than an hour for the OSD to re-balance, I checked with the someone working in ceph. Its pretty much possible to have a waiting for time of more than 1 hr. It depends on the factors like the size of the OSD that was lost, the network, the workload on the cluster
Mashesh Any more info yet on the repro, or shall we close this issue? It's not clear there is a bug here for Rook.
Closing since there is no repro, thanks
Moving to 4.11 to continue investigation.
Sonal Any more questions from the customer? I believe we have exhausted the analysis on 4.7 and will need a repro on a newer release if further investigation is needed.