Description of problem (please be detailed as possible and provide log snippests): IBM/Netezza performed a MC update in their environment, and this action brought down OCS. Scenario: Once OCP applied NoSchedule taint to OCS node, a drain of pods was triggered. The OSD pdb prevented node drain in this instance. Since the pods never finished draining, MC didn't issue a reboot; nor did the customer. Because healthCheck timeout at this point lapsed, rook-operator went to work and introduced a 4th member. Version of all relevant components (if applicable): 4.6.5 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes. Requires understanding of components & manual interaction to recover. Is there any workaround available to the best of your knowledge? Delete the mgr pod allowing a new mgr pod to be spawned which in turn led to Ceph returning to a healthy state and PDBs returning to normal. Customer is writing automation to workaround the bug by checking the mon quorum before starting a machineconfig, and delete the mgr pod if it's stuck in Init at any phase of the machineconfig. Can this issue reproducible? Yes. If this is a regression, please provide more details to justify this: https://bugzilla.redhat.com/show_bug.cgi?id=1955831 Steps to Reproduce: Intermittent issue Actual results: Once we deleted the pdb for rack0, the machineconfig passed, however a new pdb was created for rack1 at that time, and the rack2 pdb remained as well. Machineconfig at that point showed updating=false
can you please share the must-gather logs for the issue. Thanks.
(In reply to akretzsc from comment #0) (trying to understand the scenario) > Scenario: Once OCP applied NoSchedule taint to OCS node, a drain of pods was > triggered. The OSD pdb prevented node drain in this instance. OSD pdb would generally prevent the drain if another OSD is the same failure domain is drained previously and it never came back up. Was that the case? >Since the pods > never finished draining, MC didn't issue a reboot; nor did the customer.
There are a number of issues happening in this BZ that don't seem related - If mon quorum is lost, force deleting the mgr pod would not restore quorum - If mon quorum is lost, the operator wouldn't be trying to failover the mon (and add a 4th one). Quorum is required for that operation. - PDBs for the OSDs aren't related to the mon quorum or the mgr pod availability. The OSD PDBs would only affect the OSD pod drains. - If the mgr pod is down it shouldn't affect the general data path, but only certain operations such as creating new PVs. - If a node is forcefully shutdown, the operator will attempt to force delete the pods such as the mgr to allow them to start on another node. Is the node completely unavailable, or is the node still responding? - Does it help to check for mon quorum before starting the node drain? That also doesn't seem related to the mgr pod stuck in pending.
Spent some time looking into the shared logs in comment #7. They only show information about the drain event that happened on `rack1`, `rack2` and `rack3`. All these drain events went well. Drain events on `rack0` are not there, probably because the operator got restarted. Also the mon quorum looks ok as well. Mons got failed over successfully after the defined wait period and formed quorum.
Per discussion with Sam, this scenario to investigate is specifically that the mgr pod gets stuck in init state sometimes, and the only way to restore cluster health is by restarting the mgr pod. In 4.6 (and 4.7, but removed in 4.8), there is an init container on the mgr pod that calls a "ceph config set" command for the prometheus endpoint that is getting stuck. See the command generated in this method: https://github.com/openshift/rook/blob/release-4.6/pkg/operator/ceph/cluster/mgr/spec.go#L167-L177 My working theory is that mon quorum is down temporarily when this happens, related to the node drain, then the ceph command gets stuck and doesn't timeout or retry. Adding a timeout on the command will likely allow it to go ahead and fail sooner, then the pod will restart and try again. I'm going to attempt to repro with increased logging to see if we can track down the cause and if that actually fixes it.
Results of preliminary testing... If mon quorum is lost, the mgr gets stuck in init as expected after a pod restart. Then after mon quorum is restored, the mgr finishes its startup sequence just fine after a few seconds of the mons coming back in quorum. The mgr pod did not continue to be hung in the init state as being reported in this BZ. If adding the --connect-timeout=15 to the init container, the pod tries restarting again after 15 seconds of failing to connect to the mons, then follows the exponential backoff for pod crashloopbackoff. If the mgr is indeed getting stuck on the mon connection, this fix could help the mgr restart automatically. So if the pod is getting stuck on the connection, the connect-timeout param could help the mgr pod get unstuck from the init container. But JC and I are still seeing if we can get a better repro of the issue reported to confirm...
Since this is an intermittent issue, I created a script that would do the following 100 times: 1. Scale down two mons (causing loss of quorum) 2. Restart the mgr pod (to see if the mgr init container hangs during init) 3. Scale the two mons back up (to restore quorum) 4. Watch to see if the mgr starts successfully. If the mgr doesn't start after several minutes, fail the test and return to step 1. I tested three configurations before finding a reliable fix. First, I tested with no change and found about 15% failure rate: - RESULTS after 72 tries: SUCCESS: 61, FAILURES: 11 Second, I added the --connect-timeout=20 to the ceph commands in the init containers and found an improvement to about 8% failure rate: - RESULTS after 100 tries: SUCCESS: 92, FAILURES: 8 Third, I removed the --connect-timeout from the ceph command, and instead used a bash timeout command to timeout after any failure instead of only connection failures detected by the ceph client. - RESULTS after 100 tries: SUCCESS: 100, FAILURES: 0 Thus, let's use a bash timeout on the init containers for reliable mgr startup. Again, the init container was removed in 4.8 so this only needs to be considered for 4.6 and 4.7. It is low risk and will improve reliability for IBM, so I recommend we take the fix for those two releases.
Created attachment 1824032 [details] Script and output files for testing mgr restart See the attached restart-mgr.zip for the test script that automated testing for mgr restart reliability, and the output files for the three separate tests.
Hi Travis, To verify this BZ do we need to use the steps mentioned in comment 17 ?? I see from the 3rd step you mentioned the success rates are 100% with No failures, so how do we try to repro this issue or what can be the exact steps if we try to automate this scenario. Could you please suggest
Yes, to verify this, the steps in comment 17 describe how it was repro'd, and the script I used to validate my fix is attached in comment 18. So you could automate something similar to the script in 18, then validate that the mgr always starts.
Please add doc text
Do text added
I tested the BZ with the following steps: 1. Create a new cluster with the conf: vSphere Dynamic cluster, OCP 4.6, OCS 4.6.8. 2. I ran the script in https://bugzilla.redhat.com/show_bug.cgi?id=1990031#c18 but changed the value of "tries" to 30. Cause I think it will suffice in this case. I ran the script twice and got it succeeded in all 30 times. I will add the file with the script results. Link to Jenkins job: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/6620/ Versions: OCP version: Client Version: 4.9.0-0.nightly-2021-10-08-232649 Server Version: 4.6.0-0.nightly-2021-10-11-122011 Kubernetes Version: v1.19.14+fcff70a OCS verison: ocs-operator.v4.6.8 OpenShift Container Storage 4.6.8 ocs-operator.v4.6.7 Succeeded cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2021-10-11-122011 True False 19h Cluster version is 4.6.0-0.nightly-2021-10-11-122011 Rook version rook: 4.6-109.a684974.release_4.6 go: go1.15.14 Ceph version ceph version 14.2.11-199.el8cp (f5470cbfb5a4dac5925284cef1215f3e4e191a38) nautilus (stable)
According to the above results, Can we move the BZ to Verified?
Yes sounds good to move to verified thanks! That should be more than enough with 60 tries at 100% success rate.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.6.8 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4015
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days