1990031 – Ceph mgr pod sometimes stuck in init state after node drain during machineconfig update

Bug 1990031 - Ceph mgr pod sometimes stuck in init state after node drain during machineconfig update

Summary: Ceph mgr pod sometimes stuck in init state after node drain during machinecon...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.6.8
Assignee:	Travis Nielsen
QA Contact:	Itzhak
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2005515
TreeView+	depends on / blocked

Reported:	2021-08-04 15:24 UTC by akretzsc
Modified:	2023-09-15 01:13 UTC (History)
CC List:	14 users (show)
Fixed In Version:	v4.6.8-194.ci
Doc Type:	Bug Fix
Doc Text:	Previously, when the nodes restarted, the MGR pod got stuck in a pod initialisation state which resulted in the inability to create new persistent volumes (PVs). With this update, the MGR pod restarts even if the MONs are down.
Clone Of:
Clones:	2005515 (view as bug list)
Environment:
Last Closed:	2021-10-27 17:37:36 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Script and output files for testing mgr restart (10.24 KB, application/zip) 2021-09-17 19:28 UTC, Travis Nielsen	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift rook pull 289	None	Merged	Bug 1990031: Add timeout in mgr init container	2021-10-05 07:29:57 UTC
Github	red-hat-storage ocs-ci pull 5133	None	Merged	Close loop BZ:1990031, Restart mgr pod while two mon pods are down	2021-12-07 15:36:41 UTC
Red Hat Product Errata	RHBA-2021:4015	None	None	None	2021-10-27 17:37:41 UTC

Description akretzsc 2021-08-04 15:24:56 UTC

Description of problem (please be detailed as possible and provide log
snippests):

IBM/Netezza performed a MC update in their environment, and this action brought down OCS.

Scenario: Once OCP applied NoSchedule taint to OCS node, a drain of pods was triggered. The OSD pdb prevented node drain in this instance. Since the pods never finished draining, MC didn't issue a reboot; nor did the customer.

Because healthCheck timeout at this point lapsed, rook-operator went to work and introduced a 4th member.

Version of all relevant components (if applicable):
4.6.5

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes. Requires understanding of components & manual interaction to recover.

Is there any workaround available to the best of your knowledge?
Delete the mgr pod allowing a new mgr pod to be spawned which in turn led to Ceph returning to a healthy state and PDBs returning to normal.

Customer is writing automation to workaround the bug by checking the mon quorum before starting a machineconfig, and delete the mgr pod if it's stuck in Init at any phase of the machineconfig.

Can this issue reproducible?
Yes.

If this is a regression, please provide more details to justify this:
https://bugzilla.redhat.com/show_bug.cgi?id=1955831

Steps to Reproduce:
Intermittent issue

Actual results:
Once we deleted the pdb for rack0, the machineconfig passed, however a new pdb was created for rack1 at that time, and the rack2 pdb remained as well. Machineconfig at that point showed updating=false

Comment 2 Santosh Pillai 2021-08-05 13:51:36 UTC

can you please share the must-gather logs for the issue. Thanks.

Comment 3 Santosh Pillai 2021-08-05 14:03:27 UTC

(In reply to akretzsc from comment #0)

(trying to understand the scenario)

> Scenario: Once OCP applied NoSchedule taint to OCS node, a drain of pods was
> triggered. The OSD pdb prevented node drain in this instance. 

OSD pdb would generally prevent the drain if another OSD is the same failure domain is drained previously and it never came back up. Was that the case?

>Since the pods
> never finished draining, MC didn't issue a reboot; nor did the customer.

Comment 8 Travis Nielsen 2021-08-23 20:53:10 UTC

There are a number of issues happening in this BZ that don't seem related
- If mon quorum is lost, force deleting the mgr pod would not restore quorum
- If mon quorum is lost, the operator wouldn't be trying to failover the mon (and add a 4th one). Quorum is required for that operation.
- PDBs for the OSDs aren't related to the mon quorum or the mgr pod availability. The OSD PDBs would only affect the OSD pod drains.
- If the mgr pod is down it shouldn't affect the general data path, but only certain operations such as creating new PVs.
- If a node is forcefully shutdown, the operator will attempt to force delete the pods such as the mgr to allow them to start on another node. Is the node completely unavailable, or is the node still responding?
- Does it help to check for mon quorum before starting the node drain? That also doesn't seem related to the mgr pod stuck in pending.

Comment 14 Santosh Pillai 2021-09-14 14:56:34 UTC

Spent some time looking into the shared logs in comment #7. They only show information about the drain event that happened on `rack1`, `rack2` and `rack3`. All these drain events went well.  Drain events on `rack0` are not there, probably because the operator got restarted. Also the mon quorum looks ok as well. Mons got failed over successfully after the defined wait period and formed quorum.

Comment 15 Travis Nielsen 2021-09-14 18:15:44 UTC

Per discussion with Sam, this scenario to investigate is specifically that the mgr pod gets stuck in init state sometimes, and the only way to restore cluster health is by restarting the mgr pod. 

In 4.6 (and 4.7, but removed in 4.8), there is an init container on the mgr pod that calls a "ceph config set" command for the prometheus endpoint that is getting stuck.
See the command generated in this method: https://github.com/openshift/rook/blob/release-4.6/pkg/operator/ceph/cluster/mgr/spec.go#L167-L177

My working theory is that mon quorum is down temporarily when this happens, related to the node drain, then the ceph command gets stuck and doesn't timeout or retry. Adding a timeout on the command will likely allow it to go ahead and fail sooner, then the pod will restart and try again. I'm going to attempt to repro with increased logging to see if we can track down the cause and if that actually fixes it.

Comment 16 Travis Nielsen 2021-09-15 22:00:30 UTC

Results of preliminary testing...

If mon quorum is lost, the mgr gets stuck in init as expected after a pod restart. 
Then after mon quorum is restored, the mgr finishes its startup sequence just fine after a few seconds of the mons coming back in quorum. The mgr pod did not continue to be hung in the init state as being reported in this BZ.

If adding the --connect-timeout=15 to the init container, the pod tries restarting again after 15 seconds of failing to connect to the mons, then follows the exponential backoff for pod crashloopbackoff. If the mgr is indeed getting stuck on the mon connection, this fix could help the mgr restart automatically. 

So if the pod is getting stuck on the connection, the connect-timeout param could help the mgr pod get unstuck from the init container. But JC and I are still seeing if we can get a better repro of the issue reported to confirm...

Comment 17 Travis Nielsen 2021-09-17 19:23:56 UTC

Since this is an intermittent issue, I created a script that would do the following 100 times:
1. Scale down two mons (causing loss of quorum)
2. Restart the mgr pod (to see if the mgr init container hangs during init)
3. Scale the two mons back up (to restore quorum)
4. Watch to see if the mgr starts successfully. If the mgr doesn't start after several minutes, fail the test and return to step 1.

I tested three configurations before finding a reliable fix. 

First, I tested with no change and found about 15% failure rate:
- RESULTS after 72 tries: SUCCESS: 61, FAILURES: 11

Second, I added the --connect-timeout=20 to the ceph commands in the init containers and found an improvement to about 8% failure rate:
- RESULTS after 100 tries: SUCCESS: 92, FAILURES: 8

Third, I removed the --connect-timeout from the ceph command, and instead used a bash timeout command to timeout after any failure instead of only connection failures detected by the ceph client.
- RESULTS after 100 tries: SUCCESS: 100, FAILURES: 0

Thus, let's use a bash timeout on the init containers for reliable mgr startup.

Again, the init container was removed in 4.8 so this only needs to be considered for 4.6 and 4.7. It is low risk and will improve reliability for IBM, so I recommend we take the fix for those two releases.

Comment 18 Travis Nielsen 2021-09-17 19:28:48 UTC

Created attachment 1824032 [details]
Script and output files for testing mgr restart

See the attached restart-mgr.zip for the test script that automated testing for mgr restart reliability, and the output files for the three separate tests.

Comment 21 Shrivaibavi Raghaventhiran 2021-09-27 14:55:16 UTC

Hi Travis, 

To verify this BZ do we need to use the steps mentioned in comment 17 ??
I see from the 3rd step you mentioned the success rates are 100% with No failures, so how do we try to repro this issue or what can be the exact steps if we try to automate this scenario. Could you please suggest

Comment 22 Travis Nielsen 2021-09-27 16:39:15 UTC

Yes, to verify this, the steps in comment 17 describe how it was repro'd, and the script I used to validate my fix is attached in comment 18. So you could automate something similar to the script in 18, then validate that the mgr always starts.

Comment 27 Mudit Agarwal 2021-10-08 07:58:26 UTC

Please add doc text

Comment 28 Travis Nielsen 2021-10-08 16:19:13 UTC

Do text added

Comment 30 Itzhak 2021-10-13 09:27:31 UTC

I tested the BZ with the following steps:

1. Create a new cluster with the conf: vSphere Dynamic cluster, OCP 4.6, OCS 4.6.8. 

2. I ran the script in https://bugzilla.redhat.com/show_bug.cgi?id=1990031#c18 but changed the value of "tries" to 30. 
Cause I think it will suffice in this case. I ran the script twice and got it succeeded in all 30 times. 
I will add the file with the script results. 

Link to Jenkins job: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/6620/

Versions:

OCP version:
Client Version: 4.9.0-0.nightly-2021-10-08-232649
Server Version: 4.6.0-0.nightly-2021-10-11-122011
Kubernetes Version: v1.19.14+fcff70a

OCS verison:
ocs-operator.v4.6.8   OpenShift Container Storage   4.6.8     ocs-operator.v4.6.7   Succeeded

cluster version
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2021-10-11-122011   True        False         19h     Cluster version is 4.6.0-0.nightly-2021-10-11-122011

Rook version
rook: 4.6-109.a684974.release_4.6
go: go1.15.14

Ceph version
ceph version 14.2.11-199.el8cp (f5470cbfb5a4dac5925284cef1215f3e4e191a38) nautilus (stable)

Comment 32 Itzhak 2021-10-13 09:38:41 UTC

According to the above results, Can we move the BZ to Verified?

Comment 33 Travis Nielsen 2021-10-13 13:35:11 UTC

Yes sounds good to move to verified thanks! That should be more than enough with 60 tries at 100% success rate.

Comment 38 errata-xmlrpc 2021-10-27 17:37:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.6.8 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4015

Comment 39 Red Hat Bugzilla 2023-09-15 01:13:10 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.