Bug 2063301

Summary: Rook can fail to deploy due to startup probe failures on mon canary pods
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Blaine Gardner <brgardne>
Component: rookAssignee: Travis Nielsen <tnielsen>
Status: CLOSED CURRENTRELEASE QA Contact: Vijay Avuthu <vavuthu>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.10CC: madam, muagarwa, nberry, ocs-bugs, odf-bz-bot, rperiyas, tnielsen
Target Milestone: ---   
Target Release: ODF 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.10.0-210 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-04-21 09:12:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Blaine Gardner 2022-03-11 18:05:51 UTC
Description of problem (please be detailed as possible and provide log
snippests):

In upstream Rook, we have had users identify a problem where Rook can fail to deploy due to an error with a startup probe on mon canary pods. This bug could affect ODF 4.10 installs.

Travis has already completed an upstream PR to fix the issue here: https://github.com/rook/rook/pull/9888


Version of all relevant components (if applicable):

ODF 4.10 is affected. 
ODF/OCS 4.9 and below are NOT affected.


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?

No known workaround.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1: this could show up during any install but is most likely to show up in environments where there are CPU constraints.


Can this issue reproducible?

Yes, but the issue is flaky.


Can this issue reproduce from the UI?

Yes.


If this is a regression, please provide more details to justify this:

Issue affects install. I believe this makes it a regression.


Steps to Reproduce:

Install ODF (Rook), most commonly in environments with low CPU availability.


Actual results:

Rook can fail to install due to Kubernetes shutting down Canary pods before Rok can start Ceph mons.


Expected results:

Canary pods should not be shut down by Kubernetes.


Additional info:

Comment 5 Blaine Gardner 2022-03-14 15:09:04 UTC
Unfortunately, we don't have a repro even upstream, but we have had reports of users with the issue, and we identified the root cause. 

The best way to "verify" the fix is to get the details of `rook-ceph-mon-X-canary` pods when they come online and ensure that there isn't a startup probe on any of the containers. It's probably also good to make sure there isn't a readiness or liveness probe as well.

Comment 6 Mudit Agarwal 2022-03-15 07:59:41 UTC
Travis, please backport it to 4.10

Comment 7 Vijay Avuthu 2022-04-05 16:59:13 UTC
Verified with ocs-registry:4.10.0-217

Job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/11489/console

In the mon logs , didn't see any readiness or liveness probe.

Also, didn't see any deployment issue regarding mon during pipeline executions

Comment 8 Ramakrishnan Periyasamy 2022-08-17 09:34:37 UTC
there are no constant steps to reproduce the steps. Hence, not in favor of automation