Bug 2063301 - Rook can fail to deploy due to startup probe failures on mon canary pods
Summary: Rook can fail to deploy due to startup probe failures on mon canary pods
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.10.0
Assignee: Travis Nielsen
QA Contact: Vijay Avuthu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-11 18:05 UTC by Blaine Gardner
Modified: 2023-08-09 17:03 UTC (History)
7 users (show)

Fixed In Version: 4.10.0-210
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-04-21 09:12:53 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage rook pull 361 0 None open Bug 2063301: mon: Disable startup probe on canary pods 2022-03-15 16:15:08 UTC
Github rook rook pull 9888 0 None Merged mon: Disable startup probe on canary pods 2022-03-11 18:05:51 UTC

Description Blaine Gardner 2022-03-11 18:05:51 UTC
Description of problem (please be detailed as possible and provide log
snippests):

In upstream Rook, we have had users identify a problem where Rook can fail to deploy due to an error with a startup probe on mon canary pods. This bug could affect ODF 4.10 installs.

Travis has already completed an upstream PR to fix the issue here: https://github.com/rook/rook/pull/9888


Version of all relevant components (if applicable):

ODF 4.10 is affected. 
ODF/OCS 4.9 and below are NOT affected.


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?

No known workaround.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1: this could show up during any install but is most likely to show up in environments where there are CPU constraints.


Can this issue reproducible?

Yes, but the issue is flaky.


Can this issue reproduce from the UI?

Yes.


If this is a regression, please provide more details to justify this:

Issue affects install. I believe this makes it a regression.


Steps to Reproduce:

Install ODF (Rook), most commonly in environments with low CPU availability.


Actual results:

Rook can fail to install due to Kubernetes shutting down Canary pods before Rok can start Ceph mons.


Expected results:

Canary pods should not be shut down by Kubernetes.


Additional info:

Comment 5 Blaine Gardner 2022-03-14 15:09:04 UTC
Unfortunately, we don't have a repro even upstream, but we have had reports of users with the issue, and we identified the root cause. 

The best way to "verify" the fix is to get the details of `rook-ceph-mon-X-canary` pods when they come online and ensure that there isn't a startup probe on any of the containers. It's probably also good to make sure there isn't a readiness or liveness probe as well.

Comment 6 Mudit Agarwal 2022-03-15 07:59:41 UTC
Travis, please backport it to 4.10

Comment 7 Vijay Avuthu 2022-04-05 16:59:13 UTC
Verified with ocs-registry:4.10.0-217

Job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/11489/console

In the mon logs , didn't see any readiness or liveness probe.

Also, didn't see any deployment issue regarding mon during pipeline executions

Comment 8 Ramakrishnan Periyasamy 2022-08-17 09:34:37 UTC
there are no constant steps to reproduce the steps. Hence, not in favor of automation


Note You need to log in before you can comment on or make changes to this bug.