Bug 1807632 - the readiness probe for etcd should reflect actual readiness etcd on that pod
Summary: the readiness probe for etcd should reflect actual readiness etcd on that pod
Keywords:
Status: CLOSED DUPLICATE of bug 1906484
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.7.0
Assignee: Dan Mace
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks: 1899033 1899034 1899036
TreeView+ depends on / blocked
 
Reported: 2020-02-26 19:03 UTC by Alay Patel
Modified: 2024-03-25 15:42 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1899033 1899034 1899036 (view as bug list)
Environment:
Last Closed: 2021-01-18 21:04:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Alay Patel 2020-02-26 19:03:18 UTC
Description of problem:

Static pods laid down by cluster-etcd-operator should only pass if etcd has started serving on that node.

Comment 4 Sam Batschelet 2020-05-19 12:06:18 UTC
I don't see examples of this failure case. The more complex we make the readiness probe the more hurdles we add to startup. For now, I think simple is better. Lowering to low priority and if we hit examples of failure we can revisit.

Comment 5 Michal Fojtik 2020-05-20 10:57:42 UTC
This bug is actively worked on.

Comment 7 Dan Mace 2020-06-09 15:50:48 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1844727 seems like a plausible impact of the current check, although not because of the port check per say, but because of the use of an exec pod generally.

In https://bugzilla.redhat.com/show_bug.cgi?id=1844727#c7 I propose we try and introduce a TCP probe that relies on the /health endpoint which would resolve this issue and the byproducts of exec (e.g. #1844727).

We could keep this bug open to track the upstream work and then make https://bugzilla.redhat.com/show_bug.cgi?id=1844727 blocked by this bug. What does everyone think?

Comment 8 Dan Mace 2020-06-09 17:03:54 UTC
Raised the health probe issue upstream: https://github.com/etcd-io/etcd/issues/11993

Comment 9 Dan Mace 2020-06-17 18:34:14 UTC
Raising the health check endpoint TLS discussion at the next etcd community meeting (tentatively June 25, 2020; agenda is here: https://docs.google.com/document/d/16XEGyPBisZvmmoIHSZzv__LoyOeluC5a4x353CX0SIM/edit).

Comment 10 Dan Mace 2020-07-06 14:15:18 UTC
I raised the issue upstream at the last community meeting and got some indication that my concerns are justified and possible solutions I proposed are reasonable. At this point I think what's left is to choose an approach and try a PR for deeper consideration.

Comment 12 Dan Mace 2020-08-11 13:57:38 UTC
I have buy-in upstream to do the work, but need to build upon https://github.com/etcd-io/etcd/pull/10504. It's not possible for us to get any of it done for 4.6; we have to coordinate an etcd release and upgrade in a future upstream kube release. Moving to 4.7.

Comment 15 Dan Mace 2020-11-19 15:25:04 UTC
We discovered the fix in https://github.com/openshift/cluster-etcd-operator/pull/474 inadvertently exposes etcd metrics insecurely and so can't ship. I'm reverting the change in https://github.com/openshift/cluster-etcd-operator/pull/500 and we'll have continue pursuing either my original upstream changes or some new solution.

Comment 16 Jim Minter 2020-11-26 21:50:46 UTC
@dmace would it be possible to use a tcpSocket readinessProbe in the interim?  Although not as nice as /health, it would be broadly similar to the lsof and would stop all the execs?

Comment 17 Dan Mace 2020-11-30 13:58:14 UTC
(In reply to Jim Minter from comment #16)
> @dmace would it be possible to use a tcpSocket readinessProbe in
> the interim?  Although not as nice as /health, it would be broadly similar
> to the lsof and would stop all the execs?

Interesting thought. Offhand I can't think of any downsides. I will open a PR to test this and drive a discussion. Thanks for the idea!

Comment 18 Dan Mace 2020-11-30 14:06:15 UTC
I've opened https://github.com/openshift/cluster-etcd-operator/pull/502 to test the socket probe idea. That fix will be associated with https://bugzilla.redhat.com/show_bug.cgi?id=1844727 and not this bug, which is for improving the fidelity of the check itself. If the socket probe resolves the zombie issue, I'll make sure to sort out the bugs so that any report concerned with zombies will be associated with https://bugzilla.redhat.com/show_bug.cgi?id=1844727, and we can carry this bug forward to 4.8 independent of the zombie issues.

I'll report back on this soon.

Comment 19 Sam Batschelet 2021-01-18 21:04:57 UTC

*** This bug has been marked as a duplicate of bug 1906484 ***


Note You need to log in before you can comment on or make changes to this bug.