Description of problem: Static pods laid down by cluster-etcd-operator should only pass if etcd has started serving on that node.
I don't see examples of this failure case. The more complex we make the readiness probe the more hurdles we add to startup. For now, I think simple is better. Lowering to low priority and if we hit examples of failure we can revisit.
This bug is actively worked on.
https://bugzilla.redhat.com/show_bug.cgi?id=1844727 seems like a plausible impact of the current check, although not because of the port check per say, but because of the use of an exec pod generally. In https://bugzilla.redhat.com/show_bug.cgi?id=1844727#c7 I propose we try and introduce a TCP probe that relies on the /health endpoint which would resolve this issue and the byproducts of exec (e.g. #1844727). We could keep this bug open to track the upstream work and then make https://bugzilla.redhat.com/show_bug.cgi?id=1844727 blocked by this bug. What does everyone think?
Raised the health probe issue upstream: https://github.com/etcd-io/etcd/issues/11993
Raising the health check endpoint TLS discussion at the next etcd community meeting (tentatively June 25, 2020; agenda is here: https://docs.google.com/document/d/16XEGyPBisZvmmoIHSZzv__LoyOeluC5a4x353CX0SIM/edit).
I raised the issue upstream at the last community meeting and got some indication that my concerns are justified and possible solutions I proposed are reasonable. At this point I think what's left is to choose an approach and try a PR for deeper consideration.
I have buy-in upstream to do the work, but need to build upon https://github.com/etcd-io/etcd/pull/10504. It's not possible for us to get any of it done for 4.6; we have to coordinate an etcd release and upgrade in a future upstream kube release. Moving to 4.7.
We discovered the fix in https://github.com/openshift/cluster-etcd-operator/pull/474 inadvertently exposes etcd metrics insecurely and so can't ship. I'm reverting the change in https://github.com/openshift/cluster-etcd-operator/pull/500 and we'll have continue pursuing either my original upstream changes or some new solution.
@dmace would it be possible to use a tcpSocket readinessProbe in the interim? Although not as nice as /health, it would be broadly similar to the lsof and would stop all the execs?
(In reply to Jim Minter from comment #16) > @dmace would it be possible to use a tcpSocket readinessProbe in > the interim? Although not as nice as /health, it would be broadly similar > to the lsof and would stop all the execs? Interesting thought. Offhand I can't think of any downsides. I will open a PR to test this and drive a discussion. Thanks for the idea!
I've opened https://github.com/openshift/cluster-etcd-operator/pull/502 to test the socket probe idea. That fix will be associated with https://bugzilla.redhat.com/show_bug.cgi?id=1844727 and not this bug, which is for improving the fidelity of the check itself. If the socket probe resolves the zombie issue, I'll make sure to sort out the bugs so that any report concerned with zombies will be associated with https://bugzilla.redhat.com/show_bug.cgi?id=1844727, and we can carry this bug forward to 4.8 independent of the zombie issues. I'll report back on this soon.
*** This bug has been marked as a duplicate of bug 1906484 ***