Description of problem:
For some reason, etcd container has many zombie processes that are children of etcd container main pid. All of them are "grep" or "lsof" so this suggest they either come from readiness probe or startup script.
Version-Release number of selected component (if applicable):
Fresh install on customer environment
Steps to Reproduce:
1. Install the cluster in a non-optimal storage scenario
2. Wait several days
Etcd with zombie children processes
Etcd without children zombie processes
Not sure of whether this might need some investigation from the node team. However, before transferring it, it would be good if etcd team can revise:
- Can the lsof+grep readiness probes be replaced by tcp port probes? Exec probes are more costly in terms of container engine performance
- Is there any measure that can be take by improving startup scripts and/or probes so that this would not happen even on a faulty scenario?
upstream issue around exec probes
Here's my take on it so far:
1. The existing exec-based check is weak to begin with (a blind port check)
2. exec probes may carry weird performance implications
3. The zombie issue requires time to diagnose and explain
4. etcd exposes /health behind the metrics endpoint which is secured by mTLS, defeating our ability to use a TCP probe (as the k8s API and kubelet don't mTLS probes)
5. I don't think health endpoints don't need to be secure (see: router, dns), but metrics do need to be secure
Given all that, I wonder if the way forward is to propose a change to etcd to allow independent configuration of /health so that we can use a TCP probe, bypassing all the oddities of exec while simultaneously providing the best available health check.
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. The team will revisit this bug next sprint.
The latest updates on the favored resolution can be tracked on the issue which blocks this one: https://bugzilla.redhat.com/show_bug.cgi?id=1807632#c10
We won't be taking any action on this in 4.6 (see https://bugzilla.redhat.com/show_bug.cgi?id=1807632#c12).
*** Bug 1878774 has been marked as a duplicate of this bug. ***
Hi Holger and Wolfgang, in order to help the team to better evaluate BZ 1878774 that you have opened, could you provide some information here about why you would consider "On several nodes I have observed zombie processes caused by etcd (usually 2 or 3)" to be blockers instead of regular bugs?
#1878774 was opened in a row with others tracked here #1881153 and where openend together. We were asked to split the bugs. Therefore we put them together and there need the discussion flow which of those bugs are real blockers or how they are aligned to each other.
Thanks Holger. If I am understanding this correctly, the series of bugs BZ 1878772, 1878774, 1878780, as a whole is a blocker issue, but individual components such as 1878774 may, or may not, be a blocker by itself. I think it would be helpful to identify which bug among the three is the real blocker. Does your team have sufficient information to identify the blocker, or could you provide some information (error message, logs, etc.) regarding 1878774?
Hi @Trevor, I think for the moment, even though 1878774 does not seem to be a blocker, I would agree with identifying all three bugs in the above paragraph as potential blockers until we identify the true blocking issue. Are you aware of any information that we can ask the IBM team to provide in order to better diagnose this error (whether it is to be fixed in 4.6 or 4.7)?
There seems to be a generic issue with exec probes that can lead to zombies. There may also be completely unrelated bugs which can lead to zombies. For each component using exec probes today, they may be able to mitigate by migrating to other probe types (e.g. httpGet). This etcd bug is blocked on bug 1807632, which involves moving etcd off of exec probes. There should also be a bug tracking the upstream Kube issues with exec probes leaking; I'm not sure if that has a tracker yet.
As far as blockers are concerned, I think there are some valid concerns about overall zombie level, especially if the number of zombie processes has unbounded growth over the lifetime of the node. But the concern would be "do we expect significant resource consumption from $LEAK over the lifespan of the node between reboots?". So say we reboot quarterly with each minor release  (ignoring any CVE fixes within z streams, which is a bit crazy, but we're looking for a worst-case scenario). Do we expect "2 to 3" zombie etcd probe processes (not clear what the timescale is for those leaks) to build up enough to cause a problem over three months? Are their downsides to a mitigation strategy like "rolling reboot of the control plane every month" or however frequently is needed to manage the leak? Again, would be nice if the leak was fixed, but to make the blocker argument, I think you need to make the case that mitigation is too onerous.
Thank you for the explanation and describing the mitigation, Trevor.
Also worth noting Holger's latest comment in related bug BZ 1878772, that "The key blocking release bug is the high CPU usage https://bugzilla.redhat.com/show_bug.cgi?id=1878770
There seems to be a connection between etcd and the crash api nodes which I suspect is related to the zombies in this bug here"
We may want to wait until BZ 1878770 is evaluated to see if this bug is (or is not) a blocker.
As an update on BZ 1878772 we have not seen any zombie processes on a high level. Therefore and with the update from Trevor, I do not see this as a blocker anymore.
The fix for https://bugzilla.redhat.com/show_bug.cgi?id=1807632 should address this simultaneously.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.