Bug 1844727 - Etcd container leaves grep and lsof zombie processes
Summary: Etcd container leaves grep and lsof zombie processes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.7.0
Assignee: Dan Mace
QA Contact: ge liu
URL:
Whiteboard:
: 1878774 (view as bug list)
Depends On:
Blocks: 1903353
TreeView+ depends on / blocked
 
Reported: 2020-06-06 16:34 UTC by Pablo Alonso Rodriguez
Modified: 2023-12-15 18:06 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The readiness probe uses lsof and grep Consequence: Many zombie processes Fix: Replace lsof+grep readiness probes with tcp port probes Result: TCP port probes are less expensive and no more zombie processes.
Clone Of:
Environment:
Last Closed: 2021-02-24 15:12:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 502 0 None closed Bug 1844727: Use socket readiness probe to avoid generating zombies 2021-02-20 09:19:50 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:13:02 UTC

Description Pablo Alonso Rodriguez 2020-06-06 16:34:29 UTC
Description of problem:

For some reason, etcd container has many zombie processes that are children of etcd container main pid. All of them are "grep" or "lsof" so this suggest they either come from readiness probe or startup script.

Version-Release number of selected component (if applicable):

4.4.5

How reproducible:

Fresh install on customer environment

Steps to Reproduce:
1. Install the cluster in a non-optimal storage scenario
2. Wait several days
3.

Actual results:

Etcd with zombie children processes

Expected results:

Etcd without children zombie processes

Additional info:

In comments

Comment 5 Pablo Alonso Rodriguez 2020-06-06 16:42:50 UTC
Not sure of whether this might need some investigation from the node team. However, before transferring it, it would be good if etcd team can revise:

- Can the lsof+grep readiness probes be replaced by tcp port probes? Exec probes are more costly in terms of container engine performance
- Is there any measure that can be take by improving startup scripts and/or probes so that this would not happen even on a faulty scenario?

Comment 6 Sam Batschelet 2020-06-07 02:01:07 UTC
upstream issue[1] around exec probes 

[1] https://github.com/kubernetes/kubernetes/issues/82440

Comment 7 Dan Mace 2020-06-09 12:53:39 UTC
Here's my take on it so far:

1. The existing exec-based check is weak to begin with (a blind port check)
2. exec probes may carry weird performance implications[1]
3. The zombie issue requires time to diagnose and explain
4. etcd exposes /health behind the metrics endpoint which is secured by mTLS, defeating our ability to use a TCP probe (as the k8s API and kubelet don't mTLS probes)
5. I don't think health endpoints don't need to be secure (see: router, dns), but metrics do need to be secure

Given all that, I wonder if the way forward is to propose a change to etcd to allow independent configuration of /health so that we can use a TCP probe, bypassing all the oddities of exec while simultaneously providing the best available health check.

[1] https://github.com/kubernetes/kubernetes/issues/82440

Comment 8 Sam Batschelet 2020-06-20 13:03:13 UTC
Iā€™m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. The team will revisit this bug next sprint.

Comment 10 Dan Mace 2020-07-06 14:17:12 UTC
The latest updates on the favored resolution can be tracked on the issue which blocks this one: https://bugzilla.redhat.com/show_bug.cgi?id=1807632#c10

Comment 12 Dan Mace 2020-08-11 18:19:42 UTC
We won't be taking any action on this in 4.6 (see https://bugzilla.redhat.com/show_bug.cgi?id=1807632#c12).

Comment 14 Sam Batschelet 2020-09-14 14:56:29 UTC
*** Bug 1878774 has been marked as a duplicate of this bug. ***

Comment 18 Dan Li 2020-09-24 12:10:44 UTC
Hi Holger and Wolfgang, in order to help the team to better evaluate BZ 1878774 that you have opened, could you provide some information here about why you would consider "On several nodes I have observed zombie processes caused by etcd (usually 2 or 3)" to be blockers instead of regular bugs?

Comment 20 Holger Wolf 2020-09-24 14:15:09 UTC
Hi Dan,

#1878774 was opened in a row with others tracked here #1881153 and where openend together. We were asked to split the bugs. Therefore we put them together and there need the discussion flow which of those bugs are real blockers or how they are aligned to each other.

Comment 21 Dan Li 2020-09-24 15:34:09 UTC
Thanks Holger. If I am understanding this correctly, the series of bugs BZ 1878772, 1878774, 1878780, as a whole is a blocker issue, but individual components such as 1878774 may, or may not, be a blocker by itself. I think it would be helpful to identify which bug among the three is the real blocker. Does your team have sufficient information to identify the blocker, or could you provide some information (error message, logs, etc.) regarding 1878774? 

Hi @Trevor, I think for the moment, even though 1878774 does not seem to be a blocker, I would agree with identifying all three bugs in the above paragraph as potential blockers until we identify the true blocking issue. Are you aware of any information that we can ask the IBM team to provide in order to better diagnose this error (whether it is to be fixed in 4.6 or 4.7)?

Comment 22 W. Trevor King 2020-09-24 16:54:08 UTC
There seems to be a generic issue with exec probes that can lead to zombies.  There may also be completely unrelated bugs which can lead to zombies.  For each component using exec probes today, they may be able to mitigate by migrating to other probe types (e.g. httpGet).  This etcd bug is blocked on bug 1807632, which involves moving etcd off of exec probes.  There should also be a bug tracking the upstream Kube issues with exec probes leaking; I'm not sure if that has a tracker yet.

As far as blockers are concerned, I think there are some valid concerns about overall zombie level, especially if the number of zombie processes has unbounded growth over the lifetime of the node.  But the concern would be "do we expect significant resource consumption from $LEAK over the lifespan of the node between reboots?".  So say we reboot quarterly with each minor release [1] (ignoring any CVE fixes within z streams, which is a bit crazy, but we're looking for a worst-case scenario).  Do we expect "2 to 3" zombie etcd probe processes (not clear what the timescale is for those leaks) to build up enough to cause a problem over three months?  Are their downsides to a mitigation strategy like "rolling reboot of the control plane every month" or however frequently is needed to manage the leak?  Again, would be nice if the leak was fixed, but to make the blocker argument, I think you need to make the case that mitigation is too onerous.

[1]: https://access.redhat.com/support/policy/updates/openshift#dates

Comment 23 Dan Li 2020-09-24 21:08:53 UTC
Thank you for the explanation and describing the mitigation, Trevor.

Also worth noting Holger's latest comment in related bug BZ 1878772, that "The key blocking release bug is the high CPU usage https://bugzilla.redhat.com/show_bug.cgi?id=1878770
There seems to be a connection between etcd and the crash api nodes which I suspect is related to the zombies in this bug here"

We may want to wait until BZ 1878770 is evaluated to see if this bug is (or is not) a blocker.

Comment 24 Holger Wolf 2020-09-29 15:43:04 UTC
As an update on BZ 1878772 we have not seen any zombie processes on a high level. Therefore and with the update from Trevor, I do not see this as a blocker anymore.

Comment 26 Dan Mace 2020-10-23 12:45:27 UTC
The fix for https://bugzilla.redhat.com/show_bug.cgi?id=1807632 should address this simultaneously.

Comment 35 errata-xmlrpc 2021-02-24 15:12:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.