Created attachment 1714798 [details]
output of "ps -ef --forest".
Description of problem:
On several nodes I have observed zombie processes. Usually there are 2 or 3, but once 547 are counted.
I observed this on the following two separate clusters with these versions:
Cluster a) Z13: Version: 4.6.0-0.nightly-s390x-2020-08-27-080214 RHCOS: 46.82.202008261939-0
Cluster b) Z13: Version: 4.6.0-0.nightly-s390x-2020-09-05-222506 RHCOS: 46.82.202009042339-0
Please let me know what information you need for further debugging and I shall provide it happily.
$ ps -ef --forest # because of the lengthy output I removed some parts. Full list is added as attachment.
UID PID PPID C STIME TTY TIME CMD
root 7137 1 0 Aug27 ? 00:00:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/7e73bfbb7e581ff9198d3b1ef1df1a9d103ec1a873a35e4f2f9f75800deadcce/userdata -c 7e73bfbb7e581ff9198d3b1ef1df1a9d103ec1a873a35e4f2f9f75800deadcce -
1000340+ 7199 7137 0 Aug27 ? 00:42:14 \_ /bin/thanos query --grpc-address=127.0.0.1:10901 --http-address=127.0.0.1:9090 --query.replica-label=prometheus_replica --query.replica-label=thanos_ruler_replica --store=dnssrv+_grpc._tcp.prometheus-operated.opens
1000340+ 3897828 7199 0 Sep01 ? 00:00:00 \_ [curl] <defunct>
1000340+ 1859562 7199 0 Sep09 ? 00:00:00 \_ [curl] <defunct>
1000340+ 1886089 7199 0 Sep09 ? 00:00:00 \_ [sh] <defunct>
1000340+ 1887618 7199 0 Sep09 ? 00:00:00 \_ [curl] <defunct>
1000340+ 3497600 7199 0 15:15 ? 00:00:00 \_ [curl] <defunct>
root 9084 1 0 Aug27 ? 00:00:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/f573dd497990651a4fbe47b6165854bf7d8af0dc1cdb8f240e6baec31efce51e/userdata -c f573dd497990651a4fbe47b6165854bf7d8af0dc1cdb8f240e6baec31efce51e -
1000340+ 9152 9084 27 Aug27 ? 4-04:29:31 \_ /bin/prometheus --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries --config.file=/etc/prometheus/config_out/prometheus.env.yaml --storage.tsdb.path=/prometh
1000340+ 1843748 9152 0 Sep09 ? 00:00:00 \_ [curl] <defunct>
root 9085 1 0 Aug27 ? 00:00:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/c8aab67010bec144cab5835cfc58e4ccce22e85f002a1bf723d9f112cfb8cba3/userdata -c c8aab67010bec144cab5835cfc58e4ccce22e85f002a1bf723d9f112cfb8cba3 -
1000340+ 9137 9085 27 Aug27 ? 4-04:14:48 \_ /bin/prometheus --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries --config.file=/etc/prometheus/config_out/prometheus.env.yaml --storage.tsdb.path=/prometh
1000340+ 1787823 9137 0 Sep09 ? 00:00:00 \_ [curl] <defunct>
1000340+ 1813632 9137 0 Sep09 ? 00:00:00 \_ [curl] <defunct>
1000340+ 931043 9137 0 Sep10 ? 00:00:00 \_ [curl] <defunct>
root 3293814 1 0 13:27 ? 00:00:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/52f89083fd2fecbcc9d9e02d5cb956d67928d0fe5cd25e15459aca5fabbf0924/userdata -c 52f89083fd2fecbcc9d9e02d5cb956d67928d0fe5cd25e15459aca5fabbf0924 -
1000340+ 3293840 3293814 3 13:27 ? 00:03:27 \_ /bin/thanos query --grpc-address=127.0.0.1:10901 --http-address=127.0.0.1:9090 --query.replica-label=prometheus_replica --query.replica-label=thanos_ruler_replica --store=dnssrv+_grpc._tcp.prometheus-operated.opens
1000340+ 3300653 3293840 0 13:31 ? 00:00:00 \_ [curl] <defunct>
1000340+ 3343560 3293840 0 13:54 ? 00:00:00 \_ [curl] <defunct>
1000340+ 3344894 3293840 0 13:54 ? 00:00:00 \_ [sh] <defunct>
1000340+ 3345400 3293840 0 13:55 ? 00:00:00 \_ [curl] <defunct>
1000340+ 3501803 3293840 0 15:17 ? 00:00:00 \_ [curl] <defunct>
It looks like all zombie processes are coming from liveness and readiness probes . IMHO those should be reaped by container runtime.
Reassigning to node team for further investigation.
I am not sure I'm interpreting the tree correctly, but it seems like all the defunct pids are children of prometheus or thanos processes, which are still running. conmon won't call wait_pid until the main exec process exits. are prometheus and thanos reaping their children?
I was also concerned about this tree. However neither thanos not prometheus fork any curl processes, both use golang HTTP implementation for such operations. In fact the only place `sh` and `curl` are executed is in readiness and liveness probes.
Could you change the ps command to ensure that are in Z state just to avoid other conditions? Do the zombie processes vanish over time or continue to increase? What is the overall system load when you see that behavior?
@Pawel: Are you aware of any problems related to sigchld handling/reaping of process in those readiness/liveness probes?
Changing the reported "Version" from 4.6.z to 4.6 as 4.6.zstream has not been released yet.
@Hendrik this is the first report of such problems and I cannot reproduce it on other platforms.
Probes used in thanos querier and prometheus are one of the simplest implementations which shouldn't cause issues. It is just a shell with one `if` and `curl/wget` statement:
sh -c '
if [ -x "$(command -v curl)" ]; then
elif [ -x "$(command -v wget)" ]; then
wget --quiet --tries=1 --spider http://localhost:9090/-/healthy
else exit 1
If you have any improvement suggestions in regards to those probes, please share.
In case it helps, the curl version installed is:
$ curl -V
curl 7.61.1 (s390x-ibm-linux-gnu) libcurl/7.61.1 OpenSSL/1.1.1c zlib/1.2.11 brotli/1.0.6 libidn2/2.2.0 libpsl/0.20.2 (+libidn2/2.0.5) libssh/0.9.0/openssl/zlib nghttp2/1.33.0
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp
Features: AsynchDNS IDN IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz brotli TLS-SRP HTTP2 UnixSockets HTTPS-proxy PSL Metalink
(In reply to Pawel Krupa from comment #1)
> It looks like all zombie processes are coming from liveness and readiness probes...
In a related bug about etcd leaking zombies, Dan suspects buggy exec probes . Do you need to use exec probes for thanos/Prometheus? Ideally, exec probes would not leak zombies, but curling 9090 doesn't seem like it needs an exec probe. Can't you use an httpGet probe ? Then you wouldn't have to worry about exec probe bugs.
The key blocking release bug is the high CPU usage https://bugzilla.redhat.com/show_bug.cgi?id=1878770
There seems to be a connection between etcd and the crash api nodes which I suspect is related to the zombies in this bug here
A short update:
- I have installed OCP version 4.6.0-0.nightly-s390x-2020-09-24-083041 which contains a fix for the mentioned performance issue .
- As of now (cluster age ~4 days) I have not noticed zombies apart from zombies caused by podman .
- I want to wait until the cluster is around 10 days old, the time frame I saw the many zombies from this report.
we will get an update next sprint if this is fixed
> Can't you use an httpGet probe ? Then you wouldn't have to worry about exec probe bugs.
Prometheus is listening only on localhost for security reasons and it is protected with kube-rbac-proxy in front of it. In such case we cannot as httpGet probe as it is executed from outside of the Pod, which means it would need to traverse kube-rbac-proxy and carry a token. Exec probe is done from the inside and can be executed against `localhost`, which is closer to the source. Prometheus operator already has code for choosing which type of a probe to use based on configured exposition address.
Created attachment 1718981 [details]
output of "ps -ef --forest" on OCP cluster deployed on ppc64le
I installed OCP 4.6 on ppc64le (little endian). The issue was not seen(no zombie processes were seen).
Build used : 4.6.0-0.nightly-ppc64le-2020-10-02-231830
> Prometheus is listening only on localhost for security reasons and it is protected with kube-rbac-proxy in front of it.
Ahh. How about  to help out in this space? That will also ensure that if the exec probe process (currently 'sh') is killed and reaped, there's no child wget/curl process to orphan. The exec probe process just starts executing wget/curl itself.
what's the status of this bug? I believe it's out of the hands of the Node team
We could pick my PR from comment 16 over into openshift/prometheus-operator, and try and measure reduction. Noah also suggested checking the operand pods for a SIGCHLD reaper . Both of those seem like monitoring actions to me.
Because of other bugs, we will update our downstream to the next version of prometheus-operator (v0.43.0) as soon as it's available. Note that this release will also remove liveness probes because it may kill pods during the WAL replay .
As for adding SIGCHLD handling to Prometheus, I'm quite sure it won't be accepted upstream: it's the job of the init system or container runtime .
Since we got fixes for the performance issues in other areas, I have notices almost no zombies on the nodes anymore.
However, the zombies still appear when the nodes running Prometheus is under heavy load and when steal time rises (OCP Version: 4.6.0-rc.4). When the load reduces, the zombies vanish.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
*** Bug 2067497 has been marked as a duplicate of this bug. ***