Created attachment 1714798 [details] output of "ps -ef --forest". Description of problem: On several nodes I have observed zombie processes. Usually there are 2 or 3, but once 547 are counted. I observed this on the following two separate clusters with these versions: Cluster a) Z13: Version: 4.6.0-0.nightly-s390x-2020-08-27-080214 RHCOS: 46.82.202008261939-0 Cluster b) Z13: Version: 4.6.0-0.nightly-s390x-2020-09-05-222506 RHCOS: 46.82.202009042339-0 Please let me know what information you need for further debugging and I shall provide it happily. Thank you. Additional info: $ ps -ef --forest # because of the lengthy output I removed some parts. Full list is added as attachment. UID PID PPID C STIME TTY TIME CMD root 7137 1 0 Aug27 ? 00:00:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/7e73bfbb7e581ff9198d3b1ef1df1a9d103ec1a873a35e4f2f9f75800deadcce/userdata -c 7e73bfbb7e581ff9198d3b1ef1df1a9d103ec1a873a35e4f2f9f75800deadcce - 1000340+ 7199 7137 0 Aug27 ? 00:42:14 \_ /bin/thanos query --grpc-address=127.0.0.1:10901 --http-address=127.0.0.1:9090 --query.replica-label=prometheus_replica --query.replica-label=thanos_ruler_replica --store=dnssrv+_grpc._tcp.prometheus-operated.opens 1000340+ 3897828 7199 0 Sep01 ? 00:00:00 \_ [curl] <defunct> ... 1000340+ 1859562 7199 0 Sep09 ? 00:00:00 \_ [curl] <defunct> 1000340+ 1886089 7199 0 Sep09 ? 00:00:00 \_ [sh] <defunct> 1000340+ 1887618 7199 0 Sep09 ? 00:00:00 \_ [curl] <defunct> ... 1000340+ 3497600 7199 0 15:15 ? 00:00:00 \_ [curl] <defunct> root 9084 1 0 Aug27 ? 00:00:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/f573dd497990651a4fbe47b6165854bf7d8af0dc1cdb8f240e6baec31efce51e/userdata -c f573dd497990651a4fbe47b6165854bf7d8af0dc1cdb8f240e6baec31efce51e - 1000340+ 9152 9084 27 Aug27 ? 4-04:29:31 \_ /bin/prometheus --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries --config.file=/etc/prometheus/config_out/prometheus.env.yaml --storage.tsdb.path=/prometh 1000340+ 1843748 9152 0 Sep09 ? 00:00:00 \_ [curl] <defunct> root 9085 1 0 Aug27 ? 00:00:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/c8aab67010bec144cab5835cfc58e4ccce22e85f002a1bf723d9f112cfb8cba3/userdata -c c8aab67010bec144cab5835cfc58e4ccce22e85f002a1bf723d9f112cfb8cba3 - 1000340+ 9137 9085 27 Aug27 ? 4-04:14:48 \_ /bin/prometheus --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries --config.file=/etc/prometheus/config_out/prometheus.env.yaml --storage.tsdb.path=/prometh 1000340+ 1787823 9137 0 Sep09 ? 00:00:00 \_ [curl] <defunct> 1000340+ 1813632 9137 0 Sep09 ? 00:00:00 \_ [curl] <defunct> 1000340+ 931043 9137 0 Sep10 ? 00:00:00 \_ [curl] <defunct> root 3293814 1 0 13:27 ? 00:00:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/52f89083fd2fecbcc9d9e02d5cb956d67928d0fe5cd25e15459aca5fabbf0924/userdata -c 52f89083fd2fecbcc9d9e02d5cb956d67928d0fe5cd25e15459aca5fabbf0924 - 1000340+ 3293840 3293814 3 13:27 ? 00:03:27 \_ /bin/thanos query --grpc-address=127.0.0.1:10901 --http-address=127.0.0.1:9090 --query.replica-label=prometheus_replica --query.replica-label=thanos_ruler_replica --store=dnssrv+_grpc._tcp.prometheus-operated.opens 1000340+ 3300653 3293840 0 13:31 ? 00:00:00 \_ [curl] <defunct> ... 1000340+ 3343560 3293840 0 13:54 ? 00:00:00 \_ [curl] <defunct> 1000340+ 3344894 3293840 0 13:54 ? 00:00:00 \_ [sh] <defunct> 1000340+ 3345400 3293840 0 13:55 ? 00:00:00 \_ [curl] <defunct> ... 1000340+ 3501803 3293840 0 15:17 ? 00:00:00 \_ [curl] <defunct>
It looks like all zombie processes are coming from liveness and readiness probes [1]. IMHO those should be reaped by container runtime. Reassigning to node team for further investigation. [1]: https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/thanos-querier/deployment.yaml#L55-L74
I am not sure I'm interpreting the tree correctly, but it seems like all the defunct pids are children of prometheus or thanos processes, which are still running. conmon won't call wait_pid until the main exec process exits. are prometheus and thanos reaping their children?
I was also concerned about this tree. However neither thanos not prometheus fork any curl processes, both use golang HTTP implementation for such operations. In fact the only place `sh` and `curl` are executed is in readiness and liveness probes.
Hi Wolfgang, Could you change the ps command to ensure that are in Z state just to avoid other conditions? Do the zombie processes vanish over time or continue to increase? What is the overall system load when you see that behavior? @Pawel: Are you aware of any problems related to sigchld handling/reaping of process in those readiness/liveness probes?
Changing the reported "Version" from 4.6.z to 4.6 as 4.6.zstream has not been released yet.
@Hendrik this is the first report of such problems and I cannot reproduce it on other platforms. Probes used in thanos querier and prometheus are one of the simplest implementations which shouldn't cause issues. It is just a shell with one `if` and `curl/wget` statement: ``` sh -c ' if [ -x "$(command -v curl)" ]; then curl http://localhost:9090/-/healthy elif [ -x "$(command -v wget)" ]; then wget --quiet --tries=1 --spider http://localhost:9090/-/healthy else exit 1 fi' ``` If you have any improvement suggestions in regards to those probes, please share.
In case it helps, the curl version installed is: $ curl -V curl 7.61.1 (s390x-ibm-linux-gnu) libcurl/7.61.1 OpenSSL/1.1.1c zlib/1.2.11 brotli/1.0.6 libidn2/2.2.0 libpsl/0.20.2 (+libidn2/2.0.5) libssh/0.9.0/openssl/zlib nghttp2/1.33.0 Release-Date: 2018-09-05 Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp Features: AsynchDNS IDN IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz brotli TLS-SRP HTTP2 UnixSockets HTTPS-proxy PSL Metalink
(In reply to Pawel Krupa from comment #1) > It looks like all zombie processes are coming from liveness and readiness probes... In a related bug about etcd leaking zombies, Dan suspects buggy exec probes [1]. Do you need to use exec probes for thanos/Prometheus? Ideally, exec probes would not leak zombies, but curling 9090 doesn't seem like it needs an exec probe. Can't you use an httpGet probe [2]? Then you wouldn't have to worry about exec probe bugs. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1844727#c7 [2]: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-http-request
The key blocking release bug is the high CPU usage https://bugzilla.redhat.com/show_bug.cgi?id=1878770 There seems to be a connection between etcd and the crash api nodes which I suspect is related to the zombies in this bug here
A short update: - I have installed OCP version 4.6.0-0.nightly-s390x-2020-09-24-083041 which contains a fix for the mentioned performance issue [1]. - As of now (cluster age ~4 days) I have not noticed zombies apart from zombies caused by podman [2]. - I want to wait until the cluster is around 10 days old, the time frame I saw the many zombies from this report. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1878770 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1878780
we will get an update next sprint if this is fixed
> Can't you use an httpGet probe [2]? Then you wouldn't have to worry about exec probe bugs. Prometheus is listening only on localhost for security reasons and it is protected with kube-rbac-proxy in front of it. In such case we cannot as httpGet probe as it is executed from outside of the Pod, which means it would need to traverse kube-rbac-proxy and carry a token. Exec probe is done from the inside and can be executed against `localhost`, which is closer to the source. Prometheus operator already has code for choosing which type of a probe to use based on configured exposition address[1]. [1]: https://github.com/prometheus-operator/prometheus-operator/blob/b9b6d68e0b3265c3df953b317288683fc11e675d/pkg/prometheus/statefulset.go#L565-L579
Created attachment 1718981 [details] output of "ps -ef --forest" on OCP cluster deployed on ppc64le I installed OCP 4.6 on ppc64le (little endian). The issue was not seen(no zombie processes were seen). Build used : 4.6.0-0.nightly-ppc64le-2020-10-02-231830
> Prometheus is listening only on localhost for security reasons and it is protected with kube-rbac-proxy in front of it. Ahh. How about [1] to help out in this space? That will also ensure that if the exec probe process (currently 'sh') is killed and reaped, there's no child wget/curl process to orphan. The exec probe process just starts executing wget/curl itself. [1]: https://github.com/prometheus-operator/prometheus-operator/pull/3567
what's the status of this bug? I believe it's out of the hands of the Node team
We could pick my PR from comment 16 over into openshift/prometheus-operator, and try and measure reduction. Noah also suggested checking the operand pods for a SIGCHLD reaper [1]. Both of those seem like monitoring actions to me. [1]: https://github.com/prometheus-operator/prometheus-operator/pull/3567#issuecomment-703969854
Because of other bugs, we will update our downstream to the next version of prometheus-operator (v0.43.0) as soon as it's available. Note that this release will also remove liveness probes because it may kill pods during the WAL replay [1]. As for adding SIGCHLD handling to Prometheus, I'm quite sure it won't be accepted upstream: it's the job of the init system or container runtime [2]. [1] https://github.com/prometheus-operator/prometheus-operator/issues/3391 [2] https://github.com/kubernetes/kubernetes/issues/84210
Since we got fixes for the performance issues in other areas, I have notices almost no zombies on the nodes anymore. However, the zombies still appear when the nodes running Prometheus is under heavy load and when steal time rises (OCP Version: 4.6.0-rc.4). When the load reduces, the zombies vanish.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
*** Bug 2067497 has been marked as a duplicate of this bug. ***