Bug 1878772 - On the nodes there are up to 547 zombie processes caused by thanos and Prometheus.
Summary: On the nodes there are up to 547 zombie processes caused by thanos and Promet...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.6
Hardware: s390x
OS: Unspecified
low
high
Target Milestone: ---
: 4.7.0
Assignee: Sergiusz Urbaniak
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks: ocp-46-z-tracker
TreeView+ depends on / blocked
 
Reported: 2020-09-14 13:41 UTC by wvoesch
Modified: 2021-02-24 15:18 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:17:46 UTC
Target Upstream Version:


Attachments (Terms of Use)
output of "ps -ef --forest". (101.76 KB, text/plain)
2020-09-14 13:41 UTC, wvoesch
no flags Details
output of "ps -ef --forest" on OCP cluster deployed on ppc64le (16.18 KB, text/plain)
2020-10-05 11:58 UTC, alisha
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 970 0 None closed Bug 1885244: bump prometheus operator to v0.43.0 2021-02-19 14:18:50 UTC
Github openshift cluster-monitoring-operator pull 977 0 None closed Bug 1878772: jsonnet/thanos-querier: exec probes, use correct endpoint for readiness 2021-02-19 14:18:50 UTC
Github openshift prometheus-operator pull 98 0 None closed Bug 1885244: bump to v0.43.0 2021-02-19 14:18:50 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:18:22 UTC

Description wvoesch 2020-09-14 13:41:16 UTC
Created attachment 1714798 [details]
output of "ps -ef --forest".

Description of problem:

On several nodes I have observed zombie processes. Usually there are 2 or 3, but once 547 are counted. 


I observed this on the following two separate clusters with these versions:
Cluster a) Z13: Version: 4.6.0-0.nightly-s390x-2020-08-27-080214 RHCOS: 46.82.202008261939-0
Cluster b) Z13: Version: 4.6.0-0.nightly-s390x-2020-09-05-222506 RHCOS: 46.82.202009042339-0


Please let me know what information you need for further debugging and I shall provide it happily. 
Thank you. 


Additional info:

$ ps -ef --forest # because of the lengthy output I removed some parts. Full list is added as attachment.  
UID          PID    PPID  C STIME TTY          TIME CMD
root        7137       1  0 Aug27 ?        00:00:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/7e73bfbb7e581ff9198d3b1ef1df1a9d103ec1a873a35e4f2f9f75800deadcce/userdata -c 7e73bfbb7e581ff9198d3b1ef1df1a9d103ec1a873a35e4f2f9f75800deadcce -
1000340+    7199    7137  0 Aug27 ?        00:42:14  \_ /bin/thanos query --grpc-address=127.0.0.1:10901 --http-address=127.0.0.1:9090 --query.replica-label=prometheus_replica --query.replica-label=thanos_ruler_replica --store=dnssrv+_grpc._tcp.prometheus-operated.opens
1000340+ 3897828    7199  0 Sep01 ?        00:00:00      \_ [curl] <defunct>
...
1000340+ 1859562    7199  0 Sep09 ?        00:00:00      \_ [curl] <defunct>
1000340+ 1886089    7199  0 Sep09 ?        00:00:00      \_ [sh] <defunct>
1000340+ 1887618    7199  0 Sep09 ?        00:00:00      \_ [curl] <defunct>
...
1000340+ 3497600    7199  0 15:15 ?        00:00:00      \_ [curl] <defunct>

root        9084       1  0 Aug27 ?        00:00:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/f573dd497990651a4fbe47b6165854bf7d8af0dc1cdb8f240e6baec31efce51e/userdata -c f573dd497990651a4fbe47b6165854bf7d8af0dc1cdb8f240e6baec31efce51e -
1000340+    9152    9084 27 Aug27 ?        4-04:29:31  \_ /bin/prometheus --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries --config.file=/etc/prometheus/config_out/prometheus.env.yaml --storage.tsdb.path=/prometh
1000340+ 1843748    9152  0 Sep09 ?        00:00:00      \_ [curl] <defunct>
root        9085       1  0 Aug27 ?        00:00:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/c8aab67010bec144cab5835cfc58e4ccce22e85f002a1bf723d9f112cfb8cba3/userdata -c c8aab67010bec144cab5835cfc58e4ccce22e85f002a1bf723d9f112cfb8cba3 -
1000340+    9137    9085 27 Aug27 ?        4-04:14:48  \_ /bin/prometheus --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries --config.file=/etc/prometheus/config_out/prometheus.env.yaml --storage.tsdb.path=/prometh
1000340+ 1787823    9137  0 Sep09 ?        00:00:00      \_ [curl] <defunct>
1000340+ 1813632    9137  0 Sep09 ?        00:00:00      \_ [curl] <defunct>
1000340+  931043    9137  0 Sep10 ?        00:00:00      \_ [curl] <defunct>

root     3293814       1  0 13:27 ?        00:00:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/52f89083fd2fecbcc9d9e02d5cb956d67928d0fe5cd25e15459aca5fabbf0924/userdata -c 52f89083fd2fecbcc9d9e02d5cb956d67928d0fe5cd25e15459aca5fabbf0924 -
1000340+ 3293840 3293814  3 13:27 ?        00:03:27  \_ /bin/thanos query --grpc-address=127.0.0.1:10901 --http-address=127.0.0.1:9090 --query.replica-label=prometheus_replica --query.replica-label=thanos_ruler_replica --store=dnssrv+_grpc._tcp.prometheus-operated.opens
1000340+ 3300653 3293840  0 13:31 ?        00:00:00      \_ [curl] <defunct>
...
1000340+ 3343560 3293840  0 13:54 ?        00:00:00      \_ [curl] <defunct>
1000340+ 3344894 3293840  0 13:54 ?        00:00:00      \_ [sh] <defunct>
1000340+ 3345400 3293840  0 13:55 ?        00:00:00      \_ [curl] <defunct>
...
1000340+ 3501803 3293840  0 15:17 ?        00:00:00      \_ [curl] <defunct>

Comment 1 Pawel Krupa 2020-09-16 09:35:00 UTC
It looks like all zombie processes are coming from liveness and readiness probes [1]. IMHO those should be reaped by container runtime.

Reassigning to node team for further investigation.

[1]: https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/thanos-querier/deployment.yaml#L55-L74

Comment 2 Peter Hunt 2020-09-16 15:57:14 UTC
I am not sure I'm interpreting the tree correctly, but it seems like all the defunct pids are children of prometheus or thanos processes, which are still running. conmon won't call wait_pid until the main exec process exits. are prometheus and thanos reaping their children?

Comment 3 Pawel Krupa 2020-09-16 16:21:31 UTC
I was also concerned about this tree. However neither thanos not prometheus fork any curl processes, both use golang HTTP implementation for such operations. In fact the only place `sh` and `curl` are executed is in readiness and liveness probes.

Comment 4 Pawel Krupa 2020-09-16 16:21:32 UTC
I was also concerned about this tree. However neither thanos not prometheus fork any curl processes, both use golang HTTP implementation for such operations. In fact the only place `sh` and `curl` are executed is in readiness and liveness probes.

Comment 5 Hendrik Brueckner 2020-09-16 17:54:30 UTC
Hi Wolfgang,

Could you change the ps command to ensure that are in Z state just to avoid other conditions? Do the zombie processes vanish over time or continue to increase? What is the overall system load when you see that behavior?

@Pawel: Are you aware of any problems related to sigchld handling/reaping of process in those readiness/liveness probes?

Comment 6 Dan Li 2020-09-16 19:59:30 UTC
Changing the reported "Version" from 4.6.z to 4.6 as 4.6.zstream has not been released yet.

Comment 7 Pawel Krupa 2020-09-17 06:51:25 UTC
@Hendrik this is the first report of such problems and I cannot reproduce it on other platforms.

Probes used in thanos querier and prometheus are one of the simplest implementations which shouldn't cause issues. It is just a shell with one `if` and `curl/wget` statement:
```
sh -c '
  if [ -x "$(command -v curl)" ]; then
    curl http://localhost:9090/-/healthy
  elif [ -x "$(command -v wget)" ]; then
    wget --quiet --tries=1 --spider http://localhost:9090/-/healthy
  else exit 1
fi'
```

If you have any improvement suggestions in regards to those probes, please share.

Comment 8 wvoesch 2020-09-18 13:32:59 UTC
In case it helps, the curl version installed is:

$ curl -V
curl 7.61.1 (s390x-ibm-linux-gnu) libcurl/7.61.1 OpenSSL/1.1.1c zlib/1.2.11 brotli/1.0.6 libidn2/2.2.0 libpsl/0.20.2 (+libidn2/2.0.5) libssh/0.9.0/openssl/zlib nghttp2/1.33.0
Release-Date: 2018-09-05
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp
Features: AsynchDNS IDN IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz brotli TLS-SRP HTTP2 UnixSockets HTTPS-proxy PSL Metalink

Comment 10 W. Trevor King 2020-09-23 22:58:25 UTC
(In reply to Pawel Krupa from comment #1)
> It looks like all zombie processes are coming from liveness and readiness probes...

In a related bug about etcd leaking zombies, Dan suspects buggy exec probes [1].  Do you need to use exec probes for thanos/Prometheus?  Ideally, exec probes would not leak zombies, but curling 9090 doesn't seem like it needs an exec probe.  Can't you use an httpGet probe [2]?  Then you wouldn't have to worry about exec probe bugs.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1844727#c7
[2]: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-http-request

Comment 11 Holger Wolf 2020-09-24 16:19:40 UTC
The key blocking release bug is the high CPU usage https://bugzilla.redhat.com/show_bug.cgi?id=1878770
There seems to be a connection between etcd and the crash api nodes which I suspect is related to the zombies in this bug here

Comment 12 wvoesch 2020-09-28 15:01:35 UTC
A short update: 

- I have installed OCP version 4.6.0-0.nightly-s390x-2020-09-24-083041 which contains a fix for the mentioned performance issue [1]. 
- As of now (cluster age ~4 days) I have not noticed zombies apart from zombies caused by podman [2]. 
- I want to wait until the cluster is around 10 days old, the time frame I saw the many zombies from this report. 



[1] https://bugzilla.redhat.com/show_bug.cgi?id=1878770
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1878780

Comment 13 Peter Hunt 2020-10-01 16:58:14 UTC
we will get an update next sprint if this is fixed

Comment 14 Pawel Krupa 2020-10-05 11:54:45 UTC
> Can't you use an httpGet probe [2]?  Then you wouldn't have to worry about exec probe bugs.

Prometheus is listening only on localhost for security reasons and it is protected with kube-rbac-proxy in front of it. In such case we cannot as httpGet probe as it is executed from outside of the Pod, which means it would need to traverse kube-rbac-proxy and carry a token. Exec probe is done from the inside and can be executed against `localhost`, which is closer to the source. Prometheus operator already has code for choosing which type of a probe to use based on configured exposition address[1].

[1]: https://github.com/prometheus-operator/prometheus-operator/blob/b9b6d68e0b3265c3df953b317288683fc11e675d/pkg/prometheus/statefulset.go#L565-L579

Comment 15 alisha 2020-10-05 11:58:04 UTC
Created attachment 1718981 [details]
output of "ps -ef --forest" on OCP cluster deployed on ppc64le

I installed OCP 4.6 on ppc64le (little endian). The issue was not seen(no zombie processes were seen).
Build used : 4.6.0-0.nightly-ppc64le-2020-10-02-231830

Comment 16 W. Trevor King 2020-10-05 23:53:57 UTC
> Prometheus is listening only on localhost for security reasons and it is protected with kube-rbac-proxy in front of it.

Ahh.  How about [1] to help out in this space?  That will also ensure that if the exec probe process (currently 'sh') is killed and reaped, there's no child wget/curl process to orphan.  The exec probe process just starts executing wget/curl itself.

[1]: https://github.com/prometheus-operator/prometheus-operator/pull/3567

Comment 17 Peter Hunt 2020-10-15 15:24:52 UTC
what's the status of this bug? I believe it's out of the hands of the Node team

Comment 18 W. Trevor King 2020-10-15 23:55:49 UTC
We could pick my PR from comment 16 over into openshift/prometheus-operator, and try and measure reduction.  Noah also suggested checking the operand pods for a SIGCHLD reaper [1].  Both of those seem like monitoring actions to me.

[1]: https://github.com/prometheus-operator/prometheus-operator/pull/3567#issuecomment-703969854

Comment 19 Simon Pasquier 2020-10-21 13:57:27 UTC
Because of other bugs, we will update our downstream to the next version of prometheus-operator (v0.43.0) as soon as it's available. Note that this release will also remove liveness probes because it may kill pods during the WAL replay [1].
As for adding SIGCHLD handling to Prometheus, I'm quite sure it won't be accepted upstream: it's the job of the init system or container runtime [2].

[1] https://github.com/prometheus-operator/prometheus-operator/issues/3391
[2] https://github.com/kubernetes/kubernetes/issues/84210

Comment 21 wvoesch 2020-10-29 12:23:24 UTC
Since we got fixes for the performance issues in other areas, I have notices almost no zombies on the nodes anymore. 
However, the zombies still appear when the nodes running Prometheus is under heavy load and when steal time rises (OCP Version:  4.6.0-rc.4). When the load reduces, the zombies vanish.

Comment 28 errata-xmlrpc 2021-02-24 15:17:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.