Bug 1906496 - [BUG] Thanos having possible memory leak consuming huge amounts of node's memory and killing them
Summary: [BUG] Thanos having possible memory leak consuming huge amounts of node's mem...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.6.z
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: 4.7.0
Assignee: Sergiusz Urbaniak
QA Contact: Junqi Zhao
URL:
Whiteboard:
: 1916645 1920602 1927448 1930630 (view as bug list)
Depends On:
Blocks: 1918792
TreeView+ depends on / blocked
 
Reported: 2020-12-10 16:29 UTC by Andre Costa
Modified: 2024-03-25 17:29 UTC (History)
49 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1918792 (view as bug list)
Environment:
Last Closed: 2021-02-24 15:41:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
pprof result shortly before OOM, usage ~10GB (168.83 KB, image/png)
2021-01-12 13:16 UTC, Sergiusz Urbaniak
no flags Details
thanos-querier pods' memeory usage after released spaces for prometheus (157.73 KB, image/png)
2021-01-25 08:57 UTC, Junqi Zhao
no flags Details
thanos-querier pods' memeory usage after a while (157.03 KB, image/png)
2021-01-25 08:58 UTC, Junqi Zhao
no flags Details
thanos-querier pods' memeory usage during the upgrade (137.04 KB, image/png)
2021-01-27 13:02 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift thanos pull 46 0 None closed Bug 1906496: pkg/rules/proxy: fix hotlooping when receiving client errors 2021-02-21 06:48:30 UTC
Github thanos-io thanos issues 3717 0 None closed Thanos rules fanout OOMs occasionally 2021-02-18 15:52:46 UTC
Red Hat Knowledge Base (Solution) 5685771 0 None None None 2021-01-07 08:45:00 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:42:14 UTC

Description Andre Costa 2020-12-10 16:29:39 UTC
Description of problem:
We are noticing in OCP 4.6 that Thanos containers are consuming huge amounts of memory ending up with the nodes crashing and being completely unresponsive until it is able to kill processes and reclaim the resources.
So far the only option is to implement resource:limits: on thanos which ends up with Thanos being killed but preserving the nodes to crash. 

Version-Release number of selected component (if applicable):
OCP 4.6.z

How reproducible:
So far 2 customers reporting the exact same behaviour

Steps to Reproduce:
Unknown

Comment 34 Rogerio Bastos 2021-01-11 17:24:08 UTC
This issue was identified in another customer today. The thanos-querier pod is recurringly using all the memory of worker nodes, causing kubelet to starve and the node to be replaced by machine-api-controller health-check.


Here a couple of findings:

The pods
$ oc get pods -l app.kubernetes.io/name=thanos-query -n openshift-monitoring
NAME                              READY   STATUS        RESTARTS   AGE
thanos-querier-79b58777f5-f6nmh   5/5     Running       0          4m33s
thanos-querier-79b58777f5-kxh86   5/5     Terminating   0          18m
thanos-querier-79b58777f5-qdjqt   5/5     Running       0          9m3s
thanos-querier-79b58777f5-zstfc   5/5     Terminating   0          12m
 

Logged into a node instance, right before it starved and became unresponsive:
$ ps auxw | grep thanos
1000260+   19771 45.4 54.4 18371544 17680824 ?   Ssl  16:19   3:07 /bin/thanos query --log.level=info --grpc-address=127.0.0.1:10901 --http-address=127.0.0.1:9090 --query.replica-label=prometheus_replica --query.

Comment 36 Simon Pasquier 2021-01-12 10:34:24 UTC
To get profiles from the thanos-query containers, you can run something like:

SLEEP_MINUTES=5
while true; do
  for i in $(oc get pods -l app.kubernetes.io/name=thanos-query  -o go-template --template="{{range .items}}{{ .metadata.name }} {{end}}"); do
    echo "Retrieving pod usage for $i...";
    oc adm top -n openshift-monitoring pod $i  --containers > "top.$i.$(date +%Y%m%d-%H%M%S).log";
    echo "Retrieving memory profile for $i...";
    oc exec -n openshift-monitoring $i -c thanos-query -- curl -s http://localhost:9090/debug/pprof/heap > heap.$i.$(date +%Y%m%d-%H%M%S).pprof;
    oc exec -n openshift-monitoring $i -c prometheus -- curl -s http://localhost:9090/debug/pprof/allocs > allocs.$i.$(date +%Y%m%d-%H%M%S).pprof;
  done
  echo "Sleeping for $SLEEP_MINUTES minutes..."
  sleep $(( 60 * "$SLEEP_MINUTES" ))
done

Comment 37 Sergiusz Urbaniak 2021-01-12 13:16:12 UTC
Created attachment 1746641 [details]
pprof result shortly before OOM, usage ~10GB

This is an initial heap profile shortly before the OOM. We are allocating ~10 GB of data in the rules receive function which is mostly likely the offender. We are continuing the investigation on why so much data is accumulated here.

Comment 39 Sergiusz Urbaniak 2021-01-13 14:13:44 UTC
There is a potential hypothesis that this is caused by a recursion issue, eating both CPU and memory, during error handling. Please also send us the goroutine stack dump shortly before the OOM:

SLEEP_SECONDS=30
while true; do
  t=$(date +%Y%m%d-%H%M%S)
  for i in $(oc get pods -n openshift-monitoring -l app.kubernetes.io/name=thanos-query  -o go-template --template="{{range .items}}{{ .metadata.name }} {{end}}"); do
    echo "Retrieving profiles for $i...";
    oc adm top -n openshift-monitoring pod $i  --containers > "top.$i.$t.log";
    oc exec -n openshift-monitoring $i -c thanos-query -- curl -s http://localhost:9090/debug/pprof/heap > heap.$i.$t.pprof;
    oc exec -n openshift-monitoring $i -c thanos-query -- curl -s http://localhost:9090/debug/pprof/allocs > allocs.$i.$t.pprof;
    oc exec -n openshift-monitoring $p -c thanos-query -- curl -s http://localhost:9090/debug/pprof/goroutine?debug=2 >goroutine.$i.$t.txt;
  done
  echo "Sleeping for $SLEEP_SECONDS seconds..."
  sleep "$SLEEP_SECONDS"
done

Comment 48 Sergiusz Urbaniak 2021-01-19 09:32:25 UTC
*** Bug 1913532 has been marked as a duplicate of this bug. ***

Comment 56 Sergiusz Urbaniak 2021-01-22 19:34:27 UTC
*** Bug 1916645 has been marked as a duplicate of this bug. ***

Comment 58 Junqi Zhao 2021-01-25 08:57:55 UTC
Created attachment 1750426 [details]
thanos-querier pods' memeory usage after released spaces for prometheus

Comment 59 Junqi Zhao 2021-01-25 08:58:33 UTC
Created attachment 1750427 [details]
thanos-querier pods' memeory usage after a while

Comment 60 Andreas Karis 2021-01-26 22:30:40 UTC
*** Bug 1920602 has been marked as a duplicate of this bug. ***

Comment 62 Junqi Zhao 2021-01-27 13:02:13 UTC
Created attachment 1751252 [details]
thanos-querier pods' memeory usage during the upgrade

upgrade from 4.6.13 to 4.7.0-0.nightly-2021-01-22-134922, thanos-querier pods' memeory usage does not increase too much, but memory usage for prometheus is increased in a short time, see from
https://bugzilla.redhat.com/show_bug.cgi?id=1918683#c10

Comment 63 Junqi Zhao 2021-01-27 13:06:03 UTC
based on Comment 57 - 62, thanos-querier pods' memory usage does not change too much during the upgrade, set it to VERIFIED

Comment 66 Dan Yocum 2021-02-05 20:18:30 UTC
disregard my last comment - wrong BZ

Comment 67 Sergiusz Urbaniak 2021-02-11 11:04:10 UTC
*** Bug 1927448 has been marked as a duplicate of this bug. ***

Comment 69 daniel.hagen 2021-02-18 16:45:37 UTC
Hi,

we have the same problem with the Thanos Querier under OpenShift 4.6.6. This happends (only ?) if the Machine on that the 
Thanos PODs runs has more then 30 GB RAM and more then 4 CPUs. On smaller sized nodes this effect doesn't occur.

Thanos is simply eating up all the available memory which leads to an OOM Killer making the node useless.

Comment 70 daniel.hagen 2021-02-19 10:05:22 UTC
The https://access.redhat.com/solutions/5685771 didn't not fix it as it seems that the resource limits are ignored in the cluster-monitoring-config.

Comment 71 Sergiusz Urbaniak 2021-02-19 11:12:28 UTC
@daniel: please note the final fix without applying limits is available in 4.6.16 as per https://access.redhat.com/errata/RHSA-2021:0308.

Comment 73 errata-xmlrpc 2021-02-24 15:41:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 74 Fabian Deutsch 2021-03-19 09:24:20 UTC
*** Bug 1930630 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.