1906496 – [BUG] Thanos having possible memory leak consuming huge amounts of node's memory and killing them

Bug 1906496 - [BUG] Thanos having possible memory leak consuming huge amounts of node's memory and killing them

Summary: [BUG] Thanos having possible memory leak consuming huge amounts of node's mem...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.6.z
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Sergiusz Urbaniak
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Duplicates (4):	1916645 1920602 1927448 1930630 (view as bug list)
Depends On:
Blocks:	1918792
TreeView+	depends on / blocked

Reported:	2020-12-10 16:29 UTC by Andre Costa
Modified:	2024-10-01 17:11 UTC (History)
CC List:	49 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1918792 (view as bug list)
Environment:
Last Closed:	2021-02-24 15:41:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
pprof result shortly before OOM, usage ~10GB (168.83 KB, image/png) 2021-01-12 13:16 UTC, Sergiusz Urbaniak	no flags	Details
thanos-querier pods' memeory usage after released spaces for prometheus (157.73 KB, image/png) 2021-01-25 08:57 UTC, Junqi Zhao	no flags	Details
thanos-querier pods' memeory usage after a while (157.03 KB, image/png) 2021-01-25 08:58 UTC, Junqi Zhao	no flags	Details
thanos-querier pods' memeory usage during the upgrade (137.04 KB, image/png) 2021-01-27 13:02 UTC, Junqi Zhao	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift thanos pull 46	None	closed	Bug 1906496: pkg/rules/proxy: fix hotlooping when receiving client errors	2021-02-21 06:48:30 UTC
Github	thanos-io thanos issues 3717	None	closed	Thanos rules fanout OOMs occasionally	2021-02-18 15:52:46 UTC
Red Hat Knowledge Base (Solution)	5685771	None	None	None	2021-01-07 08:45:00 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:42:14 UTC

Description Andre Costa 2020-12-10 16:29:39 UTC

Description of problem:
We are noticing in OCP 4.6 that Thanos containers are consuming huge amounts of memory ending up with the nodes crashing and being completely unresponsive until it is able to kill processes and reclaim the resources.
So far the only option is to implement resource:limits: on thanos which ends up with Thanos being killed but preserving the nodes to crash. 

Version-Release number of selected component (if applicable):
OCP 4.6.z

How reproducible:
So far 2 customers reporting the exact same behaviour

Steps to Reproduce:
Unknown

Comment 34 Rogerio Bastos 2021-01-11 17:24:08 UTC

This issue was identified in another customer today. The thanos-querier pod is recurringly using all the memory of worker nodes, causing kubelet to starve and the node to be replaced by machine-api-controller health-check.


Here a couple of findings:

The pods
$ oc get pods -l app.kubernetes.io/name=thanos-query -n openshift-monitoring
NAME                              READY   STATUS        RESTARTS   AGE
thanos-querier-79b58777f5-f6nmh   5/5     Running       0          4m33s
thanos-querier-79b58777f5-kxh86   5/5     Terminating   0          18m
thanos-querier-79b58777f5-qdjqt   5/5     Running       0          9m3s
thanos-querier-79b58777f5-zstfc   5/5     Terminating   0          12m
 

Logged into a node instance, right before it starved and became unresponsive:
$ ps auxw | grep thanos
1000260+   19771 45.4 54.4 18371544 17680824 ?   Ssl  16:19   3:07 /bin/thanos query --log.level=info --grpc-address=127.0.0.1:10901 --http-address=127.0.0.1:9090 --query.replica-label=prometheus_replica --query.

Comment 36 Simon Pasquier 2021-01-12 10:34:24 UTC

To get profiles from the thanos-query containers, you can run something like:

SLEEP_MINUTES=5
while true; do
  for i in $(oc get pods -l app.kubernetes.io/name=thanos-query  -o go-template --template="{{range .items}}{{ .metadata.name }} {{end}}"); do
    echo "Retrieving pod usage for $i...";
    oc adm top -n openshift-monitoring pod $i  --containers > "top.$i.$(date +%Y%m%d-%H%M%S).log";
    echo "Retrieving memory profile for $i...";
    oc exec -n openshift-monitoring $i -c thanos-query -- curl -s http://localhost:9090/debug/pprof/heap > heap.$i.$(date +%Y%m%d-%H%M%S).pprof;
    oc exec -n openshift-monitoring $i -c prometheus -- curl -s http://localhost:9090/debug/pprof/allocs > allocs.$i.$(date +%Y%m%d-%H%M%S).pprof;
  done
  echo "Sleeping for $SLEEP_MINUTES minutes..."
  sleep $(( 60 * "$SLEEP_MINUTES" ))
done

Comment 37 Sergiusz Urbaniak 2021-01-12 13:16:12 UTC

Created attachment 1746641 [details]
pprof result shortly before OOM, usage ~10GB

This is an initial heap profile shortly before the OOM. We are allocating ~10 GB of data in the rules receive function which is mostly likely the offender. We are continuing the investigation on why so much data is accumulated here.

Comment 39 Sergiusz Urbaniak 2021-01-13 14:13:44 UTC

There is a potential hypothesis that this is caused by a recursion issue, eating both CPU and memory, during error handling. Please also send us the goroutine stack dump shortly before the OOM:

SLEEP_SECONDS=30
while true; do
  t=$(date +%Y%m%d-%H%M%S)
  for i in $(oc get pods -n openshift-monitoring -l app.kubernetes.io/name=thanos-query  -o go-template --template="{{range .items}}{{ .metadata.name }} {{end}}"); do
    echo "Retrieving profiles for $i...";
    oc adm top -n openshift-monitoring pod $i  --containers > "top.$i.$t.log";
    oc exec -n openshift-monitoring $i -c thanos-query -- curl -s http://localhost:9090/debug/pprof/heap > heap.$i.$t.pprof;
    oc exec -n openshift-monitoring $i -c thanos-query -- curl -s http://localhost:9090/debug/pprof/allocs > allocs.$i.$t.pprof;
    oc exec -n openshift-monitoring $p -c thanos-query -- curl -s http://localhost:9090/debug/pprof/goroutine?debug=2 >goroutine.$i.$t.txt;
  done
  echo "Sleeping for $SLEEP_SECONDS seconds..."
  sleep "$SLEEP_SECONDS"
done

Comment 48 Sergiusz Urbaniak 2021-01-19 09:32:25 UTC

*** Bug 1913532 has been marked as a duplicate of this bug. ***

Comment 56 Sergiusz Urbaniak 2021-01-22 19:34:27 UTC

*** Bug 1916645 has been marked as a duplicate of this bug. ***

Comment 58 Junqi Zhao 2021-01-25 08:57:55 UTC

Created attachment 1750426 [details]
thanos-querier pods' memeory usage after released spaces for prometheus

Comment 59 Junqi Zhao 2021-01-25 08:58:33 UTC

Created attachment 1750427 [details]
thanos-querier pods' memeory usage after a while

Comment 60 Andreas Karis 2021-01-26 22:30:40 UTC

*** Bug 1920602 has been marked as a duplicate of this bug. ***

Comment 62 Junqi Zhao 2021-01-27 13:02:13 UTC

Created attachment 1751252 [details]
thanos-querier pods' memeory usage during the upgrade

upgrade from 4.6.13 to 4.7.0-0.nightly-2021-01-22-134922, thanos-querier pods' memeory usage does not increase too much, but memory usage for prometheus is increased in a short time, see from
https://bugzilla.redhat.com/show_bug.cgi?id=1918683#c10

Comment 63 Junqi Zhao 2021-01-27 13:06:03 UTC

based on Comment 57 - 62, thanos-querier pods' memory usage does not change too much during the upgrade, set it to VERIFIED

Comment 66 Dan Yocum 2021-02-05 20:18:30 UTC

disregard my last comment - wrong BZ

Comment 67 Sergiusz Urbaniak 2021-02-11 11:04:10 UTC

*** Bug 1927448 has been marked as a duplicate of this bug. ***

Comment 69 daniel.hagen 2021-02-18 16:45:37 UTC

Hi,

we have the same problem with the Thanos Querier under OpenShift 4.6.6. This happends (only ?) if the Machine on that the 
Thanos PODs runs has more then 30 GB RAM and more then 4 CPUs. On smaller sized nodes this effect doesn't occur.

Thanos is simply eating up all the available memory which leads to an OOM Killer making the node useless.

Comment 70 daniel.hagen 2021-02-19 10:05:22 UTC

The https://access.redhat.com/solutions/5685771 didn't not fix it as it seems that the resource limits are ignored in the cluster-monitoring-config.

Comment 71 Sergiusz Urbaniak 2021-02-19 11:12:28 UTC

@daniel: please note the final fix without applying limits is available in 4.6.16 as per https://access.redhat.com/errata/RHSA-2021:0308.

Comment 73 errata-xmlrpc 2021-02-24 15:41:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 74 Fabian Deutsch 2021-03-19 09:24:20 UTC

*** Bug 1930630 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

akaris
alchan
alegrand
alkazako
andbartl
andcosta
anowak
anpicker
aprajapa
bjarolim
cblecker
ccoleman
cpassare
cruhm
daniel.hagen
ddelcian
djuran
dyocum
enorling
erooth
fdeutsch
gparente
jluhrsen
johan-fedora
kakkoyun
knakai
lcosic
mchebbi
mrobson
mzali
nagrawal
nmalik
openshift-bugs-escalate
pkrupa
ppostler
rbastos
rdomnu
rogbas
rphillips
rrackow
sbhavsar
sgarciam
s.heijmans
smaudet
spasquie
steven.barre
surbania
trees
wking