Description of problem: We are noticing in OCP 4.6 that Thanos containers are consuming huge amounts of memory ending up with the nodes crashing and being completely unresponsive until it is able to kill processes and reclaim the resources. So far the only option is to implement resource:limits: on thanos which ends up with Thanos being killed but preserving the nodes to crash. Version-Release number of selected component (if applicable): OCP 4.6.z How reproducible: So far 2 customers reporting the exact same behaviour Steps to Reproduce: Unknown
This issue was identified in another customer today. The thanos-querier pod is recurringly using all the memory of worker nodes, causing kubelet to starve and the node to be replaced by machine-api-controller health-check. Here a couple of findings: The pods $ oc get pods -l app.kubernetes.io/name=thanos-query -n openshift-monitoring NAME READY STATUS RESTARTS AGE thanos-querier-79b58777f5-f6nmh 5/5 Running 0 4m33s thanos-querier-79b58777f5-kxh86 5/5 Terminating 0 18m thanos-querier-79b58777f5-qdjqt 5/5 Running 0 9m3s thanos-querier-79b58777f5-zstfc 5/5 Terminating 0 12m Logged into a node instance, right before it starved and became unresponsive: $ ps auxw | grep thanos 1000260+ 19771 45.4 54.4 18371544 17680824 ? Ssl 16:19 3:07 /bin/thanos query --log.level=info --grpc-address=127.0.0.1:10901 --http-address=127.0.0.1:9090 --query.replica-label=prometheus_replica --query.
To get profiles from the thanos-query containers, you can run something like: SLEEP_MINUTES=5 while true; do for i in $(oc get pods -l app.kubernetes.io/name=thanos-query -o go-template --template="{{range .items}}{{ .metadata.name }} {{end}}"); do echo "Retrieving pod usage for $i..."; oc adm top -n openshift-monitoring pod $i --containers > "top.$i.$(date +%Y%m%d-%H%M%S).log"; echo "Retrieving memory profile for $i..."; oc exec -n openshift-monitoring $i -c thanos-query -- curl -s http://localhost:9090/debug/pprof/heap > heap.$i.$(date +%Y%m%d-%H%M%S).pprof; oc exec -n openshift-monitoring $i -c prometheus -- curl -s http://localhost:9090/debug/pprof/allocs > allocs.$i.$(date +%Y%m%d-%H%M%S).pprof; done echo "Sleeping for $SLEEP_MINUTES minutes..." sleep $(( 60 * "$SLEEP_MINUTES" )) done
Created attachment 1746641 [details] pprof result shortly before OOM, usage ~10GB This is an initial heap profile shortly before the OOM. We are allocating ~10 GB of data in the rules receive function which is mostly likely the offender. We are continuing the investigation on why so much data is accumulated here.
There is a potential hypothesis that this is caused by a recursion issue, eating both CPU and memory, during error handling. Please also send us the goroutine stack dump shortly before the OOM: SLEEP_SECONDS=30 while true; do t=$(date +%Y%m%d-%H%M%S) for i in $(oc get pods -n openshift-monitoring -l app.kubernetes.io/name=thanos-query -o go-template --template="{{range .items}}{{ .metadata.name }} {{end}}"); do echo "Retrieving profiles for $i..."; oc adm top -n openshift-monitoring pod $i --containers > "top.$i.$t.log"; oc exec -n openshift-monitoring $i -c thanos-query -- curl -s http://localhost:9090/debug/pprof/heap > heap.$i.$t.pprof; oc exec -n openshift-monitoring $i -c thanos-query -- curl -s http://localhost:9090/debug/pprof/allocs > allocs.$i.$t.pprof; oc exec -n openshift-monitoring $p -c thanos-query -- curl -s http://localhost:9090/debug/pprof/goroutine?debug=2 >goroutine.$i.$t.txt; done echo "Sleeping for $SLEEP_SECONDS seconds..." sleep "$SLEEP_SECONDS" done
*** Bug 1913532 has been marked as a duplicate of this bug. ***
*** Bug 1916645 has been marked as a duplicate of this bug. ***
Created attachment 1750426 [details] thanos-querier pods' memeory usage after released spaces for prometheus
Created attachment 1750427 [details] thanos-querier pods' memeory usage after a while
*** Bug 1920602 has been marked as a duplicate of this bug. ***
Created attachment 1751252 [details] thanos-querier pods' memeory usage during the upgrade upgrade from 4.6.13 to 4.7.0-0.nightly-2021-01-22-134922, thanos-querier pods' memeory usage does not increase too much, but memory usage for prometheus is increased in a short time, see from https://bugzilla.redhat.com/show_bug.cgi?id=1918683#c10
based on Comment 57 - 62, thanos-querier pods' memory usage does not change too much during the upgrade, set it to VERIFIED
disregard my last comment - wrong BZ
*** Bug 1927448 has been marked as a duplicate of this bug. ***
Hi, we have the same problem with the Thanos Querier under OpenShift 4.6.6. This happends (only ?) if the Machine on that the Thanos PODs runs has more then 30 GB RAM and more then 4 CPUs. On smaller sized nodes this effect doesn't occur. Thanos is simply eating up all the available memory which leads to an OOM Killer making the node useless.
The https://access.redhat.com/solutions/5685771 didn't not fix it as it seems that the resource limits are ignored in the cluster-monitoring-config.
@daniel: please note the final fix without applying limits is available in 4.6.16 as per https://access.redhat.com/errata/RHSA-2021:0308.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
*** Bug 1930630 has been marked as a duplicate of this bug. ***