Created attachment 1861508 [details] MemoryProfile Created attachment 1861508 [details] MemoryProfile Description of problem: After launching Thanos Querier, I noticed that Thanos consumes a lot of CPU and memory. Because the Querier deployment has no memory or CPU limit, kubelet does not terminate the process. Finally, the kernel terminates the Thanos pod. Version-Release number of selected component (if applicable): Thanos-Query: 0.22.0 How reproducible: I can't reproduce the error every time. I think it is some kind of race condition during startup. Steps to Reproduce: 1. Kill Thanos-Querier Pod 2. Query the Prometheus API endpoint of Thanos Querier with a lot of PromQLs Actual results: Thanos Querier sometimes consumes a lot of CPU and memory. Expected results: Normal Thanos startup with a only a few resources. Additional info: I found Bugzilla Bug 1906496, which was very helpful in getting additional information about the Thanos process.
Created attachment 1861509 [details] allocsMemory
Created attachment 1861510 [details] goroutine.txt
Created attachment 1861511 [details] heap.pprof
Created attachment 1861512 [details] top.txt
It seems that the memory leak can be triggered by using the dashboard, because the dashboard also queries the metrics via the query component.
I can't reproduce the problem with Thanos v0.24.0 at the moment. Let's see what time reveals with Thanos v0.24.0.
@philipp.dallig > After launching Thanos Querier, I noticed that Thanos consumes a lot of CPU and memory. Because the Querier deployment has no memory or CPU limit, kubelet does not terminate the process. Finally, the kernel terminates the Thanos pod. I have been trying to reproduce this without much luck. How often are you able to reproduce this? Could you please confirm what you meant by "launching Thanos Querier" is just referring to the thanos-querier deployment in openshift-monitoring namespace? Could you also attach a must-gather when this happens? > I can't reproduce the problem with Thanos v0.24.0 at the moment. Let's see what time reveals with Thanos v0.24.0. How are you testing this?
From the pprof, it seems there is a memory leak in errors.callers (pprof) list github.com/pkg/errors.callers Total: 7.25GB ROUTINE ======================== github.com/pkg/errors.callers in /go/src/github.com/improbable-eng/thanos/vendor/github.com/pkg/errors/stack.go 4.36GB 4.36GB (flat, cum) 60.08% of Total . . 160: return f . . 161:} . . 162: . . 163:func callers() *stack { . . 164: const depth = 32 3.99GB 3.99GB 165: var pcs [depth]uintptr . . 166: n := runtime.Callers(3, pcs[:]) 380.51MB 380.51MB 167: var st stack = pcs[0:n] . . 168: return &st . . 169:} . . 170: . . 171:// funcname removes the path prefix component of a function's name reported by func.Name(). . . 172:func funcname(name string) string { (pprof)
> Could you please confirm what you meant by "launching Thanos Querier" is just referring to the thanos-querier deployment in openshift-monitoring namespace? Yes I mean the thanos-querier deployment in openshift-monitoring namespace. > How are you testing this? I have build and installed my own OKD version with the following commands and just replaced the OKD-Image with the official 0.24.0 from quay.io. oc adm release new --from-release quay.io/openshift/okd:4.9.0-0.okd-2022-01-29-035536 --to-image harbor.avm.de/openshift/okd-thanos-bugfix:4.9.0-0.okd-2022-01-29-035536-1 thanos=quay.io/thanos/thanos:v0.24.0 oc adm upgrade --to-image harbor.avm.de/openshift/okd-thanos-bugfix:4.9.0-0.okd-2022-01-29-035536-1 --force --allow-explicit-upgrade This was the only change I made, and I was unsuccessful in reproducing the error afterwards. > How often are you able to reproduce this? Approximately 30% after both querier pods were terminated at the same time and massive parallel queries were performed during the startup process of the two new querier pods. > Could you also attach a must-gather when this happens? At the moment I'm not able to reproduce the error on my production cluster. I will downgrade my stage cluster to hopefully send you a must-gather report soon. By the way, I checked the log output of the thanos querier pod when it was in the error situation. The log looked normal.
Thank you for the info. A related note, you can set the limits on thanos-querier by modifying the cluster-monitoring-operator configmap e.g. ``` apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: |- thanosQuerier: resources: limits: { ... } ```
Something that stood out in the heap.pprof is that Thanos seems too be holding too many errors in heap (don't know why yet) which can contribute to a memory leak. Defining what I meant by memory leak ..., holding in memory (heap) more than what is required. Here is my analysis... Existing `caller` in pkg/errors/stack.go is below ``` 174: func callers() *stack { 175: const depth = 32 176: var pcs [depth]uintptr 177: n := runtime.Callers(3, pcs[:]) 178: var st stack = pcs[0:n] 179: return &st 180: } ``` What I gather from above is that it is trying to allocate `pcs` on stack so that it doesn't need to be garbage collected. But that isn't really what I see happen. Running `go build -gcflags='-m -m' vendor/github.com/pkg/errors/stack.go 2> tmp/stack.go` shows ``` stack.go:178:6: st escapes to heap: stack.go:178:6: flow: ~r0 = &st: stack.go:178:6: from &st (address-of) at vendor/github.com/pkg/errors/stack.go:179:9 stack.go:178:6: from return &st (return) at vendor/github.com/pkg/errors/stack.go:179:2 stack.go:176:6: pcs escapes to heap: <<< This shouldn't happen >>> stack.go:176:6: flow: st = &pcs: stack.go:176:6: from pcs (address-of) at vendor/github.com/pkg/errors/stack.go:178:20 stack.go:176:6: from pcs[0:n] (slice) at vendor/github.com/pkg/errors/stack.go:178:20 stack.go:176:6: from st = pcs[0:n] (assign) at vendor/github.com/pkg/errors/stack.go:178:6 stack.go:176:6: moved to heap: pcs stack.go:178:6: moved to heap: st ``` `pcs` does escape into heap, so instead of holding `n` (pcs[:n]) items in heap, `st` which is a slice of pcs will actually hold 32 items for every error that is created (but not garbage collected). This is also confirmed by the profiler ``` (pprof) top 3 Showing nodes accounting for 1267.21MB, 100% of 1267.25MB total flat flat% sum% cum cum% 1108.23MB 87.45% 87.45% 1108.23MB 87.45% github.com/pkg/errors.callers 94.30MB 7.44% 94.89% 1202.53MB 94.89% github.com/pkg/errors.New (inline) 64.68MB 5.10% 100% 1267.21MB 100% main.moo (inline) (pprof) list github.com/pkg/errors.callers Total: 1.24GB ROUTINE ======================== github.com/pkg/errors.callers 1.08GB 1.08GB (flat, cum) 87.45% of Total . . 173: . . 174:func callers() *stack { . . 175: const depth = 32 1012.96MB 1012.96MB 176: var pcs [depth]uintptr <<< escapes to heap . . 177: n := runtime.Callers(3, pcs[:]) 95.27MB 95.27MB 178: var st stack = pcs[0:n] . . 179: return &st . . 180:} . . 181: ``` A better way to achieve it is to prevent `pcs` from escaping to heap as below ``` func callers() *stack { const depth = 32 var mypcs [depth]uintptr n := runtime.Callers(3, mypcs[:]) var myST = make(stack, n) copy(myST, mypcs[:n]) return &myST } ``` ``` (pprof) top 3 flat flat% sum% cum cum% 324.49MB 67.81% 67.81% 324.49MB 67.81% github.com/pkg/errors.callers 89.34MB 18.67% 86.48% 413.83MB 86.48% github.com/pkg/errors.New (inline) 64.68MB 13.52% 100% 478.51MB 100% main.moo (pprof) list github.com/pkg/errors.callers Total: 478.55MB ROUTINE ======================== github.com/pkg/errors.callers 324.49MB 324.49MB (flat, cum) 67.81% of Total . . 164:func callers() *stack { . . 165: const depth = 32 . . 166: var mypcs [depth]uintptr . . 167: n := runtime.Callers(3, mypcs[:]) . . 168: 324.49MB 324.49MB 169: var myST = make(stack, n) . . 170: copy(myST, mypcs[:n]) . . 171: return &myST . . 172:} ``` Both versions hold the same amount of errors in memory but the second version is significantly better in terms for allocs. Unfortunately https://github.com/pkg/errors says "This repository has been archived by the owner. It is now read-only."
@philipp.dallig What I gather from the pprof is that it seems api/v1/targets is hit and that seems to result in errors. Without a must-gather it is hard to understand the errors and root cause. But since you mentioned that you are unable to reproduce this issue on 0.24. I think these two commits may have contributed to the fix. * https://github.com/thanos-io/thanos/pull/4811 * https://github.com/thanos-io/thanos/pull/4805 If it is okay with you, I will close this bug as "INSUFFICIENT DATA".
In the next days I wanted to rewind my staging cluster to the official OKD version. I will try to reproduce the bug here and provide a must-gather file. Please keep the ticket open until then.
Absolutely, Must-Gather will be really useful.
Sorry for the delay. Today I had time to reproduce the OOM. https://clouddrive.avm.de/s/KZQsYBRdRwW77jt Password: RIY6251apE Output of must-gather ClusterID: 8dbfcc9c-d2e7-4945-8b9c-86c3d984cfeb ClusterVersion: Stable at "4.9.0-0.okd-2022-02-12-140851" ClusterOperators: All healthy and stable
@philipp.dallig Thank you for sharing the must-gather. Could you please shed some light on when the must-gather was taken? * Is it at the time that thanos-querier started consuming too much memory or after the event? * What was the memory consumption of the Thanos Querier pods? * What is the name of the pod? Few things stood out in the must-gather. 1. "2022-03-10T12:29:10Z" thanos-querier-647bb8fcbd-56vvw on worker3-cl1-dc99.s-ocp.cloud.avm.de got killed but I can't find the pod logs, is this the pod that went OOM? If that is the case, having the logs of that pod is important. 2. "2022-03-10T12:33:44Z" worker2-cl1-dc99.s-ocp.cloud.avm.de is experiencing memory-pressure * Is this related to the thanos-querier pod - thanos-querier-647bb8fcbd-2dpbg running on that node which is going to run OOM soon? The logs indicate network issues before the node started experiencing memory-pressure. See: 2022-03-10T12:33:42.081265494Z level=warn ts=2022-03-10T12:33:42.081160578Z caller=storeset.go:570 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.128.4.21:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=10.128.4.21:10901 2022-03-10T12:33:42.081388631Z level=warn ts=2022-03-10T12:33:42.081238634Z caller=storeset.go:570 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.128.4.18:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=10.128.4.18:10901 10.131.2.15:10901: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 10.131.2.15:10901: connect: connection refused\"" address=10.131.2.15:10901 --- Even the DNS SRV lookup seems to fail 2022-03-10T12:35:06.105218244Z level=error ts=2022-03-10T12:35:06.105120537Z caller=query.go:511 msg="failed to resolve addresses for storeAPIs" err="2 errors: lookup SRV records \"_grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local\": could not resolve \"_grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local\": all servers responded with errors to at least one search domain. Errs ;could not resolve _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local.: no servers returned a viable answer. Errs ;resolution against server 172.30.0.10 for _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local.: exchange: read udp 10.129.4.17:35668->172.30.0.10:53: i/o timeout; lookup IP addresses \"_grpc._tcp.prometheus-operated.openshift-user-workload-monitoring.svc.cluster.local\": could not resolve \"prometheus-user-workload-1.prometheus-operated.openshift-user-workload-monitoring.svc.cluster.local.\": all servers responded with errors to at least one search domain. Errs ;could not resolve prometheus-user-workload-1.prometheus-operated.openshift-user-workload-monitoring.svc.cluster.local.: no servers returned a viable answer. Errs ;resolution against server 172.30.0.10 for prometheus-user-workload-1.prometheus-operated.openshift-user-workload-monitoring.svc.cluster.local.: exchange: read udp 10.129.4.17:60544->172.30.0.10:53: i/o timeout" 2022-03-10T12:35:07.095000434Z level=warn ts=2022-03-10T12:35:07.094911754Z caller=storeset.go:570 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.128.4.18:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=10.128.4.18:10901 2022-03-10T12:35:07.095000434Z level=warn ts=2022-03-10T12:35:07.094940206Z caller=storeset.go:570 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.128.4.21:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=10.128.4.21:10901 2022-03-10T12:35:10.129071765Z level=error ts=2022-03-10T12:35:10.128974645Z caller=query.go:514 msg="failed to resolve addresses for rulesAPIs" err="lookup SRV records \"_grpc._tcp.thanos-ruler-operated.openshift-user-workload-monitoring.svc.cluster.local\": could not resolve \"_grpc._tcp.thanos-ruler-operated.openshift-user-workload-monitoring.svc.cluster.local\": all servers responded with errors to at least one search domain. Errs ;could not resolve _grpc._tcp.thanos-ruler-operated.openshift-user-workload-monitoring.svc.cluster.local.: no servers returned a viable answer. Errs ;resolution against server 172.30.0.10 for _grpc._tcp.thanos-ruler-operated.openshift-user-workload-monitoring.svc.cluster.local.: exchange: read udp 10.129.4.17:54622->172.30.0.10:53: i/o timeout" I am surprised why the same doesn't happen when you try 0.24 since these logs seem to point to infrastructure issues. A bit more context about when the must-gather was taken and a timeline of events and the pod-names would be of great help in digging deeper into the issues. Given the must-gather, it seems to point to flaky N/W than any Thanos issue. As pointed earlier the depending on the amount of errors generated, thanos indeed can consume a lot of memory.
Because the last must-gather is quite old, I decided to create a new must-gather. New must-gather: Link: https://clouddrive.avm.de/s/GmcxrWQSQfEwzSn Password: RIY6251apE After triggering the error (checking with "oc adm top pod -l app.kubernetes.io/name=thanos-query") I ran the must-gather. Please note that the error occurs quite often after starting a thanos-query process, so I delete the old pods that seem to start as expected to trigger the error. Timeline after 2-3 deletions ($oc delete pods -l app.kubernetes.io/name=thanos-query): $ oc get pods -l app.kubernetes.io/name=thanos-query NAME READY STATUS RESTARTS AGE thanos-querier-647bb8fcbd-2wx6v 5/5 Running 0 92s thanos-querier-647bb8fcbd-jp5tk 5/5 Running 0 92s $ oc adm top pod -l app.kubernetes.io/name=thanos-query W0322 16:20:48.787344 21292 top_pod.go:140] Using json format to get metrics. Next release will switch to protocol-buffers, switch early by passing --use-protocol-buffers flag NAME CPU(cores) MEMORY(bytes) thanos-querier-647bb8fcbd-2wx6v 352m 5984Mi thanos-querier-647bb8fcbd-jp5tk 1140m 4934Mi $ oc adm must-gather ... $ oc adm top pod -l app.kubernetes.io/name=thanos-query W0322 16:25:14.924827 21365 top_pod.go:140] Using json format to get metrics. Next release will switch to protocol-buffers, switch early by passing --use-protocol-buffers flag NAME CPU(cores) MEMORY(bytes) thanos-querier-647bb8fcbd-2wx6v 167m 5994Mi thanos-querier-647bb8fcbd-jp5tk 5m 76Mi $ oc get pod -l app.kubernetes.io/name=thanos-query NAME READY STATUS RESTARTS AGE thanos-querier-647bb8fcbd-2wx6v 4/5 Running 2 (72s ago) 6m14s thanos-querier-647bb8fcbd-jp5tk 5/5 Running 1 (3m39s ago) 6m14s As you can see the both thanos queries pods have restarts. > Is it at the time that thanos-querier started consuming too much memory or after the event? As you can see the must-gather is taken right in during the OOM-Event > What was the memory consumption of the Thanos Querier pods? It depends on how long the Thanos querier pod is alive until the OOM killer kills the pod. > What is the name of the pod? The timeline makes this clear. If you see memory-pressure on nodes, this is most likely related to the Thanos querier pods. Since this is a test environment, there is hardly any other workload. I don't know where the network errors are coming from, maybe they are caused by the high memory consumption of the Thanos Querier pod. I don't notice any network problems when the pods are running normally. I put a resource limit in the "cluster-monitoring-config" as you recommended. This seems to preempt the OOM killer of the operating system and thus minimizes the impact.
@philipp.dallig Thank you for painting a picture of the incident and for uploading newer must-gather. TL;DR: I still can't find issues in the thanos-querier pod other than network errors. Here is my analysis based on the the must-gather provided. > If you see memory-pressure on nodes, this is most likely related to the Thanos querier pods. Since this is a test environment, there is hardly any other workload. I started looking that nodes with Memory Pressure and like the old must-gather, it is worker2. ``` ❯ oc get pods -A -o wide -l app.kubernetes.io/name=thanos-query NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES openshift-monitoring thanos-querier-647bb8fcbd-2wx6v 4/5 Running 1 (14h ago) 14h 10.128.5.69 worker2-cl1-dc99.s-ocp.cloud.avm.de <none> <none> openshift-monitoring thanos-querier-647bb8fcbd-jp5tk 5/5 Running 1 (14h ago) 14h 10.130.5.167 worker4-cl1-dc99.s-ocp.cloud.avm.de <none> <none> s ``` Looking at the nodes, I figured that the worker2 is experiencing high memory pressure and I see that both thanos-querier pods have restarted. So I went on to look at the reason for restart by looking at the pod logs. ### thanos-querier-647bb8fcbd-2wx6v ``` ❯ oc logs -n openshift-monitoring thanos-querier-647bb8fcbd-2wx6v -c thanos-query | ag warn ``` So thanos-querier container in the pod (running on worker2) hasn't logged any errors but has restarted ❯ kc describe pod thanos-querier-647bb8fcbd-2wx6v ``` oauth-proxy: Container ID: cri-o://2bc9e22c229bfb4eead6143aaafe297d502bd9d828d2a37631c74fdb12a0c70e Image: quay.io/openshift/okd-content@sha256:0e72ed64cdf76828392ef8cfe4ee49d644a013ac00ff1590d6c53774d77615e3 Image ID: quay.io/openshift/okd-content@sha256:0e72ed64cdf76828392ef8cfe4ee49d644a013ac00ff1590d6c53774d77615e3 Port: 9091/TCP Host Port: 0/TCP Args: -provider=openshift -https-address=:9091 State: Running Started: Wed, 23 Mar 2022 01:22:19 +1000 Last State: Terminated Reason: Error Message: Cookie settings: name:_oauth_proxy secure(https):true httponly:true expiry:168h0m0s domain:<default> samesite: refresh:disabled 2022/03/22 15:19:12 main.go:156: using htpasswd file /etc/proxy/htpasswd/auth I0322 15:19:12.678742 1 dynamic_serving_content.go:130] Starting serving::/etc/tls/private/tls.crt::/etc/tls/private/tls.key 2022/03/22 15:19:12 http.go:107: HTTPS: listening on [::]:9091 2022/03/22 15:19:34 oauthproxy.go:857: authenticated "internal" via basic auth 2022/03/22 15:19:40 oauthproxy.go:857: authenticated "internal" via basic auth 2022/03/22 15:20:04 reverseproxy.go:476: http: proxy error: context canceled 2022/03/22 15:20:10 reverseproxy.go:476: http: proxy error: context canceled 2022/03/22 15:21:35 server.go:3119: http: TLS handshake error from 10.131.2.2:53300: write tcp 10.128.5.69:9091->10.131.2.2:53300: write: connection reset by peer 2022/03/22 15:21:35 server.go:3119: http: TLS handshake error from 10.131.2.2:53310: write tcp 10.128.5.69:9091->10.131.2.2:53310: write: connection reset by peer 2022/03/22 15:21:35 reverseproxy.go:476: http: proxy error: context canceled Exit Code: 2 Started: Wed, 23 Mar 2022 01:19:12 +1000 Finished: Wed, 23 Mar 2022 01:22:14 +1000 Ready: False Restart Count: 1 ``` If you look at the Events, we see there were ProbeErrors Warning ProbeError 14h (x4 over 14h) kubelet Readiness probe error: Get "https://10.128.5.69:9091/-/ready": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) body: Warning Unhealthy 14h (x4 over 14h) kubelet Readiness probe failed: Get "https://10.128.5.69:9091/-/ready": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) Warning ProbeError 14h kubelet Liveness probe error: Get "https://10.128.5.69:9091/-/healthy": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) body: Warning Unhealthy 14h kubelet Liveness probe failed: Get "https://10.128.5.69:9091/-/healthy": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) And 9091 is the oauth-proxy and not thanos-querier (9090) and it says it failed to reach itself which IMHO points to n/w issues. So this doesn't really help us debug the issue with thanos-querier-647bb8fcbd-2wx6v ### thanos-querier-647bb8fcbd-jp5tk I also dug into thanos-querier-647bb8fcbd-jp5tk which isn't running on worker2 (runs on worker4) but has restarted. The logs show N/W related errors as the reason for thanos-querier to restart ``` State: Running Started: Wed, 23 Mar 2022 01:21:48 +1000 Last State: Terminated Reason: Error Message: I to query storeset" address=10.129.4.13:10901 level=info ts=2022-03-22T15:19:17.911377617Z caller=storeset.go:463 component=storeset msg="adding new storeAPI to query storeset" address=10.129.4.13:10901 extLset="{prometheus=\"openshift-user-workl oad-monitoring/user-workload\", prometheus_replica=\"prometheus-user-workload-0\"}" level=warn ts=2022-03-22T15:20:22.887799038Z caller=storeset.go:570 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.128.4.21:10901: rpc error: c ode = DeadlineExceeded desc = context deadline exceeded" address=10.128.4.21:10901 level=info ts=2022-03-22T15:20:22.888327839Z caller=storeset.go:426 component=storeset msg="removing store because it's unhealthy or does not exist" address=10.128.4.21:10901 extLset="{prometheus=\"op enshift-user-workload-monitoring/user-workload\", prometheus_replica=\"prometheus-user-workload-1\"}" level=warn ts=2022-03-22T15:20:27.889438131Z caller=storeset.go:570 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.128.4.21:10901: rpc error: c ode = DeadlineExceeded desc = context deadline exceeded" address=10.128.4.21:10901 level=warn ts=2022-03-22T15:20:32.897093152Z caller=storeset.go:570 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.128.4.21:10901: rpc error: c ode = DeadlineExceeded desc = context deadline exceeded" address=10.128.4.21:10901 level=warn ts=2022-03-22T15:20:37.918250249Z caller=storeset.go:570 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.128.4.21:10901: rpc error: c ode = DeadlineExceeded desc = context deadline exceeded" address=10.128.4.21:10901 level=warn ts=2022-03-22T15:20:42.918787711Z caller=storeset.go:570 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.128.4.21:10901: rpc error: c ode = DeadlineExceeded desc = context deadline exceeded" address=10.128.4.21:10901 Exit Code: 137 Started: Wed, 23 Mar 2022 01:19:12 +1000 Finished: Wed, 23 Mar 2022 01:21:45 +1000 Ready: True Restart Count: 1 ``` The exit code 137 indicates that the pod was terminated due to OOM and the errors point to n/w issues. If you extract the IP and look for the pods that failed to communicate, you can find that these are, user-workload-pods running on worker2 and worker1 ❯ ag =10\. current.log | ag err |sed -e 's|.*address=||g' | sort -u | cut -f1 -d: 10.128.4.21 10.131.2.15 ❯ ag '10.128.4.21|10.131.2.15' pods.txt 221:openshift-user-workload-monitoring prometheus-user-workload-1 5/5 Running 0 21d 10.128.4.21 worker2-cl1-dc99.s-ocp.cloud.avm.de <none> <none> 222:openshift-user-workload-monitoring thanos-ruler-user-workload-0 3/3 Running 8 (13h ago) 21d 10.131.2.15 worker1-cl1-dc99.s-ocp.cloud.avm.de <none> <none> ... As you can see, from the must-gather this is the only information I am able to extract. Would you be able to provide the heap info of thanos-querier along with the must-gather like in the original report when this happens to get an idea of where the memory is allocated and is_use the most? So far my conclusion is that failing to reach user-workload pods may be the reason that a lot of errors are generated which could be hogging the memory.
It looks very suspicious from the heap profile indeed. I uploaded it here: https://share.polarsignals.com/6b0ae0c/ From the initial look, it looks like a very high number of targets with some warnings. Once the user has this situation, wonder if we can grab their `/api/v1/targets` output to see the situation. Otherwise, I created an upstream issue: https://github.com/thanos-io/thanos/issues/5257 It might be worth digging further and optimising this path in case of a large number of targets (plus design some limit of how many targets query supports).
@philipp.dallig It seems https://github.com/thanos-io/thanos/issues/4676#issuecomment-922413448 may be the root cause of the issue, which is now fixed upstream and is part of 0.24. This explains why you don't see the same issue when you replace the image with 'thanos-0.24.0'
Great that you found the bug and that there is already a bugfix. Will the bugfix be backported to the older Thanos versions that Openshift uses, so that other Openshift users don't accidentally run into the same problem? I think it is no longer necessary for me to provide more information. It would be nice if this ticket gets links to backport tickets. Thank you very much @sthaha and thank you @
Thank you very much @sthaha you have helped me very well and thank you @bplotka for finding the root cause.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069