Bug 2055305 - Thanos Querier high CPU and memory usage till OOM
Summary: Thanos Querier high CPU and memory usage till OOM
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.9
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 4.11.0
Assignee: Sunil Thaha
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks: 2071479
TreeView+ depends on / blocked
 
Reported: 2022-02-16 16:09 UTC by philipp.dallig
Modified: 2022-08-10 10:50 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2071479 (view as bug list)
Environment:
Last Closed: 2022-08-10 10:49:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
MemoryProfile (136.44 KB, image/svg+xml)
2022-02-16 16:09 UTC, philipp.dallig
no flags Details
allocsMemory (49.47 KB, application/gzip)
2022-02-16 16:10 UTC, philipp.dallig
no flags Details
goroutine.txt (63.99 KB, text/plain)
2022-02-16 16:10 UTC, philipp.dallig
no flags Details
heap.pprof (49.41 KB, application/gzip)
2022-02-16 16:11 UTC, philipp.dallig
no flags Details
top.txt (616 bytes, text/plain)
2022-02-16 16:13 UTC, philipp.dallig
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github thanos-io thanos issues 5257 0 None open querier: Unexpected high heap usage for Targets API 2022-03-31 00:42:28 UTC
Red Hat Knowledge Base (Solution) 6965902 0 None None None 2022-07-04 15:40:33 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:50:23 UTC

Description philipp.dallig 2022-02-16 16:09:07 UTC
Created attachment 1861508 [details]
MemoryProfile

Created attachment 1861508 [details]
MemoryProfile

Description of problem:
After launching Thanos Querier, I noticed that Thanos consumes a lot of CPU and memory. Because the Querier deployment has no memory or CPU limit, kubelet does not terminate the process. Finally, the kernel terminates the Thanos pod.

Version-Release number of selected component (if applicable):
Thanos-Query: 0.22.0


How reproducible:
I can't reproduce the error every time. I think it is some kind of race condition during startup. 


Steps to Reproduce:
1. Kill Thanos-Querier Pod
2. Query the Prometheus API endpoint of Thanos Querier with a lot of PromQLs

Actual results:
Thanos Querier sometimes consumes a lot of CPU and memory.

Expected results:
Normal Thanos startup with a only a few resources.

Additional info:
I found Bugzilla Bug 1906496, which was very helpful in getting additional information about the Thanos process.

Comment 1 philipp.dallig 2022-02-16 16:10:02 UTC
Created attachment 1861509 [details]
allocsMemory

Comment 2 philipp.dallig 2022-02-16 16:10:49 UTC
Created attachment 1861510 [details]
goroutine.txt

Comment 3 philipp.dallig 2022-02-16 16:11:22 UTC
Created attachment 1861511 [details]
heap.pprof

Comment 4 philipp.dallig 2022-02-16 16:13:12 UTC
Created attachment 1861512 [details]
top.txt

Comment 5 philipp.dallig 2022-02-17 08:18:41 UTC
It seems that the memory leak can be triggered by using the dashboard, because the dashboard also queries the metrics via the query component.

Comment 6 philipp.dallig 2022-02-17 10:37:30 UTC
I can't reproduce the problem with Thanos v0.24.0 at the moment. Let's see what time reveals with Thanos v0.24.0.

Comment 7 Sunil Thaha 2022-02-21 05:09:47 UTC
@philipp.dallig 

> After launching Thanos Querier, I noticed that Thanos consumes a lot of CPU and memory. Because the Querier deployment has no memory or CPU limit, kubelet does not terminate the process. Finally, the kernel terminates the Thanos pod.

I have been trying to reproduce this without much luck. How often are you able to reproduce this? 

Could you please confirm what you meant by "launching Thanos Querier" is just referring
to the thanos-querier deployment in openshift-monitoring namespace?

Could you also attach a must-gather when this happens? 


> I can't reproduce the problem with Thanos v0.24.0 at the moment. Let's see what time reveals with Thanos v0.24.0.

How are you testing this?

Comment 8 Sunil Thaha 2022-02-21 06:24:02 UTC
From the pprof, it seems there is a memory leak in errors.callers

(pprof) list github.com/pkg/errors.callers
Total: 7.25GB
ROUTINE ======================== github.com/pkg/errors.callers in /go/src/github.com/improbable-eng/thanos/vendor/github.com/pkg/errors/stack.go
    4.36GB     4.36GB (flat, cum) 60.08% of Total
         .          .    160:   return f
         .          .    161:}
         .          .    162:
         .          .    163:func callers() *stack {
         .          .    164:   const depth = 32
    3.99GB     3.99GB    165:   var pcs [depth]uintptr
         .          .    166:   n := runtime.Callers(3, pcs[:])
  380.51MB   380.51MB    167:   var st stack = pcs[0:n]
         .          .    168:   return &st
         .          .    169:}
         .          .    170:
         .          .    171:// funcname removes the path prefix component of a function's name reported by func.Name().
         .          .    172:func funcname(name string) string {
(pprof)

Comment 9 philipp.dallig 2022-02-21 08:22:19 UTC
> Could you please confirm what you meant by "launching Thanos Querier" is just referring
to the thanos-querier deployment in openshift-monitoring namespace?

Yes I mean the thanos-querier deployment in openshift-monitoring namespace.

> How are you testing this?

I have build and installed my own OKD version with the following commands and just replaced the OKD-Image with the official 0.24.0 from quay.io.

oc adm release new --from-release quay.io/openshift/okd:4.9.0-0.okd-2022-01-29-035536 --to-image harbor.avm.de/openshift/okd-thanos-bugfix:4.9.0-0.okd-2022-01-29-035536-1 thanos=quay.io/thanos/thanos:v0.24.0
oc adm upgrade --to-image harbor.avm.de/openshift/okd-thanos-bugfix:4.9.0-0.okd-2022-01-29-035536-1 --force --allow-explicit-upgrade

This was the only change I made, and I was unsuccessful in reproducing the error afterwards.

> How often are you able to reproduce this?

Approximately 30% after both querier pods were terminated at the same time and massive parallel queries were performed during the startup process of the two new querier pods.

> Could you also attach a must-gather when this happens?

At the moment I'm not able to reproduce the error on my production cluster. I will downgrade my stage cluster to hopefully send you a must-gather report soon.
By the way, I checked the log output of the thanos querier pod when it was in the error situation. The log looked normal.

Comment 10 Sunil Thaha 2022-02-21 23:32:56 UTC
Thank you for the info. 

A related note, you can set the limits on thanos-querier by modifying the cluster-monitoring-operator configmap

e.g. 

```
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |-
    thanosQuerier:
      resources:
        limits: { ... }
```

Comment 11 Sunil Thaha 2022-02-22 04:13:25 UTC
Something that stood out in the heap.pprof is that Thanos seems too be holding too many errors in heap (don't know why yet)
which can contribute to a memory leak.  

Defining what I meant by  memory leak ..., holding in memory (heap) more than what is required.
Here is my analysis...

Existing `caller` in pkg/errors/stack.go is below
```
174: func callers() *stack {
175:   const depth = 32
176:   var pcs [depth]uintptr
177:   n := runtime.Callers(3, pcs[:])
178:   var st stack = pcs[0:n]
179:   return &st
180: }

```

What I gather from above is that it is trying to allocate `pcs` on stack so
that it doesn't need to be garbage collected. But that isn't really what I see happen.

Running `go build -gcflags='-m -m' vendor/github.com/pkg/errors/stack.go 2> tmp/stack.go` shows

```
stack.go:178:6: st escapes to heap:
stack.go:178:6:   flow: ~r0 = &st:
stack.go:178:6:     from &st (address-of) at vendor/github.com/pkg/errors/stack.go:179:9
stack.go:178:6:     from return &st (return) at vendor/github.com/pkg/errors/stack.go:179:2
stack.go:176:6: pcs escapes to heap:  <<< This shouldn't happen >>>
stack.go:176:6:   flow: st = &pcs:
stack.go:176:6:     from pcs (address-of) at vendor/github.com/pkg/errors/stack.go:178:20
stack.go:176:6:     from pcs[0:n] (slice) at vendor/github.com/pkg/errors/stack.go:178:20
stack.go:176:6:     from st = pcs[0:n] (assign) at vendor/github.com/pkg/errors/stack.go:178:6
stack.go:176:6: moved to heap: pcs
stack.go:178:6: moved to heap: st
```

`pcs` does escape into heap, so instead of holding `n` (pcs[:n]) items in heap, `st` which
is a slice of pcs will actually hold 32 items for every error that is created (but not garbage
collected). This is also confirmed by the profiler

```
(pprof) top 3
Showing nodes accounting for 1267.21MB, 100% of 1267.25MB total
      flat  flat%   sum%        cum   cum%
 1108.23MB 87.45% 87.45%  1108.23MB 87.45%  github.com/pkg/errors.callers
   94.30MB  7.44% 94.89%  1202.53MB 94.89%  github.com/pkg/errors.New (inline)
   64.68MB  5.10%   100%  1267.21MB   100%  main.moo (inline)

(pprof) list github.com/pkg/errors.callers

Total: 1.24GB
ROUTINE ======================== github.com/pkg/errors.callers
    1.08GB     1.08GB (flat, cum) 87.45% of Total
         .          .    173:
         .          .    174:func callers() *stack {
         .          .    175:   const depth = 32
 1012.96MB  1012.96MB    176:   var pcs [depth]uintptr    <<< escapes to heap
         .          .    177:   n := runtime.Callers(3, pcs[:])
   95.27MB    95.27MB    178:   var st stack = pcs[0:n]
         .          .    179:   return &st
         .          .    180:}
         .          .    181:
```

A better way to achieve it is to prevent `pcs` from escaping to heap as below


```
func callers() *stack {
	const depth = 32
	var mypcs [depth]uintptr
	n := runtime.Callers(3, mypcs[:])

	var myST = make(stack, n)
	copy(myST, mypcs[:n])
	return &myST
}
```

```
(pprof) top 3
      flat  flat%   sum%        cum   cum%
  324.49MB 67.81% 67.81%   324.49MB 67.81%  github.com/pkg/errors.callers
   89.34MB 18.67% 86.48%   413.83MB 86.48%  github.com/pkg/errors.New (inline)
   64.68MB 13.52%   100%   478.51MB   100%  main.moo

(pprof) list github.com/pkg/errors.callers
Total: 478.55MB

ROUTINE ======================== github.com/pkg/errors.callers
  324.49MB   324.49MB (flat, cum) 67.81% of Total
         .          .    164:func callers() *stack {
         .          .    165:   const depth = 32
         .          .    166:   var mypcs [depth]uintptr
         .          .    167:   n := runtime.Callers(3, mypcs[:])
         .          .    168:
  324.49MB   324.49MB    169:   var myST = make(stack, n)
         .          .    170:   copy(myST, mypcs[:n])
         .          .    171:   return &myST
         .          .    172:}

```

Both versions hold the same amount of errors in memory but the second version is
significantly better in terms for allocs.

Unfortunately https://github.com/pkg/errors says "This repository has been archived by the owner. It is now read-only."

Comment 12 Sunil Thaha 2022-03-01 01:26:53 UTC
@philipp.dallig 

What I gather from the pprof is that it seems api/v1/targets is hit and that seems to result in errors. Without a must-gather it is hard to understand the errors and root cause.
But since you mentioned that you are unable to reproduce this issue on 0.24. 

I think these two commits may have contributed to the fix. 
* https://github.com/thanos-io/thanos/pull/4811
* https://github.com/thanos-io/thanos/pull/4805

If it is okay with you, I will close this bug as "INSUFFICIENT DATA".

Comment 13 philipp.dallig 2022-03-01 08:04:00 UTC
In the next days I wanted to rewind my staging cluster to the official OKD version. I will try to reproduce the bug here and provide a must-gather file. Please keep the ticket open until then.

Comment 14 Sunil Thaha 2022-03-01 12:52:58 UTC
Absolutely, Must-Gather will be really useful.

Comment 15 philipp.dallig 2022-03-10 13:08:08 UTC
Sorry for the delay.
Today I had time to reproduce the OOM.
https://clouddrive.avm.de/s/KZQsYBRdRwW77jt
Password: RIY6251apE

Output of must-gather

ClusterID: 8dbfcc9c-d2e7-4945-8b9c-86c3d984cfeb
ClusterVersion: Stable at "4.9.0-0.okd-2022-02-12-140851"
ClusterOperators:
	All healthy and stable

Comment 16 Sunil Thaha 2022-03-21 14:25:30 UTC
@philipp.dallig 

Thank you for sharing the must-gather. 

Could you please shed some light on when the must-gather was taken? 
  * Is it at the time that thanos-querier started consuming too much memory or after the event? 
  * What was the memory consumption of the Thanos Querier pods? 
  * What is the name of the pod? 



Few things stood out in the must-gather. 

1. "2022-03-10T12:29:10Z" thanos-querier-647bb8fcbd-56vvw on worker3-cl1-dc99.s-ocp.cloud.avm.de got killed but I can't find the pod logs, is this the pod that went OOM?
    If that is the case, having the logs of that pod is important. 

2. "2022-03-10T12:33:44Z" worker2-cl1-dc99.s-ocp.cloud.avm.de is experiencing memory-pressure 
   * Is this related to the thanos-querier pod - thanos-querier-647bb8fcbd-2dpbg running on that node which is going to run OOM soon? 

The logs indicate network issues before the node started experiencing memory-pressure. See:

2022-03-10T12:33:42.081265494Z level=warn ts=2022-03-10T12:33:42.081160578Z caller=storeset.go:570 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.128.4.21:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=10.128.4.21:10901
2022-03-10T12:33:42.081388631Z level=warn ts=2022-03-10T12:33:42.081238634Z caller=storeset.go:570 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.128.4.18:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=10.128.4.18:10901
10.131.2.15:10901: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 10.131.2.15:10901: connect: connection refused\"" address=10.131.2.15:10901

--- 
Even the DNS SRV lookup seems to fail

2022-03-10T12:35:06.105218244Z level=error ts=2022-03-10T12:35:06.105120537Z caller=query.go:511 msg="failed to resolve addresses for storeAPIs" err="2 errors: lookup SRV records \"_grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local\": could not resolve \"_grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local\": all servers responded with errors to at least one search domain. Errs ;could not resolve _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local.: no servers returned a viable answer. Errs ;resolution against server 172.30.0.10 for _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local.: exchange: read udp 10.129.4.17:35668->172.30.0.10:53: i/o timeout; lookup IP addresses \"_grpc._tcp.prometheus-operated.openshift-user-workload-monitoring.svc.cluster.local\": could not resolve \"prometheus-user-workload-1.prometheus-operated.openshift-user-workload-monitoring.svc.cluster.local.\": all servers responded with errors to at least one search domain. Errs ;could not resolve prometheus-user-workload-1.prometheus-operated.openshift-user-workload-monitoring.svc.cluster.local.: no servers returned a viable answer. Errs ;resolution against server 172.30.0.10 for prometheus-user-workload-1.prometheus-operated.openshift-user-workload-monitoring.svc.cluster.local.: exchange: read udp 10.129.4.17:60544->172.30.0.10:53: i/o timeout"

2022-03-10T12:35:07.095000434Z level=warn ts=2022-03-10T12:35:07.094911754Z caller=storeset.go:570 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.128.4.18:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=10.128.4.18:10901

2022-03-10T12:35:07.095000434Z level=warn ts=2022-03-10T12:35:07.094940206Z caller=storeset.go:570 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.128.4.21:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=10.128.4.21:10901

2022-03-10T12:35:10.129071765Z level=error ts=2022-03-10T12:35:10.128974645Z caller=query.go:514 msg="failed to resolve addresses for rulesAPIs" err="lookup SRV records \"_grpc._tcp.thanos-ruler-operated.openshift-user-workload-monitoring.svc.cluster.local\": could not resolve \"_grpc._tcp.thanos-ruler-operated.openshift-user-workload-monitoring.svc.cluster.local\": all servers responded with errors to at least one search domain. Errs ;could not resolve _grpc._tcp.thanos-ruler-operated.openshift-user-workload-monitoring.svc.cluster.local.: no servers returned a viable answer. Errs ;resolution against server 172.30.0.10 for _grpc._tcp.thanos-ruler-operated.openshift-user-workload-monitoring.svc.cluster.local.: exchange: read udp 10.129.4.17:54622->172.30.0.10:53: i/o timeout"



I am surprised why the same doesn't happen when you try 0.24 since these logs seem to point to infrastructure issues. 

A bit more context about when the must-gather was taken and a timeline of events and the pod-names would be of great help in digging deeper into the issues. 
Given the must-gather, it seems to point to flaky N/W than any Thanos issue. As pointed earlier the depending on the amount of errors generated, thanos indeed can consume a lot of memory.

Comment 17 philipp.dallig 2022-03-22 16:04:12 UTC
Because the last must-gather is quite old, I decided to create a new must-gather.

New must-gather:
Link: https://clouddrive.avm.de/s/GmcxrWQSQfEwzSn
Password: RIY6251apE

After triggering the error (checking with "oc adm top pod -l app.kubernetes.io/name=thanos-query") I ran the must-gather.

Please note that the error occurs quite often after starting a thanos-query process, so I delete the old pods that seem to start as expected to trigger the error.

Timeline after 2-3 deletions ($oc delete pods -l app.kubernetes.io/name=thanos-query):
$ oc get pods -l app.kubernetes.io/name=thanos-query
NAME                              READY   STATUS    RESTARTS   AGE
thanos-querier-647bb8fcbd-2wx6v   5/5     Running   0          92s
thanos-querier-647bb8fcbd-jp5tk   5/5     Running   0          92s
$ oc adm top pod -l app.kubernetes.io/name=thanos-query
W0322 16:20:48.787344   21292 top_pod.go:140] Using json format to get metrics. Next release will switch to protocol-buffers, switch early by passing --use-protocol-buffers flag
NAME                              CPU(cores)   MEMORY(bytes)   
thanos-querier-647bb8fcbd-2wx6v   352m         5984Mi          
thanos-querier-647bb8fcbd-jp5tk   1140m        4934Mi    
$ oc adm must-gather
...
$ oc adm top pod -l app.kubernetes.io/name=thanos-query
W0322 16:25:14.924827   21365 top_pod.go:140] Using json format to get metrics. Next release will switch to protocol-buffers, switch early by passing --use-protocol-buffers flag
NAME                              CPU(cores)   MEMORY(bytes)   
thanos-querier-647bb8fcbd-2wx6v   167m         5994Mi          
thanos-querier-647bb8fcbd-jp5tk   5m           76Mi            
$ oc get pod -l app.kubernetes.io/name=thanos-query
NAME                              READY   STATUS    RESTARTS        AGE
thanos-querier-647bb8fcbd-2wx6v   4/5     Running   2 (72s ago)     6m14s
thanos-querier-647bb8fcbd-jp5tk   5/5     Running   1 (3m39s ago)   6m14s

As you can see the both thanos queries pods have restarts.

> Is it at the time that thanos-querier started consuming too much memory or after the event? 

As you can see the must-gather is taken right in during the OOM-Event

> What was the memory consumption of the Thanos Querier pods?

It depends on how long the Thanos querier pod is alive until the OOM killer kills the pod. 

> What is the name of the pod? 

The timeline makes this clear.

If you see memory-pressure on nodes, this is most likely related to the Thanos querier pods. Since this is a test environment, there is hardly any other workload.

I don't know where the network errors are coming from, maybe they are caused by the high memory consumption of the Thanos Querier pod. I don't notice any network problems when the pods are running normally.

I put a resource limit in the "cluster-monitoring-config" as you recommended. This seems to preempt the OOM killer of the operating system and thus minimizes the impact.

Comment 18 Sunil Thaha 2022-03-23 07:23:24 UTC
@philipp.dallig 

Thank you for painting a picture of the incident and for uploading newer must-gather.

TL;DR: I still can't find issues in the thanos-querier pod other than network errors. 


Here is my analysis based on the the must-gather provided.


> If you see memory-pressure on nodes, this is most likely related to the Thanos querier pods. Since this is a test environment, there is hardly any other workload.

I started looking that nodes with Memory Pressure and like the old must-gather, it is worker2. 

```
❯ oc get pods -A -o wide -l app.kubernetes.io/name=thanos-query
NAMESPACE              NAME                              READY   STATUS    RESTARTS      AGE   IP             NODE                                  NOMINATED NODE   READINESS GATES
openshift-monitoring   thanos-querier-647bb8fcbd-2wx6v   4/5     Running   1 (14h ago)   14h   10.128.5.69    worker2-cl1-dc99.s-ocp.cloud.avm.de   <none>           <none>
openshift-monitoring   thanos-querier-647bb8fcbd-jp5tk   5/5     Running   1 (14h ago)   14h   10.130.5.167   worker4-cl1-dc99.s-ocp.cloud.avm.de   <none>           <none>
s
```

Looking at the nodes, I figured that the worker2 is experiencing high memory pressure and I see that both thanos-querier pods have restarted.
So I went on to look at the reason for restart by looking at the pod logs. 

### thanos-querier-647bb8fcbd-2wx6v 

```
❯ oc logs -n openshift-monitoring thanos-querier-647bb8fcbd-2wx6v -c thanos-query | ag warn
```

So thanos-querier container in the pod (running on worker2) hasn't logged any errors but has restarted 

❯ kc describe pod thanos-querier-647bb8fcbd-2wx6v


```
 oauth-proxy:
    Container ID:  cri-o://2bc9e22c229bfb4eead6143aaafe297d502bd9d828d2a37631c74fdb12a0c70e
    Image:         quay.io/openshift/okd-content@sha256:0e72ed64cdf76828392ef8cfe4ee49d644a013ac00ff1590d6c53774d77615e3
    Image ID:      quay.io/openshift/okd-content@sha256:0e72ed64cdf76828392ef8cfe4ee49d644a013ac00ff1590d6c53774d77615e3
    Port:          9091/TCP
    Host Port:     0/TCP
    Args:
      -provider=openshift
      -https-address=:9091
    State:       Running
      Started:   Wed, 23 Mar 2022 01:22:19 +1000
    Last State:  Terminated
      Reason:    Error
      Message:   Cookie settings: name:_oauth_proxy secure(https):true httponly:true expiry:168h0m0s domain:<default> samesite: refresh:disabled
2022/03/22 15:19:12 main.go:156: using htpasswd file /etc/proxy/htpasswd/auth
I0322 15:19:12.678742       1 dynamic_serving_content.go:130] Starting serving::/etc/tls/private/tls.crt::/etc/tls/private/tls.key
2022/03/22 15:19:12 http.go:107: HTTPS: listening on [::]:9091
2022/03/22 15:19:34 oauthproxy.go:857: authenticated "internal" via basic auth
2022/03/22 15:19:40 oauthproxy.go:857: authenticated "internal" via basic auth
2022/03/22 15:20:04 reverseproxy.go:476: http: proxy error: context canceled
2022/03/22 15:20:10 reverseproxy.go:476: http: proxy error: context canceled
2022/03/22 15:21:35 server.go:3119: http: TLS handshake error from 10.131.2.2:53300: write tcp 10.128.5.69:9091->10.131.2.2:53300: write: connection reset by peer

2022/03/22 15:21:35 server.go:3119: http: TLS handshake error from 10.131.2.2:53310: write tcp 10.128.5.69:9091->10.131.2.2:53310: write: connection reset by peer
2022/03/22 15:21:35 reverseproxy.go:476: http: proxy error: context canceled

      Exit Code:    2
      Started:      Wed, 23 Mar 2022 01:19:12 +1000
      Finished:     Wed, 23 Mar 2022 01:22:14 +1000
    Ready:          False
    Restart Count:  1
```

If you look at the Events,  we see there were ProbeErrors 

  Warning  ProbeError      14h (x4 over 14h)  kubelet            Readiness probe error: Get "https://10.128.5.69:9091/-/ready": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
body:
  Warning  Unhealthy   14h (x4 over 14h)  kubelet  Readiness probe failed: Get "https://10.128.5.69:9091/-/ready": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  Warning  ProbeError  14h                kubelet  Liveness probe error: Get "https://10.128.5.69:9091/-/healthy": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
body:
  Warning  Unhealthy  14h  kubelet  Liveness probe failed: Get "https://10.128.5.69:9091/-/healthy": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)


And 9091 is the oauth-proxy and not thanos-querier (9090) and it says it failed to reach itself which IMHO points to n/w issues. 

So this doesn't really help us debug the issue with thanos-querier-647bb8fcbd-2wx6v


### thanos-querier-647bb8fcbd-jp5tk

I also dug into thanos-querier-647bb8fcbd-jp5tk which isn't running on worker2 (runs on worker4) but has restarted. 
The logs show N/W related errors as the reason for thanos-querier to restart 

```

  State:       Running
      Started:   Wed, 23 Mar 2022 01:21:48 +1000
    Last State:  Terminated
      Reason:    Error
      Message:   I to query storeset" address=10.129.4.13:10901
level=info ts=2022-03-22T15:19:17.911377617Z caller=storeset.go:463 component=storeset msg="adding new storeAPI to query storeset" address=10.129.4.13:10901 extLset="{prometheus=\"openshift-user-workl
oad-monitoring/user-workload\", prometheus_replica=\"prometheus-user-workload-0\"}"
level=warn ts=2022-03-22T15:20:22.887799038Z caller=storeset.go:570 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.128.4.21:10901: rpc error: c
ode = DeadlineExceeded desc = context deadline exceeded" address=10.128.4.21:10901
level=info ts=2022-03-22T15:20:22.888327839Z caller=storeset.go:426 component=storeset msg="removing store because it's unhealthy or does not exist" address=10.128.4.21:10901 extLset="{prometheus=\"op
enshift-user-workload-monitoring/user-workload\", prometheus_replica=\"prometheus-user-workload-1\"}"
level=warn ts=2022-03-22T15:20:27.889438131Z caller=storeset.go:570 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.128.4.21:10901: rpc error: c
ode = DeadlineExceeded desc = context deadline exceeded" address=10.128.4.21:10901
level=warn ts=2022-03-22T15:20:32.897093152Z caller=storeset.go:570 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.128.4.21:10901: rpc error: c
ode = DeadlineExceeded desc = context deadline exceeded" address=10.128.4.21:10901
level=warn ts=2022-03-22T15:20:37.918250249Z caller=storeset.go:570 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.128.4.21:10901: rpc error: c
ode = DeadlineExceeded desc = context deadline exceeded" address=10.128.4.21:10901
level=warn ts=2022-03-22T15:20:42.918787711Z caller=storeset.go:570 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from 10.128.4.21:10901: rpc error: c
ode = DeadlineExceeded desc = context deadline exceeded" address=10.128.4.21:10901

      Exit Code:    137
      Started:      Wed, 23 Mar 2022 01:19:12 +1000
      Finished:     Wed, 23 Mar 2022 01:21:45 +1000
    Ready:          True
    Restart Count:  1

```

The exit code 137 indicates that the pod was terminated due to OOM and the errors point to n/w issues.
If you extract the IP and look for the pods that failed to communicate, you can find that these are, user-workload-pods running on worker2 and worker1 

❯ ag =10\. current.log | ag err |sed -e 's|.*address=||g' | sort -u | cut -f1 -d:
10.128.4.21
10.131.2.15

❯ ag '10.128.4.21|10.131.2.15' pods.txt
221:openshift-user-workload-monitoring                 prometheus-user-workload-1                                        5/5     Running     0              21d     10.128.4.21    worker2-cl1-dc99.s-ocp.cloud.avm.de   <none>           <none>
222:openshift-user-workload-monitoring                 thanos-ruler-user-workload-0                                      3/3     Running     8 (13h ago)    21d     10.131.2.15    worker1-cl1-dc99.s-ocp.cloud.avm.de   <none>           <none>

... 

As you can see, from the must-gather this is the only information I am able to extract. 

Would you be able to provide the heap info of thanos-querier along with the must-gather like in the original report when this
happens to get an idea of where the memory is allocated and is_use the most? 

So far my conclusion is that failing to reach user-workload pods may be the reason that a lot of errors are generated which could be hogging the memory.

Comment 19 Bartek Plotka 2022-03-29 08:49:45 UTC
It looks very suspicious from the heap profile indeed. I uploaded it here: https://share.polarsignals.com/6b0ae0c/

From the initial look, it looks like a very high number of targets with some warnings.

Once the user has this situation, wonder if we can grab their `/api/v1/targets` output to see the situation. 

Otherwise, I created an upstream issue: https://github.com/thanos-io/thanos/issues/5257 

It might be worth digging further and optimising this path in case of a large number of targets (plus design some limit of how many targets query supports).

Comment 20 Sunil Thaha 2022-03-31 05:25:37 UTC
@philipp.dallig 

It seems https://github.com/thanos-io/thanos/issues/4676#issuecomment-922413448 may be the root cause of the issue, which is now fixed upstream and is part of 0.24. 

This explains why you don't see the same issue when you replace the image with 'thanos-0.24.0'

Comment 21 philipp.dallig 2022-03-31 09:43:11 UTC
Great that you found the bug and that there is already a bugfix. Will the bugfix be backported to the older Thanos versions that Openshift uses, so that other Openshift users don't accidentally run into the same problem?

I think it is no longer necessary for me to provide more information. It would be nice if this ticket gets links to backport tickets.

Thank you very much @sthaha and thank you @

Comment 22 philipp.dallig 2022-03-31 09:44:47 UTC
Thank you very much @sthaha you have helped me very well and thank you @bplotka for finding the root cause.

Comment 35 errata-xmlrpc 2022-08-10 10:49:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.