Bug 1883765

Summary: [user workload monitoring] improve latency of Thanos sidecar when streaming read requests
Product: OpenShift Container Platform Reporter: Simon Pasquier <spasquie>
Component: MonitoringAssignee: Simon Pasquier <spasquie>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.6CC: alegrand, anpicker, erooth, kakkoyun, lcosic, pkrupa, surbania
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:21:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
CPU utilization dashboard none

Description Simon Pasquier 2020-09-30 07:55:38 UTC
Back port https://github.com/thanos-io/thanos/pull/3146 which cuts latency by half for streaming read requests. This will significantly improve request latency of Thanos querier (including evaluations from Thanos ruler) when a query returns lots of series (e.g. several hundreds).

The fix should be available in Thanos v0.16.0 upstream.

Comment 1 Simon Pasquier 2020-09-30 11:42:32 UTC
https://github.com/thanos-io/thanos/pull/2783 might also be useful.

Comment 3 Simon Pasquier 2020-10-02 10:12:21 UTC
Created attachment 1718402 [details]
CPU utilization dashboard

I've tested locally with Thanos v0.16.0-rc.0 and the performances are much better. For the same workload, the overall CPU usage of the prometheus pod goes from 700Mi (right side of the dashboard, 4.6 downstream version) to 450Mi (left side, upstream v0.16.0-rc.0).

Comment 7 Junqi Zhao 2020-11-05 07:18:33 UTC
tested with 4.7.0-0.nightly-2020-11-05-010603, Thanos sidecar performance is improved and thanos version is 0.16.0 

# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -g -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=thanos_build_info' | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "thanos_build_info",
          "branch": "rhaos-4.7-rhel-8",
          "container": "oauth-proxy",
          "endpoint": "web",
          "goversion": "go1.15.2",
          "instance": "10.128.2.9:9091",
          "job": "thanos-querier",
          "namespace": "openshift-monitoring",
          "pod": "thanos-querier-66679569c6-gxf7p",
          "revision": "c8ee6fabf9fd5f3e44701f981204a1361965e59d",
          "service": "thanos-querier",
          "version": "0.16.0"
        },
        "value": [
          1604559215.114,
          "1"
        ]
      },
...
  }
}

Comment 11 errata-xmlrpc 2021-02-24 15:21:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633