Bug 1982302 - Prometheus replica is consuming high CPU in requests from prometheus-adapter
Summary: Prometheus replica is consuming high CPU in requests from prometheus-adapter
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Prashant Balachandran
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-14 16:25 UTC by Marco Braga
Modified: 2021-09-17 05:16 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-26 08:23:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Marco Braga 2021-07-14 16:25:01 UTC
## Description of problem:

Customer are facing a high CPU usage in one Prometheus' replica at a time. When capturing the packets from that replica, we can see that the 97% of the packets comes from prometheus-adapter's pods:

~~~
Pkt Count Src IP        % Pkt   Pod
16        192.168.44.10  0.00%	
46        192.168.0.8    0.01%	
129       192.168.44.1   0.02%	
130       192.168.0.1    0.02%	
133       192.168.43.1   0.02%	
135       192.168.6.1    0.03%	
139       192.168.3.1    0.03%	
200       192.168.44.14  0.04%	
713       192.168.42.1   0.13%	
11482     192.168.2.147  2.16%	
188090    192.168.44.21  35.32%  prometheus-adapter-*f8
331257    192.168.44.9   62.21%  prometheus-adapter-*bt 
~~~

From "query log" file, we can see the most frequent queries  is using the metric `container_cpu_usage_seconds_total` and `container_memory_working_set_bytes`, 727 of 773.

~~~
$ zcat log_query_prom.log.gz |strings |grep query |awk -F'{"query":"' '{print$2}' |wc -l
773

$ zcat log_query_prom.log.gz |strings |grep query |awk -F'{"query":"' '{print$2}' |awk -F'{' '{print$1}' |sort |uniq -c |sort -n |tail 
      1 sum(rate(container_cpu_usage_seconds_total
      1 sum(rate(node_cpu_seconds_total
      1 sum without(device) (rate(node_network_receive_bytes_total
      2 container_memory_cache
      2 (kubelet_volume_stats_available_bytes
      2 (node_filesystem_files_free
      3 container_memory_rss
     21 ((sum(rate(apiserver_request_duration_seconds_count
    334 sum(irate(container_cpu_usage_seconds_total
    383 sum(container_memory_working_set_bytes
~~~

Those metrics to measure the CPU and Memory for containers is defined on `configmap/adapter-config."config.yaml".resourceRules.memory.containerQuery` and `...resourceRules.cpu.containerQuery` on NS `openshift-monitoring`. The query count is heavily higher than the other active queries.

~~~
$ oc get cm -n openshift-monitoring adapter-config -o yaml |grep container
        
        "containerLabel": "container"
        "containerQuery": "sum(irate(container_cpu_usage_seconds_total{<<.LabelMatchers>>,container!=\"POD\",container!=\"\",pod!=\"\"}[5m])) by (<<.GroupBy>>)"
        
        "containerLabel": "container"
        "containerQuery": "sum(container_memory_working_set_bytes{<<.LabelMatchers>>,container!=\"POD\",container!=\"\",pod!=\"\"}) by (<<.GroupBy>>)"
~~~

Looking the Prometheus HTTP requests metrics[1], we can see a unbalanced requests count to replicas on handler `/api/v1/query`. One replica is receiving the most part of requests from this handler at a time, leading a high CPU usage from one replica only.

[1] Prometheus HTTP metrics
- prometheus_http_requests_total:
~~~
sum(rate(prometheus_http_requests_total[5m])) by (handler,pod)
~~~
- prometheus_http_request_duration_seconds_count


Autoscaling information:
- total of HPA setup is 224
- most frequent target metric is CPU, there is mix of memory too
- the reference/object type used is DeploymentConfig (only)


## Version-Release number of selected component (if applicable): 4.7.11

## How reproducible: 

ATM: Often when using HPAs

## Steps to Reproduce:

1. create the high number of HPAs
2. Check the metrics:
- HTTP Prometheus requests metrics by handler and pods 
- CPU usage of Prometheus by replica

## Actual results:

As described above, a high number of requests to one replica (unbalanced) leading high resource usage , in the case CPU.

## Expected results:

- Requests could be balanced between Prometheus's replicas
- Review the HPA limitation vs Prometheus performance
- Check if we can improve the performance of queries done by prometheus-adapter
- Check if we can improve the frequency that those metrics are requested to Prometheus

## Additional info:

- Do we have any matrix of limits of HPA vs Prometheus resources requirements?
- The `prometheus-k8s` service uses `sessionAffinity: ClientIP`, for that reason the requests seems to be unbalanced on pods:
https://github.com/openshift/cluster-monitoring-operator/blob/a054912abb0f5144bcfc772e59dfaf2ea02edd23/assets/prometheus-k8s/service.yaml#L28

Comment 2 Jan Fajerski 2021-07-15 07:42:25 UTC
Dropping severity as this isn't associated with loss of function.

Can you please share more details the HPA setup? How tight is the HPA sync loop?

Comment 3 Marco Braga 2021-07-16 16:46:32 UTC
Hi Jan,

> Can you please share more details the HPA setup? How tight is the HPA sync loop?

I believe that the sync loop that you have mentioned comes from a kube-controller-manager flag `--horizontal-pod-autoscaler-sync-period` , I didn't found any explicit change on OCP deployment - part of kube-controller-manager container flags:

```
--controllers=* --controllers=-bootstrapsigner\
      \ --controllers=-tokencleaner --controllers=-ttl 
```

So looking the doc[1] the hpa is enabled by default, as we haven't any flag defined for this resource, the sync loop seems be `15s`.

[1]  https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/

Comment 5 Marco Braga 2021-07-16 20:10:18 UTC
Questions/ideas to discuss to help us to create a workaround:
A) Is it a option to use/create some rule to expression 'containerQuery' ?
B) To load balance the requests. Is it a option to change the 'prometheus-k8s' service session from ClientIP to None?
C) prometheus resource usage: do we have any tune on prometheus to increase the performance on this kind of usage (query defined on containerQuery) in parallel ? Current prometheus resources are (no limit was defined):
> Note: the node running the replica pod has 64GiB w/ 12 vCPU , this node is almost dedicated to this pod
~~~
    resources:
      requests:
        cpu: 70m
        memory: 1Gi
~~~
D) Config: based on HPA objects, do we have any advice to decrease the amount of requests doing by hpa to Prometheus? 
E) Could we reach a better performance when using autoscaling/v2beta2 ? I see that we can write more accurate policies (Eg change periodSeconds), will it decrease the amount of requests to Prometheus?

Any idea to provide relief/decrease the pressure on current replica?

Comment 6 Jan Fajerski 2021-07-19 10:04:43 UTC
(In reply to Marco Braga from comment #5)
> Questions/ideas to discuss to help us to create a workaround:
> A) Is it a option to use/create some rule to expression 'containerQuery' ?
> B) To load balance the requests. Is it a option to change the
> 'prometheus-k8s' service session from ClientIP to None?
> C) prometheus resource usage: do we have any tune on prometheus to increase
> the performance on this kind of usage (query defined on containerQuery) in
> parallel ? Current prometheus resources are (no limit was defined):
> > Note: the node running the replica pod has 64GiB w/ 12 vCPU , this node is almost dedicated to this pod
> ~~~
>     resources:
>       requests:
>         cpu: 70m
>         memory: 1Gi
> ~~~
> D) Config: based on HPA objects, do we have any advice to decrease the
> amount of requests doing by hpa to Prometheus? 
> E) Could we reach a better performance when using autoscaling/v2beta2 ? I
> see that we can write more accurate policies (Eg change periodSeconds), will
> it decrease the amount of requests to Prometheus?
> 
> Any idea to provide relief/decrease the pressure on current replica?

I'm still getting familial with the whole HPA story, maybe Prashant has additional insights but here are a few thoughts.

A) This might be a good approach but needs more research and testing before we can role this out. So this seems less of a workaround and more like a potential long term fix.
B) I think the current setup is intentional. The TSDB's of the two prometheus replicas can differ slightly in the sample timestamps (since each replica scrapes metrics individually, quite likely at slightly different times) and that might impact queries in unexpected ways.

D and E) I think setting periodSeconds in each HAP object would make sense as a workaround. The default is 15 seconds, i.e. within 15 seconds all HPA objects will fire a query against prometheus. Unless the customer has a strong need for such a tight autoscaler behavior the period should be longer. Basically as long as possible should reduce the system load as much as possible.

Comment 7 Damien Grisonnet 2021-07-20 11:45:20 UTC
To me, the best option would be to use recording rules here. As you mentioned previously, the HPA will make requests to prometheus-adapter every 15 seconds resulting in queries to prometheus. However, it is not useful to get the data every 15 seconds for container metrics since it changes only every 30 seconds based on the kubelet scrape interval, so at least half of the queries made by the HPA are useless. We could better control that with recording rules since we would be able to set the evaluation interval to a meaningful value since we would know the scrape interval of the metrics, and this would also make sure that the expressions are only evaluated once and not once per HPA. I think the reason we didn't do it in the first place was because we never encountered a use case with that many HPA that would lead to using too much CPU in Prometheus.

I think it would make sense to load balance the request between the different Prometheus instances. Even if there is a slight difference in the values, it shouldn't amount to much since we only care about the 4 latest data points when computing the CPU/memory usages.

I don't think setting periodSeconds will help. As far as I understand it, it is bound to the policies so the queries will still be executed every 15 seconds, but the HPA will only take actions based on the periodSeconds define in the policy. I may be wrong here, but I think that our only option would be to customize the `--horizontal-pod-autoscaler-sync-period` flag at the controller level, but that's not something we want either since for node metrics, we have a scrape interval of 15 seconds so we want the HPA to query prometheus-adapter every 15 seconds to be as reactive as possible.

Comment 8 Simon Pasquier 2021-07-20 12:18:24 UTC
> I think the reason we didn't do it in the first place was because we never encountered a use case with that many HPA that would lead to using too much CPU in Prometheus.

Another reason against recording rules would be that enabling them for all containers would consume additional CPU/RAM resources while it's only useful for resources that are scaled by the HPA.

> Even if there is a slight difference in the values, it shouldn't amount to much since we only care about the 4 latest data points when computing the CPU/memory usages.

One edge case would be when prometheus-k8s-0 scrapes container metrics successfully and prometheus-k8s-1 doesn't (because of network issues for instance). At least with today's implementation, you get a consistent view (either the HPA gets samples or it doesn't). prometheus-adapter should probably use Thanos query instead but that would double the CPU consumption since it would eventually query both prometheus instances :-/

Comment 9 Marco Braga 2021-07-20 17:54:19 UTC
@Jan, @Simon, @damien: Ok, thanks to the feedback!

So I can discard the option to change the controller flag. As I can see that we have no short term options to decrease the performance on Prometheus, as the controller could impact to keep in sync with node scrapes.

Digging into the cluster metrics, specific the `changes()` of the metric kube_hpa_status_desired_replicas, I saw the low frequency of trigger to scale in a huge number of HPAs (~80%) in almost of the time evaluated it had the value of 0. So we are doing requests every 15s to a 'idle' service, that often scale. I am validating if it's correct, so we could decrease the number of calls (HPA objects) in a short term in this specific environment.

For the middle and long term, we are open to discuss the options.

@Simon:

> Another reason against recording rules would be that enabling them for all containers would consume additional CPU/RAM resources while it's only useful for resources that are scaled by the HPA.

Just to keep more clear for me, looking into for Prometheus' internals: 
- considering the `sample1`, random/original metric with 10+ labels, and `sample2`, the new sample created by `rule1` could be more expansive (or equal) for Prometheus overall (handler, tsdb, etc) when running `query1` than `query2`? 
- or only when creating the rule that can consume more RAM/CPU and it can improve the queries?
- can we impact in other components when droping that 5 labels from original sample?

[sample1]:
~~~
container_memory_working_set_bytes{
  container="3scale-operator",
  endpoint="https-metrics",
  id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8502d769_ab06_4517_8c81_fd4bea976ad3.slice/crio-287b3678447b162be755b9f7d0c2fe399ee45d6dd5a88a91c028fb82d4d9d59f.scope",
  image="registry.redhat.io/3scale-amp2/3scale-rhel7-operator@sha256:655062177bc53b155b87876dc1096530c365f18f9be3ceb0d32aa2d343968f9a",instance="10.10.94.249:10250",
  job="kubelet",
  metrics_path="/metrics/cadvisor",name="k8s_3scale-operator_3scale-operator-58cd76d755-mclrs_3scale_8502d769-ab06-4517-8c81-fd4bea976ad3_0",
  namespace="3scale",
  node="worker-1.sharedocp4upi46.lab.rdu2.",
  pod="3scale-operator-58cd76d755-mclrs",
  service="kubelet"
}
~~~

[sample2]:
~~~
container_cpu_usage_seconds_total:sum:irate5m{
  container="3scale-operator",
  name="k8s_3scale-operator_3scale-operator-58cd76d755-mclrs_3scale_8502d769-ab06-4517-8c81-fd4bea976ad3_0",
  namespace="3scale",
  node="worker-1.sharedocp4upi46.lab.rdu2.",
  pod="3scale-operator-58cd76d755-mclrs"
}
~~~

[rule1]
~~~
      - expr: |
          sum(
            irate(
              container_cpu_usage_seconds_total{container!=\"POD\",container!=\"\",pod!=\"\"}[5m]
            )
          ) by (node,namespace,pod,container,name)
        record: container_cpu_usage_seconds_total:sum:irate5m

~~~

[query1]:
~~~
sum(irate(container_cpu_usage_seconds_total{<<.LabelMatchers>>,container!=\    "POD\",container!=\"\",pod!=\"\"}[5m])) by (<<.GroupBy>>)
~~~

[query2]:
~~~
container_cpu_usage_seconds_total:sum:irate5m{<<.LabelMatchers>>} by (<<.GroupBy>>)
~~~


> One edge case would be when prometheus-k8s-0 scrapes container metrics successfully and prometheus-k8s-1 doesn't (because of network issues for instance). At least with today's implementation, you get a consistent view (either the HPA gets samples or it doesn't). prometheus-adapter should probably use Thanos query instead but that would double the CPU consumption since it would eventually query both prometheus instances :-/

Do you think that could be done by changing the CM/prometheus-adapter-prometheus-config key `data."prometheus-config.yaml".clusters[].cluster.server` to something like `https://thanos-querier.openshift-monitoring.svc:9091` ? The Thanos api endpoint is fully compatible with Prometheus' api?

Comment 15 Prashant Balachandran 2021-07-22 13:19:59 UTC
Marco, could you please share the prometheus adapter logs?

Comment 18 Prashant Balachandran 2021-07-23 08:06:34 UTC
Marco, from the prometheus adapter logs, it seems like there is a process which is querying the metrics api for node metrics and pod metrics almost twice in a minute. This process is not related to HPA. That is the reason for the queries across all namespaces.

Comment 19 Prashant Balachandran 2021-07-23 08:26:49 UTC
The process is a python client named pykube-ng.


Note You need to log in before you can comment on or make changes to this bug.