## Description of problem: Customer are facing a high CPU usage in one Prometheus' replica at a time. When capturing the packets from that replica, we can see that the 97% of the packets comes from prometheus-adapter's pods: ~~~ Pkt Count Src IP % Pkt Pod 16 192.168.44.10 0.00% 46 192.168.0.8 0.01% 129 192.168.44.1 0.02% 130 192.168.0.1 0.02% 133 192.168.43.1 0.02% 135 192.168.6.1 0.03% 139 192.168.3.1 0.03% 200 192.168.44.14 0.04% 713 192.168.42.1 0.13% 11482 192.168.2.147 2.16% 188090 192.168.44.21 35.32% prometheus-adapter-*f8 331257 192.168.44.9 62.21% prometheus-adapter-*bt ~~~ From "query log" file, we can see the most frequent queries is using the metric `container_cpu_usage_seconds_total` and `container_memory_working_set_bytes`, 727 of 773. ~~~ $ zcat log_query_prom.log.gz |strings |grep query |awk -F'{"query":"' '{print$2}' |wc -l 773 $ zcat log_query_prom.log.gz |strings |grep query |awk -F'{"query":"' '{print$2}' |awk -F'{' '{print$1}' |sort |uniq -c |sort -n |tail 1 sum(rate(container_cpu_usage_seconds_total 1 sum(rate(node_cpu_seconds_total 1 sum without(device) (rate(node_network_receive_bytes_total 2 container_memory_cache 2 (kubelet_volume_stats_available_bytes 2 (node_filesystem_files_free 3 container_memory_rss 21 ((sum(rate(apiserver_request_duration_seconds_count 334 sum(irate(container_cpu_usage_seconds_total 383 sum(container_memory_working_set_bytes ~~~ Those metrics to measure the CPU and Memory for containers is defined on `configmap/adapter-config."config.yaml".resourceRules.memory.containerQuery` and `...resourceRules.cpu.containerQuery` on NS `openshift-monitoring`. The query count is heavily higher than the other active queries. ~~~ $ oc get cm -n openshift-monitoring adapter-config -o yaml |grep container "containerLabel": "container" "containerQuery": "sum(irate(container_cpu_usage_seconds_total{<<.LabelMatchers>>,container!=\"POD\",container!=\"\",pod!=\"\"}[5m])) by (<<.GroupBy>>)" "containerLabel": "container" "containerQuery": "sum(container_memory_working_set_bytes{<<.LabelMatchers>>,container!=\"POD\",container!=\"\",pod!=\"\"}) by (<<.GroupBy>>)" ~~~ Looking the Prometheus HTTP requests metrics[1], we can see a unbalanced requests count to replicas on handler `/api/v1/query`. One replica is receiving the most part of requests from this handler at a time, leading a high CPU usage from one replica only. [1] Prometheus HTTP metrics - prometheus_http_requests_total: ~~~ sum(rate(prometheus_http_requests_total[5m])) by (handler,pod) ~~~ - prometheus_http_request_duration_seconds_count Autoscaling information: - total of HPA setup is 224 - most frequent target metric is CPU, there is mix of memory too - the reference/object type used is DeploymentConfig (only) ## Version-Release number of selected component (if applicable): 4.7.11 ## How reproducible: ATM: Often when using HPAs ## Steps to Reproduce: 1. create the high number of HPAs 2. Check the metrics: - HTTP Prometheus requests metrics by handler and pods - CPU usage of Prometheus by replica ## Actual results: As described above, a high number of requests to one replica (unbalanced) leading high resource usage , in the case CPU. ## Expected results: - Requests could be balanced between Prometheus's replicas - Review the HPA limitation vs Prometheus performance - Check if we can improve the performance of queries done by prometheus-adapter - Check if we can improve the frequency that those metrics are requested to Prometheus ## Additional info: - Do we have any matrix of limits of HPA vs Prometheus resources requirements? - The `prometheus-k8s` service uses `sessionAffinity: ClientIP`, for that reason the requests seems to be unbalanced on pods: https://github.com/openshift/cluster-monitoring-operator/blob/a054912abb0f5144bcfc772e59dfaf2ea02edd23/assets/prometheus-k8s/service.yaml#L28
Dropping severity as this isn't associated with loss of function. Can you please share more details the HPA setup? How tight is the HPA sync loop?
Hi Jan, > Can you please share more details the HPA setup? How tight is the HPA sync loop? I believe that the sync loop that you have mentioned comes from a kube-controller-manager flag `--horizontal-pod-autoscaler-sync-period` , I didn't found any explicit change on OCP deployment - part of kube-controller-manager container flags: ``` --controllers=* --controllers=-bootstrapsigner\ \ --controllers=-tokencleaner --controllers=-ttl ``` So looking the doc[1] the hpa is enabled by default, as we haven't any flag defined for this resource, the sync loop seems be `15s`. [1] https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/
Questions/ideas to discuss to help us to create a workaround: A) Is it a option to use/create some rule to expression 'containerQuery' ? B) To load balance the requests. Is it a option to change the 'prometheus-k8s' service session from ClientIP to None? C) prometheus resource usage: do we have any tune on prometheus to increase the performance on this kind of usage (query defined on containerQuery) in parallel ? Current prometheus resources are (no limit was defined): > Note: the node running the replica pod has 64GiB w/ 12 vCPU , this node is almost dedicated to this pod ~~~ resources: requests: cpu: 70m memory: 1Gi ~~~ D) Config: based on HPA objects, do we have any advice to decrease the amount of requests doing by hpa to Prometheus? E) Could we reach a better performance when using autoscaling/v2beta2 ? I see that we can write more accurate policies (Eg change periodSeconds), will it decrease the amount of requests to Prometheus? Any idea to provide relief/decrease the pressure on current replica?
(In reply to Marco Braga from comment #5) > Questions/ideas to discuss to help us to create a workaround: > A) Is it a option to use/create some rule to expression 'containerQuery' ? > B) To load balance the requests. Is it a option to change the > 'prometheus-k8s' service session from ClientIP to None? > C) prometheus resource usage: do we have any tune on prometheus to increase > the performance on this kind of usage (query defined on containerQuery) in > parallel ? Current prometheus resources are (no limit was defined): > > Note: the node running the replica pod has 64GiB w/ 12 vCPU , this node is almost dedicated to this pod > ~~~ > resources: > requests: > cpu: 70m > memory: 1Gi > ~~~ > D) Config: based on HPA objects, do we have any advice to decrease the > amount of requests doing by hpa to Prometheus? > E) Could we reach a better performance when using autoscaling/v2beta2 ? I > see that we can write more accurate policies (Eg change periodSeconds), will > it decrease the amount of requests to Prometheus? > > Any idea to provide relief/decrease the pressure on current replica? I'm still getting familial with the whole HPA story, maybe Prashant has additional insights but here are a few thoughts. A) This might be a good approach but needs more research and testing before we can role this out. So this seems less of a workaround and more like a potential long term fix. B) I think the current setup is intentional. The TSDB's of the two prometheus replicas can differ slightly in the sample timestamps (since each replica scrapes metrics individually, quite likely at slightly different times) and that might impact queries in unexpected ways. D and E) I think setting periodSeconds in each HAP object would make sense as a workaround. The default is 15 seconds, i.e. within 15 seconds all HPA objects will fire a query against prometheus. Unless the customer has a strong need for such a tight autoscaler behavior the period should be longer. Basically as long as possible should reduce the system load as much as possible.
To me, the best option would be to use recording rules here. As you mentioned previously, the HPA will make requests to prometheus-adapter every 15 seconds resulting in queries to prometheus. However, it is not useful to get the data every 15 seconds for container metrics since it changes only every 30 seconds based on the kubelet scrape interval, so at least half of the queries made by the HPA are useless. We could better control that with recording rules since we would be able to set the evaluation interval to a meaningful value since we would know the scrape interval of the metrics, and this would also make sure that the expressions are only evaluated once and not once per HPA. I think the reason we didn't do it in the first place was because we never encountered a use case with that many HPA that would lead to using too much CPU in Prometheus. I think it would make sense to load balance the request between the different Prometheus instances. Even if there is a slight difference in the values, it shouldn't amount to much since we only care about the 4 latest data points when computing the CPU/memory usages. I don't think setting periodSeconds will help. As far as I understand it, it is bound to the policies so the queries will still be executed every 15 seconds, but the HPA will only take actions based on the periodSeconds define in the policy. I may be wrong here, but I think that our only option would be to customize the `--horizontal-pod-autoscaler-sync-period` flag at the controller level, but that's not something we want either since for node metrics, we have a scrape interval of 15 seconds so we want the HPA to query prometheus-adapter every 15 seconds to be as reactive as possible.
> I think the reason we didn't do it in the first place was because we never encountered a use case with that many HPA that would lead to using too much CPU in Prometheus. Another reason against recording rules would be that enabling them for all containers would consume additional CPU/RAM resources while it's only useful for resources that are scaled by the HPA. > Even if there is a slight difference in the values, it shouldn't amount to much since we only care about the 4 latest data points when computing the CPU/memory usages. One edge case would be when prometheus-k8s-0 scrapes container metrics successfully and prometheus-k8s-1 doesn't (because of network issues for instance). At least with today's implementation, you get a consistent view (either the HPA gets samples or it doesn't). prometheus-adapter should probably use Thanos query instead but that would double the CPU consumption since it would eventually query both prometheus instances :-/
@Jan, @Simon, @damien: Ok, thanks to the feedback! So I can discard the option to change the controller flag. As I can see that we have no short term options to decrease the performance on Prometheus, as the controller could impact to keep in sync with node scrapes. Digging into the cluster metrics, specific the `changes()` of the metric kube_hpa_status_desired_replicas, I saw the low frequency of trigger to scale in a huge number of HPAs (~80%) in almost of the time evaluated it had the value of 0. So we are doing requests every 15s to a 'idle' service, that often scale. I am validating if it's correct, so we could decrease the number of calls (HPA objects) in a short term in this specific environment. For the middle and long term, we are open to discuss the options. @Simon: > Another reason against recording rules would be that enabling them for all containers would consume additional CPU/RAM resources while it's only useful for resources that are scaled by the HPA. Just to keep more clear for me, looking into for Prometheus' internals: - considering the `sample1`, random/original metric with 10+ labels, and `sample2`, the new sample created by `rule1` could be more expansive (or equal) for Prometheus overall (handler, tsdb, etc) when running `query1` than `query2`? - or only when creating the rule that can consume more RAM/CPU and it can improve the queries? - can we impact in other components when droping that 5 labels from original sample? [sample1]: ~~~ container_memory_working_set_bytes{ container="3scale-operator", endpoint="https-metrics", id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8502d769_ab06_4517_8c81_fd4bea976ad3.slice/crio-287b3678447b162be755b9f7d0c2fe399ee45d6dd5a88a91c028fb82d4d9d59f.scope", image="registry.redhat.io/3scale-amp2/3scale-rhel7-operator@sha256:655062177bc53b155b87876dc1096530c365f18f9be3ceb0d32aa2d343968f9a",instance="10.10.94.249:10250", job="kubelet", metrics_path="/metrics/cadvisor",name="k8s_3scale-operator_3scale-operator-58cd76d755-mclrs_3scale_8502d769-ab06-4517-8c81-fd4bea976ad3_0", namespace="3scale", node="worker-1.sharedocp4upi46.lab.rdu2.", pod="3scale-operator-58cd76d755-mclrs", service="kubelet" } ~~~ [sample2]: ~~~ container_cpu_usage_seconds_total:sum:irate5m{ container="3scale-operator", name="k8s_3scale-operator_3scale-operator-58cd76d755-mclrs_3scale_8502d769-ab06-4517-8c81-fd4bea976ad3_0", namespace="3scale", node="worker-1.sharedocp4upi46.lab.rdu2.", pod="3scale-operator-58cd76d755-mclrs" } ~~~ [rule1] ~~~ - expr: | sum( irate( container_cpu_usage_seconds_total{container!=\"POD\",container!=\"\",pod!=\"\"}[5m] ) ) by (node,namespace,pod,container,name) record: container_cpu_usage_seconds_total:sum:irate5m ~~~ [query1]: ~~~ sum(irate(container_cpu_usage_seconds_total{<<.LabelMatchers>>,container!=\ "POD\",container!=\"\",pod!=\"\"}[5m])) by (<<.GroupBy>>) ~~~ [query2]: ~~~ container_cpu_usage_seconds_total:sum:irate5m{<<.LabelMatchers>>} by (<<.GroupBy>>) ~~~ > One edge case would be when prometheus-k8s-0 scrapes container metrics successfully and prometheus-k8s-1 doesn't (because of network issues for instance). At least with today's implementation, you get a consistent view (either the HPA gets samples or it doesn't). prometheus-adapter should probably use Thanos query instead but that would double the CPU consumption since it would eventually query both prometheus instances :-/ Do you think that could be done by changing the CM/prometheus-adapter-prometheus-config key `data."prometheus-config.yaml".clusters[].cluster.server` to something like `https://thanos-querier.openshift-monitoring.svc:9091` ? The Thanos api endpoint is fully compatible with Prometheus' api?
Marco, could you please share the prometheus adapter logs?
Marco, from the prometheus adapter logs, it seems like there is a process which is querying the metrics api for node metrics and pod metrics almost twice in a minute. This process is not related to HPA. That is the reason for the queries across all namespaces.
The process is a python client named pykube-ng.