1993985 – HPA failing to get metrics due to large delay in monitoring pipeline

Bug 1993985 - HPA failing to get metrics due to large delay in monitoring pipeline

Summary: HPA failing to get metrics due to large delay in monitoring pipeline

Keywords:
Status:	CLOSED DUPLICATE of bug 2011815
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Ryan Phillips
QA Contact:	Weinan Liu
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1994491 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-16 13:41 UTC by Stephen Benjamin
Modified:	2021-10-11 18:03 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	[sig-arch] events should not repeat pathologically
Last Closed:	2021-10-11 18:03:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
e2e-horizontal-pod-autoscaling-prometheus (87.15 KB, image/png) 2021-08-19 07:06 UTC, Jan Chaloupka	no flags	Details
container_scrape_error metric for the node running the scaled pods (126.78 KB, image/png) 2021-08-31 12:03 UTC, Simon Pasquier	no flags	Details
container_cpu_usage_seconds_total metric for the autoscaled pods (86.62 KB, image/png) 2021-08-31 12:05 UTC, Simon Pasquier	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	kubernetes kubernetes issues 79365	0	None	None	None	2021-08-23 13:54:49 UTC
Github	openshift origin pull 26409	0	None	None	None	2021-08-23 13:54:55 UTC

Description Stephen Benjamin 2021-08-16 13:41:40 UTC

[sig-arch] events should not repeat pathologically:

event happened 21 times, something is wrong: ns/e2e-horizontal-pod-autoscaling-7186 horizontalpodautoscaler/rc-light - reason/FailedGetResourceMetric failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API


is failing frequently in CI, see:
https://search.ci.openshift.org/?search=event+happened+.*+resource+cpu%3A+no+metrics+returned+from+resource+metrics+API&maxAge=48h&context=1&type=bug%2Bjunit&name=4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Example job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade/1427221583637778432

The tests in https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go seem to be generated these alerts quite a bit - and I notice these do have some init containers, is this a regression from https://bugzilla.redhat.com/show_bug.cgi?id=1867477 ?

Comment 1 Jan Chaloupka 2021-08-17 14:19:36 UTC

*** Bug 1994491 has been marked as a duplicate of this bug. ***

Comment 2 Jan Chaloupka 2021-08-18 16:24:20 UTC

From https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848 job.

From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848/artifacts/e2e-gcp/gather-extra/artifacts/pods/openshift-monitoring_prometheus-adapter-669d4d75bc-f4ln6_prometheus-adapter.log:
```

E0817 09:49:17.918404       1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf, skipping
E0817 09:49:33.079029       1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf, skipping
...
E0817 10:17:46.716726       1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8883/rc-light-2c4sr, skipping
E0817 10:18:01.778282       1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8883/rc-light-2c4sr, skipping
```

From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848/artifacts/e2e-gcp/gather-extra/artifacts/pods/openshift-monitoring_prometheus-adapter-669d4d75bc-m586m_prometheus-adapter.log:
```
E0817 09:50:04.436593       1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf, skipping
E0817 09:50:34.944704       1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf, skipping
...
E0817 10:09:21.313227       1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf, skipping
E0817 10:10:10.628274       1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf, skipping
```
Checking https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848/artifacts/e2e-gcp/gather-extra/artifacts/pods/openshift-kube-controller-manager_kube-controller-manager-ci-op-t8ldxsiv-2a78c-6nfxc-master-0_kube-controller-manager.log:
```
E0817 09:49:02.753810       1 horizontal.go:227] failed to compute desired number of replicas based on listed metrics for ReplicationController/e2e-horizontal-pod-autoscaling-8508/rc-light: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
E0817 09:49:17.948165       1 horizontal.go:227] failed to compute desired number of replicas based on listed metrics for ReplicationController/e2e-horizontal-pod-autoscaling-8508/rc-light: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
...
E0817 10:11:43.548957       1 horizontal.go:227] failed to compute desired number of replicas based on listed metrics for ReplicationController/e2e-horizontal-pod-autoscaling-8508/rc-light: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
E0817 10:11:58.787296       1 horizontal.go:227] failed to compute desired number of replicas based on listed metrics for ReplicationController/e2e-horizontal-pod-autoscaling-8508/rc-light: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
```

Checking https://github.com/kubernetes-sigs/prometheus-adapter code at https://github.com/kubernetes-sigs/prometheus-adapter/blob/56df87890ce67c9c27d29dbf8586934a321caba4/pkg/resourceprovider/provider.go#L191:
```
func (p *resourceProvider) assignForPod(pod apitypes.NamespacedName, resultsByNs map[string]nsQueryResults, resMetrics *[]metrics.ContainerMetrics, resTime *api.TimeInfo) {
	// check to make sure everything is present
	nsRes, nsResPresent := resultsByNs[pod.Namespace]
	if !nsResPresent {
		klog.Errorf("unable to fetch metrics for pods in namespace %q, skipping pod %s", pod.Namespace, pod.String())
		return
	}
	cpuRes, hasResult := nsRes.cpu[pod.Name]
	if !hasResult {
		klog.Errorf("unable to fetch CPU metrics for pod %s, skipping", pod.String())
		return
	}
...
```

resultsByNs is populated with the namespace but missing pod.Name entry for e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf pod in nsRes.cpu. Tracing back how resultsByNs is populated, the prometheus-adaptar constructs a query against https://prometheus-k8s.openshift-monitoring.svc:9091 (see https://github.com/kubernetes-sigs/prometheus-adapter/blob/56df87890ce67c9c27d29dbf8586934a321caba4/cmd/config-gen/utils/default.go#L96):
```
sum(rate(container_cpu_usage_seconds_total{<<.LabelMatchers>>}[%s])) by (<<.GroupBy>>)
```

Just for testing purposes you can use `sum(rate(container_cpu_usage_seconds_total{namespace=~'e2e-horizontal-pod-autoscaling.*',pod=~'rc-light-.*',pod!~'rc-light-ctrl-.*'}[1m])) by (pods)` in https://console-openshift-console.apps.XXX.devcluster.openshift.com/monitoring/query-browser to see how the query works when running the "[sig-autoscaling] [Feature:HPA] Horizontal pod autoscaling (scale resource: CPU) ReplicationController light Should scale from 1 pod to 2 pods [Suite:openshift/conformance/parallel] [Suite:k8s]" test locally.

Unfortunately, I can't tell which response was returned the Prometheus server to see if the server returned an empty cpu entry for e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf pod.

Comment 3 Jan Chaloupka 2021-08-19 07:06:32 UTC

Created attachment 1815500 [details]
e2e-horizontal-pod-autoscaling-prometheus

Comment 4 Jan Chaloupka 2021-08-19 07:17:46 UTC

You can use https://promecieus.dptools.openshift.org/ and feed it with https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848 link. Then put the query. The graph attached in the BZ.

Based on the graph the pod metrics were reported in <10:18.36; 10:23.08>. From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848/artifacts/e2e-gcp/openshift-e2e-test/artifacts/junit/junit_e2e_20210817-102245.xml the test ran in <09:56:18.965; 10:03:07.481>. Clearly, the podmetrics were available ~11m after the test ended.

Comment 5 Joel Smith 2021-08-26 17:03:47 UTC

Based upon Jan's work, it appears that the root cause is a delay in the metrics pipeline. I have seen metrics be a little slow on a loaded cluster, but this seems to be quite a large delay.

Could someone from the monitoring team please take a look to see why there might be such a large delay? When the E2E tests of the horizontal pod autoscaler (HPA), after 10 minutes of creating the pod, the HPA is still unable to get pod CPU metrics to be available via the metrics endpoint served by the Prometheus adapter.

Comment 6 Simon Pasquier 2021-08-31 12:03:28 UTC

Created attachment 1819346 [details]
container_scrape_error metric for the node running the scaled pods

Comment 7 Simon Pasquier 2021-08-31 12:05:10 UTC

Created attachment 1819347 [details]
container_cpu_usage_seconds_total metric for the autoscaled pods

Comment 8 Simon Pasquier 2021-08-31 12:26:42 UTC

From attachment 1819347 [details] (PromCIeus using https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848), we can see that the container_cpu_usage_seconds_total metric is sporadically present until 10:12. Looking at the cadvisor_scrape_error metric for the node on which the autoscaled pods are scheduled (ci-op-t8ldxsiv-2a78c-6nfxc-worker-a-mrdqd), it shows that most of the metric scrapes fail which is the reason why prometheus-adapter says it can't fetch CPU metrics for the pods. 

In the ci-op-t8ldxsiv-2a78c-6nfxc-worker-a-mrdqd logs [1], there are lots of stats failures for the containers matching the autoscaled pods. For instance:
 
Aug 17 09:48:31.285494 ci-op-t8ldxsiv-2a78c-6nfxc-worker-a-mrdqd hyperkube[1436]: E0817 09:48:31.284845    1436 cadvisor_stats_provider.go:415] "Partial failure issuing cadvisor.ContainerInfoV2" err="partial failures: [\"/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podc431efa1_15e3_439f_89c1_a199dc51187e.slice/crio-4db2590ff86dea742386d17247d764a937ed00324292aec208c621379b146953.scope\": RecentStats: unable to find data in memory cache], [\"/kubepods.slice/kubepods-podf6b3d1f7_73b4_40b7_a1c8_b7de2302aafe.slice/crio-22525f07eaeafffe9aea50bef7dfa3bb02e8f78e03ca980e2ad46b743e1f7839.scope\": RecentStats: unable to find data in memory cache], [\"/kubepods.slice/kubepods-podeaa32efc_4e92_4fd8_9d25_e1d80b14129e.slice\": RecentStats: unable to find data in memory cache], [\"/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod6a8b5fed_77a6_4b46_bc5a_79429c99c67f.slice\": RecentStats: unable to find data in memory cache]"

Moving back to the Node team since AFAICT it's a problem with cAdvisor failing to expose container metrics from time to time.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848/artifacts/e2e-gcp/gather-extra/artifacts/nodes/ci-op-t8ldxsiv-2a78c-6nfxc-worker-a-mrdqd/journal

Comment 12 Ryan Phillips 2021-10-11 18:03:53 UTC

Hasn't happened since the kubelet fixes went in... Closing.

*** This bug has been marked as a duplicate of bug 2011815 ***

Note You need to log in before you can comment on or make changes to this bug.