Bug 1993985

Summary:

HPA failing to get metrics due to large delay in monitoring pipeline

Product:

OpenShift Container Platform

Reporter:

Stephen Benjamin <stbenjam>

Component:

Node

Assignee:

Ryan Phillips <rphillips>

Node sub component:

Kubelet

QA Contact:

Weinan Liu <weinliu>

Status:

CLOSED DUPLICATE

Docs Contact:

Severity:

high

Priority:

unspecified

CC:

alegrand, amuller, anpicker, aos-bugs, erooth, jchaloup, maszulik, sippy

Version:

4.9

Target Milestone:

---

Target Release:

4.9.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

[sig-arch] events should not repeat pathologically

Last Closed:

2021-10-11 18:03:53 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
e2e-horizontal-pod-autoscaling-prometheus	none
container_scrape_error metric for the node running the scaled pods	none
container_cpu_usage_seconds_total metric for the autoscaled pods	none

Description Stephen Benjamin 2021-08-16 13:41:40 UTC

[sig-arch] events should not repeat pathologically:

event happened 21 times, something is wrong: ns/e2e-horizontal-pod-autoscaling-7186 horizontalpodautoscaler/rc-light - reason/FailedGetResourceMetric failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API


is failing frequently in CI, see:
https://search.ci.openshift.org/?search=event+happened+.*+resource+cpu%3A+no+metrics+returned+from+resource+metrics+API&maxAge=48h&context=1&type=bug%2Bjunit&name=4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Example job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade/1427221583637778432

The tests in https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go seem to be generated these alerts quite a bit - and I notice these do have some init containers, is this a regression from https://bugzilla.redhat.com/show_bug.cgi?id=1867477 ?

Comment 1 Jan Chaloupka 2021-08-17 14:19:36 UTC

*** Bug 1994491 has been marked as a duplicate of this bug. ***

Comment 2 Jan Chaloupka 2021-08-18 16:24:20 UTC

From https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848 job.

From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848/artifacts/e2e-gcp/gather-extra/artifacts/pods/openshift-monitoring_prometheus-adapter-669d4d75bc-f4ln6_prometheus-adapter.log:
```

E0817 09:49:17.918404       1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf, skipping
E0817 09:49:33.079029       1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf, skipping
...
E0817 10:17:46.716726       1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8883/rc-light-2c4sr, skipping
E0817 10:18:01.778282       1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8883/rc-light-2c4sr, skipping
```

From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848/artifacts/e2e-gcp/gather-extra/artifacts/pods/openshift-monitoring_prometheus-adapter-669d4d75bc-m586m_prometheus-adapter.log:
```
E0817 09:50:04.436593       1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf, skipping
E0817 09:50:34.944704       1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf, skipping
...
E0817 10:09:21.313227       1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf, skipping
E0817 10:10:10.628274       1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf, skipping
```
Checking https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848/artifacts/e2e-gcp/gather-extra/artifacts/pods/openshift-kube-controller-manager_kube-controller-manager-ci-op-t8ldxsiv-2a78c-6nfxc-master-0_kube-controller-manager.log:
```
E0817 09:49:02.753810       1 horizontal.go:227] failed to compute desired number of replicas based on listed metrics for ReplicationController/e2e-horizontal-pod-autoscaling-8508/rc-light: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
E0817 09:49:17.948165       1 horizontal.go:227] failed to compute desired number of replicas based on listed metrics for ReplicationController/e2e-horizontal-pod-autoscaling-8508/rc-light: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
...
E0817 10:11:43.548957       1 horizontal.go:227] failed to compute desired number of replicas based on listed metrics for ReplicationController/e2e-horizontal-pod-autoscaling-8508/rc-light: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
E0817 10:11:58.787296       1 horizontal.go:227] failed to compute desired number of replicas based on listed metrics for ReplicationController/e2e-horizontal-pod-autoscaling-8508/rc-light: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
```

Checking https://github.com/kubernetes-sigs/prometheus-adapter code at https://github.com/kubernetes-sigs/prometheus-adapter/blob/56df87890ce67c9c27d29dbf8586934a321caba4/pkg/resourceprovider/provider.go#L191:
```
func (p *resourceProvider) assignForPod(pod apitypes.NamespacedName, resultsByNs map[string]nsQueryResults, resMetrics *[]metrics.ContainerMetrics, resTime *api.TimeInfo) {
	// check to make sure everything is present
	nsRes, nsResPresent := resultsByNs[pod.Namespace]
	if !nsResPresent {
		klog.Errorf("unable to fetch metrics for pods in namespace %q, skipping pod %s", pod.Namespace, pod.String())
		return
	}
	cpuRes, hasResult := nsRes.cpu[pod.Name]
	if !hasResult {
		klog.Errorf("unable to fetch CPU metrics for pod %s, skipping", pod.String())
		return
	}
...
```

resultsByNs is populated with the namespace but missing pod.Name entry for e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf pod in nsRes.cpu. Tracing back how resultsByNs is populated, the prometheus-adaptar constructs a query against https://prometheus-k8s.openshift-monitoring.svc:9091 (see https://github.com/kubernetes-sigs/prometheus-adapter/blob/56df87890ce67c9c27d29dbf8586934a321caba4/cmd/config-gen/utils/default.go#L96):
```
sum(rate(container_cpu_usage_seconds_total{<<.LabelMatchers>>}[%s])) by (<<.GroupBy>>)
```

Just for testing purposes you can use `sum(rate(container_cpu_usage_seconds_total{namespace=~'e2e-horizontal-pod-autoscaling.*',pod=~'rc-light-.*',pod!~'rc-light-ctrl-.*'}[1m])) by (pods)` in https://console-openshift-console.apps.XXX.devcluster.openshift.com/monitoring/query-browser to see how the query works when running the "[sig-autoscaling] [Feature:HPA] Horizontal pod autoscaling (scale resource: CPU) ReplicationController light Should scale from 1 pod to 2 pods [Suite:openshift/conformance/parallel] [Suite:k8s]" test locally.

Unfortunately, I can't tell which response was returned the Prometheus server to see if the server returned an empty cpu entry for e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf pod.

Comment 3 Jan Chaloupka 2021-08-19 07:06:32 UTC

Created attachment 1815500 [details]
e2e-horizontal-pod-autoscaling-prometheus

Comment 4 Jan Chaloupka 2021-08-19 07:17:46 UTC

You can use https://promecieus.dptools.openshift.org/ and feed it with https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848 link. Then put the query. The graph attached in the BZ.

Based on the graph the pod metrics were reported in <10:18.36; 10:23.08>. From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848/artifacts/e2e-gcp/openshift-e2e-test/artifacts/junit/junit_e2e_20210817-102245.xml the test ran in <09:56:18.965; 10:03:07.481>. Clearly, the podmetrics were available ~11m after the test ended.

Comment 5 Joel Smith 2021-08-26 17:03:47 UTC

Based upon Jan's work, it appears that the root cause is a delay in the metrics pipeline. I have seen metrics be a little slow on a loaded cluster, but this seems to be quite a large delay.

Could someone from the monitoring team please take a look to see why there might be such a large delay? When the E2E tests of the horizontal pod autoscaler (HPA), after 10 minutes of creating the pod, the HPA is still unable to get pod CPU metrics to be available via the metrics endpoint served by the Prometheus adapter.

Comment 6 Simon Pasquier 2021-08-31 12:03:28 UTC

Created attachment 1819346 [details]
container_scrape_error metric for the node running the scaled pods

Comment 7 Simon Pasquier 2021-08-31 12:05:10 UTC

Created attachment 1819347 [details]
container_cpu_usage_seconds_total metric for the autoscaled pods

Comment 8 Simon Pasquier 2021-08-31 12:26:42 UTC

From attachment 1819347 [details] (PromCIeus using https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848), we can see that the container_cpu_usage_seconds_total metric is sporadically present until 10:12. Looking at the cadvisor_scrape_error metric for the node on which the autoscaled pods are scheduled (ci-op-t8ldxsiv-2a78c-6nfxc-worker-a-mrdqd), it shows that most of the metric scrapes fail which is the reason why prometheus-adapter says it can't fetch CPU metrics for the pods. 

In the ci-op-t8ldxsiv-2a78c-6nfxc-worker-a-mrdqd logs [1], there are lots of stats failures for the containers matching the autoscaled pods. For instance:
 
Aug 17 09:48:31.285494 ci-op-t8ldxsiv-2a78c-6nfxc-worker-a-mrdqd hyperkube[1436]: E0817 09:48:31.284845    1436 cadvisor_stats_provider.go:415] "Partial failure issuing cadvisor.ContainerInfoV2" err="partial failures: [\"/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podc431efa1_15e3_439f_89c1_a199dc51187e.slice/crio-4db2590ff86dea742386d17247d764a937ed00324292aec208c621379b146953.scope\": RecentStats: unable to find data in memory cache], [\"/kubepods.slice/kubepods-podf6b3d1f7_73b4_40b7_a1c8_b7de2302aafe.slice/crio-22525f07eaeafffe9aea50bef7dfa3bb02e8f78e03ca980e2ad46b743e1f7839.scope\": RecentStats: unable to find data in memory cache], [\"/kubepods.slice/kubepods-podeaa32efc_4e92_4fd8_9d25_e1d80b14129e.slice\": RecentStats: unable to find data in memory cache], [\"/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod6a8b5fed_77a6_4b46_bc5a_79429c99c67f.slice\": RecentStats: unable to find data in memory cache]"

Moving back to the Node team since AFAICT it's a problem with cAdvisor failing to expose container metrics from time to time.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848/artifacts/e2e-gcp/gather-extra/artifacts/nodes/ci-op-t8ldxsiv-2a78c-6nfxc-worker-a-mrdqd/journal

Comment 12 Ryan Phillips 2021-10-11 18:03:53 UTC

Hasn't happened since the kubelet fixes went in... Closing.

*** This bug has been marked as a duplicate of bug 2011815 ***