[sig-arch] events should not repeat pathologically: event happened 21 times, something is wrong: ns/e2e-horizontal-pod-autoscaling-7186 horizontalpodautoscaler/rc-light - reason/FailedGetResourceMetric failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API is failing frequently in CI, see: https://search.ci.openshift.org/?search=event+happened+.*+resource+cpu%3A+no+metrics+returned+from+resource+metrics+API&maxAge=48h&context=1&type=bug%2Bjunit&name=4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Example job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade/1427221583637778432 The tests in https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go seem to be generated these alerts quite a bit - and I notice these do have some init containers, is this a regression from https://bugzilla.redhat.com/show_bug.cgi?id=1867477 ?
*** Bug 1994491 has been marked as a duplicate of this bug. ***
From https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848 job. From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848/artifacts/e2e-gcp/gather-extra/artifacts/pods/openshift-monitoring_prometheus-adapter-669d4d75bc-f4ln6_prometheus-adapter.log: ``` E0817 09:49:17.918404 1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf, skipping E0817 09:49:33.079029 1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf, skipping ... E0817 10:17:46.716726 1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8883/rc-light-2c4sr, skipping E0817 10:18:01.778282 1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8883/rc-light-2c4sr, skipping ``` From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848/artifacts/e2e-gcp/gather-extra/artifacts/pods/openshift-monitoring_prometheus-adapter-669d4d75bc-m586m_prometheus-adapter.log: ``` E0817 09:50:04.436593 1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf, skipping E0817 09:50:34.944704 1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf, skipping ... E0817 10:09:21.313227 1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf, skipping E0817 10:10:10.628274 1 provider.go:187] unable to fetch CPU metrics for pod e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf, skipping ``` Checking https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848/artifacts/e2e-gcp/gather-extra/artifacts/pods/openshift-kube-controller-manager_kube-controller-manager-ci-op-t8ldxsiv-2a78c-6nfxc-master-0_kube-controller-manager.log: ``` E0817 09:49:02.753810 1 horizontal.go:227] failed to compute desired number of replicas based on listed metrics for ReplicationController/e2e-horizontal-pod-autoscaling-8508/rc-light: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API E0817 09:49:17.948165 1 horizontal.go:227] failed to compute desired number of replicas based on listed metrics for ReplicationController/e2e-horizontal-pod-autoscaling-8508/rc-light: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API ... E0817 10:11:43.548957 1 horizontal.go:227] failed to compute desired number of replicas based on listed metrics for ReplicationController/e2e-horizontal-pod-autoscaling-8508/rc-light: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API E0817 10:11:58.787296 1 horizontal.go:227] failed to compute desired number of replicas based on listed metrics for ReplicationController/e2e-horizontal-pod-autoscaling-8508/rc-light: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API ``` Checking https://github.com/kubernetes-sigs/prometheus-adapter code at https://github.com/kubernetes-sigs/prometheus-adapter/blob/56df87890ce67c9c27d29dbf8586934a321caba4/pkg/resourceprovider/provider.go#L191: ``` func (p *resourceProvider) assignForPod(pod apitypes.NamespacedName, resultsByNs map[string]nsQueryResults, resMetrics *[]metrics.ContainerMetrics, resTime *api.TimeInfo) { // check to make sure everything is present nsRes, nsResPresent := resultsByNs[pod.Namespace] if !nsResPresent { klog.Errorf("unable to fetch metrics for pods in namespace %q, skipping pod %s", pod.Namespace, pod.String()) return } cpuRes, hasResult := nsRes.cpu[pod.Name] if !hasResult { klog.Errorf("unable to fetch CPU metrics for pod %s, skipping", pod.String()) return } ... ``` resultsByNs is populated with the namespace but missing pod.Name entry for e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf pod in nsRes.cpu. Tracing back how resultsByNs is populated, the prometheus-adaptar constructs a query against https://prometheus-k8s.openshift-monitoring.svc:9091 (see https://github.com/kubernetes-sigs/prometheus-adapter/blob/56df87890ce67c9c27d29dbf8586934a321caba4/cmd/config-gen/utils/default.go#L96): ``` sum(rate(container_cpu_usage_seconds_total{<<.LabelMatchers>>}[%s])) by (<<.GroupBy>>) ``` Just for testing purposes you can use `sum(rate(container_cpu_usage_seconds_total{namespace=~'e2e-horizontal-pod-autoscaling.*',pod=~'rc-light-.*',pod!~'rc-light-ctrl-.*'}[1m])) by (pods)` in https://console-openshift-console.apps.XXX.devcluster.openshift.com/monitoring/query-browser to see how the query works when running the "[sig-autoscaling] [Feature:HPA] Horizontal pod autoscaling (scale resource: CPU) ReplicationController light Should scale from 1 pod to 2 pods [Suite:openshift/conformance/parallel] [Suite:k8s]" test locally. Unfortunately, I can't tell which response was returned the Prometheus server to see if the server returned an empty cpu entry for e2e-horizontal-pod-autoscaling-8508/rc-light-bn7rf pod.
Created attachment 1815500 [details] e2e-horizontal-pod-autoscaling-prometheus
You can use https://promecieus.dptools.openshift.org/ and feed it with https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848 link. Then put the query. The graph attached in the BZ. Based on the graph the pod metrics were reported in <10:18.36; 10:23.08>. From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848/artifacts/e2e-gcp/openshift-e2e-test/artifacts/junit/junit_e2e_20210817-102245.xml the test ran in <09:56:18.965; 10:03:07.481>. Clearly, the podmetrics were available ~11m after the test ended.
Based upon Jan's work, it appears that the root cause is a delay in the metrics pipeline. I have seen metrics be a little slow on a loaded cluster, but this seems to be quite a large delay. Could someone from the monitoring team please take a look to see why there might be such a large delay? When the E2E tests of the horizontal pod autoscaler (HPA), after 10 minutes of creating the pod, the HPA is still unable to get pod CPU metrics to be available via the metrics endpoint served by the Prometheus adapter.
Created attachment 1819346 [details] container_scrape_error metric for the node running the scaled pods
Created attachment 1819347 [details] container_cpu_usage_seconds_total metric for the autoscaled pods
From attachment 1819347 [details] (PromCIeus using https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848), we can see that the container_cpu_usage_seconds_total metric is sporadically present until 10:12. Looking at the cadvisor_scrape_error metric for the node on which the autoscaled pods are scheduled (ci-op-t8ldxsiv-2a78c-6nfxc-worker-a-mrdqd), it shows that most of the metric scrapes fail which is the reason why prometheus-adapter says it can't fetch CPU metrics for the pods. In the ci-op-t8ldxsiv-2a78c-6nfxc-worker-a-mrdqd logs [1], there are lots of stats failures for the containers matching the autoscaled pods. For instance: Aug 17 09:48:31.285494 ci-op-t8ldxsiv-2a78c-6nfxc-worker-a-mrdqd hyperkube[1436]: E0817 09:48:31.284845 1436 cadvisor_stats_provider.go:415] "Partial failure issuing cadvisor.ContainerInfoV2" err="partial failures: [\"/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podc431efa1_15e3_439f_89c1_a199dc51187e.slice/crio-4db2590ff86dea742386d17247d764a937ed00324292aec208c621379b146953.scope\": RecentStats: unable to find data in memory cache], [\"/kubepods.slice/kubepods-podf6b3d1f7_73b4_40b7_a1c8_b7de2302aafe.slice/crio-22525f07eaeafffe9aea50bef7dfa3bb02e8f78e03ca980e2ad46b743e1f7839.scope\": RecentStats: unable to find data in memory cache], [\"/kubepods.slice/kubepods-podeaa32efc_4e92_4fd8_9d25_e1d80b14129e.slice\": RecentStats: unable to find data in memory cache], [\"/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod6a8b5fed_77a6_4b46_bc5a_79429c99c67f.slice\": RecentStats: unable to find data in memory cache]" Moving back to the Node team since AFAICT it's a problem with cAdvisor failing to expose container metrics from time to time. [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26377/pull-ci-openshift-origin-master-e2e-gcp/1427556641120718848/artifacts/e2e-gcp/gather-extra/artifacts/nodes/ci-op-t8ldxsiv-2a78c-6nfxc-worker-a-mrdqd/journal
Hasn't happened since the kubelet fixes went in... Closing. *** This bug has been marked as a duplicate of bug 2011815 ***