1950993 – Prometheus when installed on the cluster should have non-Pod host cAdvisor metrics test failing often

Bug 1950993 - Prometheus when installed on the cluster should have non-Pod host cAdvisor metrics test failing often

Summary: Prometheus when installed on the cluster should have non-Pod host cAdvisor me...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Swarup Ghosh
QA Contact:	Weinan Liu
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1955247 1961395 (view as bug list)
Depends On:
Blocks:	1973075
TreeView+	depends on / blocked

Reported:	2021-04-19 10:44 UTC by Omer Tuchfeld
Modified:	2021-12-06 06:47 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-12-06 06:47:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
cAdvisor metrics (412.54 KB, image/png) 2021-04-19 12:31 UTC, Damien Grisonnet	no flags	Details
container_scrape_error metric (78.84 KB, image/png) 2021-04-19 13:27 UTC, Simon Pasquier	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift kubernetes pull 802	None	open	Bug 1950993: UPSTREAM: <drop>: bump cadvisor for 2868 upstream patch	2021-06-10 16:02:40 UTC
Github	openshift kubernetes pull 892	None	None	None	2021-08-18 10:03:51 UTC
Github	openshift origin pull 26232	None	open	Bug 1950993: Replace cadvisor with openshift cadvisor fork	2021-06-15 15:01:17 UTC

Description Omer Tuchfeld 2021-04-19 10:44:29 UTC

Description of problem:
The "Prometheus when installed on the cluster should have non-Pod host cAdvisor metrics" test fails often.

The test is implemented here https://github.com/openshift/origin/blob/ac8ca36f59f94c4413c0571ec7a9c8d9b2430fbe/test/extended/prometheus/prometheus.go#L390-L402

Version-Release number of selected component (if applicable):


How reproducible:
https://search.ci.openshift.org/?search=Prometheus+when+installed+on+the+cluster+should+have+non-Pod+host+cAdvisor+metrics&maxAge=120h0m0s&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job 


Steps to Reproduce:
1.
2.
3.

Actual results:
Test fails

Expected results:
Test should not fail

Additional info:

Comment 1 Damien Grisonnet 2021-04-19 12:31:09 UTC

Created attachment 1773270 [details]
cAdvisor metrics

I replayed the metrics from this CI run [1] and was able to find cAdvisor metrics for system services. I attached the result of the query that is tested in origin to this BZ, but as far as I can tell, the failure seems to be caused by scraping errors.

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1121/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1383816160142692352

Comment 2 Simon Pasquier 2021-04-19 13:27:11 UTC

Created attachment 1773286 [details]
container_scrape_error metric

The container_scrape_error metric shows that cAdvisor can't get container metrics most of the time (see container_scrape_error metric screenshot). Reassigning to the Node team for investigation.

Comment 3 Ryan Phillips 2021-04-26 13:34:32 UTC

The issue is with the service account creation. Maybe a race in the test to create the serviceaccount?

   ErrStatus: {
            TypeMeta: {Kind: "", APIVersion: ""},
            ListMeta: {
                SelfLink: "",
                ResourceVersion: "",
                Continue: "",
                RemainingItemCount: nil,
            },
            Status: "Failure",
            Message: "pods \"execpod\" is forbidden: error looking up service account e2e-test-prometheus-f9wt8/default: serviceaccount \"default\" not found",
            Reason: "Forbidden",
            Details: {Name: "execpod", Group: "", Kind: "pods", UID: "", Causes: nil, RetryAfterSeconds: 0},
            Code: 403,
        },

Comment 4 Simon Pasquier 2021-04-26 15:50:12 UTC

@Ryan the failure you've spotted is legit but it isn't exactly what has been reported initially. I've looked only at "real" failures (e.g. when the Prometheus query didn't fail but returned no container metrics) [1] and it seems to be correlated with single/compact clusters.

Looking more specifically at https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-compact-upgrade/1386207349999407104, I see lots of errors in the logs of ip-10-0-207-255.ec2 [2] about container metrics that fail to be collected:

Apr 25 08:38:30.732009 ip-10-0-207-255 hyperkube[1580]: W0425 08:38:30.731960    1580 prometheus.go:1856] Couldn't get containers: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podfb5503c7_7473_4520_9dad_541def516acc.slice": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podfb5503c7_7473_4520_9dad_541def516acc.slice/crio-d15120b05438f37d9b5a8b7b6584b80aa6e6073e1dae498a6d67c312455fe0b7.scope": containerDataToContainerInfo: unable to find data in memory cache]
Apr 25 08:38:50.852883 ip-10-0-207-255 hyperkube[1580]: W0425 08:38:50.852839    1580 prometheus.go:1856] Couldn't get containers: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf3c4f7a1_78a6_44f3_a357_fa3db0e5d9fd.slice/crio-73255f8bc661b79156a6ca1f1efa7b613de6fd90d44aedc6cba3f2d57ed59202.scope": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf3c4f7a1_78a6_44f3_a357_fa3db0e5d9fd.slice/crio-efa11f10daec8f8c02d82a6bdb0ae15b7a23103a1b57f3a2a8e8c487a3795347.scope": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod2ef01847_9140_4ada_aa8d_20e909cccfc0.slice/crio-74c0a1734ddf75becb9a52c6e768680d0fe299e08108a2387be4aa5f1d33d0aa.scope": containerDataToContainerInfo: unable to find data in memory cache]
Apr 25 08:39:00.740491 ip-10-0-207-255 hyperkube[1580]: W0425 08:39:00.740450    1580 prometheus.go:1856] Couldn't get containers: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf3c4f7a1_78a6_44f3_a357_fa3db0e5d9fd.slice/crio-efa11f10daec8f8c02d82a6bdb0ae15b7a23103a1b57f3a2a8e8c487a3795347.scope": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf3c4f7a1_78a6_44f3_a357_fa3db0e5d9fd.slice/crio-73255f8bc661b79156a6ca1f1efa7b613de6fd90d44aedc6cba3f2d57ed59202.scope": containerDataToContainerInfo: unable to find data in memory cache]
Apr 25 08:40:20.866323 ip-10-0-207-255 hyperkube[1580]: W0425 08:40:20.866277    1580 prometheus.go:1856] Couldn't get containers: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod2abe929f_3ad0_4ffe_af45_026919cb467b.slice": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod2abe929f_3ad0_4ffe_af45_026919cb467b.slice/crio-85d8a6d7170f3645bff82d458ebcf67171d15951a1b8a7dfab64f3766b85eb01.scope": containerDataToContainerInfo: unable to find data in memory cache]

If I read the code correctly [3], each log line means that cadvisor returned no container metrics to Prometheus.

[1] https://search.ci.openshift.org/?search=promQL+query+returned+unexpected+results%3A.*container_cpu_usage_seconds_total&maxAge=120h0m0s&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-compact-upgrade/1386207349999407104/artifacts/e2e-aws-compact-upgrade/gather-extra/artifacts/nodes/ip-10-0-207-255.ec2.internal/journal
[3] https://github.com/openshift/origin/blob/581a8a0effc49410209e5d98735246dff9fddd4c/vendor/github.com/google/cadvisor/metrics/prometheus.go#L1821-L1826

Comment 5 Simon Pasquier 2021-04-27 13:19:12 UTC

Based on the comment 4 and after discussing offline with Ryan, reassigning to the Node team to investigate why cAdvisor fails to collect metrics repeatedly.

Comment 6 Simon Pasquier 2021-04-30 09:08:16 UTC

*** Bug 1955247 has been marked as a duplicate of this bug. ***

Comment 11 Elana Hashman 2021-06-01 18:56:52 UTC

*** Bug 1961395 has been marked as a duplicate of this bug. ***

Comment 31 Swarup Ghosh 2021-12-06 06:47:51 UTC

This test is passing successfully on 4.10, 4.8 releases.

- CI test results (4.8): https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1466686801392439296
- CI test results (4.10): https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1466777487571685376

Note You need to log in before you can comment on or make changes to this bug.