Bug 1950993

Summary:

Prometheus when installed on the cluster should have non-Pod host cAdvisor metrics test failing often

Product:

OpenShift Container Platform

Reporter:

Omer Tuchfeld <otuchfel>

Component:

Node

Assignee:

Swarup Ghosh <swghosh>

Node sub component:

Kubelet

QA Contact:

Weinan Liu <weinliu>

Status:

CLOSED CURRENTRELEASE

Docs Contact:

Severity:

medium

Priority:

unspecified

CC:

akrzos, anpicker, aos-bugs, dgrisonn, erooth, harpatil, jhusta, rfreiman, rphillips, spasquie, surbania, wking

Version:

4.8

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-12-06 06:47:51 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1973075

Attachments:

Description	Flags
cAdvisor metrics	none
container_scrape_error metric	none

Description Omer Tuchfeld 2021-04-19 10:44:29 UTC

Description of problem:
The "Prometheus when installed on the cluster should have non-Pod host cAdvisor metrics" test fails often.

The test is implemented here https://github.com/openshift/origin/blob/ac8ca36f59f94c4413c0571ec7a9c8d9b2430fbe/test/extended/prometheus/prometheus.go#L390-L402

Version-Release number of selected component (if applicable):


How reproducible:
https://search.ci.openshift.org/?search=Prometheus+when+installed+on+the+cluster+should+have+non-Pod+host+cAdvisor+metrics&maxAge=120h0m0s&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job 


Steps to Reproduce:
1.
2.
3.

Actual results:
Test fails

Expected results:
Test should not fail

Additional info:

Comment 1 Damien Grisonnet 2021-04-19 12:31:09 UTC

Created attachment 1773270 [details]
cAdvisor metrics

I replayed the metrics from this CI run [1] and was able to find cAdvisor metrics for system services. I attached the result of the query that is tested in origin to this BZ, but as far as I can tell, the failure seems to be caused by scraping errors.

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1121/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1383816160142692352

Comment 2 Simon Pasquier 2021-04-19 13:27:11 UTC

Created attachment 1773286 [details]
container_scrape_error metric

The container_scrape_error metric shows that cAdvisor can't get container metrics most of the time (see container_scrape_error metric screenshot). Reassigning to the Node team for investigation.

Comment 3 Ryan Phillips 2021-04-26 13:34:32 UTC

The issue is with the service account creation. Maybe a race in the test to create the serviceaccount?

   ErrStatus: {
            TypeMeta: {Kind: "", APIVersion: ""},
            ListMeta: {
                SelfLink: "",
                ResourceVersion: "",
                Continue: "",
                RemainingItemCount: nil,
            },
            Status: "Failure",
            Message: "pods \"execpod\" is forbidden: error looking up service account e2e-test-prometheus-f9wt8/default: serviceaccount \"default\" not found",
            Reason: "Forbidden",
            Details: {Name: "execpod", Group: "", Kind: "pods", UID: "", Causes: nil, RetryAfterSeconds: 0},
            Code: 403,
        },

Comment 4 Simon Pasquier 2021-04-26 15:50:12 UTC

@Ryan the failure you've spotted is legit but it isn't exactly what has been reported initially. I've looked only at "real" failures (e.g. when the Prometheus query didn't fail but returned no container metrics) [1] and it seems to be correlated with single/compact clusters.

Looking more specifically at https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-compact-upgrade/1386207349999407104, I see lots of errors in the logs of ip-10-0-207-255.ec2 [2] about container metrics that fail to be collected:

Apr 25 08:38:30.732009 ip-10-0-207-255 hyperkube[1580]: W0425 08:38:30.731960    1580 prometheus.go:1856] Couldn't get containers: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podfb5503c7_7473_4520_9dad_541def516acc.slice": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podfb5503c7_7473_4520_9dad_541def516acc.slice/crio-d15120b05438f37d9b5a8b7b6584b80aa6e6073e1dae498a6d67c312455fe0b7.scope": containerDataToContainerInfo: unable to find data in memory cache]
Apr 25 08:38:50.852883 ip-10-0-207-255 hyperkube[1580]: W0425 08:38:50.852839    1580 prometheus.go:1856] Couldn't get containers: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf3c4f7a1_78a6_44f3_a357_fa3db0e5d9fd.slice/crio-73255f8bc661b79156a6ca1f1efa7b613de6fd90d44aedc6cba3f2d57ed59202.scope": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf3c4f7a1_78a6_44f3_a357_fa3db0e5d9fd.slice/crio-efa11f10daec8f8c02d82a6bdb0ae15b7a23103a1b57f3a2a8e8c487a3795347.scope": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod2ef01847_9140_4ada_aa8d_20e909cccfc0.slice/crio-74c0a1734ddf75becb9a52c6e768680d0fe299e08108a2387be4aa5f1d33d0aa.scope": containerDataToContainerInfo: unable to find data in memory cache]
Apr 25 08:39:00.740491 ip-10-0-207-255 hyperkube[1580]: W0425 08:39:00.740450    1580 prometheus.go:1856] Couldn't get containers: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf3c4f7a1_78a6_44f3_a357_fa3db0e5d9fd.slice/crio-efa11f10daec8f8c02d82a6bdb0ae15b7a23103a1b57f3a2a8e8c487a3795347.scope": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf3c4f7a1_78a6_44f3_a357_fa3db0e5d9fd.slice/crio-73255f8bc661b79156a6ca1f1efa7b613de6fd90d44aedc6cba3f2d57ed59202.scope": containerDataToContainerInfo: unable to find data in memory cache]
Apr 25 08:40:20.866323 ip-10-0-207-255 hyperkube[1580]: W0425 08:40:20.866277    1580 prometheus.go:1856] Couldn't get containers: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod2abe929f_3ad0_4ffe_af45_026919cb467b.slice": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod2abe929f_3ad0_4ffe_af45_026919cb467b.slice/crio-85d8a6d7170f3645bff82d458ebcf67171d15951a1b8a7dfab64f3766b85eb01.scope": containerDataToContainerInfo: unable to find data in memory cache]

If I read the code correctly [3], each log line means that cadvisor returned no container metrics to Prometheus.

[1] https://search.ci.openshift.org/?search=promQL+query+returned+unexpected+results%3A.*container_cpu_usage_seconds_total&maxAge=120h0m0s&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-compact-upgrade/1386207349999407104/artifacts/e2e-aws-compact-upgrade/gather-extra/artifacts/nodes/ip-10-0-207-255.ec2.internal/journal
[3] https://github.com/openshift/origin/blob/581a8a0effc49410209e5d98735246dff9fddd4c/vendor/github.com/google/cadvisor/metrics/prometheus.go#L1821-L1826

Comment 5 Simon Pasquier 2021-04-27 13:19:12 UTC

Based on the comment 4 and after discussing offline with Ryan, reassigning to the Node team to investigate why cAdvisor fails to collect metrics repeatedly.

Comment 6 Simon Pasquier 2021-04-30 09:08:16 UTC

*** Bug 1955247 has been marked as a duplicate of this bug. ***

Comment 11 Elana Hashman 2021-06-01 18:56:52 UTC

*** Bug 1961395 has been marked as a duplicate of this bug. ***

Comment 31 Swarup Ghosh 2021-12-06 06:47:51 UTC

This test is passing successfully on 4.10, 4.8 releases.

- CI test results (4.8): https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1466686801392439296
- CI test results (4.10): https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1466777487571685376