Bug 1950993

Summary: Prometheus when installed on the cluster should have non-Pod host cAdvisor metrics test failing often
Product: OpenShift Container Platform Reporter: Omer Tuchfeld <otuchfel>
Component: NodeAssignee: Swarup Ghosh <swghosh>
Node sub component: Kubelet QA Contact: Weinan Liu <weinliu>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: medium    
Priority: unspecified CC: akrzos, anpicker, aos-bugs, dgrisonn, erooth, harpatil, jhusta, rfreiman, rphillips, spasquie, surbania, wking
Version: 4.8   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-12-06 06:47:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1973075    
Attachments:
Description Flags
cAdvisor metrics
none
container_scrape_error metric none

Description Omer Tuchfeld 2021-04-19 10:44:29 UTC
Description of problem:
The "Prometheus when installed on the cluster should have non-Pod host cAdvisor metrics" test fails often.

The test is implemented here https://github.com/openshift/origin/blob/ac8ca36f59f94c4413c0571ec7a9c8d9b2430fbe/test/extended/prometheus/prometheus.go#L390-L402

Version-Release number of selected component (if applicable):


How reproducible:
https://search.ci.openshift.org/?search=Prometheus+when+installed+on+the+cluster+should+have+non-Pod+host+cAdvisor+metrics&maxAge=120h0m0s&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job 


Steps to Reproduce:
1.
2.
3.

Actual results:
Test fails

Expected results:
Test should not fail

Additional info:

Comment 1 Damien Grisonnet 2021-04-19 12:31:09 UTC
Created attachment 1773270 [details]
cAdvisor metrics

I replayed the metrics from this CI run [1] and was able to find cAdvisor metrics for system services. I attached the result of the query that is tested in origin to this BZ, but as far as I can tell, the failure seems to be caused by scraping errors.

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1121/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1383816160142692352

Comment 2 Simon Pasquier 2021-04-19 13:27:11 UTC
Created attachment 1773286 [details]
container_scrape_error metric

The container_scrape_error metric shows that cAdvisor can't get container metrics most of the time (see container_scrape_error metric screenshot). Reassigning to the Node team for investigation.

Comment 3 Ryan Phillips 2021-04-26 13:34:32 UTC
The issue is with the service account creation. Maybe a race in the test to create the serviceaccount?

   ErrStatus: {
            TypeMeta: {Kind: "", APIVersion: ""},
            ListMeta: {
                SelfLink: "",
                ResourceVersion: "",
                Continue: "",
                RemainingItemCount: nil,
            },
            Status: "Failure",
            Message: "pods \"execpod\" is forbidden: error looking up service account e2e-test-prometheus-f9wt8/default: serviceaccount \"default\" not found",
            Reason: "Forbidden",
            Details: {Name: "execpod", Group: "", Kind: "pods", UID: "", Causes: nil, RetryAfterSeconds: 0},
            Code: 403,
        },

Comment 4 Simon Pasquier 2021-04-26 15:50:12 UTC
@Ryan the failure you've spotted is legit but it isn't exactly what has been reported initially. I've looked only at "real" failures (e.g. when the Prometheus query didn't fail but returned no container metrics) [1] and it seems to be correlated with single/compact clusters.

Looking more specifically at https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-compact-upgrade/1386207349999407104, I see lots of errors in the logs of ip-10-0-207-255.ec2 [2] about container metrics that fail to be collected:

Apr 25 08:38:30.732009 ip-10-0-207-255 hyperkube[1580]: W0425 08:38:30.731960    1580 prometheus.go:1856] Couldn't get containers: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podfb5503c7_7473_4520_9dad_541def516acc.slice": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podfb5503c7_7473_4520_9dad_541def516acc.slice/crio-d15120b05438f37d9b5a8b7b6584b80aa6e6073e1dae498a6d67c312455fe0b7.scope": containerDataToContainerInfo: unable to find data in memory cache]
Apr 25 08:38:50.852883 ip-10-0-207-255 hyperkube[1580]: W0425 08:38:50.852839    1580 prometheus.go:1856] Couldn't get containers: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf3c4f7a1_78a6_44f3_a357_fa3db0e5d9fd.slice/crio-73255f8bc661b79156a6ca1f1efa7b613de6fd90d44aedc6cba3f2d57ed59202.scope": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf3c4f7a1_78a6_44f3_a357_fa3db0e5d9fd.slice/crio-efa11f10daec8f8c02d82a6bdb0ae15b7a23103a1b57f3a2a8e8c487a3795347.scope": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod2ef01847_9140_4ada_aa8d_20e909cccfc0.slice/crio-74c0a1734ddf75becb9a52c6e768680d0fe299e08108a2387be4aa5f1d33d0aa.scope": containerDataToContainerInfo: unable to find data in memory cache]
Apr 25 08:39:00.740491 ip-10-0-207-255 hyperkube[1580]: W0425 08:39:00.740450    1580 prometheus.go:1856] Couldn't get containers: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf3c4f7a1_78a6_44f3_a357_fa3db0e5d9fd.slice/crio-efa11f10daec8f8c02d82a6bdb0ae15b7a23103a1b57f3a2a8e8c487a3795347.scope": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf3c4f7a1_78a6_44f3_a357_fa3db0e5d9fd.slice/crio-73255f8bc661b79156a6ca1f1efa7b613de6fd90d44aedc6cba3f2d57ed59202.scope": containerDataToContainerInfo: unable to find data in memory cache]
Apr 25 08:40:20.866323 ip-10-0-207-255 hyperkube[1580]: W0425 08:40:20.866277    1580 prometheus.go:1856] Couldn't get containers: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod2abe929f_3ad0_4ffe_af45_026919cb467b.slice": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod2abe929f_3ad0_4ffe_af45_026919cb467b.slice/crio-85d8a6d7170f3645bff82d458ebcf67171d15951a1b8a7dfab64f3766b85eb01.scope": containerDataToContainerInfo: unable to find data in memory cache]

If I read the code correctly [3], each log line means that cadvisor returned no container metrics to Prometheus.

[1] https://search.ci.openshift.org/?search=promQL+query+returned+unexpected+results%3A.*container_cpu_usage_seconds_total&maxAge=120h0m0s&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-compact-upgrade/1386207349999407104/artifacts/e2e-aws-compact-upgrade/gather-extra/artifacts/nodes/ip-10-0-207-255.ec2.internal/journal
[3] https://github.com/openshift/origin/blob/581a8a0effc49410209e5d98735246dff9fddd4c/vendor/github.com/google/cadvisor/metrics/prometheus.go#L1821-L1826

Comment 5 Simon Pasquier 2021-04-27 13:19:12 UTC
Based on the comment 4 and after discussing offline with Ryan, reassigning to the Node team to investigate why cAdvisor fails to collect metrics repeatedly.

Comment 6 Simon Pasquier 2021-04-30 09:08:16 UTC
*** Bug 1955247 has been marked as a duplicate of this bug. ***

Comment 11 Elana Hashman 2021-06-01 18:56:52 UTC
*** Bug 1961395 has been marked as a duplicate of this bug. ***