Bug 1950993 - Prometheus when installed on the cluster should have non-Pod host cAdvisor metrics test failing often
Summary: Prometheus when installed on the cluster should have non-Pod host cAdvisor me...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Swarup Ghosh
QA Contact: Weinan Liu
URL:
Whiteboard:
: 1955247 1961395 (view as bug list)
Depends On:
Blocks: 1973075
TreeView+ depends on / blocked
 
Reported: 2021-04-19 10:44 UTC by Omer Tuchfeld
Modified: 2021-12-06 06:47 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-12-06 06:47:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
cAdvisor metrics (412.54 KB, image/png)
2021-04-19 12:31 UTC, Damien Grisonnet
no flags Details
container_scrape_error metric (78.84 KB, image/png)
2021-04-19 13:27 UTC, Simon Pasquier
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 802 0 None open Bug 1950993: UPSTREAM: <drop>: bump cadvisor for 2868 upstream patch 2021-06-10 16:02:40 UTC
Github openshift kubernetes pull 892 0 None None None 2021-08-18 10:03:51 UTC
Github openshift origin pull 26232 0 None open Bug 1950993: Replace cadvisor with openshift cadvisor fork 2021-06-15 15:01:17 UTC

Description Omer Tuchfeld 2021-04-19 10:44:29 UTC
Description of problem:
The "Prometheus when installed on the cluster should have non-Pod host cAdvisor metrics" test fails often.

The test is implemented here https://github.com/openshift/origin/blob/ac8ca36f59f94c4413c0571ec7a9c8d9b2430fbe/test/extended/prometheus/prometheus.go#L390-L402

Version-Release number of selected component (if applicable):


How reproducible:
https://search.ci.openshift.org/?search=Prometheus+when+installed+on+the+cluster+should+have+non-Pod+host+cAdvisor+metrics&maxAge=120h0m0s&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job 


Steps to Reproduce:
1.
2.
3.

Actual results:
Test fails

Expected results:
Test should not fail

Additional info:

Comment 1 Damien Grisonnet 2021-04-19 12:31:09 UTC
Created attachment 1773270 [details]
cAdvisor metrics

I replayed the metrics from this CI run [1] and was able to find cAdvisor metrics for system services. I attached the result of the query that is tested in origin to this BZ, but as far as I can tell, the failure seems to be caused by scraping errors.

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1121/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1383816160142692352

Comment 2 Simon Pasquier 2021-04-19 13:27:11 UTC
Created attachment 1773286 [details]
container_scrape_error metric

The container_scrape_error metric shows that cAdvisor can't get container metrics most of the time (see container_scrape_error metric screenshot). Reassigning to the Node team for investigation.

Comment 3 Ryan Phillips 2021-04-26 13:34:32 UTC
The issue is with the service account creation. Maybe a race in the test to create the serviceaccount?

   ErrStatus: {
            TypeMeta: {Kind: "", APIVersion: ""},
            ListMeta: {
                SelfLink: "",
                ResourceVersion: "",
                Continue: "",
                RemainingItemCount: nil,
            },
            Status: "Failure",
            Message: "pods \"execpod\" is forbidden: error looking up service account e2e-test-prometheus-f9wt8/default: serviceaccount \"default\" not found",
            Reason: "Forbidden",
            Details: {Name: "execpod", Group: "", Kind: "pods", UID: "", Causes: nil, RetryAfterSeconds: 0},
            Code: 403,
        },

Comment 4 Simon Pasquier 2021-04-26 15:50:12 UTC
@Ryan the failure you've spotted is legit but it isn't exactly what has been reported initially. I've looked only at "real" failures (e.g. when the Prometheus query didn't fail but returned no container metrics) [1] and it seems to be correlated with single/compact clusters.

Looking more specifically at https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-compact-upgrade/1386207349999407104, I see lots of errors in the logs of ip-10-0-207-255.ec2 [2] about container metrics that fail to be collected:

Apr 25 08:38:30.732009 ip-10-0-207-255 hyperkube[1580]: W0425 08:38:30.731960    1580 prometheus.go:1856] Couldn't get containers: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podfb5503c7_7473_4520_9dad_541def516acc.slice": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podfb5503c7_7473_4520_9dad_541def516acc.slice/crio-d15120b05438f37d9b5a8b7b6584b80aa6e6073e1dae498a6d67c312455fe0b7.scope": containerDataToContainerInfo: unable to find data in memory cache]
Apr 25 08:38:50.852883 ip-10-0-207-255 hyperkube[1580]: W0425 08:38:50.852839    1580 prometheus.go:1856] Couldn't get containers: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf3c4f7a1_78a6_44f3_a357_fa3db0e5d9fd.slice/crio-73255f8bc661b79156a6ca1f1efa7b613de6fd90d44aedc6cba3f2d57ed59202.scope": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf3c4f7a1_78a6_44f3_a357_fa3db0e5d9fd.slice/crio-efa11f10daec8f8c02d82a6bdb0ae15b7a23103a1b57f3a2a8e8c487a3795347.scope": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod2ef01847_9140_4ada_aa8d_20e909cccfc0.slice/crio-74c0a1734ddf75becb9a52c6e768680d0fe299e08108a2387be4aa5f1d33d0aa.scope": containerDataToContainerInfo: unable to find data in memory cache]
Apr 25 08:39:00.740491 ip-10-0-207-255 hyperkube[1580]: W0425 08:39:00.740450    1580 prometheus.go:1856] Couldn't get containers: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf3c4f7a1_78a6_44f3_a357_fa3db0e5d9fd.slice/crio-efa11f10daec8f8c02d82a6bdb0ae15b7a23103a1b57f3a2a8e8c487a3795347.scope": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf3c4f7a1_78a6_44f3_a357_fa3db0e5d9fd.slice/crio-73255f8bc661b79156a6ca1f1efa7b613de6fd90d44aedc6cba3f2d57ed59202.scope": containerDataToContainerInfo: unable to find data in memory cache]
Apr 25 08:40:20.866323 ip-10-0-207-255 hyperkube[1580]: W0425 08:40:20.866277    1580 prometheus.go:1856] Couldn't get containers: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod2abe929f_3ad0_4ffe_af45_026919cb467b.slice": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod2abe929f_3ad0_4ffe_af45_026919cb467b.slice/crio-85d8a6d7170f3645bff82d458ebcf67171d15951a1b8a7dfab64f3766b85eb01.scope": containerDataToContainerInfo: unable to find data in memory cache]

If I read the code correctly [3], each log line means that cadvisor returned no container metrics to Prometheus.

[1] https://search.ci.openshift.org/?search=promQL+query+returned+unexpected+results%3A.*container_cpu_usage_seconds_total&maxAge=120h0m0s&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-compact-upgrade/1386207349999407104/artifacts/e2e-aws-compact-upgrade/gather-extra/artifacts/nodes/ip-10-0-207-255.ec2.internal/journal
[3] https://github.com/openshift/origin/blob/581a8a0effc49410209e5d98735246dff9fddd4c/vendor/github.com/google/cadvisor/metrics/prometheus.go#L1821-L1826

Comment 5 Simon Pasquier 2021-04-27 13:19:12 UTC
Based on the comment 4 and after discussing offline with Ryan, reassigning to the Node team to investigate why cAdvisor fails to collect metrics repeatedly.

Comment 6 Simon Pasquier 2021-04-30 09:08:16 UTC
*** Bug 1955247 has been marked as a duplicate of this bug. ***

Comment 11 Elana Hashman 2021-06-01 18:56:52 UTC
*** Bug 1961395 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.