Description of problem: The "Prometheus when installed on the cluster should have non-Pod host cAdvisor metrics" test fails often. The test is implemented here https://github.com/openshift/origin/blob/ac8ca36f59f94c4413c0571ec7a9c8d9b2430fbe/test/extended/prometheus/prometheus.go#L390-L402 Version-Release number of selected component (if applicable): How reproducible: https://search.ci.openshift.org/?search=Prometheus+when+installed+on+the+cluster+should+have+non-Pod+host+cAdvisor+metrics&maxAge=120h0m0s&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Steps to Reproduce: 1. 2. 3. Actual results: Test fails Expected results: Test should not fail Additional info:
Created attachment 1773270 [details] cAdvisor metrics I replayed the metrics from this CI run [1] and was able to find cAdvisor metrics for system services. I attached the result of the query that is tested in origin to this BZ, but as far as I can tell, the failure seems to be caused by scraping errors. [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1121/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1383816160142692352
Created attachment 1773286 [details] container_scrape_error metric The container_scrape_error metric shows that cAdvisor can't get container metrics most of the time (see container_scrape_error metric screenshot). Reassigning to the Node team for investigation.
The issue is with the service account creation. Maybe a race in the test to create the serviceaccount? ErrStatus: { TypeMeta: {Kind: "", APIVersion: ""}, ListMeta: { SelfLink: "", ResourceVersion: "", Continue: "", RemainingItemCount: nil, }, Status: "Failure", Message: "pods \"execpod\" is forbidden: error looking up service account e2e-test-prometheus-f9wt8/default: serviceaccount \"default\" not found", Reason: "Forbidden", Details: {Name: "execpod", Group: "", Kind: "pods", UID: "", Causes: nil, RetryAfterSeconds: 0}, Code: 403, },
@Ryan the failure you've spotted is legit but it isn't exactly what has been reported initially. I've looked only at "real" failures (e.g. when the Prometheus query didn't fail but returned no container metrics) [1] and it seems to be correlated with single/compact clusters. Looking more specifically at https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-compact-upgrade/1386207349999407104, I see lots of errors in the logs of ip-10-0-207-255.ec2 [2] about container metrics that fail to be collected: Apr 25 08:38:30.732009 ip-10-0-207-255 hyperkube[1580]: W0425 08:38:30.731960 1580 prometheus.go:1856] Couldn't get containers: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podfb5503c7_7473_4520_9dad_541def516acc.slice": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podfb5503c7_7473_4520_9dad_541def516acc.slice/crio-d15120b05438f37d9b5a8b7b6584b80aa6e6073e1dae498a6d67c312455fe0b7.scope": containerDataToContainerInfo: unable to find data in memory cache] Apr 25 08:38:50.852883 ip-10-0-207-255 hyperkube[1580]: W0425 08:38:50.852839 1580 prometheus.go:1856] Couldn't get containers: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf3c4f7a1_78a6_44f3_a357_fa3db0e5d9fd.slice/crio-73255f8bc661b79156a6ca1f1efa7b613de6fd90d44aedc6cba3f2d57ed59202.scope": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf3c4f7a1_78a6_44f3_a357_fa3db0e5d9fd.slice/crio-efa11f10daec8f8c02d82a6bdb0ae15b7a23103a1b57f3a2a8e8c487a3795347.scope": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod2ef01847_9140_4ada_aa8d_20e909cccfc0.slice/crio-74c0a1734ddf75becb9a52c6e768680d0fe299e08108a2387be4aa5f1d33d0aa.scope": containerDataToContainerInfo: unable to find data in memory cache] Apr 25 08:39:00.740491 ip-10-0-207-255 hyperkube[1580]: W0425 08:39:00.740450 1580 prometheus.go:1856] Couldn't get containers: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf3c4f7a1_78a6_44f3_a357_fa3db0e5d9fd.slice/crio-efa11f10daec8f8c02d82a6bdb0ae15b7a23103a1b57f3a2a8e8c487a3795347.scope": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf3c4f7a1_78a6_44f3_a357_fa3db0e5d9fd.slice/crio-73255f8bc661b79156a6ca1f1efa7b613de6fd90d44aedc6cba3f2d57ed59202.scope": containerDataToContainerInfo: unable to find data in memory cache] Apr 25 08:40:20.866323 ip-10-0-207-255 hyperkube[1580]: W0425 08:40:20.866277 1580 prometheus.go:1856] Couldn't get containers: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod2abe929f_3ad0_4ffe_af45_026919cb467b.slice": containerDataToContainerInfo: unable to find data in memory cache], ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod2abe929f_3ad0_4ffe_af45_026919cb467b.slice/crio-85d8a6d7170f3645bff82d458ebcf67171d15951a1b8a7dfab64f3766b85eb01.scope": containerDataToContainerInfo: unable to find data in memory cache] If I read the code correctly [3], each log line means that cadvisor returned no container metrics to Prometheus. [1] https://search.ci.openshift.org/?search=promQL+query+returned+unexpected+results%3A.*container_cpu_usage_seconds_total&maxAge=120h0m0s&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-compact-upgrade/1386207349999407104/artifacts/e2e-aws-compact-upgrade/gather-extra/artifacts/nodes/ip-10-0-207-255.ec2.internal/journal [3] https://github.com/openshift/origin/blob/581a8a0effc49410209e5d98735246dff9fddd4c/vendor/github.com/google/cadvisor/metrics/prometheus.go#L1821-L1826
Based on the comment 4 and after discussing offline with Ryan, reassigning to the Node team to investigate why cAdvisor fails to collect metrics repeatedly.
*** Bug 1955247 has been marked as a duplicate of this bug. ***
*** Bug 1961395 has been marked as a duplicate of this bug. ***
This test is passing successfully on 4.10, 4.8 releases. - CI test results (4.8): https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1466686801392439296 - CI test results (4.10): https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1466777487571685376