1973075 – Prometheus when installed on the cluster should have non-Pod host cAdvisor metrics test failing often

Bug 1973075 - Prometheus when installed on the cluster should have non-Pod host cAdvisor metrics test failing often

Summary: Prometheus when installed on the cluster should have non-Pod host cAdvisor me...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Harshal Patil
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:	1950993
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-17 08:37 UTC by OpenShift BugZilla Robot
Modified:	2021-07-01 11:06 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-01 11:06:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Comment 2 W. Trevor King 2021-06-29 17:09:52 UTC

If we decide not to backport a fix to 4.8 or earlier, I think we should either use this bug or a new bug series for "soften the 4.8 and earlier test suites so we don't complain about the expected issue".  Looking in CI search:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Prometheus+when+installed+on+the+cluster+should+have+non-Pod+host+cAdvisor+metrics&maxAge=336h&type=junit' | grep 'failures match' | grep -v 'rehearse-\|pull-ci-' | sort
openshift-ibm-roks-toolkit-release-4.5-create-cluster-periodics (all) - 28 runs, 93% failed, 4% of failures match = 4% impact
openshift-ibm-roks-toolkit-release-4.6-create-cluster-periodics (all) - 28 runs, 50% failed, 7% of failures match = 4% impact
openshift-ibm-roks-toolkit-release-4.7-create-cluster-periodics (all) - 28 runs, 50% failed, 7% of failures match = 4% impact
periodic-ci-openshift-release-master-ci-4.6-e2e-openstack-parallel (all) - 40 runs, 35% failed, 7% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-ovn-upgrade (all) - 150 runs, 41% failed, 2% of failures match = 1% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-azure-compact (all) - 7 runs, 100% failed, 14% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.4-e2e-ovirt (all) - 28 runs, 96% failed, 37% of failures match = 36% impact
periodic-ci-openshift-release-master-nightly-4.5-e2e-aws-fips (all) - 10 runs, 10% failed, 100% of failures match = 10% impact
periodic-ci-openshift-release-master-nightly-4.5-e2e-ovirt (all) - 33 runs, 100% failed, 21% of failures match = 21% impact
periodic-ci-openshift-release-master-nightly-4.5-e2e-vsphere-upi (all) - 41 runs, 100% failed, 73% of failures match = 73% impact
periodic-ci-openshift-release-master-nightly-4.7-e2e-aws-proxy (all) - 31 runs, 81% failed, 4% of failures match = 3% impact
periodic-ci-openshift-release-master-nightly-4.7-e2e-remote-libvirt-ppc64le (all) - 12 runs, 100% failed, 8% of failures match = 8% impact
periodic-ci-openshift-release-master-nightly-4.7-e2e-vsphere (all) - 68 runs, 19% failed, 8% of failures match = 1% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-csi-migration (all) - 28 runs, 100% failed, 68% of failures match = 68% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-single-node (all) - 14 runs, 100% failed, 14% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-remote-libvirt-s390x (all) - 20 runs, 80% failed, 13% of failures match = 10% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-csi-migration (all) - 28 runs, 100% failed, 57% of failures match = 57% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node (all) - 14 runs, 100% failed, 29% of failures match = 29% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-compact-remote-libvirt-ppc64le (all) - 12 runs, 100% failed, 8% of failures match = 8% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-rt (all) - 85 runs, 79% failed, 3% of failures match = 2% impact
release-openshift-ocp-installer-e2e-aws-mirrors-4.7 (all) - 7 runs, 100% failed, 86% of failures match = 86% impact
release-openshift-ocp-installer-e2e-azure-ovn-4.9 (all) - 85 runs, 39% failed, 3% of failures match = 1% impact
release-openshift-ocp-installer-e2e-gcp-ovn-4.9 (all) - 85 runs, 66% failed, 4% of failures match = 2% impact
release-openshift-ocp-installer-e2e-metal-4.8 (all) - 39 runs, 49% failed, 5% of failures match = 3% impact
release-openshift-ocp-installer-e2e-metal-compact-4.8 (all) - 7 runs, 57% failed, 25% of failures match = 14% impact
release-openshift-ocp-installer-e2e-remote-libvirt-compact-s390x-4.8 (all) - 16 runs, 94% failed, 7% of failures match = 6% impact
release-openshift-ocp-installer-e2e-remote-libvirt-ppc64le-4.6 (all) - 16 runs, 31% failed, 20% of failures match = 6% impact
release-openshift-ocp-installer-e2e-remote-libvirt-ppc64le-4.7 (all) - 16 runs, 75% failed, 8% of failures match = 6% impact
release-openshift-ocp-installer-e2e-remote-libvirt-s390x-4.6 (all) - 16 runs, 63% failed, 10% of failures match = 6% impact
release-openshift-ocp-installer-e2e-remote-libvirt-s390x-4.8 (all) - 16 runs, 100% failed, 38% of failures match = 38% impact
release-openshift-origin-installer-e2e-gcp-compact-4.5 (all) - 7 runs, 43% failed, 33% of failures match = 14% impact

So still  number of 4.6 through 4.8 failures in releases that will be supported for a long time, and whose CI health we will be closely monitoring for as long as those releases are supported.

Comment 3 Harshal Patil 2021-06-30 09:46:34 UTC

Hi Trevor, 

My apologies, I should have been more elaborate in my comment when I closed that bug. 


Let me clarify a few things before I try to explain why I decided to close this bug for 4.8. 


I found and fixed an issue [1] in upstream cadvisor which modified how cadvisor reports error.  


The cadvisor gets imported in openshift in 2 different places. First in openshift/kubernetes repo and other one in openshift/origin repo. This PR [2] brought in the changes I made in upstream [1] in openshift/kubernetes, while this PR [3] brought those upstream [1] changes in openshift/origin. This brought the failures in this test significantly and reduced it to an occasional flakes. 

Since the PR for openshift/kubernetes [3] was merged before 4.8 window closed, the changes made it into 4.8 branch of openshift/kubernetes. However, when I raised the PR [4] Seth Jennings pointed out that the code in openshift/origin is no longer used in building the kubelet (even though it imports cadvisor). This means we do not need to merge [4] in order to fix this issue. We only need changes in openshift/kubernetes and not openshift/origin. This was confirmed when we look at the SNO CI for 4.8 [6], which had the changes in openshift/kubernetes but not in openshift/origin. 

So since we don't need the changes in openshift/origin and we already have the required in openshift/kubernetes and we see tests going from failing to occasional flakes [6] I decided to close this bug. 

The search [8] you mentioned in your comment is slightly misleading IMO. I tried to open some random results from that search [9], [10]. It seems, although the job failed, it wasn't due to failure of this test specifically. Rather pretty much all the tests in those jobs failed. So I am not sure if I would link those runs to this BZ. A good example of test job failure due to issue in this BZ would be this one [11], where the job is clearly failing due to failure of the test linked with this BZ. 

 

[1] https://github.com/google/cadvisor/pull/2868
[2] https://github.com/openshift/kubernetes/pull/802
[3] https://github.com/openshift/origin/pull/26232
[4] https://github.com/openshift/origin/pull/26243
[5] https://coreos.slack.com/archives/GK6BJJ1J5/p1623919675075400
[6] https://testgrid.k8s.io/redhat-single-node#periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-single-node&include-filter-by-regex=cAdvisor
[7] https://bugzilla.redhat.com/show_bug.cgi?id=1973075#c2
[8] https://search.ci.openshift.org/?search=Prometheus+when+installed+on+the+cluster+should+have+non-Pod+host+cAdvisor+metrics&maxAge=336h&type=junit
[9] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_multus-cni/90/pull-ci-openshift-multus-cni-release-4.7-e2e-aws/1410116803887108096
[10] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_multus-cni/90/pull-ci-openshift-multus-cni-release-4.7-e2e-aws/1409203294936502272
[11] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1121/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1383816160142692352

Note You need to log in before you can comment on or make changes to this bug.