If we decide not to backport a fix to 4.8 or earlier, I think we should either use this bug or a new bug series for "soften the 4.8 and earlier test suites so we don't complain about the expected issue". Looking in CI search: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Prometheus+when+installed+on+the+cluster+should+have+non-Pod+host+cAdvisor+metrics&maxAge=336h&type=junit' | grep 'failures match' | grep -v 'rehearse-\|pull-ci-' | sort openshift-ibm-roks-toolkit-release-4.5-create-cluster-periodics (all) - 28 runs, 93% failed, 4% of failures match = 4% impact openshift-ibm-roks-toolkit-release-4.6-create-cluster-periodics (all) - 28 runs, 50% failed, 7% of failures match = 4% impact openshift-ibm-roks-toolkit-release-4.7-create-cluster-periodics (all) - 28 runs, 50% failed, 7% of failures match = 4% impact periodic-ci-openshift-release-master-ci-4.6-e2e-openstack-parallel (all) - 40 runs, 35% failed, 7% of failures match = 3% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-ovn-upgrade (all) - 150 runs, 41% failed, 2% of failures match = 1% impact periodic-ci-openshift-release-master-ci-4.9-e2e-azure-compact (all) - 7 runs, 100% failed, 14% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.4-e2e-ovirt (all) - 28 runs, 96% failed, 37% of failures match = 36% impact periodic-ci-openshift-release-master-nightly-4.5-e2e-aws-fips (all) - 10 runs, 10% failed, 100% of failures match = 10% impact periodic-ci-openshift-release-master-nightly-4.5-e2e-ovirt (all) - 33 runs, 100% failed, 21% of failures match = 21% impact periodic-ci-openshift-release-master-nightly-4.5-e2e-vsphere-upi (all) - 41 runs, 100% failed, 73% of failures match = 73% impact periodic-ci-openshift-release-master-nightly-4.7-e2e-aws-proxy (all) - 31 runs, 81% failed, 4% of failures match = 3% impact periodic-ci-openshift-release-master-nightly-4.7-e2e-remote-libvirt-ppc64le (all) - 12 runs, 100% failed, 8% of failures match = 8% impact periodic-ci-openshift-release-master-nightly-4.7-e2e-vsphere (all) - 68 runs, 19% failed, 8% of failures match = 1% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-csi-migration (all) - 28 runs, 100% failed, 68% of failures match = 68% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-single-node (all) - 14 runs, 100% failed, 14% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-remote-libvirt-s390x (all) - 20 runs, 80% failed, 13% of failures match = 10% impact periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-csi-migration (all) - 28 runs, 100% failed, 57% of failures match = 57% impact periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node (all) - 14 runs, 100% failed, 29% of failures match = 29% impact periodic-ci-openshift-release-master-nightly-4.9-e2e-compact-remote-libvirt-ppc64le (all) - 12 runs, 100% failed, 8% of failures match = 8% impact periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-rt (all) - 85 runs, 79% failed, 3% of failures match = 2% impact release-openshift-ocp-installer-e2e-aws-mirrors-4.7 (all) - 7 runs, 100% failed, 86% of failures match = 86% impact release-openshift-ocp-installer-e2e-azure-ovn-4.9 (all) - 85 runs, 39% failed, 3% of failures match = 1% impact release-openshift-ocp-installer-e2e-gcp-ovn-4.9 (all) - 85 runs, 66% failed, 4% of failures match = 2% impact release-openshift-ocp-installer-e2e-metal-4.8 (all) - 39 runs, 49% failed, 5% of failures match = 3% impact release-openshift-ocp-installer-e2e-metal-compact-4.8 (all) - 7 runs, 57% failed, 25% of failures match = 14% impact release-openshift-ocp-installer-e2e-remote-libvirt-compact-s390x-4.8 (all) - 16 runs, 94% failed, 7% of failures match = 6% impact release-openshift-ocp-installer-e2e-remote-libvirt-ppc64le-4.6 (all) - 16 runs, 31% failed, 20% of failures match = 6% impact release-openshift-ocp-installer-e2e-remote-libvirt-ppc64le-4.7 (all) - 16 runs, 75% failed, 8% of failures match = 6% impact release-openshift-ocp-installer-e2e-remote-libvirt-s390x-4.6 (all) - 16 runs, 63% failed, 10% of failures match = 6% impact release-openshift-ocp-installer-e2e-remote-libvirt-s390x-4.8 (all) - 16 runs, 100% failed, 38% of failures match = 38% impact release-openshift-origin-installer-e2e-gcp-compact-4.5 (all) - 7 runs, 43% failed, 33% of failures match = 14% impact So still number of 4.6 through 4.8 failures in releases that will be supported for a long time, and whose CI health we will be closely monitoring for as long as those releases are supported.
Hi Trevor, My apologies, I should have been more elaborate in my comment when I closed that bug. Let me clarify a few things before I try to explain why I decided to close this bug for 4.8. I found and fixed an issue [1] in upstream cadvisor which modified how cadvisor reports error. The cadvisor gets imported in openshift in 2 different places. First in openshift/kubernetes repo and other one in openshift/origin repo. This PR [2] brought in the changes I made in upstream [1] in openshift/kubernetes, while this PR [3] brought those upstream [1] changes in openshift/origin. This brought the failures in this test significantly and reduced it to an occasional flakes. Since the PR for openshift/kubernetes [3] was merged before 4.8 window closed, the changes made it into 4.8 branch of openshift/kubernetes. However, when I raised the PR [4] Seth Jennings pointed out that the code in openshift/origin is no longer used in building the kubelet (even though it imports cadvisor). This means we do not need to merge [4] in order to fix this issue. We only need changes in openshift/kubernetes and not openshift/origin. This was confirmed when we look at the SNO CI for 4.8 [6], which had the changes in openshift/kubernetes but not in openshift/origin. So since we don't need the changes in openshift/origin and we already have the required in openshift/kubernetes and we see tests going from failing to occasional flakes [6] I decided to close this bug. The search [8] you mentioned in your comment is slightly misleading IMO. I tried to open some random results from that search [9], [10]. It seems, although the job failed, it wasn't due to failure of this test specifically. Rather pretty much all the tests in those jobs failed. So I am not sure if I would link those runs to this BZ. A good example of test job failure due to issue in this BZ would be this one [11], where the job is clearly failing due to failure of the test linked with this BZ. [1] https://github.com/google/cadvisor/pull/2868 [2] https://github.com/openshift/kubernetes/pull/802 [3] https://github.com/openshift/origin/pull/26232 [4] https://github.com/openshift/origin/pull/26243 [5] https://coreos.slack.com/archives/GK6BJJ1J5/p1623919675075400 [6] https://testgrid.k8s.io/redhat-single-node#periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-single-node&include-filter-by-regex=cAdvisor [7] https://bugzilla.redhat.com/show_bug.cgi?id=1973075#c2 [8] https://search.ci.openshift.org/?search=Prometheus+when+installed+on+the+cluster+should+have+non-Pod+host+cAdvisor+metrics&maxAge=336h&type=junit [9] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_multus-cni/90/pull-ci-openshift-multus-cni-release-4.7-e2e-aws/1410116803887108096 [10] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_multus-cni/90/pull-ci-openshift-multus-cni-release-4.7-e2e-aws/1409203294936502272 [11] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1121/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1383816160142692352