Seen in 4.4 Informing: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.4-informing#release-openshift-origin-installer-e2e-aws-compact-4.4&sort-by-flakiness= Example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-compact-4.4/60 fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:163]: Expected <map[string]error | len:1>: { "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh|FailingOperator\",alertstate=\"firing\"}[2h]) >= 1": { s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh|FailingOperator\",alertstate=\"firing\"}[2h]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"KubeCPUOvercommit\",\"alertstate\":\"firing\",\"severity\":\"warning\"},\"value\":[1583968250.407,\"60\"]}]", }, } to be empty
whoever owns that alert is going to have to find a way to throttle it down on compact clusters. going to start w/ Node team (not 100% sure who owns the alert)
KubeCPUOvercommit is shipped by cluster-monitoring-operator
*** Bug 1813077 has been marked as a duplicate of this bug. ***
Sorry, I am raising this severity back up. The purpose of the bug is not so much to determine why we are firing this alert, as to find a way to make the "compact"(low number of nodes) e2e job pass. Whether that means disabling this test in that job, changing the nature of the alert, or reducing the overall openshift footprint. But as long as the job is considered a release-informing job, tests that fail/flake in it will be tied to high severity bugs that we'll be asking teams to prioritize. We can also start a discussion about whether "compact" should be a release informing job, but that's a conversation that needs to include our architecture team.
We are hopefully fixing CPU resource requests in 4.5 right now. If that doesn't clear this, we will bump the compact cluster instance sizes.
@Ben to disable checking for firing alerts, you can (ab)use TEST_UNSUPPORTED_ALLOW_VERSION_SKEW variable [1] in e2e job config. Since the workaround exists and we are actively working on fixing resource requests I am reducing severity once again. [1]: https://github.com/openshift/origin/blob/master/test/extended/prometheus/prometheus.go#L51-L65
This seems to be one of the top flakes in non-compact job: release-openshift-origin-installer-e2e-gcp-4.4: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.4-blocking#release-openshift-origin-installer-e2e-gcp-4.4&sort-by-flakiness=
> This seems to be one of the top flakes in non-compact job: The problem is, this can be any alert that are firing in the cluster, so its normal to have that as the top "flake", but its most likely not related as it can be various alerts that are firing. Unless we can find that its one alert that is firing on all of those?
I have analyzed last 8 test failures in relation to `Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured` test. Here is what happened (some labels removed for clarity): 4 times it failed due to console operator being down: - {"alertname":"TargetDown","job":"metrics","namespace":"openshift-console-operator"} 1 time there was a failure in deploying apiserver which triggers 5 alerts: - {"alertname":"ClusterOperatorDegraded","name":"openshift-apiserver","reason":"APIServerDeployment_UnavailablePod"} - {"alertname":"ClusterOperatorDown","name":"openshift-apiserver"}, - {"alertname":"KubeDeploymentReplicasMismatch","deployment":"apiserver"} - {"alertname":"KubePodNotReady","namespace":"openshift-apiserver","pod":"apiserver-7744d8d8fb-l7zcn"} - {"alertname":"TargetDown","namespace":"openshift-apiserver","service":"api"} 1 time there was something wrong with 3 operators which manifested in 4 alerts: {"alertname":"ClusterOperatorDown","job":"cluster-version-operator","name":"authentication"}, {"alertname":"ClusterOperatorDown","job":"cluster-version-operator","name":"console"} {"alertname":"ClusterOperatorDown","job":"cluster-version-operator","name":"kube-apiserver"} {"alertname":"KubePodCrashLooping","container":"console"}" And as for other cases we had one test failure for each of following alerts: - {"alertname":"TargetDown","job":"controller-manager","namespace":"openshift-controller-manager"} - {"alertname":"KubePodCrashLooping","container":"sdn-controller"} Also in most of those cases there were other tests failures. In the end this test is not flaky, but it is covering a very broad spectrum of possible failures and with increased number of alerts we should be prepared that this can fire more often than before.
I still see the KubeCPUOvercommit alert firing in compact-4.4 jobs: https://search.svc.ci.openshift.org/?search=KubeCPUOvercommit&maxAge=48h&context=2&type=junit
I don't see this alert firing in CI anymore.
From Test Grid [1], looks like the test was fixed between [2] and [3]. Diffing: $ JQ='[.spec.tags[] | .name + " " + .annotations["io.openshift.build.source-location"] + "/commit/" + .annotations["io.openshift.build.commit.id"]] | sort[]' $ diff -U0 <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-compact-4.4/73/artifacts/release-images-latest/release-images-latest | jq -r "${JQ}") <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-compact-4.4/74/artifacts/release-images-latest/release-images-latest | jq -r "${JQ}") --- /dev/fd/63 2020-05-26 20:35:38.805701029 -0700 +++ /dev/fd/62 2020-05-26 20:35:38.805701029 -0700 @@ -25 +25 @@ -cluster-monitoring-operator https://github.com/openshift/cluster-monitoring-operator/commit/76b306f220278f831fd0cacdb054947fb9861773 +cluster-monitoring-operator https://github.com/openshift/cluster-monitoring-operator/commit/0e9cf5f4b5adf87b15358f0858190f262b48ed16 @@ -51 +51 @@ -hyperkube https://github.com/openshift/origin/commit/bd89b9146c62dfce68680971155b1e9cbe4cb7a5 +hyperkube https://github.com/openshift/origin/commit/83745f6e012036ef6579cd7cb080d574b6f786bc @@ -89 +89 @@ -openshift-apiserver https://github.com/openshift/openshift-apiserver/commit/325d99cfedd79c654f262d3796d16575d2cbee8c +openshift-apiserver https://github.com/openshift/openshift-apiserver/commit/c28dd306e7bb0b553b1e76429892990c0adbc665 @@ -109 +109 @@ -tests https://github.com/openshift/origin/commit/bd89b9146c62dfce68680971155b1e9cbe4cb7a5 +tests https://github.com/openshift/origin/commit/83745f6e012036ef6579cd7cb080d574b6f786bc Checking those: cluster-monitoring-operator $ git --no-pager log --oneline 76b306f220..0e9cf5f4b5 0e9cf5f4 Merge pull request #706 from s-urbaniak/requests-4.4 9ef4b111 (origin/pr/706) assets: regenerate 08795a0f jsonnet/alertmanager: remove config-reloader resources 4c07aa15 jsonnet: add missing resource requests for sidecars 434e0305 jsonnet: bump telemeter-client e83843e8 assets: regenerate 06214ebd jsonnet/*: adapt resource requests origin $ git --no-pager log --oneline bd89b9146..83745f6e01 83745f6e01 Merge pull request #24822 from openshift-cherrypick-robot/cherry-pick-24804-to-release-4.4 22a6c2eb00 Merge pull request #24728 from ecordell/flake-olm-fix 1d8112f083 (origin/pr/24822) use rbac vs direct scc edit for s2i root bld test d25e59067d (origin/pr/24728) fix test flake in operators test 51ef06d196 fix test flake in olm tests openshift-apiserver $ git --no-pager log --oneline 325d99cf...c28dd306e c28dd306e (origin/release-4.4) Merge pull request #87 from p0lyn0mial/bump-kubernetes-apiserver 48089add1 (origin/pr/87) bump (kubernetes-apiserver) 43a568f3a pin openshift/kubernetes-apiserver The only thing in there that seems to be bumping CPU requests is 06214ebd [4], so I'm going to close this as a dup of bug 1813221. [1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.4-informing#release-openshift-origin-installer-e2e-aws-compact-4.4&sort-by-flakiness [2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-compact-4.4/73 [3]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-compact-4.4/74 [4]: https://github.com/openshift/cluster-monitoring-operator/pull/706/commits/06214ebd0d28d66ed8854811de06a07ffe20509a *** This bug has been marked as a duplicate of bug 1813221 ***