Bug 1812999

Summary: CI: compact: Alerts shouldn't report any alerts in firing state: KubeCPUOvercommit
Product: OpenShift Container Platform Reporter: Corey Daley <cdaley>
Component: MonitoringAssignee: Pawel Krupa <pkrupa>
Status: CLOSED DUPLICATE QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.4CC: alegrand, anpicker, aos-bugs, bparees, ccoleman, erooth, hongkliu, jesusr, jokerman, kakkoyun, lcosic, mloibl, pkrupa, pmuller, sdodson, surbania, wking
Target Milestone: ---   
Target Release: 4.4.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-26 07:53:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1812719    
Bug Blocks:    

Description Corey Daley 2020-03-12 17:04:42 UTC
Seen in 4.4 Informing: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.4-informing#release-openshift-origin-installer-e2e-aws-compact-4.4&sort-by-flakiness=

Example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-compact-4.4/60

fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:163]: Expected
    <map[string]error | len:1>: {
        "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh|FailingOperator\",alertstate=\"firing\"}[2h]) >= 1": {
            s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh|FailingOperator\",alertstate=\"firing\"}[2h]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"KubeCPUOvercommit\",\"alertstate\":\"firing\",\"severity\":\"warning\"},\"value\":[1583968250.407,\"60\"]}]",
        },
    }
to be empty

Comment 1 Ben Parees 2020-03-12 17:19:27 UTC
whoever owns that alert is going to have to find a way to throttle it down on compact clusters.

going to start w/ Node team (not 100% sure who owns the alert)

Comment 2 Pawel Krupa 2020-03-13 07:11:15 UTC
KubeCPUOvercommit is shipped by cluster-monitoring-operator

Comment 5 Sergiusz Urbaniak 2020-03-13 11:29:55 UTC
*** Bug 1813077 has been marked as a duplicate of this bug. ***

Comment 6 Ben Parees 2020-03-13 13:38:08 UTC
Sorry, I am raising this severity back up.  The purpose of the bug is not so much to determine why we are firing this alert, as to find a way to make the "compact"(low number of nodes) e2e job pass.  Whether that means disabling this test in that job, changing the nature of the alert, or reducing the overall openshift footprint.

But as long as the job is considered a release-informing job, tests that fail/flake in it will be tied to high severity bugs that we'll be asking teams to prioritize.

We can also start a discussion about whether "compact" should be a release informing job, but that's a conversation that needs to include our architecture team.

Comment 7 Clayton Coleman 2020-03-13 14:56:15 UTC
We are hopefully fixing CPU resource requests in 4.5 right now.  If that doesn't clear this, we will bump the compact cluster instance sizes.

Comment 8 Pawel Krupa 2020-03-16 10:23:51 UTC
@Ben to disable checking for firing alerts, you can (ab)use TEST_UNSUPPORTED_ALLOW_VERSION_SKEW variable [1] in e2e job config. Since the workaround exists and we are actively working on fixing resource requests I am reducing severity once again.


[1]: https://github.com/openshift/origin/blob/master/test/extended/prometheus/prometheus.go#L51-L65

Comment 10 Petr Muller 2020-03-24 17:23:54 UTC
This seems to be one of the top flakes in non-compact job:

release-openshift-origin-installer-e2e-gcp-4.4: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.4-blocking#release-openshift-origin-installer-e2e-gcp-4.4&sort-by-flakiness=

Comment 11 Lili Cosic 2020-03-24 18:05:43 UTC
> This seems to be one of the top flakes in non-compact job:

The problem is, this can be any alert that are firing in the cluster, so its normal to have that as the top "flake", but its most likely not related as it can be various alerts that are firing. Unless we can find that its one alert that is firing on all of those?

Comment 12 Pawel Krupa 2020-03-26 14:40:00 UTC
I have analyzed last 8 test failures in relation to `Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured` test. Here is what happened (some labels removed for clarity):

4 times it failed due to console operator being down:
- {"alertname":"TargetDown","job":"metrics","namespace":"openshift-console-operator"}

1 time there was a failure in deploying apiserver which triggers 5 alerts:
- {"alertname":"ClusterOperatorDegraded","name":"openshift-apiserver","reason":"APIServerDeployment_UnavailablePod"}
- {"alertname":"ClusterOperatorDown","name":"openshift-apiserver"},
- {"alertname":"KubeDeploymentReplicasMismatch","deployment":"apiserver"}
- {"alertname":"KubePodNotReady","namespace":"openshift-apiserver","pod":"apiserver-7744d8d8fb-l7zcn"}
- {"alertname":"TargetDown","namespace":"openshift-apiserver","service":"api"}

1 time there was something wrong with 3 operators which manifested in 4 alerts:
{"alertname":"ClusterOperatorDown","job":"cluster-version-operator","name":"authentication"},
{"alertname":"ClusterOperatorDown","job":"cluster-version-operator","name":"console"}
{"alertname":"ClusterOperatorDown","job":"cluster-version-operator","name":"kube-apiserver"}
{"alertname":"KubePodCrashLooping","container":"console"}"

And as for other cases we had one test failure for each of following alerts:
 - {"alertname":"TargetDown","job":"controller-manager","namespace":"openshift-controller-manager"}
 - {"alertname":"KubePodCrashLooping","container":"sdn-controller"}
 
Also in most of those cases there were other tests failures. In the end this test is not flaky, but it is covering a very broad spectrum of possible failures and with increased number of alerts we should be prepared that this can fire more often than before.

Comment 14 Ben Parees 2020-03-31 13:34:37 UTC
I still see the KubeCPUOvercommit alert firing in compact-4.4 jobs:

https://search.svc.ci.openshift.org/?search=KubeCPUOvercommit&maxAge=48h&context=2&type=junit

Comment 16 Pawel Krupa 2020-05-26 07:53:43 UTC
I don't see this alert firing in CI anymore.

Comment 17 W. Trevor King 2020-05-27 03:45:57 UTC
From Test Grid [1], looks like the test was fixed between [2] and [3].  Diffing:

$ JQ='[.spec.tags[] | .name + " " + .annotations["io.openshift.build.source-location"] + "/commit/" + .annotations["io.openshift.build.commit.id"]] | sort[]'
$ diff -U0 <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-compact-4.4/73/artifacts/release-images-latest/release-images-latest | jq -r "${JQ}") <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-compact-4.4/74/artifacts/release-images-latest/release-images-latest | jq -r "${JQ}")
--- /dev/fd/63	2020-05-26 20:35:38.805701029 -0700
+++ /dev/fd/62	2020-05-26 20:35:38.805701029 -0700
@@ -25 +25 @@
-cluster-monitoring-operator https://github.com/openshift/cluster-monitoring-operator/commit/76b306f220278f831fd0cacdb054947fb9861773
+cluster-monitoring-operator https://github.com/openshift/cluster-monitoring-operator/commit/0e9cf5f4b5adf87b15358f0858190f262b48ed16
@@ -51 +51 @@
-hyperkube https://github.com/openshift/origin/commit/bd89b9146c62dfce68680971155b1e9cbe4cb7a5
+hyperkube https://github.com/openshift/origin/commit/83745f6e012036ef6579cd7cb080d574b6f786bc
@@ -89 +89 @@
-openshift-apiserver https://github.com/openshift/openshift-apiserver/commit/325d99cfedd79c654f262d3796d16575d2cbee8c
+openshift-apiserver https://github.com/openshift/openshift-apiserver/commit/c28dd306e7bb0b553b1e76429892990c0adbc665
@@ -109 +109 @@
-tests https://github.com/openshift/origin/commit/bd89b9146c62dfce68680971155b1e9cbe4cb7a5
+tests https://github.com/openshift/origin/commit/83745f6e012036ef6579cd7cb080d574b6f786bc

Checking those:

cluster-monitoring-operator $ git --no-pager log --oneline 76b306f220..0e9cf5f4b5
0e9cf5f4 Merge pull request #706 from s-urbaniak/requests-4.4
9ef4b111 (origin/pr/706) assets: regenerate
08795a0f jsonnet/alertmanager: remove config-reloader resources
4c07aa15 jsonnet: add missing resource requests for sidecars
434e0305 jsonnet: bump telemeter-client
e83843e8 assets: regenerate
06214ebd jsonnet/*: adapt resource requests
origin $ git --no-pager log --oneline bd89b9146..83745f6e01
83745f6e01 Merge pull request #24822 from openshift-cherrypick-robot/cherry-pick-24804-to-release-4.4
22a6c2eb00 Merge pull request #24728 from ecordell/flake-olm-fix
1d8112f083 (origin/pr/24822) use rbac vs direct scc edit for s2i root bld test
d25e59067d (origin/pr/24728) fix test flake in operators test
51ef06d196 fix test flake in olm tests
openshift-apiserver $ git --no-pager log --oneline 325d99cf...c28dd306e
c28dd306e (origin/release-4.4) Merge pull request #87 from p0lyn0mial/bump-kubernetes-apiserver
48089add1 (origin/pr/87) bump (kubernetes-apiserver)
43a568f3a pin openshift/kubernetes-apiserver

The only thing in there that seems to be bumping CPU requests is 06214ebd [4], so I'm going to close this as a dup of bug 1813221.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.4-informing#release-openshift-origin-installer-e2e-aws-compact-4.4&sort-by-flakiness
[2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-compact-4.4/73
[3]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-compact-4.4/74
[4]: https://github.com/openshift/cluster-monitoring-operator/pull/706/commits/06214ebd0d28d66ed8854811de06a07ffe20509a

*** This bug has been marked as a duplicate of bug 1813221 ***