1821661 – CI: Alerts shouldn't report any alerts in firing state, SHOULD NOT HAPPEN: duplicate entries for key

Bug 1821661 - CI: Alerts shouldn't report any alerts in firing state, SHOULD NOT HAPPEN: duplicate entries for key

Summary: CI: Alerts shouldn't report any alerts in firing state, SHOULD NOT HAPPEN: du...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Stefan Schimanski
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1878862
TreeView+	depends on / blocked

Reported:	2020-04-07 11:05 UTC by Sinny Kumari
Modified:	2020-10-27 15:58 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1878862 (view as bug list)
Environment:	test: [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]
Last Closed:	2020-10-27 15:57:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes kubernetes issues 88182	None	closed	"SHOULD NOT HAPPEN" should not happen	2021-02-02 12:28:43 UTC
Github	openshift kubernetes pull 335	None	closed	Bug 1821661: UPSTREAM: 94614: e2e: fix deployment non-unique env vars to avoid SSA error	2021-02-02 12:28:43 UTC
Github	openshift origin pull 25495	None	closed	Bug 1873043: Bump to kube 1.19.0	2021-02-02 12:29:28 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 15:58:14 UTC

Description Sinny Kumari 2020-04-07 11:05:40 UTC

Seeing this in elease-openshift-origin-installer-e2e-gcp-4.5 CI tests reported by https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-blocking#release-openshift-origin-installer-e2e-gcp-4.5&sort-by-flakiness=

Example of failing jobs:
* Alert TargetDown - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1260
* Alert KubeletPlegDurationHigh - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1261
* Alert KubeAPIErrorBudgetBurn - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1256

Error message from one of the failing jobs:

fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:167]: Expected
    <map[string]error | len:1>: {
        "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1": {
            s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"KubeletPlegDurationHigh\",\"alertstate\":\"firing\",\"instance\":\"10.0.0.5:10250\",\"node\":\"ci-op-snzkl-m-0.c.openshift-gce-devel-ci.internal\",\"quantile\":\"0.99\",\"severity\":\"warning\"},\"value\":[1586250781.73,\"1\"]}]",
        },
    }
to be empty


Additional information: This maybe related to https://bugzilla.redhat.com/show_bug.cgi?id=1812999

Comment 3 W. Trevor King 2020-04-07 19:40:03 UTC

With the openshift-console-operator issue spun off into bug 1821708 (since closed as a dup of bug 1783881) and the KubeletPlegDurationHigh issue spun off into bug 1821697, I guess this bug is now just about the KubeAPIErrorBudgetBurn issue?  I'll update the title to reflect that, and hope I'm correct.

Comment 4 Venkata Siva Teja Areti 2020-04-07 20:06:41 UTC

I could see different errors in kube-apiserver logs.

> E0407 06:33:48.878868       1 structuredmerge.go:103] [SHOULD NOT HAPPEN] failed to create typed new object of type apps/v1, Kind=Deployment: .spec.template.spec.containers[name="httpd"].env: duplicate entries for key [name="A"]

These and other errors around it are related to the upstream issue https://github.com/kubernetes/kubernetes/issues/88182

There are other errors for which BZs are already filed for and are begin tracked.

server side validation error: https://bugzilla.redhat.com/show_bug.cgi?id=1786269
Metrics group version log spamming: https://bugzilla.redhat.com/show_bug.cgi?id=1819053

Apart from these issues, there are no other errors that would cause 503 as far as I looked.

As most of these issues are related to k8s 1.18, moving this to 4.5

Comment 5 W. Trevor King 2020-04-23 04:44:32 UTC

Bumping to high priority because this will start to block update CI once [1] lands to make alerting during updates illegal.

[1]: https://github.com/openshift/origin/pull/24786

Comment 6 W. Trevor King 2020-04-27 18:02:14 UTC

As part of fixing this bug, [1] should be reverted.

[1]: https://github.com/openshift/origin/pull/24786/commits/3a9233400053c036838bdbf7f992874b7a0805fd

Comment 7 Stefan Schimanski 2020-05-06 10:22:44 UTC

E0407 06:33:48.878868       1 structuredmerge.go:103] [SHOULD NOT HAPPEN] failed to create typed new object of type apps/v1, Kind=Deployment: .spec.template.spec.containers[name="httpd"].env: duplicate entries for key [name="A"]

This is from a early beta feature which is not GA. We have to ignore these errors. I wonder where these requests come from. No controller should use server side apply today. But we will probably have e2e tests for that feature.

Comment 8 Stefan Schimanski 2020-05-06 10:32:35 UTC

Should be fixed with 1.18 rebase due to https://github.com/kubernetes/kubernetes/issues/88182.

Comment 9 W. Trevor King 2020-05-06 18:26:44 UTC

Per comment 6, fixing this bug at least requires reverting origin@3a92334000.

Comment 10 W. Trevor King 2020-05-19 15:19:23 UTC

Updates team has no special ownership of this test; not clear to me why Jack would be on the hook to revert origin@3a92334000.

Comment 11 Stefan Schimanski 2020-05-20 09:29:16 UTC

Postponing to 4.6. This is about server-side-apply. The feature is not GA, but early better. We will see in 4.6 how it behaves.

Comment 12 Stefan Schimanski 2020-06-18 10:16:11 UTC

Same as comment 11. Waiting for 1.19.

Comment 14 Stefan Schimanski 2020-08-03 10:22:31 UTC

We have rebased to 1.19. This is supposed to be fixed.

Comment 18 Ke Wang 2020-08-21 08:34:02 UTC

From PR https://github.com/openshift/origin/pull/25314, 4.6 already has been re-based bump to kube 1.19-rc.2.

Checked the repo,
$ git log --date local --pretty="%h %an %cd - %s" | grep 'kube 1.19'
d9ca44ba95 Maru Newby Thu Jul 30 02:02:04 2020 - bump(*) to kube 1.19.0-rc.2

Searched 'shouldn't report any alerts in firing state apart from Watchdog' in release-openshift-origin-installer-e2e-azure-4.6 CI tests reported from https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-blocking#release-openshift-origin-installer-e2e-azure-4.6&sort-by-flakiness, there is no Alert KubeAPIErrorBudgetBurn related error in failed tests, most relaated tests were passed. So move the bug Verified.

Comment 19 W. Trevor King 2020-09-02 19:51:29 UTC

(In reply to W. Trevor King from comment #9)
> Per comment 6, fixing this bug at least requires reverting origin@3a92334000.

This is still true, and the revert has still not landed in master:

$ git --no-pager log --oneline -G KubeAPIErrorBudgetBurn test/e2e/upgrade
3a92334000 (origin/pr/24786) Ignore KubeAPIErrorBudgetBurn alert

Comment 20 Stefan Schimanski 2020-09-08 11:33:07 UTC

1.19 does not fix the root cause. The root cause is user data.

Compare discussion in https://github.com/kubernetes/kubernetes/issues/88182 and https://github.com/kubernetes/kubernetes/pull/88600. The latter only reduces frequency.

Comment 22 W. Trevor King 2020-09-14 17:58:45 UTC

Reverting the KubeAPIErrorBudgetBurn alert has been spun off into bug 1878862.

Comment 23 W. Trevor King 2020-09-14 17:59:13 UTC

Oops, I meant "Reverting the KubeAPIErrorBudgetBurn ignore".

Comment 24 Stefan Schimanski 2020-09-15 11:36:22 UTC

This is blocked on origin 1.19 rebase.

Comment 26 Ke Wang 2020-09-23 02:00:36 UTC

From comment #22, this bug doesn't involve 'KubeAPIErrorBudgetBurn', will change the bug subject.

Comment 27 Ke Wang 2020-09-23 02:03:35 UTC

Verification:
1. 4.6 already has been re-based bump to kube 1.19-rc.2, we search the keyword 'SHOULD NOT HAPPEN'  https://search.ci.openshift.org/?search=SHOULD+NOT+HAPPEN&maxAge=336h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job, list one result here,
...
E0910 02:03:11.289096      17 fieldmanager.go:175] [SHOULD NOT HAPPEN] failed to update managedFields for /, Kind=: failed to convert new object (apps/v1, Kind=Deployment) to smd typed: .spec.template.spec.containers[name="httpd"].env: duplicate entries for key [name="A"]
E0910 02:03:12.365113      17 fieldmanager.go:175] [SHOULD NOT HAPPEN] failed to update managedFields for /, Kind=: failed to convert new object (/v1, Kind=Pod) to smd typed: .spec.containers[name="httpd"].env: duplicate entries for key [name="A"]
...

We can see above ‘SHOULD NOT HAPPEN’ error message per second, not spamming per second, the PR https://github.com/kubernetes/kubernetes/pull/88600 works as expected.

2. In latest build 4.6.0-0.nightly-2020-09-20-184226 which merged https://github.com/openshift/kubernetes PR.
$ git log --date local --pretty="%h %an %cd - %s" 4336ff45 | grep '#335 '
0634471ce54 OpenShift Merge Robot Tue Sep 8 23:43:06 2020 - Merge pull request #335 from sttts/sttts-fix-non-unique-test-env-var-openshift

Comment 29 errata-xmlrpc 2020-10-27 15:57:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.