Bug 1821661 - CI: Alerts shouldn't report any alerts in firing state, SHOULD NOT HAPPEN: duplicate entries for key
Summary: CI: Alerts shouldn't report any alerts in firing state, SHOULD NOT HAPPEN: du...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.6.0
Assignee: Stefan Schimanski
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks: 1878862
TreeView+ depends on / blocked
 
Reported: 2020-04-07 11:05 UTC by Sinny Kumari
Modified: 2020-10-27 15:58 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1878862 (view as bug list)
Environment:
test: [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]
Last Closed: 2020-10-27 15:57:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes kubernetes issues 88182 0 None closed "SHOULD NOT HAPPEN" should not happen 2021-02-02 12:28:43 UTC
Github openshift kubernetes pull 335 0 None closed Bug 1821661: UPSTREAM: 94614: e2e: fix deployment non-unique env vars to avoid SSA error 2021-02-02 12:28:43 UTC
Github openshift origin pull 25495 0 None closed Bug 1873043: Bump to kube 1.19.0 2021-02-02 12:29:28 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 15:58:14 UTC

Description Sinny Kumari 2020-04-07 11:05:40 UTC
Seeing this in elease-openshift-origin-installer-e2e-gcp-4.5 CI tests reported by https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-blocking#release-openshift-origin-installer-e2e-gcp-4.5&sort-by-flakiness=

Example of failing jobs:
* Alert TargetDown - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1260
* Alert KubeletPlegDurationHigh - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1261
* Alert KubeAPIErrorBudgetBurn - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1256

Error message from one of the failing jobs:

fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:167]: Expected
    <map[string]error | len:1>: {
        "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1": {
            s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"KubeletPlegDurationHigh\",\"alertstate\":\"firing\",\"instance\":\"10.0.0.5:10250\",\"node\":\"ci-op-snzkl-m-0.c.openshift-gce-devel-ci.internal\",\"quantile\":\"0.99\",\"severity\":\"warning\"},\"value\":[1586250781.73,\"1\"]}]",
        },
    }
to be empty


Additional information: This maybe related to https://bugzilla.redhat.com/show_bug.cgi?id=1812999

Comment 3 W. Trevor King 2020-04-07 19:40:03 UTC
With the openshift-console-operator issue spun off into bug 1821708 (since closed as a dup of bug 1783881) and the KubeletPlegDurationHigh issue spun off into bug 1821697, I guess this bug is now just about the KubeAPIErrorBudgetBurn issue?  I'll update the title to reflect that, and hope I'm correct.

Comment 4 Venkata Siva Teja Areti 2020-04-07 20:06:41 UTC
I could see different errors in kube-apiserver logs.

> E0407 06:33:48.878868       1 structuredmerge.go:103] [SHOULD NOT HAPPEN] failed to create typed new object of type apps/v1, Kind=Deployment: .spec.template.spec.containers[name="httpd"].env: duplicate entries for key [name="A"]

These and other errors around it are related to the upstream issue https://github.com/kubernetes/kubernetes/issues/88182

There are other errors for which BZs are already filed for and are begin tracked.

server side validation error: https://bugzilla.redhat.com/show_bug.cgi?id=1786269
Metrics group version log spamming: https://bugzilla.redhat.com/show_bug.cgi?id=1819053

Apart from these issues, there are no other errors that would cause 503 as far as I looked.

As most of these issues are related to k8s 1.18, moving this to 4.5

Comment 5 W. Trevor King 2020-04-23 04:44:32 UTC
Bumping to high priority because this will start to block update CI once [1] lands to make alerting during updates illegal.

[1]: https://github.com/openshift/origin/pull/24786

Comment 6 W. Trevor King 2020-04-27 18:02:14 UTC
As part of fixing this bug, [1] should be reverted.

[1]: https://github.com/openshift/origin/pull/24786/commits/3a9233400053c036838bdbf7f992874b7a0805fd

Comment 7 Stefan Schimanski 2020-05-06 10:22:44 UTC
E0407 06:33:48.878868       1 structuredmerge.go:103] [SHOULD NOT HAPPEN] failed to create typed new object of type apps/v1, Kind=Deployment: .spec.template.spec.containers[name="httpd"].env: duplicate entries for key [name="A"]

This is from a early beta feature which is not GA. We have to ignore these errors. I wonder where these requests come from. No controller should use server side apply today. But we will probably have e2e tests for that feature.

Comment 8 Stefan Schimanski 2020-05-06 10:32:35 UTC
Should be fixed with 1.18 rebase due to https://github.com/kubernetes/kubernetes/issues/88182.

Comment 9 W. Trevor King 2020-05-06 18:26:44 UTC
Per comment 6, fixing this bug at least requires reverting origin@3a92334000.

Comment 10 W. Trevor King 2020-05-19 15:19:23 UTC
Updates team has no special ownership of this test; not clear to me why Jack would be on the hook to revert origin@3a92334000.

Comment 11 Stefan Schimanski 2020-05-20 09:29:16 UTC
Postponing to 4.6. This is about server-side-apply. The feature is not GA, but early better. We will see in 4.6 how it behaves.

Comment 12 Stefan Schimanski 2020-06-18 10:16:11 UTC
Same as comment 11. Waiting for 1.19.

Comment 14 Stefan Schimanski 2020-08-03 10:22:31 UTC
We have rebased to 1.19. This is supposed to be fixed.

Comment 18 Ke Wang 2020-08-21 08:34:02 UTC
From PR https://github.com/openshift/origin/pull/25314, 4.6 already has been re-based bump to kube 1.19-rc.2.

Checked the repo,
$ git log --date local --pretty="%h %an %cd - %s" | grep 'kube 1.19'
d9ca44ba95 Maru Newby Thu Jul 30 02:02:04 2020 - bump(*) to kube 1.19.0-rc.2

Searched 'shouldn't report any alerts in firing state apart from Watchdog' in release-openshift-origin-installer-e2e-azure-4.6 CI tests reported from https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-blocking#release-openshift-origin-installer-e2e-azure-4.6&sort-by-flakiness, there is no Alert KubeAPIErrorBudgetBurn related error in failed tests, most relaated tests were passed. So move the bug Verified.

Comment 19 W. Trevor King 2020-09-02 19:51:29 UTC
(In reply to W. Trevor King from comment #9)
> Per comment 6, fixing this bug at least requires reverting origin@3a92334000.

This is still true, and the revert has still not landed in master:

$ git --no-pager log --oneline -G KubeAPIErrorBudgetBurn test/e2e/upgrade
3a92334000 (origin/pr/24786) Ignore KubeAPIErrorBudgetBurn alert

Comment 20 Stefan Schimanski 2020-09-08 11:33:07 UTC
1.19 does not fix the root cause. The root cause is user data.

Compare discussion in https://github.com/kubernetes/kubernetes/issues/88182 and https://github.com/kubernetes/kubernetes/pull/88600. The latter only reduces frequency.

Comment 22 W. Trevor King 2020-09-14 17:58:45 UTC
Reverting the KubeAPIErrorBudgetBurn alert has been spun off into bug 1878862.

Comment 23 W. Trevor King 2020-09-14 17:59:13 UTC
Oops, I meant "Reverting the KubeAPIErrorBudgetBurn ignore".

Comment 24 Stefan Schimanski 2020-09-15 11:36:22 UTC
This is blocked on origin 1.19 rebase.

Comment 26 Ke Wang 2020-09-23 02:00:36 UTC
From comment #22, this bug doesn't involve 'KubeAPIErrorBudgetBurn', will change the bug subject.

Comment 27 Ke Wang 2020-09-23 02:03:35 UTC
Verification:
1. 4.6 already has been re-based bump to kube 1.19-rc.2, we search the keyword 'SHOULD NOT HAPPEN'  https://search.ci.openshift.org/?search=SHOULD+NOT+HAPPEN&maxAge=336h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job, list one result here,
...
E0910 02:03:11.289096      17 fieldmanager.go:175] [SHOULD NOT HAPPEN] failed to update managedFields for /, Kind=: failed to convert new object (apps/v1, Kind=Deployment) to smd typed: .spec.template.spec.containers[name="httpd"].env: duplicate entries for key [name="A"]
E0910 02:03:12.365113      17 fieldmanager.go:175] [SHOULD NOT HAPPEN] failed to update managedFields for /, Kind=: failed to convert new object (/v1, Kind=Pod) to smd typed: .spec.containers[name="httpd"].env: duplicate entries for key [name="A"]
...

We can see above ‘SHOULD NOT HAPPEN’ error message per second, not spamming per second, the PR https://github.com/kubernetes/kubernetes/pull/88600 works as expected.

2. In latest build 4.6.0-0.nightly-2020-09-20-184226 which merged https://github.com/openshift/kubernetes PR.
$ git log --date local --pretty="%h %an %cd - %s" 4336ff45 | grep '#335 '
0634471ce54 OpenShift Merge Robot Tue Sep 8 23:43:06 2020 - Merge pull request #335 from sttts/sttts-fix-non-unique-test-env-var-openshift

Comment 29 errata-xmlrpc 2020-10-27 15:57:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.