Bug 1821661 - CI: Alerts shouldn't report any alerts in firing state, SHOULD NOT HAPPEN: duplicate entries for key
Summary: CI: Alerts shouldn't report any alerts in firing state, SHOULD NOT HAPPEN: du...
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.6.0
Assignee: Stefan Schimanski
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks: 1878862
TreeView+ depends on / blocked
 
Reported: 2020-04-07 11:05 UTC by Sinny Kumari
Modified: 2020-09-23 02:03 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1878862 (view as bug list)
Environment:
test: [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]
Last Closed: 2020-05-06 10:32:35 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github kubernetes kubernetes issues 88182 None closed "SHOULD NOT HAPPEN" should not happen 2020-09-22 02:37:26 UTC
Github openshift kubernetes pull 335 None closed Bug 1821661: UPSTREAM: 94614: e2e: fix deployment non-unique env vars to avoid SSA error 2020-09-22 02:37:26 UTC
Github openshift origin pull 25495 None closed Bug 1873043: Bump to kube 1.19.0 2020-09-22 02:37:27 UTC

Description Sinny Kumari 2020-04-07 11:05:40 UTC
Seeing this in elease-openshift-origin-installer-e2e-gcp-4.5 CI tests reported by https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-blocking#release-openshift-origin-installer-e2e-gcp-4.5&sort-by-flakiness=

Example of failing jobs:
* Alert TargetDown - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1260
* Alert KubeletPlegDurationHigh - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1261
* Alert KubeAPIErrorBudgetBurn - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1256

Error message from one of the failing jobs:

fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:167]: Expected
    <map[string]error | len:1>: {
        "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1": {
            s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"KubeletPlegDurationHigh\",\"alertstate\":\"firing\",\"instance\":\"10.0.0.5:10250\",\"node\":\"ci-op-snzkl-m-0.c.openshift-gce-devel-ci.internal\",\"quantile\":\"0.99\",\"severity\":\"warning\"},\"value\":[1586250781.73,\"1\"]}]",
        },
    }
to be empty


Additional information: This maybe related to https://bugzilla.redhat.com/show_bug.cgi?id=1812999

Comment 3 W. Trevor King 2020-04-07 19:40:03 UTC
With the openshift-console-operator issue spun off into bug 1821708 (since closed as a dup of bug 1783881) and the KubeletPlegDurationHigh issue spun off into bug 1821697, I guess this bug is now just about the KubeAPIErrorBudgetBurn issue?  I'll update the title to reflect that, and hope I'm correct.

Comment 4 Venkata Siva Teja Areti 2020-04-07 20:06:41 UTC
I could see different errors in kube-apiserver logs.

> E0407 06:33:48.878868       1 structuredmerge.go:103] [SHOULD NOT HAPPEN] failed to create typed new object of type apps/v1, Kind=Deployment: .spec.template.spec.containers[name="httpd"].env: duplicate entries for key [name="A"]

These and other errors around it are related to the upstream issue https://github.com/kubernetes/kubernetes/issues/88182

There are other errors for which BZs are already filed for and are begin tracked.

server side validation error: https://bugzilla.redhat.com/show_bug.cgi?id=1786269
Metrics group version log spamming: https://bugzilla.redhat.com/show_bug.cgi?id=1819053

Apart from these issues, there are no other errors that would cause 503 as far as I looked.

As most of these issues are related to k8s 1.18, moving this to 4.5

Comment 5 W. Trevor King 2020-04-23 04:44:32 UTC
Bumping to high priority because this will start to block update CI once [1] lands to make alerting during updates illegal.

[1]: https://github.com/openshift/origin/pull/24786

Comment 6 W. Trevor King 2020-04-27 18:02:14 UTC
As part of fixing this bug, [1] should be reverted.

[1]: https://github.com/openshift/origin/pull/24786/commits/3a9233400053c036838bdbf7f992874b7a0805fd

Comment 7 Stefan Schimanski 2020-05-06 10:22:44 UTC
E0407 06:33:48.878868       1 structuredmerge.go:103] [SHOULD NOT HAPPEN] failed to create typed new object of type apps/v1, Kind=Deployment: .spec.template.spec.containers[name="httpd"].env: duplicate entries for key [name="A"]

This is from a early beta feature which is not GA. We have to ignore these errors. I wonder where these requests come from. No controller should use server side apply today. But we will probably have e2e tests for that feature.

Comment 8 Stefan Schimanski 2020-05-06 10:32:35 UTC
Should be fixed with 1.18 rebase due to https://github.com/kubernetes/kubernetes/issues/88182.

Comment 9 W. Trevor King 2020-05-06 18:26:44 UTC
Per comment 6, fixing this bug at least requires reverting origin@3a92334000.

Comment 10 W. Trevor King 2020-05-19 15:19:23 UTC
Updates team has no special ownership of this test; not clear to me why Jack would be on the hook to revert origin@3a92334000.

Comment 11 Stefan Schimanski 2020-05-20 09:29:16 UTC
Postponing to 4.6. This is about server-side-apply. The feature is not GA, but early better. We will see in 4.6 how it behaves.

Comment 12 Stefan Schimanski 2020-06-18 10:16:11 UTC
Same as comment 11. Waiting for 1.19.

Comment 14 Stefan Schimanski 2020-08-03 10:22:31 UTC
We have rebased to 1.19. This is supposed to be fixed.

Comment 18 Ke Wang 2020-08-21 08:34:02 UTC
From PR https://github.com/openshift/origin/pull/25314, 4.6 already has been re-based bump to kube 1.19-rc.2.

Checked the repo,
$ git log --date local --pretty="%h %an %cd - %s" | grep 'kube 1.19'
d9ca44ba95 Maru Newby Thu Jul 30 02:02:04 2020 - bump(*) to kube 1.19.0-rc.2

Searched 'shouldn't report any alerts in firing state apart from Watchdog' in release-openshift-origin-installer-e2e-azure-4.6 CI tests reported from https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-blocking#release-openshift-origin-installer-e2e-azure-4.6&sort-by-flakiness, there is no Alert KubeAPIErrorBudgetBurn related error in failed tests, most relaated tests were passed. So move the bug Verified.

Comment 19 W. Trevor King 2020-09-02 19:51:29 UTC
(In reply to W. Trevor King from comment #9)
> Per comment 6, fixing this bug at least requires reverting origin@3a92334000.

This is still true, and the revert has still not landed in master:

$ git --no-pager log --oneline -G KubeAPIErrorBudgetBurn test/e2e/upgrade
3a92334000 (origin/pr/24786) Ignore KubeAPIErrorBudgetBurn alert

Comment 20 Stefan Schimanski 2020-09-08 11:33:07 UTC
1.19 does not fix the root cause. The root cause is user data.

Compare discussion in https://github.com/kubernetes/kubernetes/issues/88182 and https://github.com/kubernetes/kubernetes/pull/88600. The latter only reduces frequency.

Comment 22 W. Trevor King 2020-09-14 17:58:45 UTC
Reverting the KubeAPIErrorBudgetBurn alert has been spun off into bug 1878862.

Comment 23 W. Trevor King 2020-09-14 17:59:13 UTC
Oops, I meant "Reverting the KubeAPIErrorBudgetBurn ignore".

Comment 24 Stefan Schimanski 2020-09-15 11:36:22 UTC
This is blocked on origin 1.19 rebase.

Comment 26 Ke Wang 2020-09-23 02:00:36 UTC
From comment #22, this bug doesn't involve 'KubeAPIErrorBudgetBurn', will change the bug subject.

Comment 27 Ke Wang 2020-09-23 02:03:35 UTC
Verification:
1. 4.6 already has been re-based bump to kube 1.19-rc.2, we search the keyword 'SHOULD NOT HAPPEN'  https://search.ci.openshift.org/?search=SHOULD+NOT+HAPPEN&maxAge=336h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job, list one result here,
...
E0910 02:03:11.289096      17 fieldmanager.go:175] [SHOULD NOT HAPPEN] failed to update managedFields for /, Kind=: failed to convert new object (apps/v1, Kind=Deployment) to smd typed: .spec.template.spec.containers[name="httpd"].env: duplicate entries for key [name="A"]
E0910 02:03:12.365113      17 fieldmanager.go:175] [SHOULD NOT HAPPEN] failed to update managedFields for /, Kind=: failed to convert new object (/v1, Kind=Pod) to smd typed: .spec.containers[name="httpd"].env: duplicate entries for key [name="A"]
...

We can see above ‘SHOULD NOT HAPPEN’ error message per second, not spamming per second, the PR https://github.com/kubernetes/kubernetes/pull/88600 works as expected.

2. In latest build 4.6.0-0.nightly-2020-09-20-184226 which merged https://github.com/openshift/kubernetes PR.
$ git log --date local --pretty="%h %an %cd - %s" 4336ff45 | grep '#335 '
0634471ce54 OpenShift Merge Robot Tue Sep 8 23:43:06 2020 - Merge pull request #335 from sttts/sttts-fix-non-unique-test-env-var-openshift


Note You need to log in before you can comment on or make changes to this bug.