Bug 1832511

Summary:	Failed upgrade from 4.5.0-0.nightly-2020-05-06-003431 to 4.5.0-0.nightly-2020-05-06-130506: KubePodCrashLooping kube-apiserver
Product:	OpenShift Container Platform	Reporter:	Eric Stroczynski <estroczy>
Component:	kube-apiserver	Assignee:	Stefan Schimanski <sttts>
Status:	CLOSED DUPLICATE	QA Contact:	Xingxing Xia <xxia>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.5	CC:	alegrand, anpicker, aos-bugs, erooth, kakkoyun, lcosic, mfojtik, mloibl, pkrupa, surbania, wking
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-19 09:39:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Eric Stroczynski 2020-05-06 18:25:07 UTC

Description of problem:
Release upgrade failure

Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-05-06-130506

How reproducible:
-

Steps to Reproduce:
1. https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/28108

Actual results:

fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected
    <map[string]error | len:1>: {
        "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh|KubeAPIErrorBudgetBurn\",alertstate=\"firing\",severity=\"critical\"}[1m]) >= 1": {
            s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh|KubeAPIErrorBudgetBurn\",alertstate=\"firing\",severity=\"critical\"}[1m]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"kube-apiserver\",\"endpoint\":\"https-main\",\"instance\":\"10.131.0.8:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-apiserver\",\"pod\":\"kube-apiserver-ip-10-0-143-249.us-west-2.compute.internal\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1588773918.056,\"2\"]}]",
        },
    }
to be empty


Expected results:
Pass

Additional info:
Release job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/28108

Version is mentioned in: https://bugzilla.redhat.com/show_bug.cgi?id=1832180

Comment 1 Eric Stroczynski 2020-05-06 18:27:35 UTC

Similar issue when upgrading from 4.5.0-0.nightly-2020-05-06-003431 to 4.5.0-0.nightly-2020-05-06-112104: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/28104

Comment 3 W. Trevor King 2020-05-16 16:10:05 UTC

This test catches any critical alerts that flare up during the test run, so I'm adjusting the title to make it clear that this ticket is about kube-apiserver crashlooping.  If this test fails on other alerts (and it does [1]) and folks want to track in Bugzilla, they should file separate bugs.

[1]: https://search.svc.ci.openshift.org/?search=promQL+query%3A+count_over_time.*ALERTS.*had+reported+incorrect+results&maxAge=168h&context=1&type=bug%2Bjunit&name=upgrade&groupBy=job

$ curl -s 'https://search.svc.ci.openshift.org/search?search=promQL+query%3A+count_over_time.*ALERTS.*had+reported+incorrect+results&maxAge=24h&context=0&type=junit&name=upgrade' | jq -r '. | to_entries[].value | to_entries[].value[].context[]' | sed -n 's/.*incorrect results:\\n\(.*\)",$/\1/p' | sed 's|\\||g' | jq -r '.[].metric.alertname' | sort | uniq -c | sort -n | tail
      1 KubeAPIErrorBudgetBurn
      1 TargetDown
      2 etcdMembersDown
      4 ImagePruningDisabled
      7 KubePodCrashLooping
$ curl -s 'https://search.apps.build01.ci.devcluster.openshift.com/search?search=promQL+query%3A+count_over_time.*ALERTS.*had+reported+incorrect+results&maxAge=24h&context=0&type=junit&name=upgrade' | jq -r '. | to_entries[].value | to_entries[].value[].context[]' | sed -n 's/.*incorrect results:\\n\(.*\)",$/\1/p' | sed 's|\\||g' | jq -r '.[].metric | select(.alertname == "KubePodCrashLooping").container' | sort | uniq -c | sort -n       
      1 cluster-policy-controller
      1 kube-controller-manager
      1 kube-scheduler
      4 kube-apiserver

Comment 4 Stefan Schimanski 2020-05-19 09:39:40 UTC


*** This bug has been marked as a duplicate of bug 1828606 ***