Bug 1832511

Summary: Failed upgrade from 4.5.0-0.nightly-2020-05-06-003431 to 4.5.0-0.nightly-2020-05-06-130506: KubePodCrashLooping kube-apiserver
Product: OpenShift Container Platform Reporter: Eric Stroczynski <estroczy>
Component: kube-apiserverAssignee: Stefan Schimanski <sttts>
Status: CLOSED DUPLICATE QA Contact: Xingxing Xia <xxia>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.5CC: alegrand, anpicker, aos-bugs, erooth, kakkoyun, lcosic, mfojtik, mloibl, pkrupa, surbania, wking
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-19 09:39:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eric Stroczynski 2020-05-06 18:25:07 UTC
Description of problem:
Release upgrade failure

Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-05-06-130506

How reproducible:
-

Steps to Reproduce:
1. https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/28108

Actual results:

fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected
    <map[string]error | len:1>: {
        "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh|KubeAPIErrorBudgetBurn\",alertstate=\"firing\",severity=\"critical\"}[1m]) >= 1": {
            s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh|KubeAPIErrorBudgetBurn\",alertstate=\"firing\",severity=\"critical\"}[1m]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"kube-apiserver\",\"endpoint\":\"https-main\",\"instance\":\"10.131.0.8:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-kube-apiserver\",\"pod\":\"kube-apiserver-ip-10-0-143-249.us-west-2.compute.internal\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1588773918.056,\"2\"]}]",
        },
    }
to be empty


Expected results:
Pass

Additional info:
Release job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/28108

Version is mentioned in: https://bugzilla.redhat.com/show_bug.cgi?id=1832180

Comment 1 Eric Stroczynski 2020-05-06 18:27:35 UTC
Similar issue when upgrading from 4.5.0-0.nightly-2020-05-06-003431 to 4.5.0-0.nightly-2020-05-06-112104: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/28104

Comment 3 W. Trevor King 2020-05-16 16:10:05 UTC
This test catches any critical alerts that flare up during the test run, so I'm adjusting the title to make it clear that this ticket is about kube-apiserver crashlooping.  If this test fails on other alerts (and it does [1]) and folks want to track in Bugzilla, they should file separate bugs.

[1]: https://search.svc.ci.openshift.org/?search=promQL+query%3A+count_over_time.*ALERTS.*had+reported+incorrect+results&maxAge=168h&context=1&type=bug%2Bjunit&name=upgrade&groupBy=job

$ curl -s 'https://search.svc.ci.openshift.org/search?search=promQL+query%3A+count_over_time.*ALERTS.*had+reported+incorrect+results&maxAge=24h&context=0&type=junit&name=upgrade' | jq -r '. | to_entries[].value | to_entries[].value[].context[]' | sed -n 's/.*incorrect results:\\n\(.*\)",$/\1/p' | sed 's|\\||g' | jq -r '.[].metric.alertname' | sort | uniq -c | sort -n | tail
      1 KubeAPIErrorBudgetBurn
      1 TargetDown
      2 etcdMembersDown
      4 ImagePruningDisabled
      7 KubePodCrashLooping
$ curl -s 'https://search.apps.build01.ci.devcluster.openshift.com/search?search=promQL+query%3A+count_over_time.*ALERTS.*had+reported+incorrect+results&maxAge=24h&context=0&type=junit&name=upgrade' | jq -r '. | to_entries[].value | to_entries[].value[].context[]' | sed -n 's/.*incorrect results:\\n\(.*\)",$/\1/p' | sed 's|\\||g' | jq -r '.[].metric | select(.alertname == "KubePodCrashLooping").container' | sort | uniq -c | sort -n       
      1 cluster-policy-controller
      1 kube-controller-manager
      1 kube-scheduler
      4 kube-apiserver

Comment 4 Stefan Schimanski 2020-05-19 09:39:40 UTC

*** This bug has been marked as a duplicate of bug 1828606 ***