Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1822289

Summary:	[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured: KubePodCrashLooping: console (OAuth 404s)
Product:	OpenShift Container Platform	Reporter:	Dan Williams <dcbw>
Component:	apiserver-auth	Assignee:	Standa Laznicka <slaznick>
Status:	CLOSED CANTFIX	QA Contact:	pmali
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.5	CC:	alegrand, anpicker, aos-bugs, bbennett, bpeterse, btofel, erooth, jerzhang, jokerman, kakkoyun, lcosic, lmohanty, mankulka, mfojtik, mloibl, periklis, pkrupa, surbania, vareti, wking
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-08-10 07:48:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Dan Williams 2020-04-08 16:48:25 UTC

PR is a test skip for ovn-kubernetes, which is not even used in this e2e-gcp job, so is unrelated to the failure.

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24833/pull-ci-openshift-origin-master-e2e-gcp/7122

fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:167]: Expected
    <map[string]error | len:1>: {
        "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1": {
            s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"console\",\"endpoint\":\"https-main\",\"instance\":\"10.131.0.3:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-console\",\"pod\":\"console-65bd9dd477-jssmt\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1586361949.69,\"4\"]},{\"metric\":{\"alertname\":\"KubePodCrashLooping\",\"alertstate\":\"firing\",\"container\":\"console\",\"endpoint\":\"https-main\",\"instance\":\"10.131.0.3:8443\",\"job\":\"kube-state-metrics\",\"namespace\":\"openshift-console\",\"pod\":\"console-769877df75-bc79z\",\"service\":\"kube-state-metrics\",\"severity\":\"critical\"},\"value\":[1586361949.69,\"4\"]},{\"metric\":{\"alertname\":\"KubeletPlegDurationHigh\",\"alertstate\":\"firing\",\"instance\":\"10.0.0.4:10250\",\"node\":\"ci-op-4qqj2-m-2.c.openshift-gce-devel-ci.internal\",\"quantile\":\"0.99\",\"severity\":\"warning\"},\"value\":[1586361949.69,\"2\"]}]",
        },
    }
to be empty

Comment 1 Venkata Siva Teja Areti 2020-04-08 18:00:24 UTC

There is already an open BZ for "KubeletPlegDurationHigh" alert

https://bugzilla.redhat.com/show_bug.cgi?id=1821697

Maybe we need to track console pod crashlooping in this BZ. There are many errors in console operator log about route health being degraded. Also there is an e2e failed test related to ingress in the same job. So, they all might be related.

> E0408 15:23:44.328499       1 status.go:74] OAuthClientSyncProgressing FailedHost waiting on route host
> E0408 15:23:44.328654       1 controller.go:129] {Console Console} failed with: waiting on route host
> E0408 15:23:45.973749       1 status.go:74] OAuthClientSyncProgressing FailedHost waiting on route host
> E0408 15:23:45.974031       1 controller.go:129] {Console Console} failed with: waiting on route host
> E0408 15:23:51.651979       1 status.go:74] OAuthClientSyncProgressing FailedHost waiting on route host
> E0408 15:23:51.652121       1 controller.go:129] {Console Console} failed with: waiting on route host
> E0408 15:24:02.778443       1 status.go:74] RouteHealthDegraded FailedGet failed to GET route (https://console-openshift-console.apps.ci-op-jpx6yl0w-2a78c.origin-ci-int-gce.dev.openshift.com/health): Get https://console-openshift-console.apps.ci-op-jpx6yl0w-2a78c.origin-ci-int-gce.dev.openshift.com/health: dial tcp: lookup console-openshift-console.apps.ci-op-jpx6yl0w-2a78c.origin-ci-int-gce.dev.openshift.com on 172.30.0.10:53: no such host
> E0408 15:24:02.778501       1 status.go:74] RouteAvailable FailedAdmittedIngress console route is not admitted
> E0408 15:24:02.778518       1 status.go:74] RouteSyncProgressing FailedHost route is not available at canonical host []
> E0408 15:24:02.797467       1 controller.go:199] console-route-sync--work-queue-key failed with : route is not available at canonical host []
> E0408 15:25:13.640359       1 status.go:74] OAuthClientSyncProgressing FailedHost waiting on route host
> E0408 15:25:13.640522       1 controller.go:129] {Console Console} failed with: waiting on route host
> I0408 15:25:15.501141       1 status_controller.go:172] clusteroperator/console diff {"status":{"conditions":[{"lastTransitionTime":"2020-04-08T15:25:15Z","message":"RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.ci-op-jpx6yl0w-2a78c.origin-ci-int-gce.dev.openshift.com/health): Get https://console-openshift-console.apps.ci-op-jpx6yl0w-2a78c.origin-ci-int-gce.dev.openshift.com/health: dial tcp: lookup console-openshift-console.apps.ci-op-jpx6yl0w-2a78c.origin-ci-int-gce.dev.openshift.com on 172.30.0.10:53: no such host","reason":"RouteHealth_FailedGet","status":"True","type":"Degraded"},{"lastTransitionTime":"2020-04-08T15:23:15Z","message":"RouteSyncProgressing: route is not available at canonical host []\nOAuthClientSyncProgressing: waiting on route host","reason":"OAuthClientSync_FailedHost::RouteSync_FailedHost","status":"True","type":"Progressing"},{"lastTransitionTime":"2020-04-08T15:23:15Z","message":"RouteAvailable: console route is not admitted","reason":"Route_FailedAdmittedIngress","status":"False","type":"Available"},{"lastTransitionTime":"2020-04-08T15:23:09Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
> I0408 15:25:15.517358       1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-console-operator", Name:"console-operator", UID:"3fc85b1b-1cd7-4e17-941c-9fea8efbfc7a", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/console changed: Degraded changed from False to True ("RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.ci-op-jpx6yl0w-2a78c.origin-ci-int-gce.dev.openshift.com/health): Get https://console-openshift-console.apps.ci-op-jpx6yl0w-2a78c.origin-ci-int-gce.dev.openshift.com/health: dial tcp: lookup console-openshift-console.apps.ci-op-jpx6yl0w-2a78c.origin-ci-int-gce.dev.openshift.com on 172.30.0.10:53: no such host")

Comment 2 W. Trevor King 2020-04-27 22:54:31 UTC

As Venkata pointed out, KubePodCrashLooping is a generic error.  I'm updating the title to reflect the focus on the console pod here, although console-65bd9dd477-jssmt was gone by the time the original job was gathering assets [1].  It's possible that it was crashlooping on something like:

2020-04-08T15:39:41Z auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://kubernetes.default.svc/.well-known/oauth-authorization-server failed: 404 Not Found

which we see in [2].  Not sure if we have deeper insight into what was going on with console-65bd9dd477-jssmt.

[1]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/24833/pull-ci-openshift-origin-master-e2e-gcp/7122/artifacts/e2e-gcp/pods/
[2]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/24833/pull-ci-openshift-origin-master-e2e-gcp/7122/artifacts/e2e-gcp/pods/openshift-console_console-7b87d7d756-85wj7_console_previous.log

Comment 3 Periklis Tsirakidis 2020-04-29 09:01:40 UTC

Failing job:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/1599

Comment 4 Brett Tofel 2020-04-30 18:26:52 UTC

Failing job:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1782

Comment 5 Mansi Kulkarni 2020-05-12 16:07:31 UTC

[buildcop] seeing this consistently as of today, e.g.:

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/1612

Comment 6 Daneyon Hansen 2020-05-22 21:27:50 UTC

After reviewing the logs from the failed run in https://bugzilla.redhat.com/show_bug.cgi?id=1822289#c5, Ingress and Console Operators are reporting expected status conditions. I see that a [1] made changes to the prometheus test pkg shortly before this bz was created. Reassigning to the monitoring team for further investigation.

[1] https://github.com/openshift/origin/commit/3b8cb3ca9b57e17980aa321bb8402ac9c144a17e

Comment 8 bpeterse 2020-05-26 16:03:19 UTC

Bumping this to 4.6.

Comment 9 Jakub Hadvig 2020-05-27 12:27:42 UTC

Reproduced this issue with crash looping console pod. Indeed the console-operator's
is reporting expected status conditions. That's because the the crash looping pod
will prevent new deployment rollout, which means that working console pods will be
still served, even for the health check. This is how the deployments are design.

Closing this issue since its expected behaviour.

Comment 10 W. Trevor King 2020-05-27 17:13:01 UTC

> Closing this issue since its expected behaviour.

Continuing to function in the face of crashlooping is great, but the point of this bug is that the console pod should not be crashlooping during a healthy update.  Do we understand why the console pod was crashlooping?  Can we make changes to not crashloop under those conditions?

Comment 11 Jakub Hadvig 2020-05-28 14:41:53 UTC

So we have a CI job what installs previous minor version of OpenShift cluster and updates to the latest.
That said I think that this issue with the crashlooping console pod is related to Ingress issues we saw
couple of weeks back, since the output logs look similar.
Do we have any reproduction steps?

Comment 12 W. Trevor King 2020-05-28 18:09:59 UTC

I dunno about reproducer steps short of "do what CI does, and this happens occasionally".  Spot checking the jobs reported in this thread, the original 7122 had "404 Not Found" for OAuth, as reported in comment 2.  Same for 1782 from comment 4.  1599 from comment 3 and 1612 from comment 5 have a number of crashlooping pods, but none of them are console pods.  You can also use CI search like [1] to find additional likely suspects and failure rates.  Looks like 13% of release-openshift-origin-installer-e2e-gcp-4.5 jobs match, including [2], which is also the OAuth 404.  So fixing whatever is behind that OAuth 404 will probably reduce the console crashlooping.  Moving to auth to investigate the 404s.

[1]: https://search.apps.build01.ci.devcluster.openshift.com/?search=promQL%20query:%20count_over_time.*KubePodCrashLooping.*console
[2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2376

Comment 13 Yu Qi Zhang 2020-06-11 15:23:19 UTC

This seems to have failed 100% of 4.5 CI runs recently, examples:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2463
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2464
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2465
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2466
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2467

Bumping priority to urgent, could someone take a look?

Comment 14 Maru Newby 2020-06-11 19:44:23 UTC

(In reply to Yu Qi Zhang from comment #13)
> This seems to have failed 100% of 4.5 CI runs recently, examples:
> https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/
> logs/release-openshift-origin-installer-e2e-gcp-4.5/2463
> https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/
> logs/release-openshift-origin-installer-e2e-gcp-4.5/2464
> https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/
> logs/release-openshift-origin-installer-e2e-gcp-4.5/2465
> https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/
> logs/release-openshift-origin-installer-e2e-gcp-4.5/2466
> https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/
> logs/release-openshift-origin-installer-e2e-gcp-4.5/2467
> 
> Bumping priority to urgent, could someone take a look?

Moving back to high. Your comment makes it seem like the issue reported in this bz is the only thing failing the jobs linked above. That is not the case - there are other failures mixed in.

Comment 15 Yu Qi Zhang 2020-06-11 19:53:52 UTC

Sorry, I was just using those as examples. Query:
https://search.apps.build01.ci.devcluster.openshift.com/?search=Alerts+shouldn%27t+report+any+alerts+in+firing+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured&maxAge=24h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

just in the last 24 hours there were a lot of failures due to this test across many tests and versions. There are many runs that only have that as a failure.

Comment 16 Maru Newby 2020-06-11 20:07:40 UTC

(In reply to Yu Qi Zhang from comment #15)
> Sorry, I was just using those as examples. Query:
> https://search.apps.build01.ci.devcluster.openshift.com/
> ?search=Alerts+shouldn%27t+report+any+alerts+in+firing+state+apart+from+Watch
> dog+and+AlertmanagerReceiversNotConfigured&maxAge=24h&context=1&type=bug%2Bju
> nit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
> 
> just in the last 24 hours there were a lot of failures due to this test
> across many tests and versions. There are many runs that only have that as a
> failure.

Can you describe the methodology you are using to determine that 'many runs have only that as a failure'. It's not at all obvious to me which results of your query exhibit that characteristic.

Comment 17 Yu Qi Zhang 2020-06-11 20:21:26 UTC

Sure, I've just been monitoring tests as build-cop today. One thing we were trying to get unblocked was machine-os-content promotion which is failing on that now we've fixed the previous issue:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.6/1271120771367833600

Other recent runs that just had that failure:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.5/1271149384825835520
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/615/pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-multi/267
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/1271128314525782016
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/375/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/1271120089801822208
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/3744/pull-ci-openshift-installer-master-e2e-aws-fips/1271114903146467328
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/615/pull-ci-openshift-machine-api-operator-master-e2e-gcp/1271119113317519360
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1271113473794772992

Comment 18 Maru Newby 2020-06-18 14:54:19 UTC

I’m closing this bug because the frequency of this flake occurring in our CI system is 0% in the past 7 days as determined by a ci search for `KubePodCrashLooping: console`:

https://search.apps.build01.ci.devcluster.openshift.com/?search=KubePodCrashLooping%3A+console&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 19 W. Trevor King 2020-06-25 05:47:25 UTC

I don't think it ever shows up just separated by a colon.  My query from comment 12 turns up a few 4.4 failures, but it seems like the master/4.6/4.5 log output may have changed.  Adjusting the query to [1] turns up [2] in 4.5.0-0.ci-2020-06-20-222827 GCP a few days ago.  The test job still passed, and I'm not sure why, but I don't think we can close this based on a lack of occurrences.

[1]: https://search.apps.build01.ci.devcluster.openshift.com/?search=KubePodCrashLooping.*%22container.%22%3A.*%22console.%22&maxAge=168h&context=1&type=junit&groupBy=job&name=release-openshift-
[2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1274470610822500352

Comment 20 Maru Newby 2020-07-10 22:20:03 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 23 Standa Laznicka 2020-08-10 07:48:34 UTC

This BZ is a mess, people throw searches and CI runs around, and even issue that are completely disjunct with what this BZ was originally for - see private comment 21, for example.

The BZ for this as identified by Siva in comment 2 got fixed by https://bugzilla.redhat.com/show_bug.cgi?id=1845446 and https://search.ci.openshift.org/?search=KubePodCrashLooping.*%22container.%22%3A.*%22console.%22&maxAge=168h&context=1&type=junit&groupBy=job&name=release-openshift shows 6% match in 4.5 in the last week, and since the 404 on KAS is usually caused by node issues (new KAS revision with the endpoint available fails to rollout), that to me is a good % so that I can close this BZ because I am not going to pass the mess to another team.

If you think it is still an issue, open a new BZ, present a single search that shows that it's being hit very often, but please don't reopen this.