Bug 1822289
| Summary: | [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured: KubePodCrashLooping: console (OAuth 404s) | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Dan Williams <dcbw> |
| Component: | apiserver-auth | Assignee: | Standa Laznicka <slaznick> |
| Status: | CLOSED CANTFIX | QA Contact: | pmali |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.5 | CC: | alegrand, anpicker, aos-bugs, bbennett, bpeterse, btofel, erooth, jerzhang, jokerman, kakkoyun, lcosic, lmohanty, mankulka, mfojtik, mloibl, periklis, pkrupa, surbania, vareti, wking |
| Target Milestone: | --- | ||
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-08-10 07:48:34 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Dan Williams
2020-04-08 16:48:25 UTC
There is already an open BZ for "KubeletPlegDurationHigh" alert https://bugzilla.redhat.com/show_bug.cgi?id=1821697 Maybe we need to track console pod crashlooping in this BZ. There are many errors in console operator log about route health being degraded. Also there is an e2e failed test related to ingress in the same job. So, they all might be related. > E0408 15:23:44.328499 1 status.go:74] OAuthClientSyncProgressing FailedHost waiting on route host > E0408 15:23:44.328654 1 controller.go:129] {Console Console} failed with: waiting on route host > E0408 15:23:45.973749 1 status.go:74] OAuthClientSyncProgressing FailedHost waiting on route host > E0408 15:23:45.974031 1 controller.go:129] {Console Console} failed with: waiting on route host > E0408 15:23:51.651979 1 status.go:74] OAuthClientSyncProgressing FailedHost waiting on route host > E0408 15:23:51.652121 1 controller.go:129] {Console Console} failed with: waiting on route host > E0408 15:24:02.778443 1 status.go:74] RouteHealthDegraded FailedGet failed to GET route (https://console-openshift-console.apps.ci-op-jpx6yl0w-2a78c.origin-ci-int-gce.dev.openshift.com/health): Get https://console-openshift-console.apps.ci-op-jpx6yl0w-2a78c.origin-ci-int-gce.dev.openshift.com/health: dial tcp: lookup console-openshift-console.apps.ci-op-jpx6yl0w-2a78c.origin-ci-int-gce.dev.openshift.com on 172.30.0.10:53: no such host > E0408 15:24:02.778501 1 status.go:74] RouteAvailable FailedAdmittedIngress console route is not admitted > E0408 15:24:02.778518 1 status.go:74] RouteSyncProgressing FailedHost route is not available at canonical host [] > E0408 15:24:02.797467 1 controller.go:199] console-route-sync--work-queue-key failed with : route is not available at canonical host [] > E0408 15:25:13.640359 1 status.go:74] OAuthClientSyncProgressing FailedHost waiting on route host > E0408 15:25:13.640522 1 controller.go:129] {Console Console} failed with: waiting on route host > I0408 15:25:15.501141 1 status_controller.go:172] clusteroperator/console diff {"status":{"conditions":[{"lastTransitionTime":"2020-04-08T15:25:15Z","message":"RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.ci-op-jpx6yl0w-2a78c.origin-ci-int-gce.dev.openshift.com/health): Get https://console-openshift-console.apps.ci-op-jpx6yl0w-2a78c.origin-ci-int-gce.dev.openshift.com/health: dial tcp: lookup console-openshift-console.apps.ci-op-jpx6yl0w-2a78c.origin-ci-int-gce.dev.openshift.com on 172.30.0.10:53: no such host","reason":"RouteHealth_FailedGet","status":"True","type":"Degraded"},{"lastTransitionTime":"2020-04-08T15:23:15Z","message":"RouteSyncProgressing: route is not available at canonical host []\nOAuthClientSyncProgressing: waiting on route host","reason":"OAuthClientSync_FailedHost::RouteSync_FailedHost","status":"True","type":"Progressing"},{"lastTransitionTime":"2020-04-08T15:23:15Z","message":"RouteAvailable: console route is not admitted","reason":"Route_FailedAdmittedIngress","status":"False","type":"Available"},{"lastTransitionTime":"2020-04-08T15:23:09Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}} > I0408 15:25:15.517358 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-console-operator", Name:"console-operator", UID:"3fc85b1b-1cd7-4e17-941c-9fea8efbfc7a", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/console changed: Degraded changed from False to True ("RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.ci-op-jpx6yl0w-2a78c.origin-ci-int-gce.dev.openshift.com/health): Get https://console-openshift-console.apps.ci-op-jpx6yl0w-2a78c.origin-ci-int-gce.dev.openshift.com/health: dial tcp: lookup console-openshift-console.apps.ci-op-jpx6yl0w-2a78c.origin-ci-int-gce.dev.openshift.com on 172.30.0.10:53: no such host") As Venkata pointed out, KubePodCrashLooping is a generic error. I'm updating the title to reflect the focus on the console pod here, although console-65bd9dd477-jssmt was gone by the time the original job was gathering assets [1]. It's possible that it was crashlooping on something like: 2020-04-08T15:39:41Z auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://kubernetes.default.svc/.well-known/oauth-authorization-server failed: 404 Not Found which we see in [2]. Not sure if we have deeper insight into what was going on with console-65bd9dd477-jssmt. [1]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/24833/pull-ci-openshift-origin-master-e2e-gcp/7122/artifacts/e2e-gcp/pods/ [2]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/24833/pull-ci-openshift-origin-master-e2e-gcp/7122/artifacts/e2e-gcp/pods/openshift-console_console-7b87d7d756-85wj7_console_previous.log Failing job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/1599 Failing job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1782 [buildcop] seeing this consistently as of today, e.g.: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/1612 After reviewing the logs from the failed run in https://bugzilla.redhat.com/show_bug.cgi?id=1822289#c5, Ingress and Console Operators are reporting expected status conditions. I see that a [1] made changes to the prometheus test pkg shortly before this bz was created. Reassigning to the monitoring team for further investigation. [1] https://github.com/openshift/origin/commit/3b8cb3ca9b57e17980aa321bb8402ac9c144a17e Bumping this to 4.6. Reproduced this issue with crash looping console pod. Indeed the console-operator's is reporting expected status conditions. That's because the the crash looping pod will prevent new deployment rollout, which means that working console pods will be still served, even for the health check. This is how the deployments are design. Closing this issue since its expected behaviour. > Closing this issue since its expected behaviour.
Continuing to function in the face of crashlooping is great, but the point of this bug is that the console pod should not be crashlooping during a healthy update. Do we understand why the console pod was crashlooping? Can we make changes to not crashloop under those conditions?
So we have a CI job what installs previous minor version of OpenShift cluster and updates to the latest. That said I think that this issue with the crashlooping console pod is related to Ingress issues we saw couple of weeks back, since the output logs look similar. Do we have any reproduction steps? I dunno about reproducer steps short of "do what CI does, and this happens occasionally". Spot checking the jobs reported in this thread, the original 7122 had "404 Not Found" for OAuth, as reported in comment 2. Same for 1782 from comment 4. 1599 from comment 3 and 1612 from comment 5 have a number of crashlooping pods, but none of them are console pods. You can also use CI search like [1] to find additional likely suspects and failure rates. Looks like 13% of release-openshift-origin-installer-e2e-gcp-4.5 jobs match, including [2], which is also the OAuth 404. So fixing whatever is behind that OAuth 404 will probably reduce the console crashlooping. Moving to auth to investigate the 404s. [1]: https://search.apps.build01.ci.devcluster.openshift.com/?search=promQL%20query:%20count_over_time.*KubePodCrashLooping.*console [2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2376 This seems to have failed 100% of 4.5 CI runs recently, examples: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2463 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2464 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2465 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2466 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2467 Bumping priority to urgent, could someone take a look? (In reply to Yu Qi Zhang from comment #13) > This seems to have failed 100% of 4.5 CI runs recently, examples: > https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/ > logs/release-openshift-origin-installer-e2e-gcp-4.5/2463 > https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/ > logs/release-openshift-origin-installer-e2e-gcp-4.5/2464 > https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/ > logs/release-openshift-origin-installer-e2e-gcp-4.5/2465 > https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/ > logs/release-openshift-origin-installer-e2e-gcp-4.5/2466 > https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/ > logs/release-openshift-origin-installer-e2e-gcp-4.5/2467 > > Bumping priority to urgent, could someone take a look? Moving back to high. Your comment makes it seem like the issue reported in this bz is the only thing failing the jobs linked above. That is not the case - there are other failures mixed in. Sorry, I was just using those as examples. Query: https://search.apps.build01.ci.devcluster.openshift.com/?search=Alerts+shouldn%27t+report+any+alerts+in+firing+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured&maxAge=24h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job just in the last 24 hours there were a lot of failures due to this test across many tests and versions. There are many runs that only have that as a failure. (In reply to Yu Qi Zhang from comment #15) > Sorry, I was just using those as examples. Query: > https://search.apps.build01.ci.devcluster.openshift.com/ > ?search=Alerts+shouldn%27t+report+any+alerts+in+firing+state+apart+from+Watch > dog+and+AlertmanagerReceiversNotConfigured&maxAge=24h&context=1&type=bug%2Bju > nit&name=&maxMatches=5&maxBytes=20971520&groupBy=job > > just in the last 24 hours there were a lot of failures due to this test > across many tests and versions. There are many runs that only have that as a > failure. Can you describe the methodology you are using to determine that 'many runs have only that as a failure'. It's not at all obvious to me which results of your query exhibit that characteristic. Sure, I've just been monitoring tests as build-cop today. One thing we were trying to get unblocked was machine-os-content promotion which is failing on that now we've fixed the previous issue: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.6/1271120771367833600 Other recent runs that just had that failure: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.5/1271149384825835520 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/615/pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-multi/267 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/1271128314525782016 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/375/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/1271120089801822208 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/3744/pull-ci-openshift-installer-master-e2e-aws-fips/1271114903146467328 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/615/pull-ci-openshift-machine-api-operator-master-e2e-gcp/1271119113317519360 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1271113473794772992 I’m closing this bug because the frequency of this flake occurring in our CI system is 0% in the past 7 days as determined by a ci search for `KubePodCrashLooping: console`: https://search.apps.build01.ci.devcluster.openshift.com/?search=KubePodCrashLooping%3A+console&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job I don't think it ever shows up just separated by a colon. My query from comment 12 turns up a few 4.4 failures, but it seems like the master/4.6/4.5 log output may have changed. Adjusting the query to [1] turns up [2] in 4.5.0-0.ci-2020-06-20-222827 GCP a few days ago. The test job still passed, and I'm not sure why, but I don't think we can close this based on a lack of occurrences. [1]: https://search.apps.build01.ci.devcluster.openshift.com/?search=KubePodCrashLooping.*%22container.%22%3A.*%22console.%22&maxAge=168h&context=1&type=junit&groupBy=job&name=release-openshift- [2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1274470610822500352 I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. This BZ is a mess, people throw searches and CI runs around, and even issue that are completely disjunct with what this BZ was originally for - see private comment 21, for example. The BZ for this as identified by Siva in comment 2 got fixed by https://bugzilla.redhat.com/show_bug.cgi?id=1845446 and https://search.ci.openshift.org/?search=KubePodCrashLooping.*%22container.%22%3A.*%22console.%22&maxAge=168h&context=1&type=junit&groupBy=job&name=release-openshift shows 6% match in 4.5 in the last week, and since the 404 on KAS is usually caused by node issues (new KAS revision with the endpoint available fails to rollout), that to me is a good % so that I can close this BZ because I am not going to pass the mess to another team. If you think it is still an issue, open a new BZ, present a single search that shows that it's being hit very often, but please don't reopen this. |