Description of problem: When running the openshift/conformance/parallel tests on a single node cluster running on AWS, this test: [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel] fails because the unexpected "KubePodCrashLooping" alert is firing (among other unexpected alerts). This alert is fired due to a console pod, as shown by this example alert JSON: {"alertname":"KubePodCrashLooping","alertstate":"firing","container":"console","endpoint":"https-main","job":"kube-state-metrics","namespace":"openshift-console","pod":"console-77c96c8d6b-ttwpq","service":"kube-state-metrics","severity":"warning"} Version-Release number of selected component (if applicable): 4.8.0 How reproducible: I saw it happen multiple times, here are some: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25936/pull-ci-openshift-origin-master-e2e-aws-single-node/1371419097056677888 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25936/pull-ci-openshift-origin-master-e2e-aws-single-node/1371180320862244864 This didn't use to happen, it started recently, and it now seems to happen consistently Steps to Reproduce: 1. Launch a single node cluster on an m5d.2xlarge AWS instance 2. Run openshift/conformance/parallel tests 3. Alert should probably fire Actual results: Alert fires Expected results: Alert shouldn't fire Additional info: none
So from the console pod status message is: ``` t-aws.dev.rhcloud.com: dial tcp: lookup oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host E0315 11:58:13.536731 1 auth.go:235] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com/oauth/token failed: Head https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com: dial tcp: lookup oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host E0315 11:58:23.539159 1 auth.go:235] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com/oauth/token failed: Head https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com: dial tcp: lookup oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host E0315 11:58:33.541710 1 auth.go:235] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com/oauth/token failed: Head https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com: dial tcp: lookup oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host E0315 11:58:43.545960 1 auth.go:235] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com/oauth/token failed: Head https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com: dial tcp: lookup oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host I0315 11:58:53.558215 1 main.go:670] Binding to [::]:8443... I0315 11:58:53.558245 1 main.go:672] using TLS ``` which indicates that the console is unable to contact the Oauth server. After checking the console-operator pod logs I see that ConsoleRouteController is failing to syncing console route, with RouteHealthDegraded set to true. After checking the openshift-oauth-apiserver logs I see they are flooded with following errors: ``` 2021-03-15T12:09:19.344928047Z I0315 12:09:19.344887 1 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://10.0.168.253:2379 <nil> 0 <nil>}] <nil> <nil>} 2021-03-15T12:09:19.344928047Z I0315 12:09:19.344898 1 clientconn.go:948] ClientConn switching balancer to "pick_first" 2021-03-15T12:09:19.345090982Z I0315 12:09:19.345064 1 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc000ef21a0, {CONNECTING <nil>} 2021-03-15T12:09:19.358775364Z I0315 12:09:19.357807 1 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc000ef21a0, {READY <nil>} 2021-03-15T12:09:19.361115821Z I0315 12:09:19.359803 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" 2021-03-15T12:09:22.455476705Z E0315 12:09:22.455430 1 fieldmanager.go:186] [SHOULD NOT HAPPEN] failed to update managedFields for authentication.k8s.io/v1, Kind=TokenReview: failed to convert new object (authentication.k8s.io/v1, Kind=TokenReview) to smd typed: no corresponding type for authentication.k8s.io/v1, Kind=TokenReview 2021-03-15T12:09:30.630849297Z E0315 12:09:30.630751 1 fieldmanager.go:186] [SHOULD NOT HAPPEN] failed to update managedFields for authentication.k8s.io/v1, Kind=TokenReview: failed to convert new object (authentication.k8s.io/v1, Kind=TokenReview) to smd typed: no corresponding type for authentication.k8s.io/v1, Kind=TokenReview ``` On the other hand `openshift-authentication-operator` is also logging errors with `OAuthRouteCheckEndpointAccessibleController ` thats being degraded: ``` 2021-03-15T11:48:42.019362533Z I0315 11:48:42.019310 1 status_controller.go:213] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2021-03-15T11:45:37Z","message":"OAuthRouteCheckEndpointAccessibleControllerDegraded: Get \"https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com/healthz\": dial tcp: lookup oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host\nWellKnownReadyControllerDegraded: kube-apiserver oauth endpoint https://10.0.168.253:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)","reason":"OAuthRouteCheckEndpointAccessibleController_SyncError::WellKnownReadyController_SyncError","status":"True","type":"Degraded"},{"lastTransitionTime":"2021-03-15T11:43:52Z","message":"OAuthVersionRouteProgressing: Request to \"https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com/healthz\" not successful yet","reason":"OAuthVersionRoute_WaitingForRoute","status":"True","type":"Progressing"},{"lastTransitionTime":"2021-03-15T11:35:44Z","message":"OAuthVersionRouteAvailable: HTTP request to \"https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com/healthz\" failed: dial tcp: lookup oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host\nOAuthRouteCheckEndpointAccessibleControllerAvailable: Get \"https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com/healthz\": dial tcp: lookup oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host","reason":"OAuthRouteCheckEndpointAccessibleController_EndpointUnavailable::OAuthVersionRoute_RequestFailed","status":"False","type":"Available"},{"lastTransitionTime":"2021-03-15T11:35:45Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}} ``` Apparently this is not a console related issue, since console functionality depends of various other components. Looks like these components have issues with routing.
*** This bug has been marked as a duplicate of bug 1943578 ***