1939070 – Single node E2E tests on AWS have KubePodCrashLooping alert caused by console pod

Bug 1939070 - Single node E2E tests on AWS have KubePodCrashLooping alert caused by console pod

Summary: Single node E2E tests on AWS have KubePodCrashLooping alert caused by console...

Keywords:
Status:	CLOSED DUPLICATE of bug 1943578
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Stephen Greene
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-15 14:44 UTC by Omer Tuchfeld
Modified:	2022-08-04 22:39 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-26 19:59:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Omer Tuchfeld 2021-03-15 14:44:49 UTC

Description of problem:
When running the openshift/conformance/parallel tests on a single node cluster running on AWS, this test:

[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel] 

fails because the unexpected "KubePodCrashLooping" alert is firing (among other unexpected alerts). 

This alert is fired due to a console pod, as shown by this example alert JSON:

{"alertname":"KubePodCrashLooping","alertstate":"firing","container":"console","endpoint":"https-main","job":"kube-state-metrics","namespace":"openshift-console","pod":"console-77c96c8d6b-ttwpq","service":"kube-state-metrics","severity":"warning"}

Version-Release number of selected component (if applicable):
4.8.0

How reproducible:
I saw it happen multiple times, here are some:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25936/pull-ci-openshift-origin-master-e2e-aws-single-node/1371419097056677888
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25936/pull-ci-openshift-origin-master-e2e-aws-single-node/1371180320862244864

This didn't use to happen, it started recently, and it now seems to happen consistently

Steps to Reproduce:
1. Launch a single node cluster on an m5d.2xlarge AWS instance
2. Run openshift/conformance/parallel tests
3. Alert should probably fire

Actual results:
Alert fires

Expected results:
Alert shouldn't fire

Additional info:
none

Comment 1 Jakub Hadvig 2021-03-16 09:08:13 UTC

So from the console pod status message is:
```
t-aws.dev.rhcloud.com: dial tcp: lookup oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
E0315 11:58:13.536731       1 auth.go:235] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com/oauth/token failed: Head https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com: dial tcp: lookup oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
E0315 11:58:23.539159       1 auth.go:235] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com/oauth/token failed: Head https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com: dial tcp: lookup oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
E0315 11:58:33.541710       1 auth.go:235] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com/oauth/token failed: Head https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com: dial tcp: lookup oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
E0315 11:58:43.545960       1 auth.go:235] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com/oauth/token failed: Head https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com: dial tcp: lookup oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
I0315 11:58:53.558215       1 main.go:670] Binding to [::]:8443...
I0315 11:58:53.558245       1 main.go:672] using TLS
```

which indicates that the console is unable to contact the Oauth server. After checking the console-operator pod logs I see that ConsoleRouteController is failing to syncing console route, with RouteHealthDegraded set to true.

After checking the openshift-oauth-apiserver logs I see they are flooded with following errors:
```
2021-03-15T12:09:19.344928047Z I0315 12:09:19.344887       1 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://10.0.168.253:2379  <nil> 0 <nil>}] <nil> <nil>}
2021-03-15T12:09:19.344928047Z I0315 12:09:19.344898       1 clientconn.go:948] ClientConn switching balancer to "pick_first"
2021-03-15T12:09:19.345090982Z I0315 12:09:19.345064       1 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc000ef21a0, {CONNECTING <nil>}
2021-03-15T12:09:19.358775364Z I0315 12:09:19.357807       1 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc000ef21a0, {READY <nil>}
2021-03-15T12:09:19.361115821Z I0315 12:09:19.359803       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2021-03-15T12:09:22.455476705Z E0315 12:09:22.455430       1 fieldmanager.go:186] [SHOULD NOT HAPPEN] failed to update managedFields for authentication.k8s.io/v1, Kind=TokenReview: failed to convert new object (authentication.k8s.io/v1, Kind=TokenReview) to smd typed: no corresponding type for authentication.k8s.io/v1, Kind=TokenReview
2021-03-15T12:09:30.630849297Z E0315 12:09:30.630751       1 fieldmanager.go:186] [SHOULD NOT HAPPEN] failed to update managedFields for authentication.k8s.io/v1, Kind=TokenReview: failed to convert new object (authentication.k8s.io/v1, Kind=TokenReview) to smd typed: no corresponding type for authentication.k8s.io/v1, Kind=TokenReview
```

On the other hand `openshift-authentication-operator` is also logging errors with `OAuthRouteCheckEndpointAccessibleController ` thats being degraded:
```
2021-03-15T11:48:42.019362533Z I0315 11:48:42.019310       1 status_controller.go:213] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2021-03-15T11:45:37Z","message":"OAuthRouteCheckEndpointAccessibleControllerDegraded: Get \"https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com/healthz\": dial tcp: lookup oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host\nWellKnownReadyControllerDegraded: kube-apiserver oauth endpoint https://10.0.168.253:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)","reason":"OAuthRouteCheckEndpointAccessibleController_SyncError::WellKnownReadyController_SyncError","status":"True","type":"Degraded"},{"lastTransitionTime":"2021-03-15T11:43:52Z","message":"OAuthVersionRouteProgressing: Request to \"https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com/healthz\" not successful yet","reason":"OAuthVersionRoute_WaitingForRoute","status":"True","type":"Progressing"},{"lastTransitionTime":"2021-03-15T11:35:44Z","message":"OAuthVersionRouteAvailable: HTTP request to \"https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com/healthz\" failed: dial tcp: lookup oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host\nOAuthRouteCheckEndpointAccessibleControllerAvailable: Get \"https://oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com/healthz\": dial tcp: lookup oauth-openshift.apps.ci-op-vl3520jd-1ee30.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host","reason":"OAuthRouteCheckEndpointAccessibleController_EndpointUnavailable::OAuthVersionRoute_RequestFailed","status":"False","type":"Available"},{"lastTransitionTime":"2021-03-15T11:35:45Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
```

Apparently this is not a console related issue, since console functionality depends of various other components. Looks like these components have issues with routing.

Comment 4 Stephen Greene 2021-03-26 19:59:37 UTC


*** This bug has been marked as a duplicate of bug 1943578 ***

Note You need to log in before you can comment on or make changes to this bug.