Bug 1765276 - authentication operator reports RouteHealthDegraded: failed to GET route: net/http: TLS handshake timeout
Summary: authentication operator reports RouteHealthDegraded: failed to GET route: net...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: apiserver-auth
Version: 4.5
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.6.0
Assignee: Standa Laznicka
QA Contact: scheng
URL:
Whiteboard:
: 1750953 1779429 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-24 17:52 UTC by Dan Mace
Modified: 2023-10-06 18:42 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-10 11:06:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Dan Mace 2019-10-24 17:52:55 UTC
Description of problem:

The authentication operator will sometimes report the following degraded condition:

    RouteHealthDegraded: failed to GET route: net/http: TLS handshake timeout

Observed on the following platforms in CI over the past 14 days: aws, openstack, gcp, azure, metal

The nature of the error (which could be from the backend/app) and the broad list of platforms (many of which don't use cloud LB for ingress) seem like clues.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Dan Mace 2019-10-24 19:46:42 UTC
*** Bug 1750953 has been marked as a duplicate of this bug. ***

Comment 3 Dan Mace 2019-10-25 13:55:20 UTC
Just a small update here from my very initial analysis. So far in the examples I've looked through, ingress becomes available (as far as it's possible to tell from the state we have) within a reasonable timeframe, and the auth operator eventually acknowledges this fact, but then the auth operator enters a lengthy (up to 30m) period where the TLS error presents when accessing the route.

So far I'm not seeing any evidence of a networking issue in this regard.

The error could be the auth endpoint behind the route. This could stem from some sort of botched certificate interaction between the ingress operator and auth (in openshift-config-managed) but since the secret bytes are opaque in the state dumps it's hard to say for sure.

Comment 4 Dan Mace 2019-10-29 15:27:54 UTC
I need to run a couple of experiments and analyze some more data sets to confirm, but I suspect now that the issue is that when the client pod is on a master node, traffic destined for the LB IP on AWS is getting messed up either on the way to the LB or the way back (perhaps due to VPC/SG configurations).

Note that on AWS in Kube 1.16, unlike GCP and Azure, pod traffic destined for LB Service IPs will actually egress to the ELB and back to the node.

Comment 5 Dan Mace 2019-10-31 13:57:37 UTC
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.3/228

This error has been observed to occur even on the bare metal tests which use HostNetwork ingress. The error is always transient, but the window of time before resolution varies. In a random metal/HostNetwork example[1], consistent with other examples, there is no evidence of a router/ingress issue. The problem window does not seem to coincide with any interesting direct or indirect ingress event. No rollout is happening, the router is reporting successful health checks, the containers haven't restarted. The iptables rules on the node look consistent.

If this is an issue with haproxy itself or an SDN issue, it might be too subtle for me to imagine at this point.

Isn't it possible the operator health check client or oauth health check endpoint is the problem? These seem like more obvious explanations given the evidence at hand so far.

Even if so, there could still be some issue with ingress/auth certificate integration which causes the problematic request or response, but we need a reproducer to see where the evidence leads.

At this point, I've done a lot of due dilligence on the networking side, and ask that the auth team try to reproduce and collect more evidence which can rule out the health check client or server.

[1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.3/228

Comment 6 Dan Mace 2019-10-31 15:32:04 UTC
More info from the bare metal example[1]:

authentication-operator (client):

E1031 01:03:50.697204       1 controller.go:129] {AuthenticationOperator2 AuthenticationOperator2} failed with: error checking current version: unable to check route health: failed to GET route: net/http: TLS handshake timeout

oauth-openshift (server):

I1031 01:16:04.005363       1 log.go:172] http: TLS handshake error from 10.128.0.1:57498: EOF

[1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.3/228

Comment 9 W. Trevor King 2019-12-03 22:07:10 UTC
Happened 58 times in the last 24h (4% of all e2e failures) [1].

[1]: https://search.svc.ci.openshift.org/chart?search=RouteHealthDegraded%3A+failed+to+GET+route.*TLS+handshake+timeout

Comment 10 Maru Newby 2019-12-04 02:35:34 UTC
(In reply to W. Trevor King from comment #9)
> Happened 58 times in the last 24h (4% of all e2e failures) [1].
> 
> [1]:
> https://search.svc.ci.openshift.org/
> chart?search=RouteHealthDegraded%3A+failed+to+GET+route.
> *TLS+handshake+timeout

Note that this search is not a good indication whether the issue reported by this bz is still occurring. The error message is likely to appear when a cluster is started or upgraded since the required route is likely to be unavailable while router instances are starting or restarting. Only when this condition remains for an extended period of time - and when the ingress operator is otherwise reporting healthy - is the condition reported by this bz likely to be reproducing. 

Ideally it would be possible to determine from build log output that a route health issue was persisting for a non-trivial duration (more than a couple of minutes) when the ingress operator was reporting healthy. The former suggests adding lastTransitionTime to build log output, and the latter suggests that the auth operator be considering the status of the ingress operator when checking route health. Would it be reasonable for me to pursue these changes? I'm not sure whether lastTransitionTime being absent is intentional or an omission, and I'm not clear whether operators are allowed to consider the status of other operators.

Comment 11 W. Trevor King 2019-12-04 03:05:44 UTC
> The error message is likely to appear when a cluster is started...

The build-log entries are from the monitor which runs in parallel with the tests.  It should not be running in parallel with startup.

> ... or upgraded since the required route is likely to be unavailable while router instances are starting or restarting.

Why would the route become unavailable during update?  I'd expect a zero-downtime handoff?  Also happens during non-update jobs like [1].

> Only when this condition remains for an extended period of time...

This is why the Degraded API says that operators should only go Degraded when the trigger exists for long enough to impact quality of service [2].  It's a bug for operators to flap Degraded=True briefly when there is no QoS impact.  That also means you don't need something about lastTransitionTime in the build-log output.

> ...suggests that the auth operator be considering the status of the ingress operator when checking route health...

I don't think we want this.  If auth is degraded because the route is broken enough to impact QoS, it's degraded.  If other ingress stuff is degraded too, that's fine, and both operators should be reporting Degraded=True and explaining the QoS impacts.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.3/514
[2]: https://github.com/openshift/api/blob/2ea89d203c53704f1fcfeb55c13ededab14fd020/config/v1/types_cluster_operator.go#L152-L156

Comment 12 W. Trevor King 2019-12-04 05:33:55 UTC
Differentiating from broader ingress issues, here are release jobs that failed with the auth degradation but no ingress degradation:

$ curl -s 'https://ci-search-ci-search-next.svc.ci.openshift.org/search?maxAge=24h&type=build-log&context=0&search=ingress.*Degraded&search=RouteHealthDegraded:%20failed%20to%20GET%20route:%20net/http:%20TLS%20handshake%20timeout' | jq -r '. | to_entries[] | select(.value["RouteHealthDegraded: failed to GET route: net/http: TLS handshake timeout"] != null and .value["ingress.*Degraded"] == null) | .key' | grep /release
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-rhel7-workers-4.2/104
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.4/105
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.3/509
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.3/514
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.3/515
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/104
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.3/509
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/861
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/862
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/318
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11942
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11949
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11954
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11965
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11969
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11974
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11982
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11984
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11989
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11991
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11993
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.3/222
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-old-rhcos-e2e-aws-4.3/148
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-old-rhcos-e2e-aws-4.3/149

Picking on one of those [1], monitored error events are:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.4/105/build-log.txt | grep Degraded | sort | uniq
Dec 03 20:06:17.166 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available
Dec 03 20:06:22.178 W clusteroperator/dns changed Degraded to False: AsExpected: All desired DNS DaemonSets available and operand Namespace exists
Dec 03 20:19:37.203 I ns/openshift-authentication-operator deployment/authentication-operator Status for clusteroperator/authentication changed: Degraded message changed from "" to "OperatorSyncDegraded: failed syncing configuration objects: config maps [v4-0-config-user-idp-0-ca] in openshift-authentication not synced"
Dec 03 20:19:41.308 I ns/openshift-authentication-operator deployment/authentication-operator Status for clusteroperator/authentication changed: Degraded message changed from "OperatorSyncDegraded: failed syncing configuration objects: config maps [v4-0-config-user-idp-0-ca] in openshift-authentication not synced" to "",Progressing changed from False to True ("Progressing: deployment's observed generation did not reach the expected generation")
Dec 03 20:20:20.404 I ns/openshift-authentication-operator deployment/authentication-operator Status for clusteroperator/authentication changed: Degraded message changed from "" to "RouteHealthDegraded: failed to GET route: net/http: TLS handshake timeout"
Dec 03 20:20:21.841 I ns/openshift-authentication-operator deployment/authentication-operator Status for clusteroperator/authentication changed: Degraded message changed from "RouteHealthDegraded: failed to GET route: net/http: TLS handshake timeout" to "",Progressing changed from True to False ("")

So little flap in the DNS Degraded (probably a bug in the DNS operator, but also probably unrelated to this bug).  Then auth flaps OperatorSyncDegraded for a few seconds (does failing to sync configs for a few seconds really have a QoS impact?  Seems unlikely).  Then a minute after the sync flap we have a one-second RouteHealthDegraded flap.  I'll look to see what else was happening around 20:20:20 to see if we can find a culprit, although I'm also skeptical that any serious client should be impacted by a one-second OAuth route outage.  Maybe some tests are super-sensitive and are impacted by such a brief outage?

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.4/105

Comment 13 W. Trevor King 2019-12-04 05:39:30 UTC
Also from the build log [1]:

passed: (11.4s) 2019-12-03T20:19:33 "[Feature:OpenShiftAuthorization][Serial] authorization  TestAuthorizationResourceAccessReview should succeed [Suite:openshift/conformance/serial]"

started: (0/172/221) "[Serial] [Feature:OAuthServer] [RequestHeaders] [IdP] test RequestHeaders IdP [Suite:openshift/conformance/serial]"

W1203 20:20:16.832891     239 reflector.go:299] github.com/openshift/origin/pkg/monitor/operator.go:126: watch of *v1.ClusterOperator ended with: too old resource version: 47648 (47684)
passed: (50.4s) 2019-12-03T20:20:23 "[Serial] [Feature:OAuthServer] [RequestHeaders] [IdP] test RequestHeaders IdP [Suite:openshift/conformance/serial]"

So both auth flaps happened during that OAuthServer test, and that test passed.  Hiccups on that job might be due to something about that test itself?  The actual failing test on that job was:

started: (0/186/221) "[sig-api-machinery] Namespaces [Serial] should always delete fast (ALL of 100 namespaces in 150 seconds) [Feature:ComprehensiveNamespaceDraining] [Suite:openshift/conformance/serial] [Suite:k8s]"

W1203 20:23:23.832303     239 reflector.go:299] github.com/openshift/origin/pkg/monitor/operator.go:126: watch of *v1.ClusterOperator ended with: too old resource version: 48994 (50365)
...
fail [k8s.io/kubernetes/test/e2e/apimachinery/namespace.go:50]: failed to create namespace: nslifetest-31
Unexpected error:
    <*errors.errorString | 0xc0002ea760>: {
        s: "watch closed before UntilWithoutRetry timeout",
    }
    watch closed before UntilWithoutRetry timeout
occurred

failed: (1m12s) 2019-12-03T20:24:25 "[sig-api-machinery] Namespaces [Serial] should always delete fast (ALL of 100 namespaces in 150 seconds) [Feature:ComprehensiveNamespaceDraining] [Suite:openshift/conformance/serial] [Suite:k8s]"

running a few minutes later, when the auth operator was happy again.

[1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.4/105/build-log.txt

Comment 14 W. Trevor King 2019-12-04 05:46:31 UTC
Checking another one of the serial jobs (this time on Azure), also getting the same auth Degraded flapping during that same test:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.3/509/build-log.txt | grep Degraded
Dec 03 05:17:04.674 I ns/openshift-authentication-operator deployment/authentication-operator Status for clusteroperator/authentication changed: Degraded message changed from "" to "OperatorSyncDegraded: failed syncing configuration objects: config maps [v4-0-config-user-idp-0-ca] in openshift-authentication not synced"
Dec 03 05:17:08.665 I ns/openshift-authentication-operator deployment/authentication-operator Status for clusteroperator/authentication changed: Degraded message changed from "OperatorSyncDegraded: failed syncing configuration objects: config maps [v4-0-config-user-idp-0-ca] in openshift-authentication not synced" to "",Progressing changed from False to True ("Progressing: deployment's observed generation did not reach the expected generation")
Dec 03 05:17:44.668 I ns/openshift-authentication-operator deployment/authentication-operator Status for clusteroperator/authentication changed: Degraded message changed from "" to "RouteHealthDegraded: failed to GET route: net/http: TLS handshake timeout"
Dec 03 05:17:46.118 I ns/openshift-authentication-operator deployment/authentication-operator Status for clusteroperator/authentication changed: Degraded message changed from "RouteHealthDegraded: failed to GET route: net/http: TLS handshake timeout" to ""
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.3/509/build-log.txt | grep -3 'test RequestHeaders IdP'

skipped: (4.1s) 2019-12-03T05:16:57 "[sig-storage] CSI Volumes [Driver: pd.csi.storage.gke.io][Serial] [Testpattern: Inline-volume (default fs)] subPath should be able to unmount after the subpath directory is deleted [Suite:openshift/conformance/serial] [Suite:k8s] [Skipped:gce]"

started: (0/152/221) "[Serial] [Feature:OAuthServer] [RequestHeaders] [IdP] test RequestHeaders IdP [Suite:openshift/conformance/serial]"

W1203 05:17:53.831381     254 reflector.go:299] github.com/openshift/origin/pkg/monitor/operator.go:126: watch of *v1.ClusterOperator ended with: too old resource version: 56017 (56061)
W1203 05:17:59.081110     254 reflector.go:299] github.com/openshift/origin/pkg/monitor/operator.go:126: watch of *v1.ClusterOperator ended with: too old resource version: 56061 (56086)
passed: (1m59s) 2019-12-03T05:18:57 "[Serial] [Feature:OAuthServer] [RequestHeaders] [IdP] test RequestHeaders IdP [Suite:openshift/conformance/serial]"

started: (0/153/221) "[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: gce-localssd-scsi-fs] [Serial] [Testpattern: Inline-volume (default fs)] volumes should store data [Suite:openshift/conformance/serial] [Suite:k8s] [Skipped:gce]"

So I'm pretty sure at least some of these are due to something about that specific test's logic.

Comment 15 Maru Newby 2019-12-04 13:52:02 UTC
The next step in resolving this bz is ensuring the auth operator can differentiate route health (a responsibility of the ingress operator) and the health of endpoints managed by the auth operator. A previously submitted PR [1] proposed to check endpoint health and merging that will enable diagnosis of the problem reported by this bz.

1: https://github.com/openshift/cluster-authentication-operator/pull/211

Comment 17 Maru Newby 2020-01-10 11:10:23 UTC
Revised PR to support differentiating between endpoint and route health issues: 

https://github.com/openshift/cluster-authentication-operator/pull/237

Comment 19 Maru Newby 2020-02-26 06:10:14 UTC
I haven't found a way to reliably reproduce these problems to get to a root cause. Moving to 4.5.

Comment 20 Michal Fojtik 2020-05-12 10:32:51 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

As such, we're marking this bug as "LifecycleStale" and decreasing severity from "medium" to "low".

If you have further information on the current state of the bug, please update it, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 21 OpenShift BugZilla Robot 2020-05-20 00:29:54 UTC
This bug hasn't had any activity 7 days after it was marked as LifecycleStale, so we are closing this bug as WONTFIX. If you consider this bug still valuable, please reopen it or create new bug.

Comment 22 W. Trevor King 2020-05-20 00:59:09 UTC
Still see these in 1% of failures, using the query from comment 9.  Recent example job was 4.3.21 -> 4.4.0-0.ci-2020-05-18-233313 [1], where it was not fatal.  Also looks like I reported a dup in bug 1779429, which I'll close in favor of this one.  

[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/29522

Comment 23 W. Trevor King 2020-05-20 01:00:00 UTC
*** Bug 1779429 has been marked as a duplicate of this bug. ***

Comment 24 Michal Fojtik 2020-05-27 00:01:53 UTC
This bug hasn't had any activity 7 days after it was marked as LifecycleStale, so we are closing this bug as WONTFIX. If you consider this bug still valuable, please reopen it or create new bug.

Comment 25 W. Trevor King 2020-05-27 02:39:16 UTC
Why should you expect weekly bumps?  I commented on the 20th, and then this gets closed in the next comment on the 27th with "hasn't had any activity"?  What sort of activity are you expecting?

Comment 27 Maru Newby 2020-06-18 14:25:36 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 29 Maru Newby 2020-07-11 02:55:22 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 30 Venkata Siva Teja Areti 2020-07-23 21:49:16 UTC
This does not look like an auth operator issue. The degraded condition points to the routing component. If there is an issue with oauth server pods, then we must see degraded condition like "MissingEndpoints" or "NonReadyEndpoints" along with "RouteDegraded" message.

Comment 31 W. Trevor King 2020-07-23 23:59:37 UTC
> The degraded condition points to the routing component.

What auth functionality has a degraded QoS when the RouteHealthDegraded condition is firing [1]?  Is it just "external folks won't be able to reach the OAuth server"?  If that's the case, and this is really about the routing component, maybe auth should drop the condition and leave it to the ingress operator to sound the alarm if incoming Routes are broken?

[1]: https://github.com/openshift/api/blob/787191c0c3c8cec8e481c9e1c4cf922404069da8/config/v1/types_cluster_operator.go#L153

Comment 32 Maru Newby 2020-07-24 19:58:46 UTC
(In reply to W. Trevor King from comment #31)
> > The degraded condition points to the routing component.
> 
> What auth functionality has a degraded QoS when the RouteHealthDegraded
> condition is firing [1]?  Is it just "external folks won't be able to reach
> the OAuth server"?  If that's the case, and this is really about the routing
> component, maybe auth should drop the condition and leave it to the ingress
> operator to sound the alarm if incoming Routes are broken?

How is it not the responsibility of the auth operator to report degraded if it can detect that the oauth server is likely to be unreachable outside the cluster? 

Note that degraded ingress is not necessarily required for the route to be degraded. Route admission could also be at fault, which would not represent a degraded condition for the ingress operator.

Comment 33 Standa Laznicka 2020-08-03 08:04:33 UTC
Yeah, I think the authentication operator should care whether the authentication is working or not, we're keeping that condition.

Also, I'm adding UpcomingSprint because I was occupied by other things.

Comment 34 Standa Laznicka 2020-08-10 11:06:26 UTC
I went through many of the https://search.ci.openshift.org/?search=RouteHealthDegraded%3A+failed+to+GET+route.*TLS+handshake+timeout&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job jobs, most (like 98%?) involve 4.1 and 4.2 which are EOL, so I don't care about them at all. Honestly, I don't understand why we even run them, they only pollute the searches.

The only 4.3 to 4.4 job I saw did not flap, the operator only unset the condition shortly after it came up. Closing.

Comment 35 W. Trevor King 2020-08-10 22:24:55 UTC
$ curl -s 'https://search.ci.openshift.org/search?search=RouteHealthDegraded%3A+failed+to+GET+route.*TLS+handshake+timeout&search=Resolved+release&maxAge=168h&context=1&type=build-log&name=release-openshift.*upgrade' | jq -r 'to_entries[] | select((.value | length) == 2) | .key + "\n" + ([.value["Resolved release"][].context[]] | join("\n"))'

turned up:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1291727055905361920/build-log.txt | grep 'Resolved release\|Container setup in pod\|RouteHealthDegraded: failed to GET route.*TLS handshake timeout'
2020/08/07 13:25:13 Resolved release initial to registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-08-07-112550
2020/08/07 13:25:13 Resolved release latest to registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-08-07-132242
2020/08/07 14:08:38 Container setup in pod e2e-aws-upgrade completed successfully
Aug 07 14:18:40.462 I ns/openshift-authentication-operator deployment/authentication-operator reason/OperatorStatusChanged Status for clusteroperator/authentication changed: Degraded message changed from "RouteHealthDegraded: failed to GET route: EOF" to "RouteHealthDegraded: failed to GET route: net/http: TLS handshake timeout"
Aug 07 14:18:42.504 I ns/openshift-authentication-operator deployment/authentication-operator reason/OperatorStatusChanged Status for clusteroperator/authentication changed: Degraded message changed from "RouteHealthDegraded: failed to GET route: net/http: TLS handshake timeout" to "RouteHealthDegraded: failed to GET route: EOF"

So I'm not convinced that we've solved the problem here.  But we can always re-open if we run into this again in future promotion informers.

Comment 36 Red Hat Bugzilla 2023-09-15 00:19:17 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.