Bug 1779429

Summary: flapping: RouteHealthDegraded: failed to GET route: net/http: TLS handshake timeout
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: apiserver-authAssignee: Maru Newby <mnewby>
Status: CLOSED DUPLICATE QA Contact: scheng
Severity: medium Docs Contact:
Priority: medium    
Version: 4.3.0CC: aos-bugs, mfojtik, nagrawal, slaznick, sttts
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-20 01:00:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2019-12-04 00:21:07 UTC
From a 4.3 promotion job [1]:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.3/512/build-log.txt | grep 'authentication.*RouteHealthDegraded' | sort | uniqDec 03 19:26:36.990 I ns/openshift-authentication-operator deployment/authentication-operator Status for clusteroperator/authentication changed: Degraded message changed from "" to "RouteHealthDegraded: failed to GET route: EOF"
Dec 03 19:26:38.355 I ns/openshift-authentication-operator deployment/authentication-operator Status for clusteroperator/authentication changed: Degraded message changed from "RouteHealthDegraded: failed to GET route: EOF" to ""
Dec 03 19:27:41.117 I ns/openshift-authentication-operator deployment/authentication-operator Status for clusteroperator/authentication changed: Degraded message changed from "" to "RouteHealthDegraded: failed to GET route: net/http: TLS handshake timeout"
Dec 03 19:27:42.561 I ns/openshift-authentication-operator deployment/authentication-operator Status for clusteroperator/authentication changed: Degraded message changed from "RouteHealthDegraded: failed to GET route: net/http: TLS handshake timeout" to ""

Might be related to bug 1765280, which is about the "connection refused" version of RouteHealthDegraded.  And probably not worth flipping in and out of Degraded on the second timescale.  From [2]:

  Degraded indicates that the operator's current state does not match its desired state over a period of time resulting in a lower quality of service. The period of time may vary by component, but a Degraded state represents persistent observation of a condition.  As a result, a component should not oscillate in and out of Degraded state

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.3/512
[2]: https://github.com/openshift/api/blob/2ea89d203c53704f1fcfeb55c13ededab14fd020/config/v1/types_cluster_operator.go#L152-L156

Comment 1 W. Trevor King 2019-12-04 00:37:02 UTC
[1] suggests this is slightly more common than the "connection refused" form (59 jobs vs. 36 e2e jobs over the past 24h).  And the EOF form was only 10 jobs.  EOF jobs seem to be a subset of TLS handshake jobs, and those seem to be completely distinct from "connection refused" jobs.

[1]: https://search.svc.ci.openshift.org/chart?search=RouteHealthDegraded:%20failed%20to%20GET%20route.*connection%20refused&search=RouteHealthDegraded:%20failed%20to%20GET%20route.*TLS%20handshake%20timeout&search=RouteHealthDegraded:%20failed%20to%20GET%20route.*EOF

Comment 4 Michal Fojtik 2020-05-12 10:33:33 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

As such, we're marking this bug as "LifecycleStale" and decreasing severity from "medium" to "low".

If you have further information on the current state of the bug, please update it, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 5 W. Trevor King 2020-05-13 05:23:06 UTC
Still seeing a lot of these:

$ curl -sL 'https://search.svc.ci.openshift.org/search?search=RouteHealthDegraded:+failed+to+GET+route:+net/http:+TLS+handshake+timeout&maxAge=24h&context=-1&type=build-log' | jq -r '. | keys[]'
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-knative-serverless-operator-release-1.4-4.2-e2e-aws-ocp-42-continuous/435
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/1143
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.3/1357
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.3/1312
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.3/1313
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.3/226
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-nightly/92
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly/79
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3-to-4.4-to-4.5-ci/64
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.3-to-4.4-to-4.5-to-4.6-ci/44
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/28719
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/28746
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/28749
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/28779
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-old-rhcos-e2e-aws-4.2/519
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-old-rhcos-e2e-aws-4.3/428
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/552/pull-ci-openshift-cluster-network-operator-master-e2e-ovn-hybrid-step-registry/324
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/624/pull-ci-openshift-cluster-network-operator-release-4.4-e2e-gcp-ovn/788
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1726/pull-ci-openshift-machine-config-operator-release-4.3-e2e-gcp-upgrade/227
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1726/pull-ci-openshift-machine-config-operator-release-4.3-e2e-gcp-upgrade/228

Comment 6 W. Trevor King 2020-05-20 01:00:00 UTC
Symptoms look very similar to bug 1765276, so closing as a dup.  We can always reopen if it turns out there are two different issues at play.

*** This bug has been marked as a duplicate of bug 1765276 ***