1765282 – authentication operator reports RouteHealthDegraded: failed to GET route: dial tcp: lookup oauth-openshift.apps.<namespace>.<domain> on 172.30.0.10:53: no such host

Bug 1765282 - authentication operator reports RouteHealthDegraded: failed to GET route: dial tcp: lookup oauth-openshift.apps.<namespace>.<domain> on 172.30.0.10:53: no such host

Summary: authentication operator reports RouteHealthDegraded: failed to GET route: dia...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Miciah Dashiel Butler Masters
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1752562 1765456 1765776 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-24 17:58 UTC by Dan Mace
Modified:	2022-08-04 22:24 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-23 11:09:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-ingress-operator pull 314	0	'None'	closed	Bug 1740374: status: Set Degraded using other status conditions	2021-02-19 02:55:37 UTC
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:09:35 UTC

Description Dan Mace 2019-10-24 17:58:47 UTC

Description of problem:

The authentication operator will sometimes report the following degraded condition:

    RouteHealthDegraded: failed to GET route: dial tcp: lookup oauth-openshift.apps.<namespace>.<domain> on 172.30.0.10:53: no such host"

Observed on the following platforms in CI over the past 14 days: azure, gcp

The nature of the error (which looks like an DNS failure) and the fact that it has only been observed on GCP and Azure seem like clues.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Dan Mace 2019-10-24 19:13:24 UTC

https://ci-search-ci-search-next.svc.ci.openshift.org/?search=RouteHealthDegraded%3A+failed+to+GET+route.*no+such+host&maxAge=336h&context=2&type=all

Comment 2 Dan Mace 2019-10-24 20:58:57 UTC

I found one way in which this can manifest.

Ingress is reporting Degraded=False,Available=True:

https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-cluster-autoscaler-operator-e2e-azure-master/680/artifacts/e2e-azure-operator/clusteroperators.json

Meanwhile, the default ingress controller is failing to provision a cloud LB:

https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-cluster-autoscaler-operator-e2e-azure-master/680/artifacts/e2e-azure-operator/must-gather/registry-svc-ci-openshift-org-ci-op-gr65yhw1-stable-sha256-dae1257b516a5c177237cfef5a6a3e241962b0d20cf54bcb2b66dc1671c5035e/namespaces/openshift-ingress-operator/operator.openshift.io/ingresscontrollers/default.yaml

   - lastTransitionTime: "2019-10-24T16:03:39Z"
    message: |-
      The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: azure - cloud provider rate limited(read) for operation:PublicIPGet
      The kube-controller-manager logs may contain more details.
    reason: SyncLoadBalancerFailed
    status: "False"
    type: LoadBalancerReady

However, the ingresscontroller itself is Degraded=False,Available=True. So there's a propagation issue here. The ingresscontroller should report something like Degraded=True,Available=False propagated up through the clusteroperator/ingress resource. Then it would be more clear that ingress is broken, and why.

Comment 3 Dan Mace 2019-10-24 21:00:01 UTC

(In reply to Dan Mace from comment #2)

I forgot to add, that in the referenced case, because the LB is still pending, DNS wouldn't have yet been set up to point at the LB, and so the auth operator gets a DNS lookup failure on the route host.

Comment 4 Dan Mace 2019-10-25 01:10:21 UTC

Additionally, this DNS condition should have contributed to a degraded condition on the ingresscontroller:

- lastTransitionTime: "2019-10-24T15:55:07Z"
    message: The wildcard record resource was not found.
    reason: RecordNotFound
    status: "False"
    type: DNSReady

Comment 5 Dan Mace 2019-10-25 01:29:12 UTC

Preliminary scan through the most readily available data[1] suggests to me that this RouteHealthDegraded reason tends to coincide with initial ingress rollouts. Normally, the RouteHealthDegraded is transient and short-lived as ingress starts and completes rolling out for the first time. However, if there are certain cloud provider or DNS misconfigurations, cloud provider rate limiting, etc., ingress rollout may stall for much longer and indeterminate durations. In any case, ingress operator is failing to accurately propagate a degraded or unavailable condition, which if fixed would probably get humans to take restorative action sooner.

Importantly, for this particular condition reason, I see no evidence of a networking bug.

However, the tendency for the k8s cloud provider code to be rate limited seems like a related but separate and interesting discovery. It's not clear yet what returns there are to gain from optimizing cloud provider API calls in that and other components.

[1] https://ci-search-ci-search-next.svc.ci.openshift.org/?search=RouteHealthDegraded%3A+failed+to+GET+route.*no+such+host&maxAge=336h&context=2&type=all

Comment 6 Dan Mace 2019-10-25 11:14:38 UTC

*** Bug 1765456 has been marked as a duplicate of this bug. ***

Comment 7 Dan Mace 2019-10-26 00:22:06 UTC

*** Bug 1765776 has been marked as a duplicate of this bug. ***

Comment 8 Mike Fiedler 2019-10-28 12:09:24 UTC

Setting sev to urgent since https://bugzilla.redhat.com/show_bug.cgi?id=1765776 was dup-ed to this bz and it causes install failures.

Comment 9 Dan Mace 2019-10-29 12:41:08 UTC

(In reply to Mike Fiedler from comment #8)
> Setting sev to urgent since
> https://bugzilla.redhat.com/show_bug.cgi?id=1765776 was dup-ed to this bz
> and it causes install failures.

I disagree with your assessment. The inaccurate status propagation of ingress operator here is not the cause of the failures, and the status fixes won't fix the install failures associated with this bug. Fixing this bug will only make it more obvious to observers what is wrong (i.e. cloud provider misconfiguration, rate limiting, etc) so that unspecified follow-on actions can be taken.

A dedicated investigator could manually discover these details already today, so I don't think it's accurate to say that resolving the underlying problems triggering the status reporting is dependent upon this bug's resolution.

Comment 10 Dan Mace 2019-10-29 21:33:28 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1765282#c2 and https://bugzilla.redhat.com/show_bug.cgi?id=1765282#c4 are the two actionable items here. The ingress-operator should be fixed to correctly report its available/degraded status based on those comments, and then we can consider this bug fixed.

After status reporting is fixed, we can open additional bugs as necessary if other critical reporting gaps are discovered.

Comment 11 Dan Mace 2019-10-31 20:08:59 UTC

*** Bug 1752562 has been marked as a duplicate of this bug. ***

Comment 12 Dan Mace 2019-11-12 15:38:00 UTC

I think https://github.com/openshift/cluster-ingress-operator/pull/314 ends up being the fix for this issue.

Comment 13 Lokesh Mandvekar 2019-11-25 18:37:53 UTC

Dan, I notice a similar issue at https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-shared-vpc-4.3/161 but it says 'connection refused' instead of 'no such host', could you PTAL?

Comment 14 Lokesh Mandvekar 2019-11-25 18:50:52 UTC

(In reply to Lokesh Mandvekar from comment #13)
> Dan, I notice a similar issue at
> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-
> openshift-origin-installer-e2e-gcp-shared-vpc-4.3/161 but it says
> 'connection refused' instead of 'no such host', could you PTAL?

Bug 1765280 seems like a better match.

Comment 15 Dan Mace 2019-12-03 13:52:14 UTC

https://github.com/openshift/cluster-ingress-operator/pull/314 already merged, this should be MODIFIED.

Comment 17 Hongan Li 2019-12-04 08:07:55 UTC

see https://bugzilla.redhat.com/show_bug.cgi?id=1740374#c11, this issue has been verified with 4.3.0-0.nightly-2019-11-17-224250.

Comment 19 errata-xmlrpc 2020-01-23 11:09:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.