Description of problem: The authentication operator will sometimes report the following degraded condition: RouteHealthDegraded: failed to GET route: dial tcp: lookup oauth-openshift.apps.<namespace>.<domain> on 172.30.0.10:53: no such host" Observed on the following platforms in CI over the past 14 days: azure, gcp The nature of the error (which looks like an DNS failure) and the fact that it has only been observed on GCP and Azure seem like clues. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
https://ci-search-ci-search-next.svc.ci.openshift.org/?search=RouteHealthDegraded%3A+failed+to+GET+route.*no+such+host&maxAge=336h&context=2&type=all
I found one way in which this can manifest. Ingress is reporting Degraded=False,Available=True: https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-cluster-autoscaler-operator-e2e-azure-master/680/artifacts/e2e-azure-operator/clusteroperators.json Meanwhile, the default ingress controller is failing to provision a cloud LB: https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-cluster-autoscaler-operator-e2e-azure-master/680/artifacts/e2e-azure-operator/must-gather/registry-svc-ci-openshift-org-ci-op-gr65yhw1-stable-sha256-dae1257b516a5c177237cfef5a6a3e241962b0d20cf54bcb2b66dc1671c5035e/namespaces/openshift-ingress-operator/operator.openshift.io/ingresscontrollers/default.yaml - lastTransitionTime: "2019-10-24T16:03:39Z" message: |- The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: azure - cloud provider rate limited(read) for operation:PublicIPGet The kube-controller-manager logs may contain more details. reason: SyncLoadBalancerFailed status: "False" type: LoadBalancerReady However, the ingresscontroller itself is Degraded=False,Available=True. So there's a propagation issue here. The ingresscontroller should report something like Degraded=True,Available=False propagated up through the clusteroperator/ingress resource. Then it would be more clear that ingress is broken, and why.
(In reply to Dan Mace from comment #2) I forgot to add, that in the referenced case, because the LB is still pending, DNS wouldn't have yet been set up to point at the LB, and so the auth operator gets a DNS lookup failure on the route host.
Additionally, this DNS condition should have contributed to a degraded condition on the ingresscontroller: - lastTransitionTime: "2019-10-24T15:55:07Z" message: The wildcard record resource was not found. reason: RecordNotFound status: "False" type: DNSReady
Preliminary scan through the most readily available data[1] suggests to me that this RouteHealthDegraded reason tends to coincide with initial ingress rollouts. Normally, the RouteHealthDegraded is transient and short-lived as ingress starts and completes rolling out for the first time. However, if there are certain cloud provider or DNS misconfigurations, cloud provider rate limiting, etc., ingress rollout may stall for much longer and indeterminate durations. In any case, ingress operator is failing to accurately propagate a degraded or unavailable condition, which if fixed would probably get humans to take restorative action sooner. Importantly, for this particular condition reason, I see no evidence of a networking bug. However, the tendency for the k8s cloud provider code to be rate limited seems like a related but separate and interesting discovery. It's not clear yet what returns there are to gain from optimizing cloud provider API calls in that and other components. [1] https://ci-search-ci-search-next.svc.ci.openshift.org/?search=RouteHealthDegraded%3A+failed+to+GET+route.*no+such+host&maxAge=336h&context=2&type=all
*** Bug 1765456 has been marked as a duplicate of this bug. ***
*** Bug 1765776 has been marked as a duplicate of this bug. ***
Setting sev to urgent since https://bugzilla.redhat.com/show_bug.cgi?id=1765776 was dup-ed to this bz and it causes install failures.
(In reply to Mike Fiedler from comment #8) > Setting sev to urgent since > https://bugzilla.redhat.com/show_bug.cgi?id=1765776 was dup-ed to this bz > and it causes install failures. I disagree with your assessment. The inaccurate status propagation of ingress operator here is not the cause of the failures, and the status fixes won't fix the install failures associated with this bug. Fixing this bug will only make it more obvious to observers what is wrong (i.e. cloud provider misconfiguration, rate limiting, etc) so that unspecified follow-on actions can be taken. A dedicated investigator could manually discover these details already today, so I don't think it's accurate to say that resolving the underlying problems triggering the status reporting is dependent upon this bug's resolution.
https://bugzilla.redhat.com/show_bug.cgi?id=1765282#c2 and https://bugzilla.redhat.com/show_bug.cgi?id=1765282#c4 are the two actionable items here. The ingress-operator should be fixed to correctly report its available/degraded status based on those comments, and then we can consider this bug fixed. After status reporting is fixed, we can open additional bugs as necessary if other critical reporting gaps are discovered.
*** Bug 1752562 has been marked as a duplicate of this bug. ***
I think https://github.com/openshift/cluster-ingress-operator/pull/314 ends up being the fix for this issue.
Dan, I notice a similar issue at https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-shared-vpc-4.3/161 but it says 'connection refused' instead of 'no such host', could you PTAL?
(In reply to Lokesh Mandvekar from comment #13) > Dan, I notice a similar issue at > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release- > openshift-origin-installer-e2e-gcp-shared-vpc-4.3/161 but it says > 'connection refused' instead of 'no such host', could you PTAL? Bug 1765280 seems like a better match.
https://github.com/openshift/cluster-ingress-operator/pull/314 already merged, this should be MODIFIED.
see https://bugzilla.redhat.com/show_bug.cgi?id=1740374#c11, this issue has been verified with 4.3.0-0.nightly-2019-11-17-224250.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062