Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2034795

Summary: cluster operator console/authentication shows degraded for about 6 minutes after updating ingresscontroller LB scope
Product: OpenShift Container Platform Reporter: Hongan Li <hongli>
Component: NetworkingAssignee: Miciah Dashiel Butler Masters <mmasters>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED DEFERRED Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, mmasters
Version: 4.10   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-03-09 01:10:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hongan Li 2021-12-22 07:43:04 UTC
Description of problem:
The cluster operator console/authentication shows degraded for about 6 minutes after updating ingresscontroller LB scope

OpenShift release version:
4.10.0-0.nightly-2021-12-21-130047

Cluster Platform:
AWS

How reproducible:
100%

Steps to Reproduce (in detail):
1. launch a cluster on AWS
2. change the LB scope:
$ oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"loadBalancer":{"scope":"Internal"}}}}'

3. Check the message from "oc get co/ingress" and follow the instructions and delete the LB service.
$ oc -n openshift-ingress delete svc/router-default
service "router-default" deleted

4. check the status of cluster operators
$ oc get co


Actual results:
During the process of LB re-provision and DNS records refresh, co/console and authentication shows degraded for about 6 minutes. see:

$ oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.10.0-0.nightly-2021-12-21-130047   False       False         True       5m20s   OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.hongli-a22.qe.devcluster.openshift.com/healthz": dial tcp: lookup oauth-openshift.apps.hongli-a22.qe.devcluster.openshift.com on 172.30.0.10:53: no such host (this is likely result of malfunctioning DNS server)
<---snip--->
console                                    4.10.0-0.nightly-2021-12-21-130047   False       False         False      5m24s   RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.hongli-a22.qe.devcluster.openshift.com): Get "https://console-openshift-console.apps.hongli-a22.qe.devcluster.openshift.com": dial tcp: lookup console-openshift-console.apps.hongli-a22.qe.devcluster.openshift.com on 172.30.0.10:53: no such host


### try more, after a while the authentication is avaible but console still shows degraded (6m6s) 
$ oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.10.0-0.nightly-2021-12-21-130047   True        False         False      37s     
<---snip--->
console                                    4.10.0-0.nightly-2021-12-21-130047   False       False         False      6m6s    RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.hongli-a22.qe.devcluster.openshift.com): Get "https://console-openshift-console.apps.hongli-a22.qe.devcluster.openshift.com": dial tcp: lookup console-openshift-console.apps.hongli-a22.qe.devcluster.openshift.com on 172.30.0.10:53: no such host


Expected results:
using nslookup to check the DNS record from outside cluster and find it can be refreshed within about 2 minutes, so co/console and authentication should not stay in Degraded status for such a long time.

Impact of the problem:
unfriendly user experience

Additional info:



** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report.  You may also mark the bug private if you wish.

Comment 1 Miciah Dashiel Butler Masters 2021-12-23 18:36:28 UTC
Setting blocker- as this is not a regression or upgrade issue.  

We do warn the user that changing the scope on AWS can cause disruption.  However, there may be some issue that unnecessarily extends the duration of the disruption.  I will investigate whether the ingress operator is taking longer than necessary to update the DNS record, whether CoreDNS or the authentication and console operators are caching an NXDOMAIN response too long, or whether I can find some other issue that we can improve.

Comment 2 Miciah Dashiel Butler Masters 2022-12-13 02:08:53 UTC
I believe most of the ~6-minute disruption is caused by Route 53 and AWS's name servers.

I tried changing the "default" IngressController's scope while monitoring DNS and the "console' clusteroperator.  To monitor DNS, I checked which node the "console-operator" pod was running on, found the "dns-default" pod on the same node, and started a loop doing DNS lookups within this pod for the "console" route's host name (doing the lookups inside the "dns-default" pod means they were using the AWS name server):

    % oc -n openshift-console-operator get pods -o wide
    NAME                               READY   STATUS    RESTARTS   AGE    IP            NODE                                         NOMINATED NODE   READINESS GATES
    console-operator-6945b699b-bmjpm   1/1     Running   0          132m   10.128.0.37   ip-10-0-185-155.us-west-1.compute.internal   <none>           <none>
    % oc -n openshift-dns get pods -o wide
    NAME                  READY   STATUS    RESTARTS   AGE    IP             NODE                                         NOMINATED NODE   READINESS GATES
    [...]
    dns-default-29x7p     2/2     Running   0          142m   10.128.0.12    ip-10-0-185-155.us-west-1.compute.internal   <none>           <none>
    % oc -n openshift-console get routes/console
    NAME      HOST/PORT                                                                   PATH   SERVICES   PORT    TERMINATION          WILDCARD
    console   console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org          console    https   reencrypt/Redirect   None
    % oc -n openshift-dns rsh -c dns dns-default-29x7p bash -c 'while :; do sleep 1; date; dig console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org.; done'

To monitor the console clusteroperator, I just watch its status conditions:

    % oc get clusteroperators/console -o yaml -w

Initially, the clusteroperator reports Available=True:

      - lastTransitionTime: "2022-12-13T01:41:24Z"
        message: All is well
        reason: AsExpected
        status: "True"
        type: Available

Then I toggle the IngressController's scope:

    % oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"type":"LoadBalancerService","loadBalancer":{"scope":"Internal"}}}}'
    ingresscontroller.operator.openshift.io/default patched
    % oc -n openshift-ingress delete svc/router-default
    service "router-default" deleted

The clusteroperator reports a problem a short time later; initially, there is disruption when the old ELB is deleted (this disruption is expected):

      - lastTransitionTime: "2022-12-13T01:44:25Z"
        message: 'RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org):
          Get "https://console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org":
          context deadline exceeded (Client.Timeout exceeded while awaiting headers)'
        reason: RouteHealth_FailedGet
        status: "False"
        type: Available

A short time later, the clusteroperator's status is updated to reflect that DNS lookups started failing ("lastTransitionTime" remains the same because "status" is already "False"):

      - lastTransitionTime: "2022-12-13T01:44:25Z"
        message: 'RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org):
          Get "https://console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org":
          dial tcp: lookup console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org
          on 172.30.0.10:53: no such host'
        reason: RouteHealth_FailedGet
        status: "False"
        type: Available

Around this time, the DNS lookups in the dns-default pod stop returning an answer with the old ELB's address and start returning no answer:

    Tue Dec 13 01:44:50 UTC 2022
    
    ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4.1 <<>> console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org.
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14956
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
    
    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 4096
    ;; QUESTION SECTION:
    ;console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org. IN A
    
    ;; ANSWER SECTION:
    console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org. 1 IN A 52.53.143.139
    
    ;; Query time: 0 msec
    ;; SERVER: 10.0.0.2#53(10.0.0.2)
    ;; WHEN: Tue Dec 13 01:44:50 UTC 2022
    ;; MSG SIZE  rcvd: 118
    
    Tue Dec 13 01:44:51 UTC 2022
    
    ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4.1 <<>> console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org.
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 3751
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
    
    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 4096
    ;; QUESTION SECTION:
    ;console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org. IN A
    
    ;; AUTHORITY SECTION:
    ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org. 900 IN SOA ns-1536.awsdns-00.co.uk. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400
    
    ;; Query time: 1 msec
    ;; SERVER: 10.0.0.2#53(10.0.0.2)
    ;; WHEN: Tue Dec 13 01:44:51 UTC 2022
    ;; MSG SIZE  rcvd: 189

After about 5 minutes, the name server goes from returning no answer to returning an answer with the new ELB's address:

    Tue Dec 13 01:49:50 UTC 2022
    
    ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4.1 <<>> console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org.
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 22164
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
    
    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 4096
    ;; QUESTION SECTION:
    ;console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org. IN A
    
    ;; AUTHORITY SECTION:
    ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org. 900 IN SOA ns-1536.awsdns-00.co.uk. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400
    
    ;; Query time: 0 msec
    ;; SERVER: 10.0.0.2#53(10.0.0.2)
    ;; WHEN: Tue Dec 13 01:49:50 UTC 2022
    ;; MSG SIZE  rcvd: 231
    
    Tue Dec 13 01:49:51 UTC 2022
    
    ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4.1 <<>> console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org.
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13571
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
    
    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 4096
    ;; QUESTION SECTION:
    ;console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org. IN A
    
    ;; ANSWER SECTION:
    console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org. 60 IN A 10.0.175.28
    
    ;; Query time: 1 msec
    ;; SERVER: 10.0.0.2#53(10.0.0.2)
    ;; WHEN: Tue Dec 13 01:49:51 UTC 2022
    ;; MSG SIZE  rcvd: 118

Soon after, the clusteroperator reports Available=True again:

      - lastTransitionTime: "2022-12-13T01:50:20Z"
        message: All is well
        reason: AsExpected
        status: "True"
        type: Available

So it seems that the problem lies in the name server, not in OpenShift.  

The ingress operator upserts the DNS record in Route 53 when the new ELB is provisioned.  Evidently AWS not only has a long propagation delay, but it doesn't even return the old record during the propagation delay, which is surprising to me.

Comment 3 Shiftzilla 2023-03-09 01:10:20 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9057