Bug 2034795
| Summary: | cluster operator console/authentication shows degraded for about 6 minutes after updating ingresscontroller LB scope | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Hongan Li <hongli> |
| Component: | Networking | Assignee: | Miciah Dashiel Butler Masters <mmasters> |
| Networking sub component: | router | QA Contact: | Hongan Li <hongli> |
| Status: | CLOSED DEFERRED | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | aos-bugs, mmasters |
| Version: | 4.10 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-03-09 01:10:20 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Setting blocker- as this is not a regression or upgrade issue. We do warn the user that changing the scope on AWS can cause disruption. However, there may be some issue that unnecessarily extends the duration of the disruption. I will investigate whether the ingress operator is taking longer than necessary to update the DNS record, whether CoreDNS or the authentication and console operators are caching an NXDOMAIN response too long, or whether I can find some other issue that we can improve. I believe most of the ~6-minute disruption is caused by Route 53 and AWS's name servers.
I tried changing the "default" IngressController's scope while monitoring DNS and the "console' clusteroperator. To monitor DNS, I checked which node the "console-operator" pod was running on, found the "dns-default" pod on the same node, and started a loop doing DNS lookups within this pod for the "console" route's host name (doing the lookups inside the "dns-default" pod means they were using the AWS name server):
% oc -n openshift-console-operator get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
console-operator-6945b699b-bmjpm 1/1 Running 0 132m 10.128.0.37 ip-10-0-185-155.us-west-1.compute.internal <none> <none>
% oc -n openshift-dns get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
[...]
dns-default-29x7p 2/2 Running 0 142m 10.128.0.12 ip-10-0-185-155.us-west-1.compute.internal <none> <none>
% oc -n openshift-console get routes/console
NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD
console console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org console https reencrypt/Redirect None
% oc -n openshift-dns rsh -c dns dns-default-29x7p bash -c 'while :; do sleep 1; date; dig console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org.; done'
To monitor the console clusteroperator, I just watch its status conditions:
% oc get clusteroperators/console -o yaml -w
Initially, the clusteroperator reports Available=True:
- lastTransitionTime: "2022-12-13T01:41:24Z"
message: All is well
reason: AsExpected
status: "True"
type: Available
Then I toggle the IngressController's scope:
% oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"type":"LoadBalancerService","loadBalancer":{"scope":"Internal"}}}}'
ingresscontroller.operator.openshift.io/default patched
% oc -n openshift-ingress delete svc/router-default
service "router-default" deleted
The clusteroperator reports a problem a short time later; initially, there is disruption when the old ELB is deleted (this disruption is expected):
- lastTransitionTime: "2022-12-13T01:44:25Z"
message: 'RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org):
Get "https://console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org":
context deadline exceeded (Client.Timeout exceeded while awaiting headers)'
reason: RouteHealth_FailedGet
status: "False"
type: Available
A short time later, the clusteroperator's status is updated to reflect that DNS lookups started failing ("lastTransitionTime" remains the same because "status" is already "False"):
- lastTransitionTime: "2022-12-13T01:44:25Z"
message: 'RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org):
Get "https://console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org":
dial tcp: lookup console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org
on 172.30.0.10:53: no such host'
reason: RouteHealth_FailedGet
status: "False"
type: Available
Around this time, the DNS lookups in the dns-default pod stop returning an answer with the old ELB's address and start returning no answer:
Tue Dec 13 01:44:50 UTC 2022
; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4.1 <<>> console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14956
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org. IN A
;; ANSWER SECTION:
console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org. 1 IN A 52.53.143.139
;; Query time: 0 msec
;; SERVER: 10.0.0.2#53(10.0.0.2)
;; WHEN: Tue Dec 13 01:44:50 UTC 2022
;; MSG SIZE rcvd: 118
Tue Dec 13 01:44:51 UTC 2022
; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4.1 <<>> console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 3751
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org. IN A
;; AUTHORITY SECTION:
ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org. 900 IN SOA ns-1536.awsdns-00.co.uk. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400
;; Query time: 1 msec
;; SERVER: 10.0.0.2#53(10.0.0.2)
;; WHEN: Tue Dec 13 01:44:51 UTC 2022
;; MSG SIZE rcvd: 189
After about 5 minutes, the name server goes from returning no answer to returning an answer with the new ELB's address:
Tue Dec 13 01:49:50 UTC 2022
; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4.1 <<>> console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 22164
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org. IN A
;; AUTHORITY SECTION:
ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org. 900 IN SOA ns-1536.awsdns-00.co.uk. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400
;; Query time: 0 msec
;; SERVER: 10.0.0.2#53(10.0.0.2)
;; WHEN: Tue Dec 13 01:49:50 UTC 2022
;; MSG SIZE rcvd: 231
Tue Dec 13 01:49:51 UTC 2022
; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4.1 <<>> console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13571
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org. IN A
;; ANSWER SECTION:
console-openshift-console.apps.ci-ln-qc0dfgb-76ef8.aws-2.ci.openshift.org. 60 IN A 10.0.175.28
;; Query time: 1 msec
;; SERVER: 10.0.0.2#53(10.0.0.2)
;; WHEN: Tue Dec 13 01:49:51 UTC 2022
;; MSG SIZE rcvd: 118
Soon after, the clusteroperator reports Available=True again:
- lastTransitionTime: "2022-12-13T01:50:20Z"
message: All is well
reason: AsExpected
status: "True"
type: Available
So it seems that the problem lies in the name server, not in OpenShift.
The ingress operator upserts the DNS record in Route 53 when the new ELB is provisioned. Evidently AWS not only has a long propagation delay, but it doesn't even return the old record during the propagation delay, which is surprising to me.
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira. https://issues.redhat.com/browse/OCPBUGS-9057 |
Description of problem: The cluster operator console/authentication shows degraded for about 6 minutes after updating ingresscontroller LB scope OpenShift release version: 4.10.0-0.nightly-2021-12-21-130047 Cluster Platform: AWS How reproducible: 100% Steps to Reproduce (in detail): 1. launch a cluster on AWS 2. change the LB scope: $ oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"loadBalancer":{"scope":"Internal"}}}}' 3. Check the message from "oc get co/ingress" and follow the instructions and delete the LB service. $ oc -n openshift-ingress delete svc/router-default service "router-default" deleted 4. check the status of cluster operators $ oc get co Actual results: During the process of LB re-provision and DNS records refresh, co/console and authentication shows degraded for about 6 minutes. see: $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.10.0-0.nightly-2021-12-21-130047 False False True 5m20s OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.hongli-a22.qe.devcluster.openshift.com/healthz": dial tcp: lookup oauth-openshift.apps.hongli-a22.qe.devcluster.openshift.com on 172.30.0.10:53: no such host (this is likely result of malfunctioning DNS server) <---snip---> console 4.10.0-0.nightly-2021-12-21-130047 False False False 5m24s RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.hongli-a22.qe.devcluster.openshift.com): Get "https://console-openshift-console.apps.hongli-a22.qe.devcluster.openshift.com": dial tcp: lookup console-openshift-console.apps.hongli-a22.qe.devcluster.openshift.com on 172.30.0.10:53: no such host ### try more, after a while the authentication is avaible but console still shows degraded (6m6s) $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.10.0-0.nightly-2021-12-21-130047 True False False 37s <---snip---> console 4.10.0-0.nightly-2021-12-21-130047 False False False 6m6s RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.hongli-a22.qe.devcluster.openshift.com): Get "https://console-openshift-console.apps.hongli-a22.qe.devcluster.openshift.com": dial tcp: lookup console-openshift-console.apps.hongli-a22.qe.devcluster.openshift.com on 172.30.0.10:53: no such host Expected results: using nslookup to check the DNS record from outside cluster and find it can be refreshed within about 2 minutes, so co/console and authentication should not stay in Degraded status for such a long time. Impact of the problem: unfriendly user experience Additional info: ** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report. You may also mark the bug private if you wish.