Bug 2037190
| Summary: | dns operator status flaps between True/False/False and True/True/(False|True) after updating dnses.operator.openshift.io/default | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Hongan Li <hongli> |
| Component: | Networking | Assignee: | Suleyman Akbas <sakbas> |
| Networking sub component: | DNS | QA Contact: | Melvin Joseph <mjoseph> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | medium | CC: | hongli, mmasters, sakbas |
| Version: | 4.10 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.11.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-08-10 10:41:16 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Setting blocker- because this is not a regression or upgrade issue, but it looks like we are doing a lot of spurious status updates when the user uses the new logging API. Miheer, please take a look, and let me know if you need help with the status logic. After some investigation, I believe this is a longstanding, general issue that doesn't apply only to the log-level API and predates 4.10. I can reproduce the issue on OpenShift 4.9 using the following steps:
1. Launch a 4.9 cluster. I used 4.9.0-0.nightly-2022-01-24-212243.
2. In one terminal, watch pods: oc -n openshift-dns get pods -w
2. In a second terminal, watch the clusteroperator: oc get co/dns -w
4. In a third terminal, watch dns pod logs: oc -n openshift-dns logs -c dns -f -l dns.operator.openshift.io/daemonset-dns=default --max-log-requests=6
5. Patch the dnses.operator.openshift.io/default in some way that causes a change to the Corefile config; for example: oc patch dns.operator/default --type=merge --patch='{"spec":{"servers":[{"name":"test","zones":["test"],"forwardPlugin":{"upstreams":["127.0.0.1:5353"]}}]}}'
After performing these steps, I see the status updates for the clusteroperator:
% oc get co/dns -w
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
dns 4.9.0-0.nightly-2022-01-24-212243 True False False 25m
dns 4.9.0-0.nightly-2022-01-24-212243 True True False 25m DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
dns 4.9.0-0.nightly-2022-01-24-212243 True False False 25m
dns 4.9.0-0.nightly-2022-01-24-212243 True True False 25m DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
dns 4.9.0-0.nightly-2022-01-24-212243 True True True 26m DNS default is degraded
dns 4.9.0-0.nightly-2022-01-24-212243 True True False 26m DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
dns 4.9.0-0.nightly-2022-01-24-212243 True False False 26m
dns 4.9.0-0.nightly-2022-01-24-212243 True True False 26m DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
dns 4.9.0-0.nightly-2022-01-24-212243 True True True 26m DNS default is degraded
dns 4.9.0-0.nightly-2022-01-24-212243 True True False 26m DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
dns 4.9.0-0.nightly-2022-01-24-212243 True True True 26m DNS default is degraded
dns 4.9.0-0.nightly-2022-01-24-212243 True True False 26m DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
dns 4.9.0-0.nightly-2022-01-24-212243 True False False 27m
I also see each dns pod briefly go unready, then ready again:
% oc -n openshift-dns get pods -w
NAME READY STATUS RESTARTS AGE
dns-default-242pw 2/2 Running 0 25m
dns-default-28grx 2/2 Running 0 18m
dns-default-44sv4 2/2 Running 0 25m
dns-default-9wnx9 2/2 Running 0 18m
dns-default-bn6kt 2/2 Running 0 19m
dns-default-jj2kf 2/2 Running 0 25m
node-resolver-7x9gq 1/1 Running 0 19m
node-resolver-dldk4 1/1 Running 0 19m
node-resolver-drvrj 1/1 Running 0 25m
node-resolver-jqsnv 1/1 Running 0 19m
node-resolver-scmnj 1/1 Running 0 25m
node-resolver-ztnnm 1/1 Running 0 25m
dns-default-44sv4 1/2 Running 0 25m
dns-default-44sv4 2/2 Running 0 26m
dns-default-28grx 1/2 Running 0 19m
dns-default-9wnx9 1/2 Running 0 19m
dns-default-28grx 2/2 Running 0 19m
dns-default-9wnx9 2/2 Running 0 19m
dns-default-242pw 1/2 Running 0 26m
dns-default-jj2kf 1/2 Running 0 27m
dns-default-242pw 2/2 Running 0 27m
dns-default-bn6kt 1/2 Running 0 21m
dns-default-jj2kf 2/2 Running 0 27m
dns-default-bn6kt 2/2 Running 0 21m
And finally, the logs reveal the issue; following is a snippet from one dns pod's logs:
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 20s
[INFO] plugin/reload: Running configuration MD5 = 746216286a0ba8ae7a62a714fa095855
[INFO] Reloading complete
I see the above repeated for every dns pod. Every time the Corefile configmap is updated, each CoreDNS instance reloads the Corefile, and when CoreDNS reloads, it goes into "lameduck mode" for 20 seconds, during which time it reports not-ready to the kubelet.
We can add hysteresis to the status reporting logic to prevent flapping. I believe this work is best left till after 4.10.0.
I think still there is some flapping, but reduced than previous time.
melvinjoseph@mjoseph-mac Downloads % oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.ci.test-2022-04-13-042208-ci-ln-i9fzbyk-latest True False 40m Cluster version is 4.11.0-0.ci.test-2022-04-13-042208-ci-ln-i9fzbyk-latest
melvinjoseph@mjoseph-mac Downloads % oc patch dns.operator/default --type=merge --patch='{"spec":{"servers":[{"name":"test","zones":["test"],"forwardPlugin":{"upstreams":["127.0.0.1:5353"]}}]}}'
dns.operator.openshift.io/default patched
melvinjoseph@mjoseph-mac Downloads % oc get co/dns -w
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
dns 4.11.0-0.ci.test-2022-04-13-042208-ci-ln-i9fzbyk-latest True False False 57m
dns 4.11.0-0.ci.test-2022-04-13-042208-ci-ln-i9fzbyk-latest True True False 57m DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
dns 4.11.0-0.ci.test-2022-04-13-042208-ci-ln-i9fzbyk-latest True True True 57m DNS default is degraded
dns 4.11.0-0.ci.test-2022-04-13-042208-ci-ln-i9fzbyk-latest True True False 57m DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
dns 4.11.0-0.ci.test-2022-04-13-042208-ci-ln-i9fzbyk-latest True False False 57m
dns 4.11.0-0.ci.test-2022-04-13-042208-ci-ln-i9fzbyk-latest True False True 58m DNS default is degraded
dns 4.11.0-0.ci.test-2022-04-13-042208-ci-ln-i9fzbyk-latest True False False 58m
dns 4.11.0-0.ci.test-2022-04-13-042208-ci-ln-i9fzbyk-latest True True False 58m DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
dns 4.11.0-0.ci.test-2022-04-13-042208-ci-ln-i9fzbyk-latest True False False 58m
melvinjoseph@mjoseph-mac Downloads % oc -n openshift-dns get pods -w
NAME READY STATUS RESTARTS AGE
dns-default-44sv6 2/2 Running 0 57m
dns-default-9fffj 2/2 Running 0 50m
dns-default-b9mgt 2/2 Running 0 57m
dns-default-ln49d 2/2 Running 0 57m
dns-default-wb66v 2/2 Running 0 50m
dns-default-zlj5f 2/2 Running 0 51m
node-resolver-59dsx 1/1 Running 0 52m
node-resolver-6trnh 1/1 Running 0 52m
node-resolver-gdhc6 1/1 Running 0 57m
node-resolver-h6dt4 1/1 Running 0 57m
node-resolver-szstk 1/1 Running 0 57m
node-resolver-t27dz 1/1 Running 0 52m
dns-default-ln49d 1/2 Running 0 57m
dns-default-44sv6 1/2 Running 0 57m
dns-default-ln49d 2/2 Running 0 57m
dns-default-b9mgt 1/2 Running 0 57m
dns-default-44sv6 2/2 Running 0 57m
dns-default-b9mgt 2/2 Running 0 58m
dns-default-wb66v 1/2 Running 0 52m
dns-default-9fffj 1/2 Running 0 52m
dns-default-wb66v 2/2 Running 0 52m
dns-default-9fffj 2/2 Running 0 52m
dns-default-zlj5f 1/2 Running 0 52m
dns-default-zlj5f 2/2 Running 0 52m
melvinjoseph@mjoseph-mac Downloads % oc -n openshift-dns logs -c dns -f -l dns.operator.openshift.io/daemonset-dns=default --max-log-requests=6
[INFO] plugin/reload: Running configuration MD5 = 477feca3db059e6b9063274509ca76ee
CoreDNS-1.8.7
linux/amd64, go1.17.5,
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 20s
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 20s
[INFO] plugin/reload: Running configuration MD5 = ca3dd6b3842292b76f3c39bece74187e
[INFO] Reloading complete
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 20s
[INFO] plugin/reload: Running configuration MD5 = ca3dd6b3842292b76f3c39bece74187e
[INFO] Reloading complete
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 20s
[INFO] plugin/reload: Running configuration MD5 = ca3dd6b3842292b76f3c39bece74187e
[INFO] Reloading complete
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 20s
[INFO] plugin/reload: Running configuration MD5 = ca3dd6b3842292b76f3c39bece74187e
[INFO] Reloading complete
[INFO] plugin/reload: Running configuration MD5 = ca3dd6b3842292b76f3c39bece74187e
[INFO] Reloading complete
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 20s
[INFO] plugin/reload: Running configuration MD5 = ca3dd6b3842292b76f3c39bece74187e
[INFO] Reloading complete
melvinjoseph@mjoseph-mac Downloads % oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-06-04-014713 True False 7h9m Cluster version is 4.11.0-0.nightly-2022-06-04-014713
melvinjoseph@mjoseph-mac Downloads %
melvinjoseph@mjoseph-mac Downloads % oc patch dns.operator/default --type=merge --patch='{"spec":{"servers":[{"name":"test","zones":["test"],"forwardPlugin":{"upstreams":["127.0.0.1:5353"]}}]}}'
dns.operator.openshift.io/default patched
melvinjoseph@mjoseph-mac Downloads % oc -n openshift-dns get pods -w
NAME READY STATUS RESTARTS AGE
dns-default-4j8gl 2/2 Running 0 6h37m
dns-default-cnv29 2/2 Running 0 6h37m
dns-default-pjsrd 2/2 Running 0 6h29m
dns-default-vbqn8 2/2 Running 0 6h30m
dns-default-vqkpp 2/2 Running 0 6h37m
node-resolver-6js2l 1/1 Running 0 6h37m
node-resolver-9p6vn 1/1 Running 0 6h30m
node-resolver-n9257 1/1 Running 0 6h30m
node-resolver-qp5wm 1/1 Running 0 6h37m
node-resolver-qr5lb 1/1 Running 0 6h37m
dns-default-4j8gl 1/2 Running 0 6h40m
dns-default-pjsrd 1/2 Running 0 6h33m
dns-default-4j8gl 2/2 Running 0 6h40m
dns-default-pjsrd 2/2 Running 0 6h33m
dns-default-vbqn8 1/2 Running 0 6h33m
dns-default-vbqn8 2/2 Running 0 6h34m
dns-default-vqkpp 1/2 Running 0 6h41m
dns-default-cnv29 1/2 Running 0 6h41m
dns-default-vqkpp 2/2 Running 0 6h41m
dns-default-cnv29 2/2 Running 0 6h41m
melvinjoseph@mjoseph-mac Downloads % oc get co/dns -w
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
dns 4.11.0-0.nightly-2022-06-04-014713 True False False 6h37m
dns 4.11.0-0.nightly-2022-06-04-014713 True True False 6h40m DNS "default" reports Progressing=True: "Have 4 available DNS pods, want 5."
dns 4.11.0-0.nightly-2022-06-04-014713 True True True 6h40m DNS default is degraded
dns 4.11.0-0.nightly-2022-06-04-014713 True True False 6h40m DNS "default" reports Progressing=True: "Have 4 available DNS pods, want 5."
dns 4.11.0-0.nightly-2022-06-04-014713 True False False 6h40m
melvinjoseph@mjoseph-mac Downloads % oc -n openshift-dns logs -c dns -f -l dns.operator.openshift.io/daemonset-dns=default --max-log-requests=6
[INFO] 10.128.0.51:48836 - 5141 "AAAA IN registry.redhat.io.us-west-2.compute.internal. udp 63 false 512" NXDOMAIN qr,rd,ra 63 0.003339211s
[INFO] 10.128.0.51:45118 - 61895 "A IN registry.redhat.io.mjoseph-ip18.qe.devcluster.openshift.com. udp 77 false 512" NXDOMAIN qr,rd,ra 77 0.002902669s
[INFO] 10.128.0.51:34660 - 34948 "AAAA IN registry.redhat.io.mjoseph-ip18.qe.devcluster.openshift.com. udp 77 false 512" NXDOMAIN qr,rd,ra 77 0.003045752s
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 20s
[INFO] 10.129.0.1:4062 - 14450 "AAAA IN api.openshift.com.us-west-2.compute.internal. udp 62 false 512" NXDOMAIN qr,rd,ra 62 0.007191427s
[INFO] 10.129.0.1:60402 - 11971 "A IN api.openshift.com.us-west-2.compute.internal. udp 62 false 512" NXDOMAIN qr,rd,ra 62 0.007292815s
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 20s
Based on the output and https://bugzilla.redhat.com/show_bug.cgi?id=2037190#c8 i marking this bug as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |
Description of problem: dns operator status flaps between True/False/False and True/True/(False|True) after updating dnses.operator.openshift.io/default OpenShift release version: 4.10.0-0.nightly-2021-12-23-153012 Cluster Platform: any How reproducible: 100% Steps to Reproduce (in detail): 1. update dnses.operator.openshift.io/default, e.g $ oc patch dnses.operator.openshift.io/default -p '{"spec":{"logLevel":"Debug"}}' --type=merge 2. Open another terminal and check/watch the co/dns status $ oc get co/dns -w Actual results: $ oc get co/dns -w NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE dns 4.10.0-0.nightly-2021-12-23-153012 True False False 6h7m dns 4.10.0-0.nightly-2021-12-23-153012 True True False 6h8m DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6." dns 4.10.0-0.nightly-2021-12-23-153012 True False False 6h8m dns 4.10.0-0.nightly-2021-12-23-153012 True True False 6h8m DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6." dns 4.10.0-0.nightly-2021-12-23-153012 True True True 6h8m DNS default is degraded dns 4.10.0-0.nightly-2021-12-23-153012 True True False 6h8m DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6." dns 4.10.0-0.nightly-2021-12-23-153012 True True True 6h8m DNS default is degraded dns 4.10.0-0.nightly-2021-12-23-153012 True True False 6h8m DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6." dns 4.10.0-0.nightly-2021-12-23-153012 True False False 6h9m dns 4.10.0-0.nightly-2021-12-23-153012 True True False 6h9m DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6." dns 4.10.0-0.nightly-2021-12-23-153012 True False False 6h9m dns 4.10.0-0.nightly-2021-12-23-153012 True True False 6h9m DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6." dns 4.10.0-0.nightly-2021-12-23-153012 True False False 6h9m During the rolling update, co/dns status flaps between True/False/False and True/True/(False|True). Expected results: During the DNS rolling update, co/dns status should always be True/True/False until the update is finished. Impact of the problem: co/dns doesn't show the real reconciling status Additional info: ** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report. You may also mark the bug private if you wish.