2037190 – dns operator status flaps between True/False/False and True/True/(False|True) after updating dnses.operator.openshift.io/default

Bug 2037190 - dns operator status flaps between True/False/False and True/True/(False|True) after updating dnses.operator.openshift.io/default

Summary: dns operator status flaps between True/False/False and True/True/(False|True)...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Suleyman Akbas
QA Contact:	Melvin Joseph
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-05 08:01 UTC by Hongan Li
Modified:	2022-08-10 10:41 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 10:41:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-dns-operator pull 318	0	None	open	Bug 2037190: Skip frequent updates to progressing and degraded conditions to prevent status flaps	2022-04-05 16:02:52 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 10:41:49 UTC

Description Hongan Li 2022-01-05 08:01:24 UTC

Description of problem:
dns operator status flaps between True/False/False and True/True/(False|True) after updating dnses.operator.openshift.io/default

OpenShift release version:
4.10.0-0.nightly-2021-12-23-153012

Cluster Platform:
any

How reproducible:
100%

Steps to Reproduce (in detail):
1. update dnses.operator.openshift.io/default, e.g
$ oc patch dnses.operator.openshift.io/default -p '{"spec":{"logLevel":"Debug"}}' --type=merge

2. Open another terminal and check/watch the co/dns status
$ oc get co/dns -w


Actual results:
$ oc get co/dns -w
NAME   VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
dns    4.10.0-0.nightly-2021-12-23-153012   True        False         False      6h7m    
dns    4.10.0-0.nightly-2021-12-23-153012   True        True          False      6h8m    DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
dns    4.10.0-0.nightly-2021-12-23-153012   True        False         False      6h8m    
dns    4.10.0-0.nightly-2021-12-23-153012   True        True          False      6h8m    DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
dns    4.10.0-0.nightly-2021-12-23-153012   True        True          True       6h8m    DNS default is degraded
dns    4.10.0-0.nightly-2021-12-23-153012   True        True          False      6h8m    DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
dns    4.10.0-0.nightly-2021-12-23-153012   True        True          True       6h8m    DNS default is degraded
dns    4.10.0-0.nightly-2021-12-23-153012   True        True          False      6h8m    DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
dns    4.10.0-0.nightly-2021-12-23-153012   True        False         False      6h9m    
dns    4.10.0-0.nightly-2021-12-23-153012   True        True          False      6h9m    DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
dns    4.10.0-0.nightly-2021-12-23-153012   True        False         False      6h9m    
dns    4.10.0-0.nightly-2021-12-23-153012   True        True          False      6h9m    DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
dns    4.10.0-0.nightly-2021-12-23-153012   True        False         False      6h9m    

During the rolling update, co/dns status flaps between True/False/False and True/True/(False|True).


Expected results:
During the DNS rolling update, co/dns status should always be True/True/False until the update is finished.

Impact of the problem:
co/dns doesn't show the real reconciling status

Additional info:



** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report.  You may also mark the bug private if you wish.

Comment 1 Miciah Dashiel Butler Masters 2022-01-06 17:06:24 UTC

Setting blocker- because this is not a regression or upgrade issue, but it looks like we are doing a lot of spurious status updates when the user uses the new logging API.  

Miheer, please take a look, and let me know if you need help with the status logic.

Comment 2 Miciah Dashiel Butler Masters 2022-01-25 02:05:33 UTC

After some investigation, I believe this is a longstanding, general issue that doesn't apply only to the log-level API and predates 4.10.  I can reproduce the issue on OpenShift 4.9 using the following steps:

1. Launch a 4.9 cluster.  I used 4.9.0-0.nightly-2022-01-24-212243.

2. In one terminal, watch pods: oc -n openshift-dns get pods -w

2. In a second terminal, watch the clusteroperator: oc get co/dns -w

4. In a third terminal, watch dns pod logs: oc -n openshift-dns logs -c dns -f -l dns.operator.openshift.io/daemonset-dns=default --max-log-requests=6

5. Patch the dnses.operator.openshift.io/default in some way that causes a change to the Corefile config; for example: oc patch dns.operator/default --type=merge --patch='{"spec":{"servers":[{"name":"test","zones":["test"],"forwardPlugin":{"upstreams":["127.0.0.1:5353"]}}]}}'

After performing these steps, I see the status updates for the clusteroperator:

    % oc get co/dns -w
    NAME   VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
    dns    4.9.0-0.nightly-2022-01-24-212243   True        False         False      25m
    dns    4.9.0-0.nightly-2022-01-24-212243   True        True          False      25m     DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
    dns    4.9.0-0.nightly-2022-01-24-212243   True        False         False      25m
    dns    4.9.0-0.nightly-2022-01-24-212243   True        True          False      25m     DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
    dns    4.9.0-0.nightly-2022-01-24-212243   True        True          True       26m     DNS default is degraded
    dns    4.9.0-0.nightly-2022-01-24-212243   True        True          False      26m     DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
    dns    4.9.0-0.nightly-2022-01-24-212243   True        False         False      26m
    dns    4.9.0-0.nightly-2022-01-24-212243   True        True          False      26m     DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
    dns    4.9.0-0.nightly-2022-01-24-212243   True        True          True       26m     DNS default is degraded
    dns    4.9.0-0.nightly-2022-01-24-212243   True        True          False      26m     DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
    dns    4.9.0-0.nightly-2022-01-24-212243   True        True          True       26m     DNS default is degraded
    dns    4.9.0-0.nightly-2022-01-24-212243   True        True          False      26m     DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
    dns    4.9.0-0.nightly-2022-01-24-212243   True        False         False      27m

I also see each dns pod briefly go unready, then ready again:

    % oc -n openshift-dns get pods -w
    NAME                  READY   STATUS    RESTARTS   AGE
    dns-default-242pw     2/2     Running   0          25m
    dns-default-28grx     2/2     Running   0          18m
    dns-default-44sv4     2/2     Running   0          25m
    dns-default-9wnx9     2/2     Running   0          18m
    dns-default-bn6kt     2/2     Running   0          19m
    dns-default-jj2kf     2/2     Running   0          25m
    node-resolver-7x9gq   1/1     Running   0          19m
    node-resolver-dldk4   1/1     Running   0          19m
    node-resolver-drvrj   1/1     Running   0          25m
    node-resolver-jqsnv   1/1     Running   0          19m
    node-resolver-scmnj   1/1     Running   0          25m
    node-resolver-ztnnm   1/1     Running   0          25m
    dns-default-44sv4     1/2     Running   0          25m
    dns-default-44sv4     2/2     Running   0          26m
    dns-default-28grx     1/2     Running   0          19m
    dns-default-9wnx9     1/2     Running   0          19m
    dns-default-28grx     2/2     Running   0          19m
    dns-default-9wnx9     2/2     Running   0          19m
    dns-default-242pw     1/2     Running   0          26m
    dns-default-jj2kf     1/2     Running   0          27m
    dns-default-242pw     2/2     Running   0          27m
    dns-default-bn6kt     1/2     Running   0          21m
    dns-default-jj2kf     2/2     Running   0          27m
    dns-default-bn6kt     2/2     Running   0          21m

And finally, the logs reveal the issue; following is a snippet from one dns pod's logs:

    [INFO] Reloading
    [INFO] plugin/health: Going into lameduck mode for 20s
    [INFO] plugin/reload: Running configuration MD5 = 746216286a0ba8ae7a62a714fa095855
    [INFO] Reloading complete

I see the above repeated for every dns pod.  Every time the Corefile configmap is updated, each CoreDNS instance reloads the Corefile, and when CoreDNS reloads, it goes into "lameduck mode" for 20 seconds, during which time it reports not-ready to the kubelet.  

We can add hysteresis to the status reporting logic to prevent flapping.  I believe this work is best left till after 4.10.0.

Comment 5 Melvin Joseph 2022-04-13 06:09:34 UTC

I think still there is some flapping, but reduced than previous time.

melvinjoseph@mjoseph-mac Downloads % oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.ci.test-2022-04-13-042208-ci-ln-i9fzbyk-latest   True        False         40m     Cluster version is 4.11.0-0.ci.test-2022-04-13-042208-ci-ln-i9fzbyk-latest
melvinjoseph@mjoseph-mac Downloads % oc patch dns.operator/default --type=merge --patch='{"spec":{"servers":[{"name":"test","zones":["test"],"forwardPlugin":{"upstreams":["127.0.0.1:5353"]}}]}}'
dns.operator.openshift.io/default patched

melvinjoseph@mjoseph-mac Downloads % oc get co/dns -w
NAME   VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
dns    4.11.0-0.ci.test-2022-04-13-042208-ci-ln-i9fzbyk-latest   True        False         False      57m     
dns    4.11.0-0.ci.test-2022-04-13-042208-ci-ln-i9fzbyk-latest   True        True          False      57m     DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
dns    4.11.0-0.ci.test-2022-04-13-042208-ci-ln-i9fzbyk-latest   True        True          True       57m     DNS default is degraded
dns    4.11.0-0.ci.test-2022-04-13-042208-ci-ln-i9fzbyk-latest   True        True          False      57m     DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
dns    4.11.0-0.ci.test-2022-04-13-042208-ci-ln-i9fzbyk-latest   True        False         False      57m     
dns    4.11.0-0.ci.test-2022-04-13-042208-ci-ln-i9fzbyk-latest   True        False         True       58m     DNS default is degraded
dns    4.11.0-0.ci.test-2022-04-13-042208-ci-ln-i9fzbyk-latest   True        False         False      58m     
dns    4.11.0-0.ci.test-2022-04-13-042208-ci-ln-i9fzbyk-latest   True        True          False      58m     DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
dns    4.11.0-0.ci.test-2022-04-13-042208-ci-ln-i9fzbyk-latest   True        False         False      58m     


melvinjoseph@mjoseph-mac Downloads % oc -n openshift-dns get pods -w                                                                                                                              
NAME                  READY   STATUS    RESTARTS   AGE
dns-default-44sv6     2/2     Running   0          57m
dns-default-9fffj     2/2     Running   0          50m
dns-default-b9mgt     2/2     Running   0          57m
dns-default-ln49d     2/2     Running   0          57m
dns-default-wb66v     2/2     Running   0          50m
dns-default-zlj5f     2/2     Running   0          51m
node-resolver-59dsx   1/1     Running   0          52m
node-resolver-6trnh   1/1     Running   0          52m
node-resolver-gdhc6   1/1     Running   0          57m
node-resolver-h6dt4   1/1     Running   0          57m
node-resolver-szstk   1/1     Running   0          57m
node-resolver-t27dz   1/1     Running   0          52m
dns-default-ln49d     1/2     Running   0          57m
dns-default-44sv6     1/2     Running   0          57m
dns-default-ln49d     2/2     Running   0          57m
dns-default-b9mgt     1/2     Running   0          57m
dns-default-44sv6     2/2     Running   0          57m
dns-default-b9mgt     2/2     Running   0          58m
dns-default-wb66v     1/2     Running   0          52m
dns-default-9fffj     1/2     Running   0          52m
dns-default-wb66v     2/2     Running   0          52m
dns-default-9fffj     2/2     Running   0          52m
dns-default-zlj5f     1/2     Running   0          52m
dns-default-zlj5f     2/2     Running   0          52m

melvinjoseph@mjoseph-mac Downloads % oc -n openshift-dns logs -c dns -f -l dns.operator.openshift.io/daemonset-dns=default --max-log-requests=6
[INFO] plugin/reload: Running configuration MD5 = 477feca3db059e6b9063274509ca76ee
CoreDNS-1.8.7
linux/amd64, go1.17.5, 
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 20s
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 20s
[INFO] plugin/reload: Running configuration MD5 = ca3dd6b3842292b76f3c39bece74187e
[INFO] Reloading complete
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 20s
[INFO] plugin/reload: Running configuration MD5 = ca3dd6b3842292b76f3c39bece74187e
[INFO] Reloading complete
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 20s
[INFO] plugin/reload: Running configuration MD5 = ca3dd6b3842292b76f3c39bece74187e
[INFO] Reloading complete
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 20s
[INFO] plugin/reload: Running configuration MD5 = ca3dd6b3842292b76f3c39bece74187e
[INFO] Reloading complete
[INFO] plugin/reload: Running configuration MD5 = ca3dd6b3842292b76f3c39bece74187e
[INFO] Reloading complete
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 20s
[INFO] plugin/reload: Running configuration MD5 = ca3dd6b3842292b76f3c39bece74187e
[INFO] Reloading complete

Comment 10 Melvin Joseph 2022-06-06 09:47:56 UTC

melvinjoseph@mjoseph-mac Downloads % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-04-014713   True        False         7h9m    Cluster version is 4.11.0-0.nightly-2022-06-04-014713
melvinjoseph@mjoseph-mac Downloads % 

melvinjoseph@mjoseph-mac Downloads % oc patch dns.operator/default --type=merge --patch='{"spec":{"servers":[{"name":"test","zones":["test"],"forwardPlugin":{"upstreams":["127.0.0.1:5353"]}}]}}'
dns.operator.openshift.io/default patched

melvinjoseph@mjoseph-mac Downloads % oc -n openshift-dns get pods -w

NAME                  READY   STATUS    RESTARTS   AGE
dns-default-4j8gl     2/2     Running   0          6h37m
dns-default-cnv29     2/2     Running   0          6h37m
dns-default-pjsrd     2/2     Running   0          6h29m
dns-default-vbqn8     2/2     Running   0          6h30m
dns-default-vqkpp     2/2     Running   0          6h37m
node-resolver-6js2l   1/1     Running   0          6h37m
node-resolver-9p6vn   1/1     Running   0          6h30m
node-resolver-n9257   1/1     Running   0          6h30m
node-resolver-qp5wm   1/1     Running   0          6h37m
node-resolver-qr5lb   1/1     Running   0          6h37m
dns-default-4j8gl     1/2     Running   0          6h40m
dns-default-pjsrd     1/2     Running   0          6h33m
dns-default-4j8gl     2/2     Running   0          6h40m
dns-default-pjsrd     2/2     Running   0          6h33m
dns-default-vbqn8     1/2     Running   0          6h33m
dns-default-vbqn8     2/2     Running   0          6h34m
dns-default-vqkpp     1/2     Running   0          6h41m
dns-default-cnv29     1/2     Running   0          6h41m
dns-default-vqkpp     2/2     Running   0          6h41m
dns-default-cnv29     2/2     Running   0          6h41m

melvinjoseph@mjoseph-mac Downloads % oc get co/dns -w
NAME   VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
dns    4.11.0-0.nightly-2022-06-04-014713   True        False         False      6h37m   
dns    4.11.0-0.nightly-2022-06-04-014713   True        True          False      6h40m   DNS "default" reports Progressing=True: "Have 4 available DNS pods, want 5."
dns    4.11.0-0.nightly-2022-06-04-014713   True        True          True       6h40m   DNS default is degraded
dns    4.11.0-0.nightly-2022-06-04-014713   True        True          False      6h40m   DNS "default" reports Progressing=True: "Have 4 available DNS pods, want 5."
dns    4.11.0-0.nightly-2022-06-04-014713   True        False         False      6h40m   

melvinjoseph@mjoseph-mac Downloads % oc -n openshift-dns logs -c dns -f -l dns.operator.openshift.io/daemonset-dns=default --max-log-requests=6
[INFO] 10.128.0.51:48836 - 5141 "AAAA IN registry.redhat.io.us-west-2.compute.internal. udp 63 false 512" NXDOMAIN qr,rd,ra 63 0.003339211s
[INFO] 10.128.0.51:45118 - 61895 "A IN registry.redhat.io.mjoseph-ip18.qe.devcluster.openshift.com. udp 77 false 512" NXDOMAIN qr,rd,ra 77 0.002902669s
[INFO] 10.128.0.51:34660 - 34948 "AAAA IN registry.redhat.io.mjoseph-ip18.qe.devcluster.openshift.com. udp 77 false 512" NXDOMAIN qr,rd,ra 77 0.003045752s
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 20s
[INFO] 10.129.0.1:4062 - 14450 "AAAA IN api.openshift.com.us-west-2.compute.internal. udp 62 false 512" NXDOMAIN qr,rd,ra 62 0.007191427s
[INFO] 10.129.0.1:60402 - 11971 "A IN api.openshift.com.us-west-2.compute.internal. udp 62 false 512" NXDOMAIN qr,rd,ra 62 0.007292815s
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 20s

Based on the output and https://bugzilla.redhat.com/show_bug.cgi?id=2037190#c8 i marking this bug as verified.

Comment 12 errata-xmlrpc 2022-08-10 10:41:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.