Bug 2002621 - DNS operator performs spurious updates in response to API's defaulting of service's internalTrafficPolicy
Summary: DNS operator performs spurious updates in response to API's defaulting of ser...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.9
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: 4.9.z
Assignee: Miciah Dashiel Butler Masters
QA Contact: Shudi Li
URL:
Whiteboard:
Depends On: 2002461
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-09 11:32 UTC by OpenShift BugZilla Robot
Modified: 2022-08-04 22:39 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When the DNS operator reconciles its operands, the operator gets the cluster DNS service object from the API to determine whether the operator needs to create or update the service. If the service already exists, the operator compares it with what the operator expects to get in order to determine whether an update is needed. Kubernetes 1.22, on which OpenShift 4.9 is based, introduced a new spec.internalTrafficPolicy API field for services. The operator leaves this field empty when it creates the service, but the API sets a default value for this field. The operator was observing this default value and trying to update the field back to the empty value. Consequence: The operator's update logic would keep trying to revert the default value that the API set for the service's internal traffic policy. Fix: When comparing services to determine whether an update is required, the operator now treats the empty value and default value for spec.internalTrafficPolicy as equal. Result: The operator no longer spuriously tries to update the cluster DNS service when the API sets a default value for the service's spec.internalTrafficPolicy field.
Clone Of:
Environment:
Last Closed: 2021-11-01 13:44:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-dns-operator pull 295 0 None None None 2021-09-09 11:32:11 UTC
Red Hat Product Errata RHBA-2021:4005 0 None None None 2021-11-01 13:44:47 UTC

Description OpenShift BugZilla Robot 2021-09-09 11:32:00 UTC
+++ This bug was initially created as a clone of Bug #2002461 +++

Description of problem:

When the DNS operator reconciles its operands, the operator gets the DNS service from the API to determine whether the operator needs to create or update the service.  If the service already exists, the operator compares it with what the operator expects to get in order to determine whether an update is needed.  In this comparison, if the API has set the default value for the service's spec.internalTrafficPolicy field, the operator detects the update and tries to set the field back to the empty value.  The operator should not update the service in response to API defaulting.


OpenShift release version:

Kubernetes 1.22 and OpenShift 4.9 enable the new internalTrafficPolicy field by default.  


Cluster Platform:

Observed on AWS and GCP but can be expected to be the same on all platforms.


How reproducible:

100%.


Steps to Reproduce (in detail):

1. Launch a new OpenShift 4.9 or 4.10 cluster.

2. Check the DNS operator's logs: 

    oc -n openshift-dns-operator logs -c dns-operator deploy/dns-operator

3. Restart the operator: 

    oc -n openshift-dns-operator delete pods -l name=dns-operator

4. Check the DNS operator's logs again.  

Actual results:

The operator logs many "updated dns service openshift-dns/dns-default" messages.  


Expected results:

The operator should log only a few such messages when it first starts, and it shouldn't log any such messages when restarted (unless something else besides the operator itself or API defaulting modifies the service).  


Impact of the problem:

Extra API load and CPU usage performing spurious updates.


Additional information:

The fix for this BZ should be backported to OpenShift 4.9.

Comment 1 Shudi Li 2021-09-15 07:13:13 UTC
Verified in 4.9.0-0.ci.test-2021-09-15-053753-ci-ln-3s3plwb-latest

1.show cluster's version
% oc get clusterversion
NAME      VERSION                                                  AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.ci.test-2021-09-15-053753-ci-ln-3s3plwb-latest   True        False         14m     Cluster version is 4.9.0-0.ci.test-2021-09-15-053753-ci-ln-3s3plwb-latest

2. delete dns-operator pod
% oc -n openshift-dns-operator delete pods -l name=dns-operator
pod "dns-operator-59bd9659bd-rmcmj" deleted

3.dns-operator pod was recreated
% oc -n openshift-dns-operator get pods
NAME                            READY   STATUS    RESTARTS   AGE
dns-operator-59bd9659bd-wpt47   2/2     Running   0          2m10s

4.Wait for sometime and check log again, can't see the "updated dns service openshift-dns" log
% oc -n openshift-dns-operator get pods
NAME                            READY   STATUS    RESTARTS   AGE
dns-operator-59bd9659bd-wpt47   2/2     Running   0          18m
 
% oc -n openshift-dns-operator logs -c dns-operator deploy/dns-operator                     
I0915 06:25:19.561393       1 request.go:668] Waited for 1.029170503s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/quota.openshift.io/v1?timeout=32s
time="2021-09-15T06:25:20Z" level=info msg="reconciling request: /default"
time="2021-09-15T06:25:21Z" level=info msg="reconciling request: /default"

Comment 2 Miciah Dashiel Butler Masters 2021-09-23 16:20:31 UTC
Setting priority to low as the PR is blocked on 4.9.0 GA.

Comment 5 Shudi Li 2021-10-25 06:03:47 UTC
Verified it with 4.9.0-0.nightly-2021-10-22-102153 and passed

1.
% oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-10-22-102153   True        False         128m    Cluster version is 4.9.0-0.nightly-2021-10-22-102153
%

2.
% oc -n openshift-dns-operator delete pods -l name=dns-operator
pod "dns-operator-5d469849fd-cpsf9" deleted
%

3.
% oc -n openshift-dns-operator get pods                        
NAME                            READY   STATUS    RESTARTS   AGE
dns-operator-5d469849fd-ztcdj   2/2     Running   0          24s
%

4. Wait for sometime and check log, can't see the "updated dns service openshift-dns" log
% oc -n openshift-dns-operator logs -c dns-operator deploy/dns-operator
I1025 05:57:51.853656       1 request.go:668] Waited for 1.028907827s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/template.openshift.io/v1?timeout=32s
time="2021-10-25T05:57:53Z" level=info msg="reconciling request: /default"
time="2021-10-25T05:57:53Z" level=info msg="reconciling request: /default"
%

Comment 8 errata-xmlrpc 2021-11-01 13:44:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.5 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4005


Note You need to log in before you can comment on or make changes to this bug.