Bug 2002461 - DNS operator performs spurious updates in response to API's defaulting of service's internalTrafficPolicy
Summary: DNS operator performs spurious updates in response to API's defaulting of ser...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.9
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.10.0
Assignee: Miciah Dashiel Butler Masters
QA Contact: jechen
URL:
Whiteboard:
Depends On:
Blocks: 2002621
TreeView+ depends on / blocked
 
Reported: 2021-09-08 21:49 UTC by Miciah Dashiel Butler Masters
Modified: 2022-12-27 21:17 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When the DNS operator reconciles its operands, the operator gets the cluster DNS service object from the API to determine whether the operator needs to create or update the service. If the service already exists, the operator compares it with what the operator expects to get in order to determine whether an update is needed. Kubernetes 1.22, on which OpenShift 4.9 is based, introduced a new spec.internalTrafficPolicy API field for services. The operator leaves this field empty when it creates the service, but the API sets a default value for this field. The operator was observing this default value and trying to update the field back to the empty value. Consequence: The operator's update logic would keep trying to revert the default value that the API set for the service's internal traffic policy. Fix: When comparing services to determine whether an update is required, the operator now treats the empty value and default value for spec.internalTrafficPolicy as equal. Result: The operator no longer spuriously tries to update the cluster DNS service when the API sets a default value for the service's spec.internalTrafficPolicy field.
Clone Of:
Environment:
Last Closed: 2022-03-10 16:08:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-dns-operator pull 294 0 None Merged Bug 2002461: serviceChanged: Fix internalTrafficPolicy 2022-02-10 21:23:36 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:09:24 UTC

Description Miciah Dashiel Butler Masters 2021-09-08 21:49:44 UTC
Description of problem:

When the DNS operator reconciles its operands, the operator gets the DNS service from the API to determine whether the operator needs to create or update the service.  If the service already exists, the operator compares it with what the operator expects to get in order to determine whether an update is needed.  In this comparison, if the API has set the default value for the service's spec.internalTrafficPolicy field, the operator detects the update and tries to set the field back to the empty value.  The operator should not update the service in response to API defaulting.


OpenShift release version:

Kubernetes 1.22 and OpenShift 4.9 enable the new internalTrafficPolicy field by default.  


Cluster Platform:

Observed on AWS and GCP but can be expected to be the same on all platforms.


How reproducible:

100%.


Steps to Reproduce (in detail):

1. Launch a new OpenShift 4.9 or 4.10 cluster.

2. Check the DNS operator's logs: 

    oc -n openshift-dns-operator logs -c dns-operator deploy/dns-operator

3. Restart the operator: 

    oc -n openshift-dns-operator delete pods -l name=dns-operator

4. Check the DNS operator's logs again.  

Actual results:

The operator logs many "updated dns service openshift-dns/dns-default" messages.  


Expected results:

The operator should log only a few such messages when it first starts, and it shouldn't log any such messages when restarted (unless something else besides the operator itself or API defaulting modifies the service).  


Impact of the problem:

Extra API load and CPU usage performing spurious updates.


Additional information:

The fix for this BZ should be backported to OpenShift 4.9.

Comment 31 jechen 2021-09-13 15:16:34 UTC
Verified in 4.10.0-0.nightly-2021-09-10-083647

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-09-10-083647   True        False         8m46s   Cluster version is 4.10.0-0.nightly-2021-09-10-083647

after delete dns-operator pod
$ oc -n openshift-dns-operator delete pods -l name=dns-operator
pod "dns-operator-598b8b6cc7-vt58d" deleted

dns-operator pod was recreated
$ oc -n openshift-dns-operator get pod
NAME                            READY   STATUS    RESTARTS   AGE
dns-operator-598b8b6cc7-xnvc2   2/2     Running   0          19m


#check log again
$ oc -n openshift-dns-operator logs -c dns-operator deploy/dns-operator
I0913 13:07:22.257263       1 request.go:668] Waited for 1.030837528s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/quota.openshift.io/v1?timeout=32s
time="2021-09-13T13:07:23Z" level=info msg="reconciling request: /default"
time="2021-09-13T13:07:23Z" level=info msg="reconciling request: /default"


donot see "updated dns service" after 19 minutes, issue is fixed.

Comment 34 errata-xmlrpc 2022-03-10 16:08:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 35 xuezia 2022-11-25 06:41:48 UTC Comment hidden (spam)

Note You need to log in before you can comment on or make changes to this bug.