Bug 1873728

Summary: Ingress operator fails to update existing DNSRecord status conditions
Product: OpenShift Container Platform Reporter: Miciah Dashiel Butler Masters <mmasters>
Component: NetworkingAssignee: Miciah Dashiel Butler Masters <mmasters>
Networking sub component: DNS QA Contact: Hongan Li <hongli>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: aos-bugs, mfojtik, sgreene
Version: 4.6   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:36:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1874244    

Description Miciah Dashiel Butler Masters 2020-08-29 16:57:49 UTC
Description of problem:

When the ingress operator's DNS controller reconciles a DNSRecord and computes the new status of the DNSRecord, it performs an equality check on the old and new status conditions to determine whether or not they have changed, and thus whether or not the controller should update the DNSRecord's status.  This equality check returns false positives for status conditions that are already set but have changed status (for example, from Failed=False to Failed=True).  This causes the controller to fail to record success after an earlier failure, and as a result, the DNS controller endlessly retries publishing the DNSRecord, and the IngressController's status conditions show DNSReady=False and Degraded=True.


Version-Release number of selected component (if applicable):

4.6.0-0.ci.test-2020-08-29-131655-ci-op-rv6xn61c


How reproducible:

Happens when the DNS provider returns an error and subsequently returns success.  Observed in some CI runs.  For example, <https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_router/170/pull-ci-openshift-router-master-e2e/1299697433579622400>.


Actual results:

Here, the DNS controller initially fails to publish the DNS records:

    2020-08-29T13:36:49.275Z	ERROR	operator.dns_controller	dns/controller.go:181	failed to publish DNS record to zone	{"record": {"dnsName":"*.apps.ci-op-rv6xn61c-8f0fe.origin-ci-int-gce.dev.openshift.com.","targets":["34.73.141.20"],"recordType":"A","recordTTL":30}, "dnszone": {"id":"ci-op-rv6xn61c-8f0fe-wm6nv-private-zone"}, "error": "Post https://dns.googleapis.com/dns/v1/projects/openshift-gce-devel-ci/managedZones/ci-op-rv6xn61c-8f0fe-wm6nv-private-zone/changes?alt=json&prettyPrint=false: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: dial tcp: i/o timeout"}
...
    2020-08-29T13:37:09.283Z	ERROR	operator.dns_controller	dns/controller.go:181	failed to publish DNS record to zone	{"record": {"dnsName":"*.apps.ci-op-rv6xn61c-8f0fe.origin-ci-int-gce.dev.openshift.com.","targets":["34.73.141.20"],"recordType":"A","recordTTL":30}, "dnszone": {"id":"origin-ci-int-gce-new"}, "error": "Post https://dns.googleapis.com/dns/v1/projects/openshift-gce-devel-ci/managedZones/origin-ci-int-gce-new/changes?alt=json&prettyPrint=false: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: dial tcp: lookup oauth2.googleapis.com on 172.30.0.10:53: read udp 10.130.0.22:41182->172.30.0.10:53: read: connection refused"}

Shortly after, the DNS controller succeeds in publishing the records:

2020-08-29T13:37:09.295Z	INFO	operator.dns_controller	controller/controller.go:233	updated dnsrecord	{"dnsrecord": {"metadata":{"name":"default-wildcard","namespace":"openshift-ingress-operator","selfLink":"/apis/ingress.operator.openshift.io/v1/namespaces/openshift-ingress-operator/dnsrecords/default-wildcard/status","uid":"9448c7dc-e082-47fc-8f45-062e0ea092b6","resourceVersion":"17508","generation":1,"creationTimestamp":"2020-08-29T13:36:19Z","labels":{"ingresscontroller.operator.openshift.io/owning-ingresscontroller":"default"},"ownerReferences":[{"apiVersion":"operator.openshift.io/v1","kind":"IngressController","name":"default","uid":"6300ae77-9e39-4432-81d2-d8c64dc2340f","controller":true,"blockOwnerDeletion":true}],"finalizers":["operator.openshift.io/ingress-dns"],"managedFields":[{"manager":"ingress-operator","operation":"Update","apiVersion":"ingress.operator.openshift.io/v1","time":"2020-08-29T13:37:09Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:finalizers":{".":{},"v:\"operator.openshift.io/ingress-dns\"":{}},"f:labels":{".":{},"f:ingresscontroller.operator.openshift.io/owning-ingresscontroller":{}},"f:ownerReferences":{".":{},"k:{\"uid\":\"6300ae77-9e39-4432-81d2-d8c64dc2340f\"}":{".":{},"f:apiVersion":{},"f:blockOwnerDeletion":{},"f:controller":{},"f:kind":{},"f:name":{},"f:uid":{}}}},"f:spec":{".":{},"f:dnsName":{},"f:recordTTL":{},"f:recordType":{},"f:targets":{}},"f:status":{".":{},"f:observedGeneration":{},"f:zones":{}}}}]},"spec":{"dnsName":"*.apps.ci-op-rv6xn61c-8f0fe.origin-ci-int-gce.dev.openshift.com.","targets":["34.73.141.20"],"recordType":"A","recordTTL":30},"status":{"zones":[{"dnsZone":{"id":"ci-op-rv6xn61c-8f0fe-wm6nv-private-zone"},"conditions":[{"type":"Failed","status":"True","lastTransitionTime":"2020-08-29T13:36:19Z","reason":"ProviderError","message":"The DNS provider failed to ensure the record: Post https://dns.googleapis.com/dns/v1/projects/openshift-gce-devel-ci/managedZones/ci-op-rv6xn61c-8f0fe-wm6nv-private-zone/changes?alt=json&prettyPrint=false: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: dial tcp: i/o timeout"}]},{"dnsZone":{"id":"origin-ci-int-gce-new"},"conditions":[{"type":"Failed","status":"True","lastTransitionTime":"2020-08-29T13:36:49Z","reason":"ProviderError","message":"The DNS provider failed to ensure the record: Post https://dns.googleapis.com/dns/v1/projects/openshift-gce-devel-ci/managedZones/origin-ci-int-gce-new/changes?alt=json&prettyPrint=false: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: dial tcp: lookup oauth2.googleapis.com on 172.30.0.10:53: read udp 10.130.0.22:41182->172.30.0.10:53: read: connection refused"}]}],"observedGeneration":1}}}

However, the operator does not update the DNSRecord, and soon the ingress controller reports DNSReady=False:

2020-08-29T13:37:09.330Z	ERROR	operator.ingress_controller	controller/controller.go:233	got retryable error; requeueing	{"after": "1m0s", "error": "IngressController is degraded: DeploymentAvailable=False, DeploymentReplicasMinAvailable=False, DNSReady=False"}

The DNS controller continues retrying publishing the records:

    2020-08-29T13:37:19.651Z	INFO	operator.dns_controller	dns/controller.go:181	published DNS record to zone	{"record": {"dnsName":"*.apps.ci-op-rv6xn61c-8f0fe.origin-ci-int-gce.dev.openshift.com.","targets":["34.73.141.20"],"recordType":"A","recordTTL":30}, "dnszone": {"id":"ci-op-rv6xn61c-8f0fe-wm6nv-private-zone"}}
    2020-08-29T13:37:20.137Z	INFO	operator.dns_controller	dns/controller.go:181	published DNS record to zone	{"record": {"dnsName":"*.apps.ci-op-rv6xn61c-8f0fe.origin-ci-int-gce.dev.openshift.com.","targets":["34.73.141.20"],"recordType":"A","recordTTL":30}, "dnszone": {"id":"origin-ci-int-gce-new"}}

The IngressController remains degraded because the DNSRecord is never updated from Failed=True to Failed=False:

    2020-08-29T13:41:19.061Z	ERROR	operator.ingress_controller	controller/controller.go:233	got retryable error; requeueing	{"after": "1m0s", "error": "IngressController is degraded: DNSReady=False"}


Expected results:

The DNS controller should log "updated dnsrecord" with the "Failed" status conditions set to "False" after it succeeds in publishing the records.


Additional info:

The logic error was introduced in 4.5.0 with https://github.com/openshift/cluster-ingress-operator/pull/390/commits/d953fa97c7f90d8ec733fdbf9ba12aa5fb433cc1.

Comment 4 Andrew McDermott 2020-09-01 17:12:19 UTC
*** Bug 1874051 has been marked as a duplicate of this bug. ***

Comment 5 Hongan Li 2020-09-07 07:17:35 UTC
verified with 4.6.0-0.nightly-2020-09-05-015624 and issue has been fixed.

test steps:
1. run "oc -n openshift-ingress-operator edit secret cloud-credentials" and change the key
2. delete dnsrecords and wait for it is recreated
3. ensure DNSRecord status is Failed=True and co/ingress is Degraded
4. delete secret cloud-credentials and wait for it is recreated
5. ensure DNSRecord status is Failed=False and co/ingress is not Degraded

Comment 7 errata-xmlrpc 2020-10-27 16:36:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196