1873728 – Ingress operator fails to update existing DNSRecord status conditions

Bug 1873728 - Ingress operator fails to update existing DNSRecord status conditions

Summary: Ingress operator fails to update existing DNSRecord status conditions

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Miciah Dashiel Butler Masters
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1874051 (view as bug list)
Depends On:
Blocks:	1874244
TreeView+	depends on / blocked

Reported:	2020-08-29 16:57 UTC by Miciah Dashiel Butler Masters
Modified:	2022-08-04 22:39 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:36:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-ingress-operator pull 446	0	None	closed	Bug 1873728: publishRecordToZones: Fix status merge	2021-02-15 05:54:37 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:36:39 UTC

Description Miciah Dashiel Butler Masters 2020-08-29 16:57:49 UTC

Description of problem:

When the ingress operator's DNS controller reconciles a DNSRecord and computes the new status of the DNSRecord, it performs an equality check on the old and new status conditions to determine whether or not they have changed, and thus whether or not the controller should update the DNSRecord's status.  This equality check returns false positives for status conditions that are already set but have changed status (for example, from Failed=False to Failed=True).  This causes the controller to fail to record success after an earlier failure, and as a result, the DNS controller endlessly retries publishing the DNSRecord, and the IngressController's status conditions show DNSReady=False and Degraded=True.


Version-Release number of selected component (if applicable):

4.6.0-0.ci.test-2020-08-29-131655-ci-op-rv6xn61c


How reproducible:

Happens when the DNS provider returns an error and subsequently returns success.  Observed in some CI runs.  For example, <https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_router/170/pull-ci-openshift-router-master-e2e/1299697433579622400>.


Actual results:

Here, the DNS controller initially fails to publish the DNS records:

    2020-08-29T13:36:49.275Z	ERROR	operator.dns_controller	dns/controller.go:181	failed to publish DNS record to zone	{"record": {"dnsName":"*.apps.ci-op-rv6xn61c-8f0fe.origin-ci-int-gce.dev.openshift.com.","targets":["34.73.141.20"],"recordType":"A","recordTTL":30}, "dnszone": {"id":"ci-op-rv6xn61c-8f0fe-wm6nv-private-zone"}, "error": "Post https://dns.googleapis.com/dns/v1/projects/openshift-gce-devel-ci/managedZones/ci-op-rv6xn61c-8f0fe-wm6nv-private-zone/changes?alt=json&prettyPrint=false: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: dial tcp: i/o timeout"}
...
    2020-08-29T13:37:09.283Z	ERROR	operator.dns_controller	dns/controller.go:181	failed to publish DNS record to zone	{"record": {"dnsName":"*.apps.ci-op-rv6xn61c-8f0fe.origin-ci-int-gce.dev.openshift.com.","targets":["34.73.141.20"],"recordType":"A","recordTTL":30}, "dnszone": {"id":"origin-ci-int-gce-new"}, "error": "Post https://dns.googleapis.com/dns/v1/projects/openshift-gce-devel-ci/managedZones/origin-ci-int-gce-new/changes?alt=json&prettyPrint=false: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: dial tcp: lookup oauth2.googleapis.com on 172.30.0.10:53: read udp 10.130.0.22:41182->172.30.0.10:53: read: connection refused"}

Shortly after, the DNS controller succeeds in publishing the records:

2020-08-29T13:37:09.295Z	INFO	operator.dns_controller	controller/controller.go:233	updated dnsrecord	{"dnsrecord": {"metadata":{"name":"default-wildcard","namespace":"openshift-ingress-operator","selfLink":"/apis/ingress.operator.openshift.io/v1/namespaces/openshift-ingress-operator/dnsrecords/default-wildcard/status","uid":"9448c7dc-e082-47fc-8f45-062e0ea092b6","resourceVersion":"17508","generation":1,"creationTimestamp":"2020-08-29T13:36:19Z","labels":{"ingresscontroller.operator.openshift.io/owning-ingresscontroller":"default"},"ownerReferences":[{"apiVersion":"operator.openshift.io/v1","kind":"IngressController","name":"default","uid":"6300ae77-9e39-4432-81d2-d8c64dc2340f","controller":true,"blockOwnerDeletion":true}],"finalizers":["operator.openshift.io/ingress-dns"],"managedFields":[{"manager":"ingress-operator","operation":"Update","apiVersion":"ingress.operator.openshift.io/v1","time":"2020-08-29T13:37:09Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:finalizers":{".":{},"v:\"operator.openshift.io/ingress-dns\"":{}},"f:labels":{".":{},"f:ingresscontroller.operator.openshift.io/owning-ingresscontroller":{}},"f:ownerReferences":{".":{},"k:{\"uid\":\"6300ae77-9e39-4432-81d2-d8c64dc2340f\"}":{".":{},"f:apiVersion":{},"f:blockOwnerDeletion":{},"f:controller":{},"f:kind":{},"f:name":{},"f:uid":{}}}},"f:spec":{".":{},"f:dnsName":{},"f:recordTTL":{},"f:recordType":{},"f:targets":{}},"f:status":{".":{},"f:observedGeneration":{},"f:zones":{}}}}]},"spec":{"dnsName":"*.apps.ci-op-rv6xn61c-8f0fe.origin-ci-int-gce.dev.openshift.com.","targets":["34.73.141.20"],"recordType":"A","recordTTL":30},"status":{"zones":[{"dnsZone":{"id":"ci-op-rv6xn61c-8f0fe-wm6nv-private-zone"},"conditions":[{"type":"Failed","status":"True","lastTransitionTime":"2020-08-29T13:36:19Z","reason":"ProviderError","message":"The DNS provider failed to ensure the record: Post https://dns.googleapis.com/dns/v1/projects/openshift-gce-devel-ci/managedZones/ci-op-rv6xn61c-8f0fe-wm6nv-private-zone/changes?alt=json&prettyPrint=false: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: dial tcp: i/o timeout"}]},{"dnsZone":{"id":"origin-ci-int-gce-new"},"conditions":[{"type":"Failed","status":"True","lastTransitionTime":"2020-08-29T13:36:49Z","reason":"ProviderError","message":"The DNS provider failed to ensure the record: Post https://dns.googleapis.com/dns/v1/projects/openshift-gce-devel-ci/managedZones/origin-ci-int-gce-new/changes?alt=json&prettyPrint=false: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: dial tcp: lookup oauth2.googleapis.com on 172.30.0.10:53: read udp 10.130.0.22:41182->172.30.0.10:53: read: connection refused"}]}],"observedGeneration":1}}}

However, the operator does not update the DNSRecord, and soon the ingress controller reports DNSReady=False:

2020-08-29T13:37:09.330Z	ERROR	operator.ingress_controller	controller/controller.go:233	got retryable error; requeueing	{"after": "1m0s", "error": "IngressController is degraded: DeploymentAvailable=False, DeploymentReplicasMinAvailable=False, DNSReady=False"}

The DNS controller continues retrying publishing the records:

    2020-08-29T13:37:19.651Z	INFO	operator.dns_controller	dns/controller.go:181	published DNS record to zone	{"record": {"dnsName":"*.apps.ci-op-rv6xn61c-8f0fe.origin-ci-int-gce.dev.openshift.com.","targets":["34.73.141.20"],"recordType":"A","recordTTL":30}, "dnszone": {"id":"ci-op-rv6xn61c-8f0fe-wm6nv-private-zone"}}
    2020-08-29T13:37:20.137Z	INFO	operator.dns_controller	dns/controller.go:181	published DNS record to zone	{"record": {"dnsName":"*.apps.ci-op-rv6xn61c-8f0fe.origin-ci-int-gce.dev.openshift.com.","targets":["34.73.141.20"],"recordType":"A","recordTTL":30}, "dnszone": {"id":"origin-ci-int-gce-new"}}

The IngressController remains degraded because the DNSRecord is never updated from Failed=True to Failed=False:

    2020-08-29T13:41:19.061Z	ERROR	operator.ingress_controller	controller/controller.go:233	got retryable error; requeueing	{"after": "1m0s", "error": "IngressController is degraded: DNSReady=False"}


Expected results:

The DNS controller should log "updated dnsrecord" with the "Failed" status conditions set to "False" after it succeeds in publishing the records.


Additional info:

The logic error was introduced in 4.5.0 with https://github.com/openshift/cluster-ingress-operator/pull/390/commits/d953fa97c7f90d8ec733fdbf9ba12aa5fb433cc1.

Comment 4 Andrew McDermott 2020-09-01 17:12:19 UTC

*** Bug 1874051 has been marked as a duplicate of this bug. ***

Comment 5 Hongan Li 2020-09-07 07:17:35 UTC

verified with 4.6.0-0.nightly-2020-09-05-015624 and issue has been fixed.

test steps:
1. run "oc -n openshift-ingress-operator edit secret cloud-credentials" and change the key
2. delete dnsrecords and wait for it is recreated
3. ensure DNSRecord status is Failed=True and co/ingress is Degraded
4. delete secret cloud-credentials and wait for it is recreated
5. ensure DNSRecord status is Failed=False and co/ingress is not Degraded

Comment 7 errata-xmlrpc 2020-10-27 16:36:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.