Description of problem: The Ingress Operator fails to delete the wildcard dns record when the recors is not cached. Version-Release number of selected component (if applicable): $ oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-06-23-160540 True False 6h23m Cluster version is 4.6.0-0.nightly-2020-06-23-160540 How reproducible: Always Steps to Reproduce: 1. Install an AWS cluster using IPI 2. Kill ingress operator to remove the cached dnsrecord 3. Start the ingress operator 4. Delete the default ingresscontroller Actual results: 2020-06-24T16:10:31.413-0700 ERROR operator.dns_controller dns/controller.go:88 failed to delete dnsrecord; will retry {"dnsrecord": {"metadata":{"name":"default-wildcard","namespace":"openshift-ingress-operator","selfLink":"/apis/ingress.operator.openshift.io/v1/namespaces/openshift-ingress-operator/dnsrecords/default-wildcard","uid":"fcd8789a-dad4-411d-a6c1-689ca55efa4c","resourceVersion":"271966","generation":2,"creationTimestamp":"2020-06-24T23:09:12Z","deletionTimestamp":"2020-06-24T23:10:30Z","deletionGracePeriodSeconds":0,"labels":{"ingresscontroller.operator.openshift.io/owning-ingresscontroller":"default"},"ownerReferences":[{"apiVersion":"operator.openshift.io/v1","kind":"IngressController","name":"default","uid":"fb11358d-a3dd-4015-83bf-a3168fbc3e34","controller":true,"blockOwnerDeletion":true}],"finalizers":["operator.openshift.io/ingress-dns"],"managedFields":[{"manager":"ingress-operator","operation":"Update","apiVersion":"ingress.operator.openshift.io/v1","time":"2020-06-24T23:09:14Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:finalizers":{".":{},"v:\"operator.openshift.io/ingress-dns\"":{}},"f:labels":{".":{},"f:ingresscontroller.operator.openshift.io/owning-ingresscontroller":{}},"f:ownerReferences":{".":{},"k:{\"uid\":\"fb11358d-a3dd-4015-83bf-a3168fbc3e34\"}":{".":{},"f:apiVersion":{},"f:blockOwnerDeletion":{},"f:controller":{},"f:kind":{},"f:name":{},"f:uid":{}}}},"f:spec":{".":{},"f:dnsName":{},"f:recordTTL":{},"f:recordType":{},"f:targets":{}},"f:status":{".":{},"f:observedGeneration":{},"f:zones":{}}}}]},"spec":{"dnsName":"*.apps.dhansen.devcluster.openshift.com.","targets":["a3c35848dd37540b4bb0b54d1e3f84bf-105426093.us-west-2.elb.amazonaws.com"],"recordType":"CNAME","recordTTL":30},"status":{"zones":[{"dnsZone":{"tags":{"Name":"dhansen-ksg8r-int","kubernetes.io/cluster/dhansen-ksg8r":"owned"}},"conditions":[{"type":"Failed","status":"False","lastTransitionTime":"2020-06-24T23:09:12Z","reason":"ProviderSuccess","message":"The DNS provider succeeded in ensuring the record"}]},{"dnsZone":{"id":"Z3URY6TWQ91KVV"},"conditions":[{"type":"Failed","status":"False","lastTransitionTime":"2020-06-24T23:09:14Z","reason":"ProviderSuccess","message":"The DNS provider succeeded in ensuring the record"}]}],"observedGeneration":1}}, "error": "failed to get hosted zone for load balancer target \"a3c35848dd37540b4bb0b54d1e3f84bf-105426093.us-west-2.elb.amazonaws.com\": couldn't find hosted zone ID of ELB a3c35848dd37540b4bb0b54d1e3f84bf-105426093.us-west-2.elb.amazonaws.com", "errorCauses": [{"error": "failed to get hosted zone for load balancer target \"a3c35848dd37540b4bb0b54d1e3f84bf-105426093.us-west-2.elb.amazonaws.com\": couldn't find hosted zone ID of ELB a3c35848dd37540b4bb0b54d1e3f84bf-105426093.us-west-2.elb.amazonaws.com"}, {"error": "failed to get hosted zone for load balancer target \"a3c35848dd37540b4bb0b54d1e3f84bf-105426093.us-west-2.elb.amazonaws.com\": couldn't find hosted zone ID of ELB a3c35848dd37540b4bb0b54d1e3f84bf-105426093.us-west-2.elb.amazonaws.com"}]} Expected results: The default ingresscontroller and dependent resources are deleted. Additional info:
*** Bug 1850812 has been marked as a duplicate of this bug. ***
verified with 4.6.0-0.nightly-2020-07-15-170241 and issue has been fixed. follow the steps and default ingresscontroller can be deleted and no error "failed to delete dnsrecord" in the logs
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, it’s always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? This bz addresses a niche use-case where the Ingress Operator is restarted after the default ingress controller (and any additional ingress controllers) were created and any of the ingress controllers are deleted. What is the impact? Is it serious enough to warrant blocking edges? The ingresscontroller deletion will hang with the log message in https://bugzilla.redhat.com/show_bug.cgi?id=1850813#c0 observed. No. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? Remediation involves removing the ingress controller finalizer and manually deleting dependent resources. Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? No, it’s always been like this we just never noticed.
Pulling UpgradeBlocker based on comment 7.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196