Description of problem: I set up an additional ingresscontroller, which looks like this: ``` apiVersion: operator.openshift.io/v1 kind: IngressController metadata: creationTimestamp: "2022-01-17T21:18:01Z" finalizers: - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller generation: 1 name: clusters-alvaro-test namespace: openshift-ingress-operator resourceVersion: "314326" uid: eddcf948-4b21-4e78-a297-fa1bab375d20 spec: domain: alvaro-test.hypershift.local endpointPublishingStrategy: loadBalancer: providerParameters: aws: type: NLB type: AWS scope: Internal type: LoadBalancerService httpErrorCodePages: name: "" routeSelector: matchLabels: hypershift.openshift.io/hosted-control-plane: clusters-alvaro-test tuningOptions: {} unsupportedConfigOverrides: null ``` The DNS for this ingresscontroller is managed by outside of the ingressoperator by a different controller. This results in the ingressoperator continuously trying to set up route 53 entries for the ingresscontroller under the clusters route 53 zone, which can't work, as the ingresscontroller uses a completely different domain: ``` status: availableReplicas: 2 conditions: - lastTransitionTime: "2022-01-17T21:18:01Z" reason: Valid status: "True" type: Admitted - lastTransitionTime: "2022-01-17T21:18:01Z" status: "True" type: PodsScheduled - lastTransitionTime: "2022-01-17T21:18:37Z" message: The deployment has Available status condition set to True reason: DeploymentAvailable status: "True" type: DeploymentAvailable - lastTransitionTime: "2022-01-17T21:18:37Z" message: Minimum replicas requirement is met reason: DeploymentMinimumReplicasMet status: "True" type: DeploymentReplicasMinAvailable - lastTransitionTime: "2022-01-17T21:18:37Z" message: All replicas are available reason: DeploymentReplicasAvailable status: "True" type: DeploymentReplicasAllAvailable - lastTransitionTime: "2022-01-17T21:18:01Z" message: The endpoint publishing strategy supports a managed load balancer reason: WantedByEndpointPublishingStrategy status: "True" type: LoadBalancerManaged - lastTransitionTime: "2022-01-17T21:18:03Z" message: The LoadBalancer service is provisioned reason: LoadBalancerProvisioned status: "True" type: LoadBalancerReady - lastTransitionTime: "2022-01-17T21:18:01Z" message: DNS management is supported and zones are specified in the cluster DNS config. reason: Normal status: "True" type: DNSManaged - lastTransitionTime: "2022-01-17T21:18:04Z" message: 'The record failed to provision in some zones: [{ map[Name:alvaro-host-l9kbq-int kubernetes.io/cluster/alvaro-host-l9kbq:owned]} {Z01753031XC9KEOLEZ50O map[]}]' reason: FailedZones status: "False" type: DNSReady - lastTransitionTime: "2022-01-17T21:18:37Z" message: 'One or more status conditions indicate unavailable: DNSReady=False (FailedZones: The record failed to provision in some zones: [{ map[Name:alvaro-host-l9kbq-int kubernetes.io/cluster/alvaro-host-l9kbq:owned]} {Z01753031XC9KEOLEZ50O map[]}])' reason: IngressControllerUnavailable status: "False" type: Available - lastTransitionTime: "2022-01-17T21:18:37Z" message: 'One or more other status conditions indicate a degraded state: DNSReady=False (FailedZones: The record failed to provision in some zones: [{ map[Name:alvaro-host-l9kbq-int kubernetes.io/cluster/alvaro-host-l9kbq:owned]} {Z01753031XC9KEOLEZ50O map[]}])' reason: DegradedConditions status: "True" type: Degraded ``` The clusters basedomain is not `hypershift.local`: ``` oc get dnses.config/cluster -o 'jsonpath={.spec.baseDomain}' alvaro-host.alvaroaleman.hypershift.devcluster.openshift.com ``` OpenShift release version: ``` Server Version: 4.8.11 Kubernetes Version: v1.21.1+9807387 ``` Cluster Platform: AWS How reproducible: Steps to Reproduce (in detail): 1. Apply the IngressController manifest from above 2. 3. Actual results: Expected results: The operator not trying to reconcile DNS records from a domain that is not below the clusters base domain. Impact of the problem: The operator reports the IngressController as degraded even though it works fine and is full of errors trying to set up the route 53 records. Additional info: ** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report. You may also mark the bug private if you wish.
Setting blocker- as this isn't a regression or upgrade issue. It would make sense to change the ingress operator not to try to manage DNS for an ingresscontroller with a spec.domain that does not match the spec.baseDomain of the cluster DNS config (i.e., `oc get dnses.config/cluster -o 'jsonpath={.spec.baseDomain}'`). However, I'm nervous about making a change like that so close to code freeze for 4.10.0 or in a z-tream release, so it might be best if we addressed this in a future y-stream release.
To follow up on comment 1, it would seem reasonable to apply the aforementioned change not only for AWS but for all platforms. However, Azure has unusual behavior with respect to DNS, as described in bug 1919151: When the operator creates a DNS record with a domain outside the hosted zone's domain, Azure concatenates the domains. For example, if an IngressController specifies the domain "apps.foo.tld" and the cluster domain is "bar.tld", then when the operator tries to create a DNS record for "*.apps.foo.tld", Azure creates a record "*.apps.foo.tld.bar.tld". The situation gets more complicated if the cluster has different domains for the public zone and the private zone. In a test cluster, I noticed that the private zone's domain is a subdomain of the public zone's domain. So for example, suppose the public zone has the domain "bar.tld", the private zone has the domain "baz.bar.tld", and the IngressController has the domain "apps.foo.tld"; then what happens is that the operator tells Azure to create a DNS record for "*.apps.foo.tld", and Azure creates a DNS record for "*.apps.foo.tld.bar.tld" in the public zone and a DNS record for "*.apps.foo.tld.baz.bar.tld" in the private zone. This makes things *very* tricky. In order not to risk breaking existing Azure clusters, we could do one of the following: * Apply the change only for AWS. * Apply the change for all platforms except for Azure to preserve the current behavior there. * Apply the change for all platforms, add a big release note warning users of the new behavior on Azure, and maybe add logic in the previous version of OpenShift to set Upgradeable=False if some IngressController with endpointPublishingStrategy.type: LoadBalancerService has a domain outside the cluster's domain. I hope no one actually wants the existing behavior on Azure; it is bizarre, undocumented, and not likely to be useful for any realistic use-case. However, there is always the risk that if something is possible, someone may have come to rely on it, no matter how bizarre it is. If we apply to the change more broadly than only for AWS, then we also need to investigate whether any relevant idiosyncrasies exist for the other supported cloud platforms: Alibaba, GCP, IBM Cloud, and Power VS.
Verified in "4.11.0-0.nightly-2022-06-25-081133" release version. With this payload deployed in hypershift environments, there are no more attempts made by the ingress operator to add route53 entries for the controllers created with hypershift domain: ------ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-25-081133 True False 90m Cluster version is 4.11.0-0.nightly-2022-06-25-081133 Template for deploying ingress controller: apiVersion: operator.openshift.io/v1 kind: IngressController metadata: name: intapps namespace: openshift-ingress-operator spec: domain: intapps.hypershift-ci-736.qe.devcluster.openshift.com endpointPublishingStrategy: loadBalancer: providerParameters: aws: type: NLB type: AWS scope: Internal type: LoadBalancerService httpErrorCodePages: name: "" tuningOptions: {} unsupportedConfigOverrides: null oc -n openshift-ingress-operator get ingresscontroller intapps -ojsonpath='{.spec}' "domain": "intapps.hypershift-ci-736.qe.devcluster.openshift.com", "endpointPublishingStrategy": { "loadBalancer": { "providerParameters": { "aws": { "type": "NLB" }, "type": "AWS" }, "scope": "Internal" }, "type": "LoadBalancerService" oc -n openshift-ingress-operator get ingresscontroller NAME AGE default 127m intapps 40m Ingress operator logging post the controller creation: oc -n openshift-ingress-operator logs pod/ingress-operator-6bf85c9ffc-kf8b7 -c ingress-operator | grep -i "intapps" 2022-06-27T07:00:05.536Z DEBUG operator.init.events record/event.go:311 Warning {"object": {"kind":"IngressController","namespace":"openshift-ingress-operator","name":"intapps","uid":"d9cdc356-639e-415a-88b3-5d1741ca1534","apiVersion":"operator.openshift.io/v1","resourceVersion":"59071"}, "reason": "DomainNotMatching", "message": "Domain [intapps.hypershift-ci-736.qe.devcluster.openshift.com] of ingresscontroller does not match the baseDomain [aiyengar411hi.qe.devcluster.openshift.com] of the cluster DNS config, so DNS management is not supported."} <===== 2022-06-27T07:00:05.543Z DEBUG operator.init.events record/event.go:311 Normal {"object": {"kind":"IngressController","namespace":"openshift-ingress-operator","name":"intapps","uid":"d9cdc356-639e-415a-88b3-5d1741ca1534","apiVersion":"operator.openshift.io/v1","resourceVersion":"59071"}, "reason": "Admitted", "message": "ingresscontroller passed validation"} 2022-06-27T07:00:05.544Z INFO operator.ingressclass_controller controller/controller.go:121 reconciling {"request": "openshift-ingress-operator/intapps"} 2022-06-27T07:00:05.544Z INFO operator.ingress_controller controller/controller.go:121 reconciling {"request": "openshift-ingress-operator/intapps"} oc -n openshift-ingress-operator get ingresscontroller intapps -oyaml - lastTransitionTime: "2022-06-27T07:00:05Z" message: DNS management is not supported for ingresscontrollers with domain not matching the baseDomain of the cluster DNS config. reason: DomainNotMatching status: "False" type: DNSManaged -------
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069