2041616 – Ingress operator tries to manage DNS of additional ingresscontrollers that are not under clusters basedomain, which can't work

Bug 2041616 - Ingress operator tries to manage DNS of additional ingresscontrollers that are not under clusters basedomain, which can't work

Summary: Ingress operator tries to manage DNS of additional ingresscontrollers that ar...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Suleyman Akbas
QA Contact:	Arvind iyengar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2078524
TreeView+	depends on / blocked

Reported:	2022-01-17 21:22 UTC by aaleman
Modified:	2022-12-12 20:41 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The cluster-ingress-operator will try to create route 53 entries for custom IngressControllers regardless of whether the spec.domain of the custom IngressController contains the cluster's baseDomain. This can happen when a custom IngressController is created for DNS managed outside of OpenShift. Consequence: The custom IngressController is marked "Degraded" noting a failure to create a zone record. Fix: The cluster-ingress-operator now checks whether the custom IngressController's spec.domain contains the cluster's baseDomain. Result: Custom IngressControllers with spec.domain that does not contain the cluster's baseDomain will no longer try to provision route 53 entries.
Clone Of:
Environment:
Last Closed:	2022-08-10 10:42:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-ingress-operator pull 761	0	None	open	Bug 2041616: Do not create DNS records for IngressControllers with domain not matching baseDomain	2022-05-25 18:48:27 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 10:42:52 UTC

Description aaleman 2022-01-17 21:22:48 UTC

Description of problem:

I set up an additional ingresscontroller, which looks like this:

```
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  creationTimestamp: "2022-01-17T21:18:01Z"
  finalizers:
  - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller
  generation: 1
  name: clusters-alvaro-test
  namespace: openshift-ingress-operator
  resourceVersion: "314326"
  uid: eddcf948-4b21-4e78-a297-fa1bab375d20
spec:
  domain: alvaro-test.hypershift.local
  endpointPublishingStrategy:
    loadBalancer:
      providerParameters:
        aws:
          type: NLB
        type: AWS
      scope: Internal
    type: LoadBalancerService
  httpErrorCodePages:
    name: ""
  routeSelector:
    matchLabels:
      hypershift.openshift.io/hosted-control-plane: clusters-alvaro-test
  tuningOptions: {}
  unsupportedConfigOverrides: null

```

The DNS for this ingresscontroller is managed by outside of the ingressoperator by a different controller. This results in the ingressoperator continuously trying to set up route 53 entries for the ingresscontroller under the clusters route 53 zone, which can't work, as the ingresscontroller uses a completely different domain:

```
status:
  availableReplicas: 2
  conditions:
  - lastTransitionTime: "2022-01-17T21:18:01Z"
    reason: Valid
    status: "True"
    type: Admitted
  - lastTransitionTime: "2022-01-17T21:18:01Z"
    status: "True"
    type: PodsScheduled
  - lastTransitionTime: "2022-01-17T21:18:37Z"
    message: The deployment has Available status condition set to True
    reason: DeploymentAvailable
    status: "True"
    type: DeploymentAvailable
  - lastTransitionTime: "2022-01-17T21:18:37Z"
    message: Minimum replicas requirement is met
    reason: DeploymentMinimumReplicasMet
    status: "True"
    type: DeploymentReplicasMinAvailable
  - lastTransitionTime: "2022-01-17T21:18:37Z"
    message: All replicas are available
    reason: DeploymentReplicasAvailable
    status: "True"
    type: DeploymentReplicasAllAvailable
  - lastTransitionTime: "2022-01-17T21:18:01Z"
    message: The endpoint publishing strategy supports a managed load balancer
    reason: WantedByEndpointPublishingStrategy
    status: "True"
    type: LoadBalancerManaged
  - lastTransitionTime: "2022-01-17T21:18:03Z"
    message: The LoadBalancer service is provisioned
    reason: LoadBalancerProvisioned
    status: "True"
    type: LoadBalancerReady
  - lastTransitionTime: "2022-01-17T21:18:01Z"
    message: DNS management is supported and zones are specified in the cluster DNS
      config.
    reason: Normal
    status: "True"
    type: DNSManaged
  - lastTransitionTime: "2022-01-17T21:18:04Z"
    message: 'The record failed to provision in some zones: [{ map[Name:alvaro-host-l9kbq-int
      kubernetes.io/cluster/alvaro-host-l9kbq:owned]} {Z01753031XC9KEOLEZ50O map[]}]'
    reason: FailedZones
    status: "False"
    type: DNSReady
  - lastTransitionTime: "2022-01-17T21:18:37Z"
    message: 'One or more status conditions indicate unavailable: DNSReady=False (FailedZones:
      The record failed to provision in some zones: [{ map[Name:alvaro-host-l9kbq-int
      kubernetes.io/cluster/alvaro-host-l9kbq:owned]} {Z01753031XC9KEOLEZ50O map[]}])'
    reason: IngressControllerUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: "2022-01-17T21:18:37Z"
    message: 'One or more other status conditions indicate a degraded state: DNSReady=False
      (FailedZones: The record failed to provision in some zones: [{ map[Name:alvaro-host-l9kbq-int
      kubernetes.io/cluster/alvaro-host-l9kbq:owned]} {Z01753031XC9KEOLEZ50O map[]}])'
    reason: DegradedConditions
    status: "True"
    type: Degraded

```

The clusters basedomain is not `hypershift.local`:
```
oc get dnses.config/cluster -o 'jsonpath={.spec.baseDomain}'
alvaro-host.alvaroaleman.hypershift.devcluster.openshift.com
```


OpenShift release version:
```
Server Version: 4.8.11
Kubernetes Version: v1.21.1+9807387
```

 
Cluster Platform: AWS


How reproducible:


Steps to Reproduce (in detail):
1. Apply the IngressController manifest from above
2.
3.


Actual results:


Expected results: The operator not trying to reconcile DNS records from a domain that is not below the clusters base domain.


Impact of the problem:

The operator reports the IngressController as degraded even though it works fine and is full of errors trying to set up the route 53 records.


Additional info:



** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report.  You may also mark the bug private if you wish.

Comment 1 Miciah Dashiel Butler Masters 2022-01-18 00:46:47 UTC

Setting blocker- as this isn't a regression or upgrade issue.  

It would make sense to change the ingress operator not to try to manage DNS for an ingresscontroller with a spec.domain that does not match the spec.baseDomain of the cluster DNS config (i.e., `oc get dnses.config/cluster -o 'jsonpath={.spec.baseDomain}'`).  However, I'm nervous about making a change like that so close to code freeze for 4.10.0 or in a z-tream release, so it might be best if we addressed this in a future y-stream release.

Comment 2 Miciah Dashiel Butler Masters 2022-05-07 00:20:13 UTC

To follow up on comment 1, it would seem reasonable to apply the aforementioned change not only for AWS but for all platforms.  However, Azure has unusual behavior with respect to DNS, as described in bug 1919151:  When the operator creates a DNS record with a domain outside the hosted zone's domain, Azure concatenates the domains.  For example, if an IngressController specifies the domain "apps.foo.tld" and the cluster domain is "bar.tld", then when the operator tries to create a DNS record for "*.apps.foo.tld", Azure creates a record "*.apps.foo.tld.bar.tld".  

The situation gets more complicated if the cluster has different domains for the public zone and the private zone.  In a test cluster, I noticed that the private zone's domain is a subdomain of the public zone's domain.  So for example, suppose the public zone has the domain "bar.tld", the private zone has the domain "baz.bar.tld", and the IngressController has the domain "apps.foo.tld"; then what happens is that the operator tells Azure to create a DNS record for "*.apps.foo.tld", and Azure creates a DNS record for "*.apps.foo.tld.bar.tld" in the public zone and a DNS record for "*.apps.foo.tld.baz.bar.tld" in the private zone.  This makes things *very* tricky.  

In order not to risk breaking existing Azure clusters, we could do one of the following:

* Apply the change only for AWS.

* Apply the change for all platforms except for Azure to preserve the current behavior there.

* Apply the change for all platforms, add a big release note warning users of the new behavior on Azure, and maybe add logic in the previous version of OpenShift to set Upgradeable=False if some IngressController with endpointPublishingStrategy.type: LoadBalancerService has a domain outside the cluster's domain.

I hope no one actually wants the existing behavior on Azure; it is bizarre, undocumented, and not likely to be useful for any realistic use-case.  However, there is always the risk that if something is possible, someone may have come to rely on it, no matter how bizarre it is.  

If we apply to the change more broadly than only for AWS, then we also need to investigate whether any relevant idiosyncrasies exist for the other supported cloud platforms: Alibaba, GCP, IBM Cloud, and Power VS.

Comment 6 Arvind iyengar 2022-06-27 07:43:10 UTC

Verified in "4.11.0-0.nightly-2022-06-25-081133" release version. With this payload deployed in hypershift environments, there are no more attempts made by the ingress operator to add route53 entries  for the controllers created with hypershift domain:
------
oc get clusterversion     
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-25-081133   True        False         90m     Cluster version is 4.11.0-0.nightly-2022-06-25-081133


Template for deploying ingress controller:
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  name: intapps
  namespace: openshift-ingress-operator
spec:
  domain: intapps.hypershift-ci-736.qe.devcluster.openshift.com
  endpointPublishingStrategy:
    loadBalancer:
      providerParameters:
        aws:
          type: NLB
        type: AWS
      scope: Internal
    type: LoadBalancerService
  httpErrorCodePages:
    name: ""
  tuningOptions: {}
  unsupportedConfigOverrides: null

oc -n openshift-ingress-operator get ingresscontroller intapps -ojsonpath='{.spec}'
  "domain": "intapps.hypershift-ci-736.qe.devcluster.openshift.com",
  "endpointPublishingStrategy": {
    "loadBalancer": {
      "providerParameters": {
        "aws": {
          "type": "NLB"
        },
        "type": "AWS"
      },
      "scope": "Internal"
    },
    "type": "LoadBalancerService"

 oc -n openshift-ingress-operator get ingresscontroller
NAME      AGE
default   127m
intapps   40m



Ingress operator logging post the controller creation:
 oc -n openshift-ingress-operator logs pod/ingress-operator-6bf85c9ffc-kf8b7 -c ingress-operator | grep -i "intapps"
 2022-06-27T07:00:05.536Z	DEBUG	operator.init.events	record/event.go:311	Warning	{"object": {"kind":"IngressController","namespace":"openshift-ingress-operator","name":"intapps","uid":"d9cdc356-639e-415a-88b3-5d1741ca1534","apiVersion":"operator.openshift.io/v1","resourceVersion":"59071"}, "reason": "DomainNotMatching", "message": "Domain [intapps.hypershift-ci-736.qe.devcluster.openshift.com] of ingresscontroller does not match the baseDomain [aiyengar411hi.qe.devcluster.openshift.com] of the cluster DNS config, so DNS management is not supported."}   <=====

2022-06-27T07:00:05.543Z	DEBUG	operator.init.events	record/event.go:311	Normal	{"object": {"kind":"IngressController","namespace":"openshift-ingress-operator","name":"intapps","uid":"d9cdc356-639e-415a-88b3-5d1741ca1534","apiVersion":"operator.openshift.io/v1","resourceVersion":"59071"}, "reason": "Admitted", "message": "ingresscontroller passed validation"}
2022-06-27T07:00:05.544Z	INFO	operator.ingressclass_controller	controller/controller.go:121	reconciling	{"request": "openshift-ingress-operator/intapps"}
2022-06-27T07:00:05.544Z	INFO	operator.ingress_controller	controller/controller.go:121	reconciling	{"request": "openshift-ingress-operator/intapps"}

oc -n openshift-ingress-operator get ingresscontroller intapps -oyaml
  - lastTransitionTime: "2022-06-27T07:00:05Z"
    message: DNS management is not supported for ingresscontrollers with domain not
      matching the baseDomain of the cluster DNS config.
    reason: DomainNotMatching
    status: "False"
    type: DNSManaged
-------

Comment 7 errata-xmlrpc 2022-08-10 10:42:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.