Bug 1942657

Summary: ingress operator stays degraded after privateZone fixed in DNS
Product: OpenShift Container Platform Reporter: Matthew Staebler <mstaeble>
Component: NetworkingAssignee: Luigi Mario Zuccarelli <luzuccar>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: low CC: amcdermo, aos-bugs, mmasters, rfredette, wking
Version: 4.8   
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: If the .spec.privateZone field of the dns.config.openshift.io operator is filled out incorrectly so that the ingress operator is not able to find the private hosted zone, then the ingress operator goes degraded. Consequence: However, even after fixing the .spec.privateZone field, the ingress operator stays degraded. The ingress operator finds the hosted zone and adds the *.apps resource record, but the ingress operator does not reset the degraded status Fix: This fix watches the DNSConfig object and monitors changes regarding the .spec.privateZone field. It applies the appropriate logic and updates the operator status accordingly Result: The operator status will return to DEGRADED (False) once the correct .spec.privateZone field is set.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:29:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Matthew Staebler 2021-03-24 17:00:54 UTC
Description of problem:
If the .spec.privateZone field of the dns.config.openshift.io operator is filled out incorrectly so that the ingress operator is not able to find the private hosted zone, then the ingress operator goes degraded. That is good. However, even after fixing the .spec.privateZone field, the ingress operator stays degraded. The ingress operator finds the hosted zone and adds the *.apps resource record, but the ingress operator does not reset the degraded status.


How reproducible:
I only attempted once, so I cannot say whether it is reproducible.


Steps to Reproduce:
1. openshift-install create manifests
2. Change the .spec.privateZone fields in the manifests/cluster-dns-02-config.yml file.
3. openshit-install create cluster
4. Wait for the cluster to install enough for the ingress operator to report degraded due to "FailedZones: The record failed to provision in some zones: [...]"
5. Fix the .spec.privateZone fields via `oc edit dns.config.openshift.io cluster`.
6. Wait for the ingress operator to create the *.apps record in the hosted zone.
7. Observe that the ingress operator is still degraded with the same "FailedZones" message.

Actual results:
ingress operator remains degraded

Expected results:
ingress operator should reset its degraded status and show not degraded

Comment 1 Stephen Greene 2021-04-23 17:22:04 UTC
(In reply to Matthew Staebler from comment #0)
> How reproducible:
> I only attempted once, so I cannot say whether it is reproducible.

This is reproducible post-installation by modifying .spec.privateZone in the cluster DNS config to an invalid zone value, and then reverting the change.
The DNS controller does not remove the status condition for the zone that we no longer care about.

Workaround would be to delete the DNSRecord resource and let the DNS operator re-create it (which is inconvenient).

Comment 4 Hongan Li 2021-08-18 04:01:44 UTC
verified with 4.9.0-0.nightly-2021-08-17-122812 and passed.

steps:
1. oc adm release extract --command=openshift-install registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-08-17-122812 -a ../pull-secret-full.txt
2. ./openshift-install create manifests --dir ./818/
3. change the .spec.privateZone fields in the manifests/cluster-dns-02-config.yml file.
4. ./openshit-install create cluster --dir ./818/
5. ingress reports below error during the installation 

The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DNSReady=False (FailedZones: The record failed to provision in some zones: [{ map[Name:hongli-xx-j7tk5-int kubernetes.io/cluster/hongli-bz-j7tk5:owned]}])


6. oc edit dnses.config.openshift.io cluster
<---sni--->
  privateZone:
    tags:
      Name: hongli-bz-j7tk5-int                           <---- update from hongli-xx-j7tk5-int to correct one
      kubernetes.io/cluster/hongli-bz-j7tk5: owned

7. wait for a while and find ingress is available
$ oc get co/ingress
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress   4.9.0-0.nightly-2021-08-17-122812   True        False         False      3m52s   

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-08-17-122812   True        False         11s     Cluster version is 4.9.0-0.nightly-2021-08-17-122812


install log:
$ ./openshift-install create cluster --dir ./818/
INFO Consuming Worker Machines from target directory 
INFO Consuming Common Manifests from target directory 
INFO Consuming OpenShift Install (Manifests) from target directory 
INFO Consuming Master Machines from target directory 
INFO Consuming Openshift Manifests from target directory 
INFO Credentials loaded from the "default" profile in file "/home/hongan/.aws/credentials" 
INFO Creating infrastructure resources...         
INFO Waiting up to 20m0s for the Kubernetes API at https://api.hongli-bz.qe.devcluster.openshift.com:6443... 
INFO API v1.22.0-rc.0+3dfed96 up                  
INFO Waiting up to 30m0s for bootstrapping to complete... 
INFO Destroying the bootstrap resources...        
INFO Waiting up to 40m0s for the cluster at https://api.hongli-bz.qe.devcluster.openshift.com:6443 to initialize... 
INFO Waiting up to 10m0s for the openshift-console route to be created... 
INFO Install complete!

Comment 8 errata-xmlrpc 2021-10-18 17:29:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759