Bug 1798789

Summary: Ingress-operator fails without route53 privileges
Product: OpenShift Container Platform Reporter: Paul Weil <pweil>
Component: RoutingAssignee: Dan Mace <dmace>
Status: CLOSED ERRATA QA Contact: Hongan Li <hongli>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.3.0CC: aos-bugs, dmace, hongli, scuppett, sjenning
Target Milestone: ---   
Target Release: 4.3.z   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1793541 Environment:
Last Closed: 2020-02-19 05:40:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1793541    
Bug Blocks: 1780703    

Description Paul Weil 2020-02-06 01:45:00 UTC
+++ This bug was initially created as a clone of Bug #1793541 +++

Created from investigation on private clusters in gov cloud type environments.  See https://docs.google.com/document/d/1rY5Wklqx8Rvjd-SynXwOvUvqaUotMmNKCqPN0IkTTpc/edit?ts=5e20e2c5#

ingress-operator fails without route53 privileges
Team: network-edge

$ oc logs ingress-operator-5ff98dcb6c-mzh6p -c ingress-operator
2020-01-16T19:29:59.650Z    INFO    operator.main    ingress-operator/start.go:80    using operator namespace    {"namespace": "openshift-ingress-operator"}
2020-01-16T19:29:59.659Z    ERROR    operator.main    ingress-operator/start.go:123    failed to create DNS manager    {"error": "failed to get cloud credentials from secret /: secrets \"cloud-credentials\" not found"

Even if the privateZone and publicZone blocks are removed from the cluster DNS spec, which indicates to the ingress-operator that DNS management should be disabled, it is still fatal.

$ oc get dns cluster -oyaml
apiVersion: config.openshift.io/v1
kind: DNS
  name: cluster
  baseDomain: sjenning.devcluster.openshift.com
# start removal
  	Name: sjenning-h6n24-int
  	kubernetes.io/cluster/sjenning-h6n24: owned
    id: Z3URY6TWQ91KVV
# end removal
status: {}

In fact, the failure is so early, it happens before the operator creates the ingress ClusterOperator CR, so there is no top level visibility into why ingress is not starting.

Ingress failure causes dependent operators authentication, monitoring, and console to be degraded or not available.
Proposed solutions
If the the privateZone and publicZone blocks are removed from the cluster DNS CR, ingress-operator should not start the DNS manager (the part of the operator that requires all the AWS privileges).  In this way, ingress could get away with not needing an IAM user at all.

There is a catch here in that the ELB that the wildcard DNS record targets doesn't exist until the ingress operator creates a Service type LoadBalancer that results in the ELB's creation.  Thus the out-of-band mapping of the wildcard record to the ELB would need to be day-2.

--- Additional comment from Stephen Cuppett on 2020-01-21 15:18:11 UTC ---

Setting target release to the active development version (4.4). Fixes, if any, where requested/required for previous versions will result in clones targeting those z-stream releases.

--- Additional comment from errata-xmlrpc on 2020-01-24 16:07:35 UTC ---

Bug report changed to ON_QA status by Errata System.
A QE request has been submitted for advisory RHBA-2019:47983-01

--- Additional comment from Hongan Li on 2020-02-03 08:29:49 UTC ---

verified with 4.4.0-0.nightly-2020-02-02-201619 and issue has been fixed.

No error after removing the privateZone and publicZone and logs show:

$ oc -n openshift-ingress-operator logs ingress-operator-88f857479-tshnc -c ingress-operator | grep -i "fake dns"
2020-02-03T07:38:26.827Z	INFO	operator.main	ingress-operator/start.go:218	using fake DNS provider because no public or private zone is defined in the cluster DNS configuration

Comment 4 Hongan Li 2020-02-10 03:09:58 UTC
verified with 4.3.0-0.nightly-2020-02-09-195913 and the issue has been fixed.

remove both public and private zone from dns/cluster and restart ingress-operator, no error reported

$ oc -n openshift-ingress-operator logs ingress-operator-6454978cf5-npqf4 -c ingress-operator | grep  DNS
2020-02-10T02:52:01.626Z	INFO	operator.main	ingress-operator/start.go:201	using fake DNS provider because no public or private zone is defined in the cluster DNS configuration

$ oc get co/ingress
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
ingress   4.3.0-0.nightly-2020-02-09-195913   True        False         False      20m

Comment 6 errata-xmlrpc 2020-02-19 05:40:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.