Bug 1793541

Summary: Ingress-operator fails without route53 privileges
Product: OpenShift Container Platform Reporter: Paul Weil <pweil>
Component: NetworkingAssignee: Dan Mace <dmace>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: aos-bugs, scuppett, sjenning
Version: 4.3.0   
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1798789 (view as bug list) Environment:
Last Closed: 2020-05-04 11:25:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1798789    

Description Paul Weil 2020-01-21 14:28:32 UTC
Created from investigation on private clusters in gov cloud type environments.  See https://docs.google.com/document/d/1rY5Wklqx8Rvjd-SynXwOvUvqaUotMmNKCqPN0IkTTpc/edit?ts=5e20e2c5#

ingress-operator fails without route53 privileges
Team: network-edge

$ oc logs ingress-operator-5ff98dcb6c-mzh6p -c ingress-operator
2020-01-16T19:29:59.650Z    INFO    operator.main    ingress-operator/start.go:80    using operator namespace    {"namespace": "openshift-ingress-operator"}
2020-01-16T19:29:59.659Z    ERROR    operator.main    ingress-operator/start.go:123    failed to create DNS manager    {"error": "failed to get cloud credentials from secret /: secrets \"cloud-credentials\" not found"

Even if the privateZone and publicZone blocks are removed from the cluster DNS spec, which indicates to the ingress-operator that DNS management should be disabled, it is still fatal.

$ oc get dns cluster -oyaml
apiVersion: config.openshift.io/v1
kind: DNS
metadata:
  name: cluster
...
spec:
  baseDomain: sjenning.devcluster.openshift.com
# start removal
  privateZone:
    tags:
  	Name: sjenning-h6n24-int
  	kubernetes.io/cluster/sjenning-h6n24: owned
  publicZone:
    id: Z3URY6TWQ91KVV
# end removal
status: {}

In fact, the failure is so early, it happens before the operator creates the ingress ClusterOperator CR, so there is no top level visibility into why ingress is not starting.

Ingress failure causes dependent operators authentication, monitoring, and console to be degraded or not available.
Proposed solutions
If the the privateZone and publicZone blocks are removed from the cluster DNS CR, ingress-operator should not start the DNS manager (the part of the operator that requires all the AWS privileges).  In this way, ingress could get away with not needing an IAM user at all.

There is a catch here in that the ELB that the wildcard DNS record targets doesn't exist until the ingress operator creates a Service type LoadBalancer that results in the ELB's creation.  Thus the out-of-band mapping of the wildcard record to the ELB would need to be day-2.

Comment 1 Stephen Cuppett 2020-01-21 15:18:11 UTC
Setting target release to the active development version (4.4). Fixes, if any, where requested/required for previous versions will result in clones targeting those z-stream releases.

Comment 3 Hongan Li 2020-02-03 08:29:49 UTC
verified with 4.4.0-0.nightly-2020-02-02-201619 and issue has been fixed.

No error after removing the privateZone and publicZone and logs show:

$ oc -n openshift-ingress-operator logs ingress-operator-88f857479-tshnc -c ingress-operator | grep -i "fake dns"
2020-02-03T07:38:26.827Z	INFO	operator.main	ingress-operator/start.go:218	using fake DNS provider because no public or private zone is defined in the cluster DNS configuration

Comment 5 errata-xmlrpc 2020-05-04 11:25:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581