1798789 – Ingress-operator fails without route53 privileges

Bug 1798789 - Ingress-operator fails without route53 privileges

Summary: Ingress-operator fails without route53 privileges

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.3.z
Assignee:	Dan Mace
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:	1793541
Blocks:	1780703
TreeView+	depends on / blocked

Reported:	2020-02-06 01:45 UTC by Paul Weil
Modified:	2022-08-04 22:27 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1793541
Environment:
Last Closed:	2020-02-19 05:40:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-ingress-operator pull 356	0	None	closed	[release-4.3] Bug 1798789: Only initialize DNS providers if DNS config is present	2020-08-24 11:45:05 UTC
Red Hat Product Errata	RHBA-2020:0492	0	None	None	None	2020-02-19 05:40:27 UTC

Description Paul Weil 2020-02-06 01:45:00 UTC

+++ This bug was initially created as a clone of Bug #1793541 +++

Created from investigation on private clusters in gov cloud type environments.  See https://docs.google.com/document/d/1rY5Wklqx8Rvjd-SynXwOvUvqaUotMmNKCqPN0IkTTpc/edit?ts=5e20e2c5#

ingress-operator fails without route53 privileges
Team: network-edge

$ oc logs ingress-operator-5ff98dcb6c-mzh6p -c ingress-operator
2020-01-16T19:29:59.650Z    INFO    operator.main    ingress-operator/start.go:80    using operator namespace    {"namespace": "openshift-ingress-operator"}
2020-01-16T19:29:59.659Z    ERROR    operator.main    ingress-operator/start.go:123    failed to create DNS manager    {"error": "failed to get cloud credentials from secret /: secrets \"cloud-credentials\" not found"

Even if the privateZone and publicZone blocks are removed from the cluster DNS spec, which indicates to the ingress-operator that DNS management should be disabled, it is still fatal.

$ oc get dns cluster -oyaml
apiVersion: config.openshift.io/v1
kind: DNS
metadata:
  name: cluster
...
spec:
  baseDomain: sjenning.devcluster.openshift.com
# start removal
  privateZone:
    tags:
  	Name: sjenning-h6n24-int
  	kubernetes.io/cluster/sjenning-h6n24: owned
  publicZone:
    id: Z3URY6TWQ91KVV
# end removal
status: {}

In fact, the failure is so early, it happens before the operator creates the ingress ClusterOperator CR, so there is no top level visibility into why ingress is not starting.

Ingress failure causes dependent operators authentication, monitoring, and console to be degraded or not available.
Proposed solutions
If the the privateZone and publicZone blocks are removed from the cluster DNS CR, ingress-operator should not start the DNS manager (the part of the operator that requires all the AWS privileges).  In this way, ingress could get away with not needing an IAM user at all.

There is a catch here in that the ELB that the wildcard DNS record targets doesn't exist until the ingress operator creates a Service type LoadBalancer that results in the ELB's creation.  Thus the out-of-band mapping of the wildcard record to the ELB would need to be day-2.

--- Additional comment from Stephen Cuppett on 2020-01-21 15:18:11 UTC ---

Setting target release to the active development version (4.4). Fixes, if any, where requested/required for previous versions will result in clones targeting those z-stream releases.

--- Additional comment from errata-xmlrpc on 2020-01-24 16:07:35 UTC ---

Bug report changed to ON_QA status by Errata System.
A QE request has been submitted for advisory RHBA-2019:47983-01
https://errata.devel.redhat.com/advisory/47983

--- Additional comment from Hongan Li on 2020-02-03 08:29:49 UTC ---

verified with 4.4.0-0.nightly-2020-02-02-201619 and issue has been fixed.

No error after removing the privateZone and publicZone and logs show:

$ oc -n openshift-ingress-operator logs ingress-operator-88f857479-tshnc -c ingress-operator | grep -i "fake dns"
2020-02-03T07:38:26.827Z	INFO	operator.main	ingress-operator/start.go:218	using fake DNS provider because no public or private zone is defined in the cluster DNS configuration

Comment 4 Hongan Li 2020-02-10 03:09:58 UTC

verified with 4.3.0-0.nightly-2020-02-09-195913 and the issue has been fixed.

remove both public and private zone from dns/cluster and restart ingress-operator, no error reported

$ oc -n openshift-ingress-operator logs ingress-operator-6454978cf5-npqf4 -c ingress-operator | grep  DNS
2020-02-10T02:52:01.626Z	INFO	operator.main	ingress-operator/start.go:201	using fake DNS provider because no public or private zone is defined in the cluster DNS configuration

$ oc get co/ingress
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
ingress   4.3.0-0.nightly-2020-02-09-195913   True        False         False      20m

Comment 6 errata-xmlrpc 2020-02-19 05:40:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0492

Note You need to log in before you can comment on or make changes to this bug.