Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1765776

Summary: OCP 4.2.2 install fails to initialize with authentication and console operators failing to rollout - unable to check route health: failed to GET route
Product: OpenShift Container Platform Reporter: Walid A. <wabouham>
Component: NetworkingAssignee: Dan Mace <dmace>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED DUPLICATE Docs Contact:
Severity: urgent    
Priority: unspecified CC: aos-bugs, mifiedle, wking
Version: 4.2.0   
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-26 00:22:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Walid A. 2019-10-25 23:29:22 UTC
Description of problem:
OCP 4.2.2 IPI install on AWS fails after 54 minutes with cluster operators authentication and console still updating.  

Install logs shows:
time="2019-10-25T16:56:15Z" level=fatal msg="failed to initialize the cluster: Working towards 4.2.2: 100% complete"

CVO log shows:
I1025 19:11:32.016838       1 task_graph.go:611] Result of work: [Cluster operator authentication is still updating Cluster operator console has not yet reported success]
I1025 19:11:32.016856       1 sync_worker.go:741] Summarizing 2 errors
I1025 19:11:32.016862       1 sync_worker.go:745] Update error 135 of 433: ClusterOperatorNotAvailable Cluster operator authentication is still updating (*errors.errorString: cluster operator authentication is still updating)
I1025 19:11:32.016870       1 sync_worker.go:745] Update error 294 of 433: ClusterOperatorNotAvailable Cluster operator console has not yet reported success (*errors.errorString: cluster operator console is not done; it is available=false, progressing=true, degraded=false)
E1025 19:11:32.016885       1 sync_worker.go:311] unable to synchronize image (waiting 2m52.525702462s): Some cluster operators are still updating: authentication, console

Authentication operator logs show:

E1025 19:36:46.938680       1 controller.go:129] {AuthenticationOperator2 AuthenticationOperator2} failed with: error checking current version: unable to check route health: failed to GET route: dial tcp: lookup oauth-openshift.apps.<cluster-domain-name> on 172.30.0.10:53: no such host


The Route53 hosted zone for this cluster is missing A record for *.apps.

Also:

# oc -n openshift-ingress-operator get -o yaml dnsrecord default-wildcard
Error from server (NotFound): dnsrecords.ingress.operator.openshift.io "default-wildcard" not found

Version-Release number of selected component (if applicable):


How reproducible:

Reproduced twice so far

Steps to Reproduce:
1. AWS IPI install of OCP version 4.2.2 with OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE: quay.io/openshift-release-dev/ocp-release:4.2.2

2. install fails after around 54 minutes by checking install logs  
3. `oc get co` shows auth and console operators progressing or degraded

Actual results:

Install fails after 54 minutes 
Expected results:
Install to succeed and all cluster operators to be available and not progressing/degraded

Additional info:
links to operator logs and oc adm must-gather logs will be in next comment

Comment 2 W. Trevor King 2019-10-25 23:41:01 UTC
Changing the component to Routing, because the ingress operator sets up the DNSRecords (the DNS operator is about in-cluster CoreDNS resolution, not about managing the out-of-cluster Route 53 records).

Comment 3 Dan Mace 2019-10-26 00:22:06 UTC
Already tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1765282.

*** This bug has been marked as a duplicate of bug 1765282 ***