Bug 1849036

Summary: [IPI][OSP] Installer fails on proxy + externalDNS configuration. Authentication cluster
Product: OpenShift Container Platform Reporter: David Sanz <dsanzmor>
Component: InstallerAssignee: Martin André <m.andre>
Installer sub component: OpenShift on OpenStack QA Contact: David Sanz <dsanzmor>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: m.andre, scuppett, wking
Version: 4.6   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:08:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1851344, 1866379, 1868178    
Bug Blocks:    

Description David Sanz 2020-06-19 13:59:20 UTC
Description of problem:

When installing using proxy + externalDNS, authentication cluster operator never gets available.

Logs from authentication-operator pod:

E0619 13:51:20.270807       1 controller.go:129] {AuthenticationOperator2 AuthenticationOperator2} failed with: error checking current version: unable to check route health: failed to GET route: dial tcp 10.0.111.196:443: i/o timeout

10.0.111.196 IP address is the API floating ip:

[morenod@morenod-laptop ~]$ openstack floating ip list --long | grep 10.0.111.196
| d7bb6c14-82e0-45d6-9e66-875b6bb8d72e | 10.0.111.196        | 192.168.0.5      | 6e88657a-e418-40e5-8631-5a43ab54e70a | 316eeb47-1498-46b4-b39e-00ddf73bd2a5 | 542c6ebd48bf40fa857fc245c7572e30 | a4684936-c0d0-491b-b623-7659ba1ea501 | None   | preserve mrnd-13-46-px                                     | []   | None     | None       |
[morenod@morenod-laptop ~]$ openstack port list | grep mrnd-13-46-px | grep 192.168.0.5
| 6e88657a-e418-40e5-8631-5a43ab54e70a | mrnd-13-46-px-l6tr2-api-port       | fa:16:3e:c7:79:f7 | ip_address='192.168.0.5', subnet_id='8f3467f5-2771-4f1c-af5f-656ea1ee1657'                      | DOWN   |
[morenod@morenod-laptop ~]$ 


Compared with an installation without proxy or externalDNS, check done on controller.go:129 returns an EOF, not a timeout:

E0619 13:00:51.627750       1 controller.go:129] {AuthenticationOperator2 AuthenticationOperator2} failed with: error checking current version: unable to check route health: failed to GET route: EOF

This seems to be controller, as operator gets available even with the EOF error:

I0619 13:00:53.430840       1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"9dbd31c7-0d3b-4e8e-9360-0b4774253913", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/authentication changed: Degraded changed from True to False ("RouteHealthDegraded: failed to GET route: EOF")
I0619 13:01:01.413373       1 status_controller.go:172] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2020-06-19T13:00:53Z","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2020-06-19T13:01:01Z","message":"Progressing: got '404 Not Found' status while trying to GET the OAuth well-known https://192.168.3.157:6443/.well-known/oauth-authorization-server endpoint data","reason":"_WellKnownNotReady","status":"True","type":"Progressing"},{"lastTransitionTime":"2020-06-19T13:01:01Z","status":"False","type":"Available"},{"lastTransitionTime":"2020-06-19T12:49:58Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
I0619 13:01:01.428440       1 status_controller.go:172] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2020-06-19T13:00:53Z","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2020-06-19T13:01:01Z","message":"Progressing: got '404 Not Found' status while trying to GET the OAuth well-known https://192.168.3.157:6443/.well-known/oauth-authorization-server endpoint data","reason":"_WellKnownNotReady","status":"True","type":"Progressing"},{"lastTransitionTime":"2020-06-19T13:01:01Z","status":"False","type":"Available"},{"lastTransitionTime":"2020-06-19T12:49:58Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
I0619 13:01:01.428948       1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"9dbd31c7-0d3b-4e8e-9360-0b4774253913", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/authentication changed: Degraded message changed from "RouteHealthDegraded: failed to GET route: EOF" to "",Progressing changed from Unknown to True ("Progressing: got '404 Not Found' status while trying to GET the OAuth well-known https://192.168.3.157:6443/.well-known/oauth-authorization-server endpoint data"),Available changed from Unknown to False ("")

Version-Release number of the following components:
4.6.0-0.nightly-2020-06-19-051412

How reproducible:

Steps to Reproduce:
1.Install IPI on OSP using proxy and externalDNS
2.Check status of authentication cluster operator
3.

Actual results:
Cluster operators authentication and console (dependency from authentication) are not getting Available, making the installation failed

Expected results:
Authentication cluster operator captures the timeout as it does with the EOF error and continues its initialization


Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Stephen Cuppett 2020-06-19 14:25:24 UTC
Setting target release to current development version (4.6) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.

Comment 2 Martin André 2020-06-25 14:09:07 UTC
The team considers this bug as valid. Considering this bug priority and our capacity, we are deferring this bug to an upcoming sprint. If there are reasons for us to reprioritise, please let us know.

Comment 3 David Sanz 2020-07-16 15:15:01 UTC
Verified on 4.6.0-0.nightly-2020-07-15-170241

Comment 6 errata-xmlrpc 2020-10-27 16:08:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196