Description of problem:
Authentication operator reports RouteHealthDegraded:failed to GET route: dial tcp 192.168.0.7:443: connect: connection refused
Version-Release number of selected component (if applicable):
[ramakasturinarra@dhcp35-60 cucushift]$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.4.0-0.nightly-2020-03-30-180504 True False 131m Error while reconciling 4.4.0-0.nightly-2020-03-30-180504: the cluster operator authentication is degraded
[ramakasturinarra@dhcp35-60 cucushift]$ oc describe co/authentication
API Version: config.openshift.io/v1
Creation Timestamp: 2020-03-31T02:54:32Z
Resource Version: 194866
Self Link: /apis/config.openshift.io/v1/clusteroperators/authentication
Last Transition Time: 2020-03-31T08:08:54Z
Message: RouteHealthDegraded: failed to GET route: dial tcp 192.168.0.7:443: connect: connection refused
Last Transition Time: 2020-03-31T08:04:12Z
Last Transition Time: 2020-03-31T03:10:24Z
Last Transition Time: 2020-03-31T02:54:34Z
Hit it once
Steps to Reproduce:
1. Install OCP_4.4 rc with params "UPI_OSP 13_Connected_No Proxy_RHCOS 4.4_Disk Encyption off_FIPS on_OpenShift-SDN (network policy)_IPv4_Etcd Encyption Off_CRIO-1.17_Fluentd_Etcd-3.3_OpenIDconnect_File System_Cinder_Object_Swift_overlay2_OVS-2.11"
2. Now run the command below to upgrade to the latest nightly version available
oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-03-30-180504 --force=true --allow-explicit-upgrade=true
Upgrade succeeds but i see that authetication operator is in degraded state due to RouteHealthDegraded.
Upgrade should succeed and no operator should be in degraded state
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges.
Who is impacted?
Customers upgrading from 4.2.99 to 4.3.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
All customers upgrading from 4.2.z to 4.3.z fail approximately 10% of the time
What is the impact?
Up to 2 minute disruption in edge routing
Up to 90seconds of API downtime
etcd loses quorum and you have to restore from backup
How involved is remediation?
Issue resolves itself after five minutes
Admin uses oc to fix things
Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
No, it’s always been like this we just never noticed
Yes, from 4.2 and 4.3.1
Regarding edge routing, workload routing (and thus auth/console) disruption during upgrades is improved in 4.3+ (see https://bugzilla.redhat.com/show_bug.cgi?id=1809665 and linked backports). There are related upgrade disruption improvements in other areas including the SDN, apiserver, console, and auth. There are no plans I'm aware of to backport those improvements to 4.2, so the benefits will only be realized in 4.3+ upgrade scenarios.
Note that the 4.3 backports for these fixes are still in flight. To test them today, you would need to upgrade to a 4.4 or 4.5 build and upgrade from there.
I don't think there's any plan to do further disruption investigation or fixes in the 4.2 line at this point.
(In reply to Dan Mace from comment #3)
> Regarding edge routing, workload routing (and thus auth/console) disruption
> during upgrades is improved in 4.3+ (see
> https://bugzilla.redhat.com/show_bug.cgi?id=1809665 and linked backports).
> There are related upgrade disruption improvements in other areas including
> the SDN, apiserver, console, and auth. There are no plans I'm aware of to
> backport those improvements to 4.2, so the benefits will only be realized in
> 4.3+ upgrade scenarios.
> Note that the 4.3 backports for these fixes are still in flight. To test
> them today, you would need to upgrade to a 4.4 or 4.5 build and upgrade from
> I don't think there's any plan to do further disruption investigation or
> fixes in the 4.2 line at this point.
Dan, This bug is reported on 4.4 upgrades. I thin the example answers to the assessment questions created the confusion of 4.2.
After talking with Clayton, it turns out I was incorrect about the current backporting status, and we're apparently still not 100% done even with the totality of known 4.5 fixes, and 4.4 may not yet be in sync with all of what's already done for 4.5. On the surface this appears to be just another data point related to known issues, and a duplicate of one of the other disruption related bugs. Probably https://bugzilla.redhat.com/show_bug.cgi?id=1809667. Only a root cause analysis would reveal whether there's another novel issue at play, but so far I'm not seeing enough interesting new evidence to justify the effort.
Right now I'd recommend closing this one as a dupe of 1809667, or if you want to leave it open, mark this bug blocked by 1809667.
Clayton and I are going to try and fix up a meta-bug to associate all these disjoint symptom bugs with. Stay tuned...
I'm not sure there's much value in keeping this bug open, but for now we'll keep it and I've made it depend on #1809665, which is the canonical issue for disruption to workloads during upgrades (which encompasses auth and the console).
Dropping UpgradeBlocker flag since this is tied to existing well understood route availability that's existed throughout the life of 4.x.