Description of problem: Authentication operator reports RouteHealthDegraded:failed to GET route: dial tcp 192.168.0.7:443: connect: connection refused Version-Release number of selected component (if applicable): [ramakasturinarra@dhcp35-60 cucushift]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.0-0.nightly-2020-03-30-180504 True False 131m Error while reconciling 4.4.0-0.nightly-2020-03-30-180504: the cluster operator authentication is degraded [ramakasturinarra@dhcp35-60 cucushift]$ oc describe co/authentication Name: authentication Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2020-03-31T02:54:32Z Generation: 1 Resource Version: 194866 Self Link: /apis/config.openshift.io/v1/clusteroperators/authentication UID: ec3127a1-f45c-412a-be4a-9b915f0fbb78 Spec: Status: Conditions: Last Transition Time: 2020-03-31T08:08:54Z Message: RouteHealthDegraded: failed to GET route: dial tcp 192.168.0.7:443: connect: connection refused Reason: RouteHealth_FailedGet Status: True Type: Degraded Last Transition Time: 2020-03-31T08:04:12Z Reason: AsExpected Status: False Type: Progressing Last Transition Time: 2020-03-31T03:10:24Z Reason: AsExpected Status: True Type: Available Last Transition Time: 2020-03-31T02:54:34Z Reason: AsExpected Status: True Type: Upgradeable Extension: <nil> Related Objects: Group: operator.openshift.io Name: cluster Resource: authentications Group: config.openshift.io Name: cluster Resource: authentications Group: config.openshift.io Name: cluster Resource: infrastructures Group: config.openshift.io Name: cluster Resource: oauths Group: route.openshift.io Name: oauth-openshift Resource: routes Group: Name: oauth-openshift Resource: services Group: Name: openshift-config Resource: namespaces Group: Name: openshift-config-managed Resource: namespaces Group: Name: openshift-authentication Resource: namespaces Group: Name: openshift-authentication-operator Resource: namespaces Group: Name: openshift-ingress Resource: namespaces Versions: Name: oauth-openshift Version: 4.4.0-0.nightly-2020-03-30-180504_openshift Name: operator Version: 4.4.0-0.nightly-2020-03-30-180504 Events: <none> How reproducible: Hit it once Steps to Reproduce: 1. Install OCP_4.4 rc with params "UPI_OSP 13_Connected_No Proxy_RHCOS 4.4_Disk Encyption off_FIPS on_OpenShift-SDN (network policy)_IPv4_Etcd Encyption Off_CRIO-1.17_Fluentd_Etcd-3.3_OpenIDconnect_File System_Cinder_Object_Swift_overlay2_OVS-2.11" 2. Now run the command below to upgrade to the latest nightly version available oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-03-30-180504 --force=true --allow-explicit-upgrade=true Actual results: Upgrade succeeds but i see that authetication operator is in degraded state due to RouteHealthDegraded. Expected results: Upgrade should succeed and no operator should be in degraded state Additional info:
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. Who is impacted? Customers upgrading from 4.2.99 to 4.3.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.2.z to 4.3.z fail approximately 10% of the time What is the impact? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression? No, itβs always been like this we just never noticed Yes, from 4.2 and 4.3.1
Regarding edge routing, workload routing (and thus auth/console) disruption during upgrades is improved in 4.3+ (see https://bugzilla.redhat.com/show_bug.cgi?id=1809665 and linked backports). There are related upgrade disruption improvements in other areas including the SDN, apiserver, console, and auth. There are no plans I'm aware of to backport those improvements to 4.2, so the benefits will only be realized in 4.3+ upgrade scenarios. Note that the 4.3 backports for these fixes are still in flight. To test them today, you would need to upgrade to a 4.4 or 4.5 build and upgrade from there. I don't think there's any plan to do further disruption investigation or fixes in the 4.2 line at this point.
(In reply to Dan Mace from comment #3) > Regarding edge routing, workload routing (and thus auth/console) disruption > during upgrades is improved in 4.3+ (see > https://bugzilla.redhat.com/show_bug.cgi?id=1809665 and linked backports). > There are related upgrade disruption improvements in other areas including > the SDN, apiserver, console, and auth. There are no plans I'm aware of to > backport those improvements to 4.2, so the benefits will only be realized in > 4.3+ upgrade scenarios. > > Note that the 4.3 backports for these fixes are still in flight. To test > them today, you would need to upgrade to a 4.4 or 4.5 build and upgrade from > there. > > I don't think there's any plan to do further disruption investigation or > fixes in the 4.2 line at this point. Dan, This bug is reported on 4.4 upgrades. I thin the example answers to the assessment questions created the confusion of 4.2.
After talking with Clayton, it turns out I was incorrect about the current backporting status, and we're apparently still not 100% done even with the totality of known 4.5 fixes, and 4.4 may not yet be in sync with all of what's already done for 4.5. On the surface this appears to be just another data point related to known issues, and a duplicate of one of the other disruption related bugs. Probably https://bugzilla.redhat.com/show_bug.cgi?id=1809667. Only a root cause analysis would reveal whether there's another novel issue at play, but so far I'm not seeing enough interesting new evidence to justify the effort. Right now I'd recommend closing this one as a dupe of 1809667, or if you want to leave it open, mark this bug blocked by 1809667.
Clayton and I are going to try and fix up a meta-bug to associate all these disjoint symptom bugs with. Stay tuned...
I'm not sure there's much value in keeping this bug open, but for now we'll keep it and I've made it depend on #1809665, which is the canonical issue for disruption to workloads during upgrades (which encompasses auth and the console).
Dropping UpgradeBlocker flag since this is tied to existing well understood route availability that's existed throughout the life of 4.x.