Bug 1819147 - [UPI] Failed to upgrade from OCP_4.4. rc to 4.4 nightly_Upgrade Testing due to RouteHealthDegraded: failed to GET route: dial tcp 192.168.0.7:443: connect: connection refused
Summary: [UPI] Failed to upgrade from OCP_4.4. rc to 4.4 nightly_Upgrade Testing due t...
Keywords:
Status: CLOSED DUPLICATE of bug 1809665
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Routing
Version: 4.4
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.5.0
Assignee: Dan Mace
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On: 1809668 1809665 1869785
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-31 10:24 UTC by RamaKasturi
Modified: 2020-08-24 06:13 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-07 15:57:45 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description RamaKasturi 2020-03-31 10:24:58 UTC
Description of problem:
Authentication operator reports RouteHealthDegraded:failed to GET route: dial tcp 192.168.0.7:443: connect: connection refused

Version-Release number of selected component (if applicable):
[ramakasturinarra@dhcp35-60 cucushift]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-03-30-180504   True        False         131m    Error while reconciling 4.4.0-0.nightly-2020-03-30-180504: the cluster operator authentication is degraded
[ramakasturinarra@dhcp35-60 cucushift]$ oc describe co/authentication
Name:         authentication
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2020-03-31T02:54:32Z
  Generation:          1
  Resource Version:    194866
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/authentication
  UID:                 ec3127a1-f45c-412a-be4a-9b915f0fbb78
Spec:
Status:
  Conditions:
    Last Transition Time:  2020-03-31T08:08:54Z
    Message:               RouteHealthDegraded: failed to GET route: dial tcp 192.168.0.7:443: connect: connection refused
    Reason:                RouteHealth_FailedGet
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2020-03-31T08:04:12Z
    Reason:                AsExpected
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2020-03-31T03:10:24Z
    Reason:                AsExpected
    Status:                True
    Type:                  Available
    Last Transition Time:  2020-03-31T02:54:34Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
  Related Objects:
    Group:     operator.openshift.io
    Name:      cluster
    Resource:  authentications
    Group:     config.openshift.io
    Name:      cluster
    Resource:  authentications
    Group:     config.openshift.io
    Name:      cluster
    Resource:  infrastructures
    Group:     config.openshift.io
    Name:      cluster
    Resource:  oauths
    Group:     route.openshift.io
    Name:      oauth-openshift
    Resource:  routes
    Group:     
    Name:      oauth-openshift
    Resource:  services
    Group:     
    Name:      openshift-config
    Resource:  namespaces
    Group:     
    Name:      openshift-config-managed
    Resource:  namespaces
    Group:     
    Name:      openshift-authentication
    Resource:  namespaces
    Group:     
    Name:      openshift-authentication-operator
    Resource:  namespaces
    Group:     
    Name:      openshift-ingress
    Resource:  namespaces
  Versions:
    Name:     oauth-openshift
    Version:  4.4.0-0.nightly-2020-03-30-180504_openshift
    Name:     operator
    Version:  4.4.0-0.nightly-2020-03-30-180504
Events:       <none>



How reproducible:
Hit it once

Steps to Reproduce:
1. Install OCP_4.4 rc with params "UPI_OSP 13_Connected_No Proxy_RHCOS 4.4_Disk Encyption off_FIPS on_OpenShift-SDN (network policy)_IPv4_Etcd Encyption Off_CRIO-1.17_Fluentd_Etcd-3.3_OpenIDconnect_File System_Cinder_Object_Swift_overlay2_OVS-2.11"
2. Now run the command below to upgrade to the latest nightly version available
oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-03-30-180504 --force=true --allow-explicit-upgrade=true 

Actual results:
Upgrade succeeds but i see that authetication operator is in degraded state due to RouteHealthDegraded.

Expected results:
Upgrade should succeed and no operator should be in degraded state

Additional info:

Comment 2 Lalatendu Mohanty 2020-03-31 11:32:05 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges.

Who is impacted?
  Customers upgrading from 4.2.99 to 4.3.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.2.z to 4.3.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.2 and 4.3.1

Comment 3 Dan Mace 2020-03-31 12:15:40 UTC
Regarding edge routing, workload routing (and thus auth/console) disruption during upgrades is improved in 4.3+ (see https://bugzilla.redhat.com/show_bug.cgi?id=1809665 and linked backports). There are related upgrade disruption improvements in other areas including the SDN, apiserver, console, and auth. There are no plans I'm aware of to backport those improvements to 4.2, so the benefits will only be realized in 4.3+ upgrade scenarios.

Note that the 4.3 backports for these fixes are still in flight. To test them today, you would need to upgrade to a 4.4 or 4.5 build and upgrade from there.

I don't think there's any plan to do further disruption investigation or fixes in the 4.2 line at this point.

Comment 4 Lalatendu Mohanty 2020-03-31 13:27:28 UTC
(In reply to Dan Mace from comment #3)
> Regarding edge routing, workload routing (and thus auth/console) disruption
> during upgrades is improved in 4.3+ (see
> https://bugzilla.redhat.com/show_bug.cgi?id=1809665 and linked backports).
> There are related upgrade disruption improvements in other areas including
> the SDN, apiserver, console, and auth. There are no plans I'm aware of to
> backport those improvements to 4.2, so the benefits will only be realized in
> 4.3+ upgrade scenarios.
> 
> Note that the 4.3 backports for these fixes are still in flight. To test
> them today, you would need to upgrade to a 4.4 or 4.5 build and upgrade from
> there.
> 
> I don't think there's any plan to do further disruption investigation or
> fixes in the 4.2 line at this point.

Dan, This bug is reported on 4.4 upgrades. I thin the example answers to the assessment questions created the confusion of 4.2.

Comment 5 Dan Mace 2020-03-31 13:52:03 UTC
After talking with Clayton, it turns out I was incorrect about the current backporting status, and we're apparently still not 100% done even with the totality of known 4.5 fixes, and 4.4 may not yet be in sync with all of what's already done for 4.5. On the surface this appears to be just another data point related to known issues, and a duplicate of one of the other disruption related bugs. Probably https://bugzilla.redhat.com/show_bug.cgi?id=1809667. Only a root cause analysis would reveal whether there's another novel issue at play, but so far I'm not seeing enough interesting new evidence to justify the effort.

Right now I'd recommend closing this one as a dupe of 1809667, or if you want to leave it open, mark this bug blocked by 1809667.

Comment 6 Dan Mace 2020-03-31 14:02:35 UTC
Clayton and I are going to try and fix up a meta-bug to associate all these disjoint symptom bugs with. Stay tuned...

Comment 7 Dan Mace 2020-03-31 17:43:53 UTC
I'm not sure there's much value in keeping this bug open, but for now we'll keep it and I've made it depend on #1809665, which is the canonical issue for disruption to workloads during upgrades (which encompasses auth and the console).

Comment 8 Scott Dodson 2020-04-08 18:26:18 UTC
Dropping UpgradeBlocker flag since this is tied to existing well understood route availability that's existed throughout the life of 4.x.


Note You need to log in before you can comment on or make changes to this bug.