Bug 1819147
Summary: | [UPI] Failed to upgrade from OCP_4.4. rc to 4.4 nightly_Upgrade Testing due to RouteHealthDegraded: failed to GET route: dial tcp 192.168.0.7:443: connect: connection refused | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | RamaKasturi <knarra> |
Component: | Networking | Assignee: | Dan Mace <dmace> |
Networking sub component: | router | QA Contact: | Hongan Li <hongli> |
Status: | CLOSED DUPLICATE | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | amcdermo, aos-bugs, ccoleman, dmace, lmohanty, sdodson, wking, wsun |
Version: | 4.4 | Keywords: | TestBlocker, Upgrades |
Target Milestone: | --- | ||
Target Release: | 4.5.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-05-07 15:57:45 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1809665, 1809668, 1869785 | ||
Bug Blocks: |
Description
RamaKasturi
2020-03-31 10:24:58 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. Who is impacted? Customers upgrading from 4.2.99 to 4.3.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.2.z to 4.3.z fail approximately 10% of the time What is the impact? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression? No, it’s always been like this we just never noticed Yes, from 4.2 and 4.3.1 Regarding edge routing, workload routing (and thus auth/console) disruption during upgrades is improved in 4.3+ (see https://bugzilla.redhat.com/show_bug.cgi?id=1809665 and linked backports). There are related upgrade disruption improvements in other areas including the SDN, apiserver, console, and auth. There are no plans I'm aware of to backport those improvements to 4.2, so the benefits will only be realized in 4.3+ upgrade scenarios. Note that the 4.3 backports for these fixes are still in flight. To test them today, you would need to upgrade to a 4.4 or 4.5 build and upgrade from there. I don't think there's any plan to do further disruption investigation or fixes in the 4.2 line at this point. (In reply to Dan Mace from comment #3) > Regarding edge routing, workload routing (and thus auth/console) disruption > during upgrades is improved in 4.3+ (see > https://bugzilla.redhat.com/show_bug.cgi?id=1809665 and linked backports). > There are related upgrade disruption improvements in other areas including > the SDN, apiserver, console, and auth. There are no plans I'm aware of to > backport those improvements to 4.2, so the benefits will only be realized in > 4.3+ upgrade scenarios. > > Note that the 4.3 backports for these fixes are still in flight. To test > them today, you would need to upgrade to a 4.4 or 4.5 build and upgrade from > there. > > I don't think there's any plan to do further disruption investigation or > fixes in the 4.2 line at this point. Dan, This bug is reported on 4.4 upgrades. I thin the example answers to the assessment questions created the confusion of 4.2. After talking with Clayton, it turns out I was incorrect about the current backporting status, and we're apparently still not 100% done even with the totality of known 4.5 fixes, and 4.4 may not yet be in sync with all of what's already done for 4.5. On the surface this appears to be just another data point related to known issues, and a duplicate of one of the other disruption related bugs. Probably https://bugzilla.redhat.com/show_bug.cgi?id=1809667. Only a root cause analysis would reveal whether there's another novel issue at play, but so far I'm not seeing enough interesting new evidence to justify the effort. Right now I'd recommend closing this one as a dupe of 1809667, or if you want to leave it open, mark this bug blocked by 1809667. Clayton and I are going to try and fix up a meta-bug to associate all these disjoint symptom bugs with. Stay tuned... I'm not sure there's much value in keeping this bug open, but for now we'll keep it and I've made it depend on #1809665, which is the canonical issue for disruption to workloads during upgrades (which encompasses auth and the console). Dropping UpgradeBlocker flag since this is tied to existing well understood route availability that's existed throughout the life of 4.x. |