Bug 1873046
Summary: | [OVN&Upgrade]Upgrade from 4.4.18 to latest 4.4 nightly builds failed, pods in different nodes cannot communicate. | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | huirwang |
Component: | Networking | Assignee: | Alexander Constantinescu <aconstan> |
Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> |
Status: | CLOSED DUPLICATE | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | aconstan, bbennett, jhou, lmohanty, rbrattai, sdodson, weliang, wking, zzhao |
Version: | 4.4 | Keywords: | Reopened, Upgrades |
Target Milestone: | --- | ||
Target Release: | 4.4.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-09-03 14:17:46 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
huirwang
2020-08-27 09:10:16 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, it’s always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1 From the affected cluster $ oc get clusterversion -o yaml apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2020-08-26T10:56:50Z" generation: 2 name: version resourceVersion: "998596" selfLink: /apis/config.openshift.io/v1/clusterversions/version uid: f02ba116-e2ab-4398-a2db-c0f3425c130a spec: channel: stable-4.4 clusterID: b348e412-eb0f-473b-9fbb-ce26ec1f231d desiredUpdate: force: true image: registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-08-25-142845 version: "" upstream: https://api.openshift.com/api/upgrades_info/v1/graph status: availableUpdates: null conditions: - lastTransitionTime: "2020-08-26T11:19:54Z" message: Done applying 4.4.18 status: "True" type: Available - lastTransitionTime: "2020-08-27T10:48:05Z" status: "False" type: Failing - lastTransitionTime: "2020-08-26T13:04:14Z" message: 'Working towards 4.4.0-0.nightly-2020-08-25-142845: 17% complete' status: "True" type: Progressing - lastTransitionTime: "2020-08-26T10:56:56Z" message: 'Unable to retrieve available updates: currently installed version 4.4.0-0.nightly-2020-08-25-142845 not found in the "stable-4.4" channel' reason: VersionNotFound status: "False" type: RetrievedUpdates - lastTransitionTime: "2020-08-26T11:21:36Z" message: |- Multiple cluster operators cannot be upgradeable: * Cluster operator service-catalog-apiserver cannot be upgraded: _Managed: Upgradeable: The Service Catalog is deprecated, upgrades are not possible. Please visit this link for further details: https://docs.openshift.com/container -platform/4.4/applications/service_brokers/installing-service-catalog.html * Cluster operator service-catalog-controller-manager cannot be upgraded: _Managed: Upgradeable: The Service Catalog is deprecated, upgrades are not possible. Please visit this link for further details: https://docs.openshift.com/ container-platform/4.4/applications/service_brokers/installing-service-catalog.html * Cluster operator marketplace cannot be upgraded: DeprecatedAPIsInUse: The cluster has custom OperatorSource/CatalogSourceConfig, which are deprecated in future versions. Please visit this link for further deatils: https://docs.o penshift.com/container-platform/4.4/release_notes/ocp-4-4-release-notes.html#ocp-4-4-marketplace-apis-deprecated reason: ClusterOperatorsNotUpgradeable status: "False" type: Upgradeable desired: force: true image: registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-08-25-142845 version: 4.4.0-0.nightly-2020-08-25-142845 history: - completionTime: null image: registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-08-25-142845 startedTime: "2020-08-26T13:04:14Z" state: Partial verified: false version: 4.4.0-0.nightly-2020-08-25-142845 - completionTime: "2020-08-26T11:19:54Z" image: quay.io/openshift-release-dev/ocp-release@sha256:3250780b072ed81a561350a5f3e4688076bd7eceb29991caf5d4fd0a5c03b7a5 startedTime: "2020-08-26T10:56:56Z" state: Completed verified: false version: 4.4.18 observedGeneration: 2 versionHash: a6jbAqQCiOU= kind: List metadata: resourceVersion: "" selfLink: "" Response to comment 4 Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? ovn-kubernetes is not supported on 4.4 except for one customer with a support exception and they do not upgrade clusters in 4.4, they reinstall What is the impact? Is it serious enough to warrant blocking edges? Cross-node SDN is broken on ovn-kube. But 4.4 ovn-kube is not supported. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? I suspect restarting the ovn-kube controllers will fix the issue, but that is untested. Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? It would be if this always fails. It may also be a one-time flake. However, this should not block the edges since the only supported customer for ovn-kube does not upgrade their test clusters in 4.4, and are using 4.5 more now anyway. As per the previous comment, removing the upgrade blocker keyword. This bug seems to be a duplicate of bug https://bugzilla.redhat.com/show_bug.cgi?id=1874385 expect this is reported against 4.4, I mean the symptoms are the same, thanks to Alex for pointing me to this bug. Looking at the apiserver pods in the above cluster this bug also seems to be about the transport closing issue: controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" W0902 08:18:41.941578 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.0.xx.xx:xxxx 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.61.114:2379: connect: connection refused". Reconnecting... I0902 08:18:41.943152 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" W0902 08:18:41.943420 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.0.xx.xx:xxxx 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.61.114:2379: connect: connection refused". Reconnecting... I0902 08:18:41.944314 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" As said by Alex in https://bugzilla.redhat.com/show_bug.cgi?id=1874385#c8 the issue seems to have been understood. This is the related bug filing the PR, which is what seems to break the pod-to-pod networking between nodes: https://bugzilla.redhat.com/show_bug.cgi?id=1868392 This will be fixed in master an 4.5. But upogrades of ovn on 4.4 is not supported. Re-opening as we decided the back-port should be sufficiently easy to do given the reward of having 4.4 -> 4.4.N upgrades secure. *** This bug has been marked as a duplicate of bug 1875438 *** |