Version-Release number of selected component (if applicable): Base version:4.4.18 Tatget version: 4.4.0-0.nightly-2020-08-25-142845 How reproducible: Sometimes Steps to Reproduce: 1. Setup a 4.4.18 OVN cluster 2. Upgrade to 4.4.0-0.nightly-2020-08-25-142845 oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-08-25-142845 --force=true --allow-explicit-upgrade=true Actual results: Upgrade failed with many CO degraded. oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.0-0.nightly-2020-08-25-142845 True True True 20h cloud-credential 4.4.0-0.nightly-2020-08-25-142845 True False False 21h cluster-autoscaler 4.4.0-0.nightly-2020-08-25-142845 True False False 20h console 4.4.0-0.nightly-2020-08-25-142845 False True True 18h csi-snapshot-controller 4.4.0-0.nightly-2020-08-25-142845 True False False 18h dns 4.4.0-0.nightly-2020-08-25-142845 True False False 21h etcd 4.4.0-0.nightly-2020-08-25-142845 True False False 21h image-registry 4.4.0-0.nightly-2020-08-25-142845 True False False 19h ingress 4.4.0-0.nightly-2020-08-25-142845 True False False 18h insights 4.4.0-0.nightly-2020-08-25-142845 True False True 21h kube-apiserver 4.4.0-0.nightly-2020-08-25-142845 True False False 21h kube-controller-manager 4.4.0-0.nightly-2020-08-25-142845 True False False 21h kube-scheduler 4.4.0-0.nightly-2020-08-25-142845 True False False 21h kube-storage-version-migrator 4.4.0-0.nightly-2020-08-25-142845 True False False 18h machine-api 4.4.0-0.nightly-2020-08-25-142845 True False False 21h machine-config 4.4.0-0.nightly-2020-08-25-142845 True False False 17h marketplace 4.4.0-0.nightly-2020-08-25-142845 True False False 18h monitoring 4.4.0-0.nightly-2020-08-25-142845 False True True 18h network 4.4.0-0.nightly-2020-08-25-142845 True False False 21h node-tuning 4.4.0-0.nightly-2020-08-25-142845 True False False 18h openshift-apiserver 4.4.0-0.nightly-2020-08-25-142845 False False False 18h openshift-controller-manager 4.4.0-0.nightly-2020-08-25-142845 True False False 33m openshift-samples 4.4.0-0.nightly-2020-08-25-142845 True False False 18h operator-lifecycle-manager 4.4.0-0.nightly-2020-08-25-142845 True False False 21h operator-lifecycle-manager-catalog 4.4.0-0.nightly-2020-08-25-142845 True False False 21h operator-lifecycle-manager-packageserver 4.4.0-0.nightly-2020-08-25-142845 False True False 16m service-ca 4.4.0-0.nightly-2020-08-25-142845 True False False 21h service-catalog-apiserver 4.4.0-0.nightly-2020-08-25-142845 False False False 18h service-catalog-controller-manager 4.4.0-0.nightly-2020-08-25-142845 False False False 18h storage 4.4.0-0.nightly-2020-08-25-142845 True False False 18h oc get co authentication -o yaml - lastTransitionTime: "2020-08-26T13:28:57Z" message: |- RouteStatusDegraded: the server is currently unable to handle the request (get routes.route.openshift.io oauth-openshift) RouteHealthDegraded: failed to GET route: dial tcp: i/o timeout reason: RouteHealth_FailedGet::RouteStatus_FailedCreate oc get pods -n openshift-ingress -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES router-default-679d8ff997-9pcfm 1/1 Running 0 19h 10.131.2.5 ip-10-0-52-215.us-east-2.compute.internal <none> <none> router-default-679d8ff997-hdm92 1/1 Running 5 19h 10.130.2.3 ip-10-0-57-180.us-east-2.compute.internal <none> <none> huiran-mac:script hrwang$ oc rsh -n openshift-ingress router-default-679d8ff997-9pcfm sh-4.2$ sh-4.2$ curl 10.130.2.3 -v * About to connect() to 10.130.2.3 port 80 (#0) * Trying 10.130.2.3... ^C sh-4.2$ exit exit Check the pods in same nodes can communicate, but cannot communicate on different nodes. oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES hello-6ldjz 1/1 Running 0 48m 10.130.0.20 ip-10-0-49-72.us-east-2.compute.internal <none> <none> hello-6vbbl 1/1 Running 0 48m 10.131.2.11 ip-10-0-52-215.us-east-2.compute.internal <none> <none> hello-lqx4s 1/1 Running 0 48m 10.130.2.7 ip-10-0-57-180.us-east-2.compute.internal <none> <none> hello-pod 1/1 Running 0 2m35s 10.130.2.17 ip-10-0-57-180.us-east-2.compute.internal <none> <none> hello-sphrv 1/1 Running 0 48m 10.128.0.19 ip-10-0-65-178.us-east-2.compute.internal <none> <none> hello-xxmkd 1/1 Running 0 48m 10.129.0.48 ip-10-0-60-174.us-east-2.compute.internal <none> <none> huiran-mac:script hrwang$ oc project No project has been set. Pass a project name to make that the default. huiran-mac:script hrwang$ oc rsh hello-pod / # curl 10.130.2.7:8080 Hello-OpenShift-1 http-8080 / # curl 10.129.0.48:8080 curl: (7) Failed to connect to 10.129.0.48 port 8080: Operation timed out / # curl 10.129.0.48:8080 ^C / # exit Expected results: Should upgrade successfully.
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, itβs always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
From the affected cluster $ oc get clusterversion -o yaml apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2020-08-26T10:56:50Z" generation: 2 name: version resourceVersion: "998596" selfLink: /apis/config.openshift.io/v1/clusterversions/version uid: f02ba116-e2ab-4398-a2db-c0f3425c130a spec: channel: stable-4.4 clusterID: b348e412-eb0f-473b-9fbb-ce26ec1f231d desiredUpdate: force: true image: registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-08-25-142845 version: "" upstream: https://api.openshift.com/api/upgrades_info/v1/graph status: availableUpdates: null conditions: - lastTransitionTime: "2020-08-26T11:19:54Z" message: Done applying 4.4.18 status: "True" type: Available - lastTransitionTime: "2020-08-27T10:48:05Z" status: "False" type: Failing - lastTransitionTime: "2020-08-26T13:04:14Z" message: 'Working towards 4.4.0-0.nightly-2020-08-25-142845: 17% complete' status: "True" type: Progressing - lastTransitionTime: "2020-08-26T10:56:56Z" message: 'Unable to retrieve available updates: currently installed version 4.4.0-0.nightly-2020-08-25-142845 not found in the "stable-4.4" channel' reason: VersionNotFound status: "False" type: RetrievedUpdates - lastTransitionTime: "2020-08-26T11:21:36Z" message: |- Multiple cluster operators cannot be upgradeable: * Cluster operator service-catalog-apiserver cannot be upgraded: _Managed: Upgradeable: The Service Catalog is deprecated, upgrades are not possible. Please visit this link for further details: https://docs.openshift.com/container -platform/4.4/applications/service_brokers/installing-service-catalog.html * Cluster operator service-catalog-controller-manager cannot be upgraded: _Managed: Upgradeable: The Service Catalog is deprecated, upgrades are not possible. Please visit this link for further details: https://docs.openshift.com/ container-platform/4.4/applications/service_brokers/installing-service-catalog.html * Cluster operator marketplace cannot be upgraded: DeprecatedAPIsInUse: The cluster has custom OperatorSource/CatalogSourceConfig, which are deprecated in future versions. Please visit this link for further deatils: https://docs.o penshift.com/container-platform/4.4/release_notes/ocp-4-4-release-notes.html#ocp-4-4-marketplace-apis-deprecated reason: ClusterOperatorsNotUpgradeable status: "False" type: Upgradeable desired: force: true image: registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-08-25-142845 version: 4.4.0-0.nightly-2020-08-25-142845 history: - completionTime: null image: registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-08-25-142845 startedTime: "2020-08-26T13:04:14Z" state: Partial verified: false version: 4.4.0-0.nightly-2020-08-25-142845 - completionTime: "2020-08-26T11:19:54Z" image: quay.io/openshift-release-dev/ocp-release@sha256:3250780b072ed81a561350a5f3e4688076bd7eceb29991caf5d4fd0a5c03b7a5 startedTime: "2020-08-26T10:56:56Z" state: Completed verified: false version: 4.4.18 observedGeneration: 2 versionHash: a6jbAqQCiOU= kind: List metadata: resourceVersion: "" selfLink: ""
Response to comment 4 Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? ovn-kubernetes is not supported on 4.4 except for one customer with a support exception and they do not upgrade clusters in 4.4, they reinstall What is the impact? Is it serious enough to warrant blocking edges? Cross-node SDN is broken on ovn-kube. But 4.4 ovn-kube is not supported. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? I suspect restarting the ovn-kube controllers will fix the issue, but that is untested. Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? It would be if this always fails. It may also be a one-time flake. However, this should not block the edges since the only supported customer for ovn-kube does not upgrade their test clusters in 4.4, and are using 4.5 more now anyway.
As per the previous comment, removing the upgrade blocker keyword.
This bug seems to be a duplicate of bug https://bugzilla.redhat.com/show_bug.cgi?id=1874385 expect this is reported against 4.4, I mean the symptoms are the same, thanks to Alex for pointing me to this bug. Looking at the apiserver pods in the above cluster this bug also seems to be about the transport closing issue: controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" W0902 08:18:41.941578 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.0.xx.xx:xxxx 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.61.114:2379: connect: connection refused". Reconnecting... I0902 08:18:41.943152 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" W0902 08:18:41.943420 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.0.xx.xx:xxxx 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.61.114:2379: connect: connection refused". Reconnecting... I0902 08:18:41.944314 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" As said by Alex in https://bugzilla.redhat.com/show_bug.cgi?id=1874385#c8 the issue seems to have been understood.
This is the related bug filing the PR, which is what seems to break the pod-to-pod networking between nodes: https://bugzilla.redhat.com/show_bug.cgi?id=1868392
This will be fixed in master an 4.5. But upogrades of ovn on 4.4 is not supported.
Re-opening as we decided the back-port should be sufficiently easy to do given the reward of having 4.4 -> 4.4.N upgrades secure.
*** This bug has been marked as a duplicate of bug 1875438 ***