1873046 – [OVN&Upgrade]Upgrade from 4.4.18 to latest 4.4 nightly builds failed, pods in different nodes cannot communicate.

Bug 1873046 - [OVN&Upgrade]Upgrade from 4.4.18 to latest 4.4 nightly builds failed, pods in different nodes cannot communicate.

Summary: [OVN&Upgrade]Upgrade from 4.4.18 to latest 4.4 nightly builds failed, pods in...

Keywords:
Status:	CLOSED DUPLICATE of bug 1875438
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.4.z
Assignee:	Alexander Constantinescu
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-27 09:10 UTC by huirwang
Modified:	2020-09-03 14:17 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-03 14:17:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description huirwang 2020-08-27 09:10:16 UTC

Version-Release number of selected component (if applicable):
Base version:4.4.18
Tatget version: 4.4.0-0.nightly-2020-08-25-142845

How reproducible:
Sometimes

Steps to Reproduce:
1. Setup a 4.4.18 OVN cluster
2. Upgrade to 4.4.0-0.nightly-2020-08-25-142845
oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-08-25-142845 --force=true --allow-explicit-upgrade=true



Actual results:
Upgrade failed with many CO degraded.
oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.0-0.nightly-2020-08-25-142845   True        True          True       20h
cloud-credential                           4.4.0-0.nightly-2020-08-25-142845   True        False         False      21h
cluster-autoscaler                         4.4.0-0.nightly-2020-08-25-142845   True        False         False      20h
console                                    4.4.0-0.nightly-2020-08-25-142845   False       True          True       18h
csi-snapshot-controller                    4.4.0-0.nightly-2020-08-25-142845   True        False         False      18h
dns                                        4.4.0-0.nightly-2020-08-25-142845   True        False         False      21h
etcd                                       4.4.0-0.nightly-2020-08-25-142845   True        False         False      21h
image-registry                             4.4.0-0.nightly-2020-08-25-142845   True        False         False      19h
ingress                                    4.4.0-0.nightly-2020-08-25-142845   True        False         False      18h
insights                                   4.4.0-0.nightly-2020-08-25-142845   True        False         True       21h
kube-apiserver                             4.4.0-0.nightly-2020-08-25-142845   True        False         False      21h
kube-controller-manager                    4.4.0-0.nightly-2020-08-25-142845   True        False         False      21h
kube-scheduler                             4.4.0-0.nightly-2020-08-25-142845   True        False         False      21h
kube-storage-version-migrator              4.4.0-0.nightly-2020-08-25-142845   True        False         False      18h
machine-api                                4.4.0-0.nightly-2020-08-25-142845   True        False         False      21h
machine-config                             4.4.0-0.nightly-2020-08-25-142845   True        False         False      17h
marketplace                                4.4.0-0.nightly-2020-08-25-142845   True        False         False      18h
monitoring                                 4.4.0-0.nightly-2020-08-25-142845   False       True          True       18h
network                                    4.4.0-0.nightly-2020-08-25-142845   True        False         False      21h
node-tuning                                4.4.0-0.nightly-2020-08-25-142845   True        False         False      18h
openshift-apiserver                        4.4.0-0.nightly-2020-08-25-142845   False       False         False      18h
openshift-controller-manager               4.4.0-0.nightly-2020-08-25-142845   True        False         False      33m
openshift-samples                          4.4.0-0.nightly-2020-08-25-142845   True        False         False      18h
operator-lifecycle-manager                 4.4.0-0.nightly-2020-08-25-142845   True        False         False      21h
operator-lifecycle-manager-catalog         4.4.0-0.nightly-2020-08-25-142845   True        False         False      21h
operator-lifecycle-manager-packageserver   4.4.0-0.nightly-2020-08-25-142845   False       True          False      16m
service-ca                                 4.4.0-0.nightly-2020-08-25-142845   True        False         False      21h
service-catalog-apiserver                  4.4.0-0.nightly-2020-08-25-142845   False       False         False      18h
service-catalog-controller-manager         4.4.0-0.nightly-2020-08-25-142845   False       False         False      18h
storage                                    4.4.0-0.nightly-2020-08-25-142845   True        False         False      18h

oc get co authentication -o yaml
- lastTransitionTime: "2020-08-26T13:28:57Z"
   message: |-
     RouteStatusDegraded: the server is currently unable to handle the request (get routes.route.openshift.io oauth-openshift)
     RouteHealthDegraded: failed to GET route: dial tcp: i/o timeout
   reason: RouteHealth_FailedGet::RouteStatus_FailedCreate

   oc get pods -n openshift-ingress -o wide
   NAME                              READY   STATUS    RESTARTS   AGE   IP           NODE                                        NOMINATED NODE   READINESS GATES
   router-default-679d8ff997-9pcfm   1/1     Running   0          19h   10.131.2.5   ip-10-0-52-215.us-east-2.compute.internal   <none>           <none>
   router-default-679d8ff997-hdm92   1/1     Running   5          19h   10.130.2.3   ip-10-0-57-180.us-east-2.compute.internal   <none>           <none>
   huiran-mac:script hrwang$ oc rsh -n openshift-ingress router-default-679d8ff997-9pcfm
   sh-4.2$ 
   sh-4.2$ curl   10.130.2.3 -v
   * About to connect() to 10.130.2.3 port 80 (#0)
   *   Trying 10.130.2.3...
   ^C
   sh-4.2$ exit
   exit


Check the pods in same nodes can communicate, but cannot communicate on different nodes.
oc get pods -o wide
NAME          READY   STATUS    RESTARTS   AGE     IP            NODE                                        NOMINATED NODE   READINESS GATES
hello-6ldjz   1/1     Running   0          48m     10.130.0.20   ip-10-0-49-72.us-east-2.compute.internal    <none>           <none>
hello-6vbbl   1/1     Running   0          48m     10.131.2.11   ip-10-0-52-215.us-east-2.compute.internal   <none>           <none>
hello-lqx4s   1/1     Running   0          48m     10.130.2.7    ip-10-0-57-180.us-east-2.compute.internal   <none>           <none>
hello-pod     1/1     Running   0          2m35s   10.130.2.17   ip-10-0-57-180.us-east-2.compute.internal   <none>           <none>
hello-sphrv   1/1     Running   0          48m     10.128.0.19   ip-10-0-65-178.us-east-2.compute.internal   <none>           <none>
hello-xxmkd   1/1     Running   0          48m     10.129.0.48   ip-10-0-60-174.us-east-2.compute.internal   <none>           <none>
huiran-mac:script hrwang$ oc project
No project has been set. Pass a project name to make that the default.
huiran-mac:script hrwang$ oc rsh hello-pod
/ # curl  10.130.2.7:8080
Hello-OpenShift-1 http-8080
/ # curl  10.129.0.48:8080
curl: (7) Failed to connect to 10.129.0.48 port 8080: Operation timed out
/ # curl 10.129.0.48:8080
^C
/ # exit

Expected results:
Should upgrade successfully.

Comment 4 Lalatendu Mohanty 2020-08-27 12:11:30 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 5 Lalatendu Mohanty 2020-08-27 12:54:25 UTC

From the affected cluster

$ oc get clusterversion -o yaml                                                                                                                                                                               
apiVersion: v1                                                                                                                                                                                                                                
items:                                                                                                                                                                                                                                        
- apiVersion: config.openshift.io/v1                                                                                                                                                                                                          
  kind: ClusterVersion                                                                                                                                                                                                                        
  metadata:                                     
    creationTimestamp: "2020-08-26T10:56:50Z"
    generation: 2         
    name: version                                                                                                      
    resourceVersion: "998596"            
    selfLink: /apis/config.openshift.io/v1/clusterversions/version
    uid: f02ba116-e2ab-4398-a2db-c0f3425c130a
  spec:                                         
    channel: stable-4.4                     
    clusterID: b348e412-eb0f-473b-9fbb-ce26ec1f231d                                                                                                                                                                                           
    desiredUpdate:                       
      force: true     
      image: registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-08-25-142845
      version: ""    
    upstream: https://api.openshift.com/api/upgrades_info/v1/graph
  status:                    
    availableUpdates: null
    conditions:
    - lastTransitionTime: "2020-08-26T11:19:54Z"
      message: Done applying 4.4.18
      status: "True"              
      type: Available
    - lastTransitionTime: "2020-08-27T10:48:05Z"
      status: "False"
      type: Failing
    - lastTransitionTime: "2020-08-26T13:04:14Z"
      message: 'Working towards 4.4.0-0.nightly-2020-08-25-142845: 17% complete'
      status: "True"
      type: Progressing
    - lastTransitionTime: "2020-08-26T10:56:56Z"
      message: 'Unable to retrieve available updates: currently installed version
        4.4.0-0.nightly-2020-08-25-142845 not found in the "stable-4.4" channel'
      reason: VersionNotFound
      status: "False"
      type: RetrievedUpdates
    - lastTransitionTime: "2020-08-26T11:21:36Z"
      message: |-
        Multiple cluster operators cannot be upgradeable:
        * Cluster operator service-catalog-apiserver cannot be upgraded: _Managed: Upgradeable: The Service Catalog is deprecated, upgrades are not possible. Please visit this link for further details: https://docs.openshift.com/container
-platform/4.4/applications/service_brokers/installing-service-catalog.html
        * Cluster operator service-catalog-controller-manager cannot be upgraded: _Managed: Upgradeable: The Service Catalog is deprecated, upgrades are not possible. Please visit this link for further details: https://docs.openshift.com/
container-platform/4.4/applications/service_brokers/installing-service-catalog.html
        * Cluster operator marketplace cannot be upgraded: DeprecatedAPIsInUse: The cluster has custom OperatorSource/CatalogSourceConfig, which are deprecated in future versions. Please visit this link for further deatils: https://docs.o
penshift.com/container-platform/4.4/release_notes/ocp-4-4-release-notes.html#ocp-4-4-marketplace-apis-deprecated
      reason: ClusterOperatorsNotUpgradeable
      status: "False"
      type: Upgradeable
    desired:
      force: true
      image: registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-08-25-142845
      version: 4.4.0-0.nightly-2020-08-25-142845
    history:
    - completionTime: null
      image: registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-08-25-142845
      startedTime: "2020-08-26T13:04:14Z"
      state: Partial
      verified: false
      version: 4.4.0-0.nightly-2020-08-25-142845
    - completionTime: "2020-08-26T11:19:54Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:3250780b072ed81a561350a5f3e4688076bd7eceb29991caf5d4fd0a5c03b7a5
      startedTime: "2020-08-26T10:56:56Z"
      state: Completed
      verified: false
      version: 4.4.18
    observedGeneration: 2
    versionHash: a6jbAqQCiOU=
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Comment 7 Ben Bennett 2020-08-27 16:11:43 UTC

Response to comment 4

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  ovn-kubernetes is not supported on 4.4 except for one customer with a support exception and they do not upgrade clusters in 4.4, they reinstall
What is the impact?  Is it serious enough to warrant blocking edges?
  Cross-node SDN is broken on ovn-kube.  But 4.4 ovn-kube is not supported.
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  I suspect restarting the ovn-kube controllers will fix the issue, but that is untested.
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  It would be if this always fails.  It may also be a one-time flake.

However, this should not block the edges since the only supported customer for ovn-kube does not upgrade their test clusters in 4.4, and are using 4.5 more now anyway.

Comment 8 Lalatendu Mohanty 2020-08-27 16:23:28 UTC

As per the previous comment, removing the upgrade blocker keyword.

Comment 11 Surya Seetharaman 2020-09-02 11:29:47 UTC

This bug seems to be a duplicate of bug https://bugzilla.redhat.com/show_bug.cgi?id=1874385 expect this is reported against 4.4, I mean the symptoms are the same, thanks to Alex for pointing me to this bug.

Looking at the apiserver pods in the above cluster this bug also seems to be about the transport closing issue:

controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
W0902 08:18:41.941578       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.0.xx.xx:xxxx 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.61.114:2379: connect: connection refused". Reconnecting...
I0902 08:18:41.943152       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
W0902 08:18:41.943420       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.0.xx.xx:xxxx 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.61.114:2379: connect: connection refused". Reconnecting...
I0902 08:18:41.944314       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"

As said by Alex in https://bugzilla.redhat.com/show_bug.cgi?id=1874385#c8 the issue seems to have been understood.

Comment 12 Surya Seetharaman 2020-09-02 11:35:32 UTC

This is the related bug filing the PR, which is what seems to break the pod-to-pod networking between nodes: https://bugzilla.redhat.com/show_bug.cgi?id=1868392

Comment 13 Ben Bennett 2020-09-03 13:10:18 UTC

This will be fixed in master an 4.5.  But upogrades of ovn on 4.4 is not supported.

Comment 14 Alexander Constantinescu 2020-09-03 14:02:32 UTC

Re-opening as we decided the back-port should be sufficiently easy to do given the reward of having 4.4 -> 4.4.N upgrades secure.

Comment 15 Ben Bennett 2020-09-03 14:17:46 UTC


*** This bug has been marked as a duplicate of bug 1875438 ***

Note You need to log in before you can comment on or make changes to this bug.