1858834 – [OVN] 4.5.3 upgrade failure---some ovnkube-master and ovnkube-node is in CrashLoopBackOff

Bug 1858834 - [OVN] 4.5.3 upgrade failure---some ovnkube-master and ovnkube-node is in CrashLoopBackOff

Summary: [OVN] 4.5.3 upgrade failure---some ovnkube-master and ovnkube-node is in Cra...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Federico Paolinelli
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1859365 (view as bug list)
Depends On:
Blocks:	1858712
TreeView+	depends on / blocked

Reported:	2020-07-20 14:06 UTC by W. Trevor King
Modified:	2021-04-05 17:46 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1858712
Environment:
Last Closed:	2020-10-27 16:16:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 723	0	None	closed	Bug 1858834: Revert ovn db consistency check.	2021-02-08 04:19:02 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:16:31 UTC

Description W. Trevor King 2020-07-20 14:06:13 UTC

+++ This bug was initially created as a clone of Bug #1858712 +++

Version-Release number of selected component (if applicable):

Base version:4.5.2-x86_64
Target version:4.5.0-0.nightly-2020-07-18-024505


How reproducible: 
always

Steps to Reproduce:
Use the upgrade ci to trigger upgrade from 4.5.2-x86_64 to 4.5.0-0.nightly-2020-07-18-024505
https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/upgrade_CI/3787/console

Finally upgrade failed.


Actual Result:
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.2     True        True          3h19m   Unable to apply 4.5.0-0.nightly-2020-07-18-024505: an unknown error has occurred: MultipleErrors

oc get co network -o yaml
status:
  conditions:
  - lastTransitionTime: "2020-07-20T04:43:27Z"
    message: |-
      DaemonSet "openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - last change 2020-07-20T04:33:13Z
      DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2020-07-20T04:32:51Z
    reason: RolloutHung
    status: "True"
    type: Degraded
  - lastTransitionTime: "2020-07-20T03:27:52Z"
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2020-07-20T04:31:05Z"
    message: |-
      DaemonSet "openshift-multus/multus-admission-controller" update is rolling out (1 out of 3 updated)
      DaemonSet "openshift-ovn-kubernetes/ovnkube-master" is not available (awaiting 1 nodes)
      DaemonSet "openshift-ovn-kubernetes/ovnkube-node" update is rolling out (4 out of 6 updated)


One multus pod is in ContainerCreating with error:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_multus-admission-controller-2wxwn_openshift-multus_7dc31947-f76a-4207-9288-38778b17eafe_0(be44e539169c537a7867a026c71109fb91d3b7086580e86471269665b7548578): Multus: [openshift-multus/multus-admission-controller-2wxwn]: error adding container to network “ovn-kubernetes”: delegateAdd: error invoking confAdd - “ovn-k8s-cni-overlay”: error in getting result from AddNetwork: CNI request failed with status 400: ’[openshift-multus/multus-admission-controller-2wxwn] failed to configure pod interface: failure in plugging pod interface: failed to run ‘ovs-vsctl --timeout=30 add-port br-int be44e539169c537 -- set interface be44e539169c537 external_ids:attached_mac=2e:a8:2b:82:00:04 external_ids:iface-id=openshift-multus_multus-admission-controller-2wxwn external_ids:ip_addresses=10.130.0.3/23 external_ids:sandbox=be44e539169c537a7867a026c71109fb91d3b7086580e86471269665b7548578’: exit status 1
 “ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)\n”
 
 oc get pods -o wide -n openshift-ovn-kubernetes
NAME                   READY   STATUS             RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
ovnkube-master-gjs4n   2/4     CrashLoopBackOff   39         3h      10.0.136.16    ip-10-0-136-16.us-east-2.compute.internal    <none>           <none>
ovnkube-master-gx75c   4/4     Running            0          3h2m    10.0.204.230   ip-10-0-204-230.us-east-2.compute.internal   <none>           <none>
ovnkube-master-mn5vc   4/4     Running            0          3h1m    10.0.178.208   ip-10-0-178-208.us-east-2.compute.internal   <none>           <none>
ovnkube-node-2bwg2     2/2     Running            0          3h2m    10.0.178.208   ip-10-0-178-208.us-east-2.compute.internal   <none>           <none>
ovnkube-node-7vszf     1/2     CrashLoopBackOff   33         3h1m    10.0.204.230   ip-10-0-204-230.us-east-2.compute.internal   <none>           <none>
ovnkube-node-8clcn     2/2     Running            0          3h49m   10.0.135.242   ip-10-0-135-242.us-east-2.compute.internal   <none>           <none>
ovnkube-node-srwtf     2/2     Running            0          3h2m    10.0.165.9     ip-10-0-165-9.us-east-2.compute.internal     <none>           <none>
ovnkube-node-v8j5p     2/2     Running            0          3h2m    10.0.214.190   ip-10-0-214-190.us-east-2.compute.internal   <none>           <none>
ovnkube-node-zslrj     2/2     Running            0          4h5m    10.0.136.16    ip-10-0-136-16.us-east-2.compute.internal    <none>           <none>
ovs-node-cwfz2         1/1     Running            0          3h      10.0.135.242   ip-10-0-135-242.us-east-2.compute.internal   <none>           <none>
ovs-node-fvblz         1/1     Running            0          3h1m    10.0.178.208   ip-10-0-178-208.us-east-2.compute.internal   <none>           <none>
ovs-node-t8vl6         1/1     Running            0          179m    10.0.204.230   ip-10-0-204-230.us-east-2.compute.internal   <none>           <none>
ovs-node-thbn7         1/1     Running            0          3h2m    10.0.136.16    ip-10-0-136-16.us-east-2.compute.internal    <none>           <none>
ovs-node-vp2rp         1/1     Running            0          3h2m    10.0.214.190   ip-10-0-214-190.us-east-2.compute.internal   <none>           <none>
ovs-node-vwrps         1/1     Running            0          179m    10.0.165.9     ip-10-0-165-9.us-east-2.compute.internal     <none> 

 oc logs -c ovnkube-master ovnkube-master-gjs4n  -n   openshift-ovn-kubernetes
+ [[ -f /env/_master ]]
+ hybrid_overlay_flags=
+ [[ -n '' ]]
++ ovn-nbctl --pidfile=/var/run/ovn/ovn-nbctl.pid --detach -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt --db ssl:10.0.136.16:9641,ssl:10.0.178.208:9641,ssl:10.0.204.230:9641
2020-07-20T05:39:13Z|00184|stream_ssl|WARN|SSL_connect: unexpected SSL connection close


oc logs -c ovnkube-node ovnkube-node-7vszf -n openshift-ovn-kubernetes
.....
I0720 06:48:08.321931  404343 ovs.go:249] exec(122): stdout: "not connected\n"
I0720 06:48:08.321965  404343 ovs.go:250] exec(122): stderr: ""
I0720 06:48:08.321981  404343 node.go:116] node ip-10-0-204-230.us-east-2.compute.internal connection status = not connected
I0720 06:48:08.792527  404343 ovs.go:246] exec(123): /usr/bin/ovs-appctl --timeout=15 -t /var/run/ovn/ovn-controller.93522.ctl connection-status
I0720 06:48:08.820573  404343 ovs.go:249] exec(123): stdout: "not connected\n"
I0720 06:48:08.820724  404343 ovs.go:250] exec(123): stderr: ""
I0720 06:48:08.820748  404343 node.go:116] node ip-10-0-204-230.us-east-2.compute.internal connection status = not connected
I0720 06:48:08.820767  404343 ovs.go:246] exec(124): /usr/bin/ovs-appctl --timeout=15 -t /var/run/ovn/ovn-controller.93522.ctl connection-status
I0720 06:48:08.847272  404343 ovs.go:249] exec(124): stdout: "not connected\n"
I0720 06:48:08.847306  404343 ovs.go:250] exec(124): stderr: ""
I0720 06:48:08.847321  404343 node.go:116] node ip-10-0-204-230.us-east-2.compute.internal connection status = not connected
F0720 06:48:08.847355  404343 ovnkube.go:129] timed out waiting sbdb for node ip-10-0-204-230.us-east-2.compute.internal: timed out waiting for the condition

Comment 1 W. Trevor King 2020-07-20 14:18:19 UTC

Closing as a dup of bug 1837953 , see [1].

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1858712#c16

*** This bug has been marked as a duplicate of bug 1837953 ***

Comment 2 W. Trevor King 2020-07-20 15:34:50 UTC

Un-duping, based on Scott's change to bug 1858712.

Comment 7 Anurag saxena 2020-07-29 14:57:43 UTC

Upgrade on 4.6.0-0.nightly-2020-07-25-065959 -> 4.6.0-0.nightly-2020-07-25-091217 looks good. Verifying this bug on same observations

Comment 8 Aniket Bhat 2020-07-31 19:53:29 UTC

*** Bug 1859365 has been marked as a duplicate of this bug. ***

Comment 10 errata-xmlrpc 2020-10-27 16:16:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 11 W. Trevor King 2021-04-05 17:46:52 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.