Bug 1860522

Summary: OVN-kubernetes --- servers get stuck after reboot on ovnkube-node pods
Product: OpenShift Container Platform Reporter: Andreas Karis <akaris>
Component: NetworkingAssignee: Dumitru Ceara <dceara>
Networking sub component: ovn-kubernetes QA Contact: Ross Brattain <rbrattai>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: anusaxen, bbennett, dcbw, ealcaniz, mmichels, nusiddiq, rkhan, trozet
Version: 4.4Keywords: Reopened
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1867183 (view as bug list) Environment:
Last Closed: 2020-10-27 16:17:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1867183, 1867185, 1878099    
Bug Blocks:    

Description Andreas Karis 2020-07-24 22:07:10 UTC
Description of problem:
OVN-kubernetes --- servers get stuck after reboot on ovnkube-node pods
The customer can reproduce this by rebooting their nodes on their 4.4.11 cluster 




Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 11 Dan Williams 2020-07-29 18:31:43 UTC
Just a note, if we ever see stuff about transaction failures or database inconsistency in northd logs or anywhere else, we need to get *all* the master DBs.

Comment 16 Dan Williams 2020-07-29 21:33:51 UTC
@Andreas, is the cluster still 4.4.11?

For those playing along at home 4.4.11 has:

ovn2.13.x86_64 0:2.13.0-31.el7fdp
openvswitch2.13.x86_64 0:2.13.0-29.el7fdp

Comment 27 Dan Williams 2020-08-07 19:35:03 UTC
*** Bug 1861087 has been marked as a duplicate of this bug. ***

Comment 30 Ben Bennett 2020-08-24 14:49:16 UTC
Reopening so we can use this bug to update the ovs version to get the fix.

Comment 31 Dan Williams 2020-09-08 20:16:11 UTC
OCP 4.6 is using RHEL8 content now, and openvswitch2.13-2.13.0-52.el8fdp is the latest available in OCP repos. So we currently have this fix in OCP 4.6.

We do *not* have this fix in earlier OCP versions yet, but that is a simple matter of agreeing as a team that we are comfortable with tagging the given OVS versions into OCP 4.4 and 4.5.

In any case, we'll get the fix anyway when FDP 20.G ships at the end of September.

Comment 33 Ross Brattain 2020-09-14 14:10:18 UTC
Tested on 4.6.0-0.ci-2020-09-13-124145 with openvswitch2.13-2.13.0-52.el8fdp.x86_64

Rebooting master succeeded, cluster recovered and is healthy, no "violations" in ovnkube-master logs.

Blocked waiting on correct RPM versions in nightly builds

Comment 34 Ross Brattain 2020-09-14 21:40:52 UTC
Verified on 4.6.0-0.nightly-2020-09-12-230035

Comment 36 errata-xmlrpc 2020-10-27 16:17:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 37 errata-xmlrpc 2020-10-27 16:20:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196