Bug 2091634

Summary: OVS 2.15 stops handling traffic once ovs-dpctl(2.17.2) is used against it
Product: OpenShift Container Platform Reporter: Patryk Diak <pdiak>
Component: NetworkingAssignee: Patryk Diak <pdiak>
Networking sub component: ovn-kubernetes QA Contact:
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: ctrautma, dcbw, jhsiao, ralongi, surya, vpickard, vrutkovs
Version: 4.11   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 11:15:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Patryk Diak 2022-05-30 14:34:15 UTC
Description of problem:

Latest OVN-Kubernetes image ships with openvswitch2.17-2.17.0-8.el8fdp.x86_64 rpm.
Starting here: https://github.com/openshift/cluster-network-operator/commit/d38a80372198130a1a43603d036f1b202ae8fbcb ovnkube-node pods periodically scrape ovs metrics by running commands like:
  - ovs-dpctl dump-dps
  - ovs-dpctl show ovs-system

When running on OKD with openvswitch-2.15.0-8.fc35.x86_64 installed on the host OVS stops handling any pod traffic once `ovs-dpctl(2.17-2)` is called from the ovnkube-node pod. 

There is a version mismatch but the outcome seems really severe.

Version-Release number of selected component (if applicable):
- openvswitch2.17-2
- openvswitch-2.15.0

How reproducible:
Always

Steps to Reproduce:
1. Setup latest openshift OKD cluster without https://github.com/openshift/cluster-network-operator/commit/d38a80372198130a1a43603d036f1b202ae8fbcb to allow successful installation
2. Run "ovs-dpctl show" or "ovs-dpctl dump-dps" from any ovnkube-node pod
3. Pods running on the node hosting ovnkube-node pod from the previous step loose all connectivity.

I think the same behavior should be observed just by using ovs-dpctl 2.17-2 with ovs 2.15 but I have not tried that.

Comment 1 Dan Williams 2022-05-31 16:43:31 UTC
The root cause is that ovs-dpctl uses the same interface as ovs-vswitchd to talk to the kernel, and that includes setting updated capabilities and such that vswitchd does when it starts. So running a newer dpctl basically overwrites what the vswitchd on the host is doing. Fixing this would require changes pretty deep down the call stack so it's quite unlikely to be fixed in OVS.

Instead, ovn-kubernetes can use `ovs-appctl dpctl/*` with the same commands that will talk to vswitchd on the host (instead of directly to the kenrel) and get the same information. The response formatting should also be the same as with a direct dpctl.

This will prevent the issue since vswitchd will be the only thing opening the netlink channel to the kernel.

Comment 5 Vadim Rutkovsky 2022-06-04 14:11:09 UTC
*** Bug 2089148 has been marked as a duplicate of this bug. ***

Comment 8 errata-xmlrpc 2022-08-10 11:15:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069