Bug 1874696

Summary: Openshift-sdn starts ovs instance in container, instead of using the systemd service on node.
Product: OpenShift Container Platform Reporter: Peng Liu <pliu>
Component: NetworkingAssignee: Juan Luis de Sousa-Valadas <jdesousa>
Networking sub component: openshift-sdn QA Contact: Ross Brattain <rbrattai>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: aprabhak, bbennett, danili, deads, dosmith, dsanzmor, eslutsky, gzaidman, huirwang, jcallen, jdesousa, mfojtik, mjahangi, mjtarsel, mtarsel, sdodson, tnozicka, vvoronko, wking, wsun, xtian, yanyang, yunjiang, zzhao
Version: 4.6Keywords: NeedsTestCase, TestBlocker
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: TechnicalReleaseBlocker
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-16 14:06:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1854306    

Description Peng Liu 2020-09-02 03:09:56 UTC
Description of problem:
Openshift-sdn starts ovs instance in container, instead of using the systemd service on node.

Version-Release number of selected component (if applicable):
4.6.0-0.ci-2020-09-01-180917

How reproducible:


Steps to Reproduce:
1. Create a cluster with openshift-sdn as cluster network provider.
2. Check the log of pod ovs-xxx in namespace openshift-sdn.
e.g. 
3. In the beginning of the log.

Actual results:

Failed to connect to bus: No data available
openvswitch is running in container
Starting ovsdb-server.
...

Expected results:

openvswitch is running in systems
...

Additional info:

Comment 1 Ross Brattain 2020-09-08 01:51:29 UTC
This was part of the SDN to OVN migration?  I don't think we have seen this in our SDN testing.

On 4.6.0-0.nightly-2020-09-07-104243 the systemctl commands seem to run okay in the OVS pods without hostPid.  Does something change in the pods during migration?

Comment 2 Peng Liu 2020-09-08 06:07:58 UTC
Ross, this is not about migration. It happens in a fresh installed cluster with openshift-sdn. Did you check the beginning of the ovs pod log with e.g.`oc logs -n openshift-sdn ovs-2gvr5 2>&1 | less`? I tested again with 4.6.0-0.ci-2020-09-06-060329, still got the same issue.

Comment 3 Ross Brattain 2020-09-08 18:47:46 UTC
I also reproduced on 4.6.0-0.ci-2020-09-06-060329, and the OVS pod yaml seems to be the same as 4.6.0-0.nightly-2020-09-07-104243 so something else is breaking systemctl.

Maybe it is selinux?  We should try to root cause before switching to hostPid.

Comment 4 Peng Liu 2020-09-09 10:57:00 UTC
reproduce on 4.6.0-0.ci-2020-09-09-061410 too. I don't think it's related to selinux. I tried set `setenforce 0` on the node, it doesn't help.

Comment 5 Juan Luis de Sousa-Valadas 2020-09-15 13:41:39 UTC
*** Bug 1874820 has been marked as a duplicate of this bug. ***

Comment 6 Aniket Bhat 2020-09-16 13:24:44 UTC
*** Bug 1878707 has been marked as a duplicate of this bug. ***

Comment 7 Wei Sun 2020-09-16 13:41:50 UTC
Adding testblocker keyword, since the duplicated bug https://bugzilla.redhat.com/show_bug.cgi?id=1878707 is blocking QE's test.

Comment 8 Evgeny Slutsky 2020-09-17 06:52:20 UTC
we 've verified in this PR [0] that this bug also affecting all ovirt CI presubmits  jobs.

[0] https://github.com/openshift/machine-config-operator/pull/2090

Comment 9 Ben Bennett 2020-09-17 13:12:01 UTC
*** Bug 1879524 has been marked as a duplicate of this bug. ***

Comment 10 Aniket Bhat 2020-09-17 13:26:47 UTC
*** Bug 1879591 has been marked as a duplicate of this bug. ***

Comment 11 Dan Williams 2020-09-17 19:43:00 UTC
*** Bug 1880110 has been marked as a duplicate of this bug. ***

Comment 12 Dan Winship 2020-09-18 15:30:51 UTC
*** Bug 1880425 has been marked as a duplicate of this bug. ***

Comment 13 Prashanth Sundararaman 2020-09-18 18:41:40 UTC
*** Bug 1878657 has been marked as a duplicate of this bug. ***

Comment 14 Ben Bennett 2020-09-22 13:06:58 UTC
*** Bug 1881188 has been marked as a duplicate of this bug. ***

Comment 18 zhaozhanqi 2020-09-30 11:07:44 UTC
I just have a try with CI build 4.6.0-0.ci-2020-09-30-071822
ovs is running on host

openvswitch is running in systemd
==> /host/var/log/openvswitch/ovs-vswitchd.log <==
2020-09-30T08:42:15.251Z|00007|dpdk|INFO|DPDK Disabled - Use other_config:dpdk-init to enable
2020-09-30T08:42:15.254Z|00008|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.13.2
2020-09-30T08:43:02.250Z|00009|memory|INFO|50004 kB peak resident set size after 47.1 seconds
2020-09-30T08:43:28.037Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2020-09-30T08:43:28.074Z|00002|ovs_numa|INFO|Discovered 4 CPU cores on NUMA node 0
2020-09-30T08:43:28.074Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes and 4 CPU cores
2020-09-30T08:43:28.074Z|00004|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2020-09-30T08:43:28.074Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2020-09-30T08:43:28.075Z|00006|dpdk|INFO|DPDK Disabled - Use other_config:dpdk-init to enable
2020-09-30T08:43:28.078Z|00007|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.13.2

need nightly build to double confirm

Comment 19 Ross Brattain 2020-10-01 02:03:19 UTC
Something messed up when upgrading from 4.5.0-0.nightly-2020-09-28-124031 to 4.6.0-0.nightly-2020-09-30-145011 on Azure.

The nodes that switched to host OVS are stuck at SchedulingDisabled probably because openvswitch service is disabled.

So maybe MCO didn't enable it, which isn't a bug with CNO.

Upgrade from 4.5.0-0.nightly-2020-09-28-124031 to 4.6.0-0.nightly-2020-09-30-091659 succeeded on AWS.


sh-4.4# systemctl status openvswitch
● openvswitch.service - Open vSwitch
Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; disabled; vendor preset: disabled)
Active: inactive (dead)

sh-4.4# systemctl status ovs-vswitchd
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
  Drop-In: /etc/systemd/system/ovs-vswitchd.service.d
           └─10-ovs-vswitchd-restart.conf
   Active: inactive (dead)

sh-4.4# ls -l /etc/systemd/system/multi-user.target.wants/openvswitch.service
ls: cannot access '/etc/systemd/system/multi-user.target.wants/openvswitch.service': No such file or directory

container logs

openvswitch is running in container
Starting ovsdb-server.
PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4)
net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5)
Configuring Open vSwitch system IDs.
Enabling remote OVSDB managers.
PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4)
net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5)
Starting ovs-vswitchd.
Enabling remote OVSDB managers.
2020-09-30 21:11:30 info: Loading previous flows ...
2020-09-30 21:11:30 info: Adding br0 if it doesn't exist ...
2020-09-30 21:11:30 info: Created br0, now adding flows ...
+ ovs-ofctl add-tlv-map br0 ''
2020-09-30T21:11:30Z|00001|vconn|WARN|unix:/var/run/openvswitch/br0.mgmt: version negotiation failed (we support version 0x01, peer supports version 0x04)
ovs-ofctl: br0: failed to connect to socket (Broken pipe)
+ ovs-ofctl -O OpenFlow13 add-groups br0 /var/run/openvswitch/ovs-save.nVSt9McrJW/br0.groups.dump
+ ovs-ofctl -O OpenFlow13 replace-flows br0 /var/run/openvswitch/ovs-save.nVSt9McrJW/br0.flows.dump
+ rm -rf /var/run/openvswitch/ovs-save.nVSt9McrJW
2020-09-30 21:11:30 info: Done restoring the existing flows ...
2020-09-30 21:11:30 info: Remove other config ...
2020-09-30 21:11:30 info: Removed other config ...
2020-09-30T21:11:29.736Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log
2020-09-30T21:11:29.741Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.11.5
2020-09-30T21:11:29.748Z|00003|jsonrpc|WARN|unix#0: receive error: Connection reset by peer
2020-09-30T21:11:29.748Z|00004|reconnect|WARN|unix#0: connection dropped (Connection reset by peer)
2020-09-30T21:11:30.131Z|00031|bridge|INFO|bridge br0: added interface vethec1140a0 on port 4
2020-09-30T21:11:30.132Z|00032|bridge|INFO|bridge br0: added interface br0 on port 65534
2020-09-30T21:11:30.132Z|00033|bridge|INFO|bridge br0: added interface vetha0b45f6d on port 6
2020-09-30T21:11:30.132Z|00034|bridge|INFO|bridge br0: added interface vethe77d62ce on port 10
2020-09-30T21:11:30.132Z|00035|bridge|INFO|bridge br0: using datapath ID 00001a0aabc20744
2020-09-30T21:11:30.132Z|00036|connmgr|INFO|br0: added service controller "punix:/var/run/openvswitch/br0.mgmt"
2020-09-30T21:11:30.135Z|00037|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.11.5
2020-09-30T21:11:30.197Z|00038|vconn|WARN|unix#0: version negotiation failed (we support version 0x04, peer supports version 0x01)
2020-09-30T21:11:30.197Z|00039|rconn|WARN|br0<->unix#0: connection dropped (Protocol error)
2020-09-30T21:11:30.252Z|00040|connmgr|INFO|br0<->unix#6: 111 flow_mods in the last 0 s (111 adds)
2020-09-30T21:11:39.747Z|00005|memory|INFO|7496 kB peak resident set size after 10.0 seconds
2020-09-30T21:11:39.747Z|00006|memory|INFO|cells:652 json-caches:1 monitors:2 sessions:2
2020-09-30T21:11:40.138Z|00041|memory|INFO|59596 kB peak resident set size after 10.3 seconds
2020-09-30T21:11:40.138Z|00042|memory|INFO|handlers:1 ports:10 revalidators:1 rules:115 udpif keys:132
2020-09-30T21:18:15.278Z|00043|connmgr|INFO|br0<->unix#58: 2 flow_mods in the last 0 s (2 deletes)
2020-09-30T21:18:15.309Z|00044|connmgr|INFO|br0<->unix#61: 4 flow_mods in the last 0 s (4 deletes)
2020-09-30T21:18:15.339Z|00045|bridge|INFO|bridge br0: deleted interface veth956fb903 on port 3
2020-09-30T21:18:26.104Z|00046|bridge|INFO|bridge br0: added interface vethd3e3323a on port 12
2020-09-30T21:18:26.142Z|00047|connmgr|INFO|br0<->unix#64: 5 flow_mods in the last 0 s (5 adds)
2020-09-30T21:18:26.183Z|00048|connmgr|INFO|br0<->unix#67: 2 flow_mods in the last 0 s (2 deletes)
2020-09-30T21:28:05.860Z|00049|connmgr|INFO|br0<->unix#132: 2 flow_mods in the last 0 s (2 deletes)
2020-09-30T21:28:06.011Z|00050|connmgr|INFO|br0<->unix#137: 4 flow_mods in the last 0 s (4 deletes)
2020-09-30T21:28:06.121Z|00051|bridge|INFO|bridge br0: deleted interface vetha0b45f6d on port 6
2020-09-30T21:28:06.256Z|00052|connmgr|INFO|br0<->unix#141: 2 flow_mods in the last 0 s (2 deletes)
2020-09-30T21:28:06.400Z|00053|connmgr|INFO|br0<->unix#144: 4 flow_mods in the last 0 s (4 deletes)
2020-09-30T21:28:06.725Z|00054|bridge|INFO|bridge br0: deleted interface veth7add96e2 on port 9
2020-09-30T21:28:06.878Z|00055|connmgr|INFO|br0<->unix#147: 2 flow_mods in the last 0 s (2 deletes)
2020-09-30T21:28:07.031Z|00056|connmgr|INFO|br0<->unix#150: 4 flow_mods in the last 0 s (4 deletes)
2020-09-30T21:28:07.334Z|00057|bridge|INFO|bridge br0: deleted interface vethec1140a0 on port 4
2020-09-30T21:28:07.471Z|00058|connmgr|INFO|br0<->unix#153: 2 flow_mods in the last 0 s (2 deletes)
2020-09-30T21:28:07.594Z|00059|connmgr|INFO|br0<->unix#156: 4 flow_mods in the last 0 s (4 deletes)
2020-09-30T21:28:07.675Z|00060|bridge|INFO|bridge br0: deleted interface vethe77d62ce on port 10
2020-09-30T21:28:08.166Z|00061|connmgr|INFO|br0<->unix#159: 2 flow_mods in the last 0 s (2 deletes)
2020-09-30T21:28:08.249Z|00062|connmgr|INFO|br0<->unix#162: 4 flow_mods in the last 0 s (4 deletes)
2020-09-30T21:28:08.376Z|00063|bridge|INFO|bridge br0: deleted interface vethac02a791 on port 8
2020-09-30 21:28:16 info: Saving flows ...
ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)
rm: cannot remove '/var/run/openvswitch/ovs-vswitchd.pid': No such file or directory
openvswitch is running in systemd
(objectpath '/org/freedesktop/systemd1/job/796',)
tail: cannot open '/host/var/log/openvswitch/ovs-vswitchd.log' for reading: No such file or directory
tail: cannot open '/host/var/log/openvswitch/ovsdb-server.log' for reading: No such file or directory
tail: '/host/var/log/openvswitch/ovsdb-server.log' has appeared;  following new file
2020-09-30T21:28:56.511Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log
2020-09-30T21:28:56.518Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.2
2020-09-30T21:28:58.661Z|00003|jsonrpc|WARN|unix#4: receive error: Connection reset by peer
2020-09-30T21:28:58.661Z|00004|reconnect|WARN|unix#4: connection dropped (Connection reset by peer)
2020-09-30T21:29:00.177Z|00005|jsonrpc|WARN|unix#7: receive error: Connection reset by peer
2020-09-30T21:29:00.177Z|00006|reconnect|WARN|unix#7: connection dropped (Connection reset by peer)
2020-09-30T21:29:06.526Z|00007|memory|INFO|7640 kB peak resident set size after 10.0 seconds
2020-09-30T21:29:06.526Z|00008|memory|INFO|cells:122 monitors:2 sessions:1
2020-09-30T21:29:44.579Z|00009|jsonrpc|WARN|unix#19: receive error: Connection reset by peer
2020-09-30T21:29:44.579Z|00010|reconnect|WARN|unix#19: connection dropped (Connection reset by peer)
2020-09-30T21:29:47.487Z|00011|jsonrpc|WARN|unix#21: receive error: Connection reset by peer
2020-09-30T21:29:47.487Z|00012|reconnect|WARN|unix#21: connection dropped (Connection reset by peer)
2020-09-30T21:29:52.488Z|00013|jsonrpc|WARN|unix#22: receive error: Connection reset by peer
2020-09-30T21:29:52.488Z|00014|reconnect|WARN|unix#22: connection dropped (Connection reset by peer)
2020-09-30T21:29:57.488Z|00015|jsonrpc|WARN|unix#23: receive error: Connection reset by peer
2020-09-30T21:29:57.488Z|00016|reconnect|WARN|unix#23: connection dropped (Connection reset by peer)
2020-09-30T21:30:02.484Z|00017|jsonrpc|WARN|unix#24: receive error: Connection reset by peer
2020-09-30T21:30:02.484Z|00018|reconnect|WARN|unix#24: connection dropped (Connection reset by peer)
2020-09-30T21:30:07.487Z|00019|jsonrpc|WARN|unix#25: receive error: Connection reset by peer
2020-09-30T21:30:07.487Z|00020|reconnect|WARN|unix#25: connection dropped (Connection reset by peer)
2020-09-30T21:30:12.494Z|00021|jsonrpc|WARN|unix#26: receive erro

Comment 20 Ross Brattain 2020-10-01 02:48:55 UTC
Upgrade failure issue https://bugzilla.redhat.com/show_bug.cgi?id=1884101

Comment 21 Scott Dodson 2020-10-01 13:21:44 UTC
*** Bug 1875534 has been marked as a duplicate of this bug. ***

Comment 22 Ross Brattain 2020-10-01 17:00:17 UTC
We no longer start OVS in the container, so the behavior described in this issue has been fixed, marking Verified.

Comment 25 errata-xmlrpc 2020-10-27 16:36:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 26 Selim Jahangir 2020-12-16 05:25:21 UTC
Hi 
The issue has not been fixed in many ocp4.5.x upgrade plus fresh installation of ocp4.6.x.
I have got a few cases where customers are upgrading from ocp.4.5 and openvswicth failed. Manually restart of the openvswicth fixed the issue. 

But in case of fresh installation of ocp4.6 , the issue is not going forward until openvswitch is restarted manually. Which is the customer does not like to do for the fresh installation. Customers wanted a permanent solution. 

Regards
selim

Comment 28 Juan Luis de Sousa-Valadas 2020-12-16 14:06:10 UTC
Hi Selim,
Please do not reopen cases with CLOSED ERRATA. File a new one.