Bug 1874696
Summary: | Openshift-sdn starts ovs instance in container, instead of using the systemd service on node. | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Peng Liu <pliu> |
Component: | Networking | Assignee: | Juan Luis de Sousa-Valadas <jdesousa> |
Networking sub component: | openshift-sdn | QA Contact: | Ross Brattain <rbrattai> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | urgent | ||
Priority: | urgent | CC: | aprabhak, bbennett, danili, deads, dosmith, dsanzmor, eslutsky, gzaidman, huirwang, jcallen, jdesousa, mfojtik, mjahangi, mjtarsel, mtarsel, sdodson, tnozicka, vvoronko, wking, wsun, xtian, yanyang, yunjiang, zzhao |
Version: | 4.6 | Keywords: | NeedsTestCase, TestBlocker |
Target Milestone: | --- | ||
Target Release: | 4.6.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | TechnicalReleaseBlocker | ||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-12-16 14:06:10 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1854306 |
Description
Peng Liu
2020-09-02 03:09:56 UTC
This was part of the SDN to OVN migration? I don't think we have seen this in our SDN testing. On 4.6.0-0.nightly-2020-09-07-104243 the systemctl commands seem to run okay in the OVS pods without hostPid. Does something change in the pods during migration? Ross, this is not about migration. It happens in a fresh installed cluster with openshift-sdn. Did you check the beginning of the ovs pod log with e.g.`oc logs -n openshift-sdn ovs-2gvr5 2>&1 | less`? I tested again with 4.6.0-0.ci-2020-09-06-060329, still got the same issue. I also reproduced on 4.6.0-0.ci-2020-09-06-060329, and the OVS pod yaml seems to be the same as 4.6.0-0.nightly-2020-09-07-104243 so something else is breaking systemctl. Maybe it is selinux? We should try to root cause before switching to hostPid. reproduce on 4.6.0-0.ci-2020-09-09-061410 too. I don't think it's related to selinux. I tried set `setenforce 0` on the node, it doesn't help. *** Bug 1874820 has been marked as a duplicate of this bug. *** *** Bug 1878707 has been marked as a duplicate of this bug. *** Adding testblocker keyword, since the duplicated bug https://bugzilla.redhat.com/show_bug.cgi?id=1878707 is blocking QE's test. we 've verified in this PR [0] that this bug also affecting all ovirt CI presubmits jobs. [0] https://github.com/openshift/machine-config-operator/pull/2090 *** Bug 1879524 has been marked as a duplicate of this bug. *** *** Bug 1879591 has been marked as a duplicate of this bug. *** *** Bug 1880110 has been marked as a duplicate of this bug. *** *** Bug 1880425 has been marked as a duplicate of this bug. *** *** Bug 1878657 has been marked as a duplicate of this bug. *** *** Bug 1881188 has been marked as a duplicate of this bug. *** I just have a try with CI build 4.6.0-0.ci-2020-09-30-071822 ovs is running on host openvswitch is running in systemd ==> /host/var/log/openvswitch/ovs-vswitchd.log <== 2020-09-30T08:42:15.251Z|00007|dpdk|INFO|DPDK Disabled - Use other_config:dpdk-init to enable 2020-09-30T08:42:15.254Z|00008|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.13.2 2020-09-30T08:43:02.250Z|00009|memory|INFO|50004 kB peak resident set size after 47.1 seconds 2020-09-30T08:43:28.037Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log 2020-09-30T08:43:28.074Z|00002|ovs_numa|INFO|Discovered 4 CPU cores on NUMA node 0 2020-09-30T08:43:28.074Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes and 4 CPU cores 2020-09-30T08:43:28.074Z|00004|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting... 2020-09-30T08:43:28.074Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected 2020-09-30T08:43:28.075Z|00006|dpdk|INFO|DPDK Disabled - Use other_config:dpdk-init to enable 2020-09-30T08:43:28.078Z|00007|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.13.2 need nightly build to double confirm Something messed up when upgrading from 4.5.0-0.nightly-2020-09-28-124031 to 4.6.0-0.nightly-2020-09-30-145011 on Azure. The nodes that switched to host OVS are stuck at SchedulingDisabled probably because openvswitch service is disabled. So maybe MCO didn't enable it, which isn't a bug with CNO. Upgrade from 4.5.0-0.nightly-2020-09-28-124031 to 4.6.0-0.nightly-2020-09-30-091659 succeeded on AWS. sh-4.4# systemctl status openvswitch ● openvswitch.service - Open vSwitch Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; disabled; vendor preset: disabled) Active: inactive (dead) sh-4.4# systemctl status ovs-vswitchd ● ovs-vswitchd.service - Open vSwitch Forwarding Unit Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled) Drop-In: /etc/systemd/system/ovs-vswitchd.service.d └─10-ovs-vswitchd-restart.conf Active: inactive (dead) sh-4.4# ls -l /etc/systemd/system/multi-user.target.wants/openvswitch.service ls: cannot access '/etc/systemd/system/multi-user.target.wants/openvswitch.service': No such file or directory container logs openvswitch is running in container Starting ovsdb-server. PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4) net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5) Configuring Open vSwitch system IDs. Enabling remote OVSDB managers. PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4) net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5) Starting ovs-vswitchd. Enabling remote OVSDB managers. 2020-09-30 21:11:30 info: Loading previous flows ... 2020-09-30 21:11:30 info: Adding br0 if it doesn't exist ... 2020-09-30 21:11:30 info: Created br0, now adding flows ... + ovs-ofctl add-tlv-map br0 '' 2020-09-30T21:11:30Z|00001|vconn|WARN|unix:/var/run/openvswitch/br0.mgmt: version negotiation failed (we support version 0x01, peer supports version 0x04) ovs-ofctl: br0: failed to connect to socket (Broken pipe) + ovs-ofctl -O OpenFlow13 add-groups br0 /var/run/openvswitch/ovs-save.nVSt9McrJW/br0.groups.dump + ovs-ofctl -O OpenFlow13 replace-flows br0 /var/run/openvswitch/ovs-save.nVSt9McrJW/br0.flows.dump + rm -rf /var/run/openvswitch/ovs-save.nVSt9McrJW 2020-09-30 21:11:30 info: Done restoring the existing flows ... 2020-09-30 21:11:30 info: Remove other config ... 2020-09-30 21:11:30 info: Removed other config ... 2020-09-30T21:11:29.736Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log 2020-09-30T21:11:29.741Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.11.5 2020-09-30T21:11:29.748Z|00003|jsonrpc|WARN|unix#0: receive error: Connection reset by peer 2020-09-30T21:11:29.748Z|00004|reconnect|WARN|unix#0: connection dropped (Connection reset by peer) 2020-09-30T21:11:30.131Z|00031|bridge|INFO|bridge br0: added interface vethec1140a0 on port 4 2020-09-30T21:11:30.132Z|00032|bridge|INFO|bridge br0: added interface br0 on port 65534 2020-09-30T21:11:30.132Z|00033|bridge|INFO|bridge br0: added interface vetha0b45f6d on port 6 2020-09-30T21:11:30.132Z|00034|bridge|INFO|bridge br0: added interface vethe77d62ce on port 10 2020-09-30T21:11:30.132Z|00035|bridge|INFO|bridge br0: using datapath ID 00001a0aabc20744 2020-09-30T21:11:30.132Z|00036|connmgr|INFO|br0: added service controller "punix:/var/run/openvswitch/br0.mgmt" 2020-09-30T21:11:30.135Z|00037|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.11.5 2020-09-30T21:11:30.197Z|00038|vconn|WARN|unix#0: version negotiation failed (we support version 0x04, peer supports version 0x01) 2020-09-30T21:11:30.197Z|00039|rconn|WARN|br0<->unix#0: connection dropped (Protocol error) 2020-09-30T21:11:30.252Z|00040|connmgr|INFO|br0<->unix#6: 111 flow_mods in the last 0 s (111 adds) 2020-09-30T21:11:39.747Z|00005|memory|INFO|7496 kB peak resident set size after 10.0 seconds 2020-09-30T21:11:39.747Z|00006|memory|INFO|cells:652 json-caches:1 monitors:2 sessions:2 2020-09-30T21:11:40.138Z|00041|memory|INFO|59596 kB peak resident set size after 10.3 seconds 2020-09-30T21:11:40.138Z|00042|memory|INFO|handlers:1 ports:10 revalidators:1 rules:115 udpif keys:132 2020-09-30T21:18:15.278Z|00043|connmgr|INFO|br0<->unix#58: 2 flow_mods in the last 0 s (2 deletes) 2020-09-30T21:18:15.309Z|00044|connmgr|INFO|br0<->unix#61: 4 flow_mods in the last 0 s (4 deletes) 2020-09-30T21:18:15.339Z|00045|bridge|INFO|bridge br0: deleted interface veth956fb903 on port 3 2020-09-30T21:18:26.104Z|00046|bridge|INFO|bridge br0: added interface vethd3e3323a on port 12 2020-09-30T21:18:26.142Z|00047|connmgr|INFO|br0<->unix#64: 5 flow_mods in the last 0 s (5 adds) 2020-09-30T21:18:26.183Z|00048|connmgr|INFO|br0<->unix#67: 2 flow_mods in the last 0 s (2 deletes) 2020-09-30T21:28:05.860Z|00049|connmgr|INFO|br0<->unix#132: 2 flow_mods in the last 0 s (2 deletes) 2020-09-30T21:28:06.011Z|00050|connmgr|INFO|br0<->unix#137: 4 flow_mods in the last 0 s (4 deletes) 2020-09-30T21:28:06.121Z|00051|bridge|INFO|bridge br0: deleted interface vetha0b45f6d on port 6 2020-09-30T21:28:06.256Z|00052|connmgr|INFO|br0<->unix#141: 2 flow_mods in the last 0 s (2 deletes) 2020-09-30T21:28:06.400Z|00053|connmgr|INFO|br0<->unix#144: 4 flow_mods in the last 0 s (4 deletes) 2020-09-30T21:28:06.725Z|00054|bridge|INFO|bridge br0: deleted interface veth7add96e2 on port 9 2020-09-30T21:28:06.878Z|00055|connmgr|INFO|br0<->unix#147: 2 flow_mods in the last 0 s (2 deletes) 2020-09-30T21:28:07.031Z|00056|connmgr|INFO|br0<->unix#150: 4 flow_mods in the last 0 s (4 deletes) 2020-09-30T21:28:07.334Z|00057|bridge|INFO|bridge br0: deleted interface vethec1140a0 on port 4 2020-09-30T21:28:07.471Z|00058|connmgr|INFO|br0<->unix#153: 2 flow_mods in the last 0 s (2 deletes) 2020-09-30T21:28:07.594Z|00059|connmgr|INFO|br0<->unix#156: 4 flow_mods in the last 0 s (4 deletes) 2020-09-30T21:28:07.675Z|00060|bridge|INFO|bridge br0: deleted interface vethe77d62ce on port 10 2020-09-30T21:28:08.166Z|00061|connmgr|INFO|br0<->unix#159: 2 flow_mods in the last 0 s (2 deletes) 2020-09-30T21:28:08.249Z|00062|connmgr|INFO|br0<->unix#162: 4 flow_mods in the last 0 s (4 deletes) 2020-09-30T21:28:08.376Z|00063|bridge|INFO|bridge br0: deleted interface vethac02a791 on port 8 2020-09-30 21:28:16 info: Saving flows ... ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory) rm: cannot remove '/var/run/openvswitch/ovs-vswitchd.pid': No such file or directory openvswitch is running in systemd (objectpath '/org/freedesktop/systemd1/job/796',) tail: cannot open '/host/var/log/openvswitch/ovs-vswitchd.log' for reading: No such file or directory tail: cannot open '/host/var/log/openvswitch/ovsdb-server.log' for reading: No such file or directory tail: '/host/var/log/openvswitch/ovsdb-server.log' has appeared; following new file 2020-09-30T21:28:56.511Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log 2020-09-30T21:28:56.518Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.2 2020-09-30T21:28:58.661Z|00003|jsonrpc|WARN|unix#4: receive error: Connection reset by peer 2020-09-30T21:28:58.661Z|00004|reconnect|WARN|unix#4: connection dropped (Connection reset by peer) 2020-09-30T21:29:00.177Z|00005|jsonrpc|WARN|unix#7: receive error: Connection reset by peer 2020-09-30T21:29:00.177Z|00006|reconnect|WARN|unix#7: connection dropped (Connection reset by peer) 2020-09-30T21:29:06.526Z|00007|memory|INFO|7640 kB peak resident set size after 10.0 seconds 2020-09-30T21:29:06.526Z|00008|memory|INFO|cells:122 monitors:2 sessions:1 2020-09-30T21:29:44.579Z|00009|jsonrpc|WARN|unix#19: receive error: Connection reset by peer 2020-09-30T21:29:44.579Z|00010|reconnect|WARN|unix#19: connection dropped (Connection reset by peer) 2020-09-30T21:29:47.487Z|00011|jsonrpc|WARN|unix#21: receive error: Connection reset by peer 2020-09-30T21:29:47.487Z|00012|reconnect|WARN|unix#21: connection dropped (Connection reset by peer) 2020-09-30T21:29:52.488Z|00013|jsonrpc|WARN|unix#22: receive error: Connection reset by peer 2020-09-30T21:29:52.488Z|00014|reconnect|WARN|unix#22: connection dropped (Connection reset by peer) 2020-09-30T21:29:57.488Z|00015|jsonrpc|WARN|unix#23: receive error: Connection reset by peer 2020-09-30T21:29:57.488Z|00016|reconnect|WARN|unix#23: connection dropped (Connection reset by peer) 2020-09-30T21:30:02.484Z|00017|jsonrpc|WARN|unix#24: receive error: Connection reset by peer 2020-09-30T21:30:02.484Z|00018|reconnect|WARN|unix#24: connection dropped (Connection reset by peer) 2020-09-30T21:30:07.487Z|00019|jsonrpc|WARN|unix#25: receive error: Connection reset by peer 2020-09-30T21:30:07.487Z|00020|reconnect|WARN|unix#25: connection dropped (Connection reset by peer) 2020-09-30T21:30:12.494Z|00021|jsonrpc|WARN|unix#26: receive erro Upgrade failure issue https://bugzilla.redhat.com/show_bug.cgi?id=1884101 *** Bug 1875534 has been marked as a duplicate of this bug. *** We no longer start OVS in the container, so the behavior described in this issue has been fixed, marking Verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 Hi The issue has not been fixed in many ocp4.5.x upgrade plus fresh installation of ocp4.6.x. I have got a few cases where customers are upgrading from ocp.4.5 and openvswicth failed. Manually restart of the openvswicth fixed the issue. But in case of fresh installation of ocp4.6 , the issue is not going forward until openvswitch is restarted manually. Which is the customer does not like to do for the fresh installation. Customers wanted a permanent solution. Regards selim Hi Selim, Please do not reopen cases with CLOSED ERRATA. File a new one. |