Description of problem: It's easy to get openvswitch to segfault when changing the MTU of a vif port. A customer of ours uses OVS with OSP10z3 and DPDK with Jumbo frames. As VIF port do not currently inherit the MTU from the OVS-Bridge, the customer must run a cronjob to set the MTU on 'vhu*' ports when they come up. This results in ovs-vswitchd segfaulting very often: E.g: ovs-vsctl set interface vhu2fd7027c-33 mtu_request=9000 Results in: Jul 05 12:52:46 tkll00p1 kernel: pmd459[6777]: segfault at 44 ip 00007fa334156dff sp 00007fa2517ef4d0 error 4 in ovs-vswitchd[7fa33407b000+3b1000] Jul 05 12:52:46 tkll00p1 systemd[1]: ovs-vswitchd.service: main process exited, code=killed, status=11/SEGV In ovsdb-server.log we see a line such a this one: 2017-07-05T17:52:47.975Z|00005|fatal_signal|WARN|terminating with signal 15 (Terminated) Version-Release number of selected component (if applicable): openvswitch-2.6.1-10.git20161206.el7fdp.x86_64 How reproducible: From the field it seems there's about 50% chance of it segfaulting when setting MTU. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
most likely this needs 546e57d44c473aac2915037f6906c9dd04294105 Will check to see if that is all.
Is it also possible to get a crash dump from the customer? That would confirm this is the issue.
I've posted an upstream fix for the crash reported: http://dpdk.org/ml/archives/dev/2017-August/072387.html
Hi, How fast are we going to have that downstream? Either as a hotfix or in the repos? Thanks, Andreas
Needs to be accepted upstream first - I don't know how long that will take, usually a few days to a week.
*** Bug 1477785 has been marked as a duplicate of this bug. ***
The fix was applied in upstream repository. Please build a new package with it included.
Brew build https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13784598
TAM and SA team met with Cisco 8/8. Cisco is open to providing us with the necessary hardware, or access to a lab with the hardware, or helping us determine if hardware we have is functionally equivalent. Ravi Anan (ravianan) is the Cisco engineer we need to contact about test environment. Here are the notes from our meeting: Current RH Software on Sprint Environment: --RH OSP 10.z2 --OVS 2.6.1-3 beta (Compiled with DPDK) Current OVS Deployment: Open vSwitch version number does not necessarily imply what version of DPDK library the upstream used to compile it. Cisco does track what version of DPDK library works with VIC-1340. We don't know if OVS 2.6.1-3 uses a compatible DPDK library. Action Item: Cisco to confirm what DPDK libraries work (tested?) with VIC-1340, map DPDK library to OVS version (or check upstream opevswitch.org?). Cisco PMD drivers also upstreamed to openvswitch.org. Need to confirm which versions of OVS contain the correct DPDK library and correct PMD drivers. Need to confirm that we're using a stable branch of DPDK libraries that will accumulate support patches going forward. Current RH QE Test Lab: RH may lack the correct hardware in their lab to test OVS hotfixes. Action Item: RH engineering will contact Ravi Anan to determine if our lab hardware is either the same as or compatible with UCS-B200 and VIC-1340 (Jim Sisul will pass contact info to RH QE) and if not, how to go about getting it for testing purposes.
Moving back to ASSIGNED. Due to lack of HW to verify and enable support, the ENIC PMD driver will be disabled for 10z4. fbl
I have reproduced this issue on cisco machine, following is the detail info. ENV info: two machine: cisco machine and dell-02 cisco machine's nic enp11s0 connect dell-02's nic p3p2. cisco nic's model is VIC 1225. Steps: (1) did following configurations on cisco machine: setenforce permissive systemctl restart openvswitch /usr/bin/ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true /usr/bin/ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-lcore-mask=0x2 /usr/bin/ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem=1024,0 /usr/bin/ovs-vsctl --timeout 10 set Open_vSwitch . other_config:pmd-cpu-mask=1 ovs-vsctl --timeout 10 set Open_vSwitch . other_config:pmd-cpu-mask=30 /usr/bin/ovs-vsctl --timeout 10 add-br br0 -- set bridge br0 datapath_type=netdev /usr/bin/ovs-vsctl --timeout 10 add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk /usr/bin/ovs-vsctl --timeout 10 add-port br0 dpdkvhostuser0 -- set Interface dpdkvhostuser0 type=dpdkvhostuser sudo /usr/bin/ovs-ofctl -O OpenFlow13 --timeout 10 del-flows br0 /usr/bin/ovs-ofctl -O OpenFlow13 --timeout 10 add-flow br0 idle_timeout=0,in_port=1,action=output:2 /usr/bin/ovs-ofctl -O OpenFlow13 --timeout 10 add-flow br0 idle_timeout=0,in_port=2,action=output:1 chmod 777 /var/run/openvswitch/dpdkvhostuser0 inside guest: ip addr add 192.168.10.1/24 dev eth1 (2) checked the ovs topo: [root@cisco-c220m3-01 openvswitch]# ovs-vsctl show 9b108ee7-ef08-4c14-92fb-8814206bb881 Bridge "br0" Port "dpdk0" Interface "dpdk0" type: dpdk Port "dpdkvhostuser0" Interface "dpdkvhostuser0" type: dpdkvhostuser Port "br0" Interface "br0" type: internal ovs_version: "2.6.1" (3) did following configuration on dell-02: ip addr add 192.168.10.2/24 dev p3p2 and ping the guest on cisco machine: ping -n -i 0.001 192.168.10.1 (4) I configured following command by steps on cisco machine. ovs-vsctl set int dpdkvhostuser0 mtu_request=1900 ovs-vsctl set int dpdkvhostuser0 mtu_request=2000 ovs-vsctl set int dpdkvhostuser0 mtu_request=2200 ovs-vsctl set int dpdkvhostuser0 mtu_request=2300 ovs-vsctl set int dpdkvhostuser0 mtu_request=9000 ovs-vsctl set int dpdkvhostuser0 mtu_request=2000 ovs-vsctl set int dpdkvhostuser0 mtu_request=1500 (5) dell-02 can ping the guest in cisco machine at first, but after configure mtu to 1500, it cannot ping the guest in cisco machine. And check /var/log/messages, it has following error info: Aug 24 04:08:53 cisco-c220m3-01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl set int dpdkvhostuser0 mtu_request=1900 Aug 24 04:08:53 cisco-c220m3-01 ovs-vswitchd[16701]: RING: Cannot reserve memory Aug 24 04:08:53 cisco-c220m3-01 ovs-vswitchd[16701]: RING: Cannot reserve memory Aug 24 04:08:57 cisco-c220m3-01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl set int dpdkvhostuser0 mtu_request=2000 Aug 24 04:09:01 cisco-c220m3-01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl set int dpdkvhostuser0 mtu_request=2100 Aug 24 04:09:06 cisco-c220m3-01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl set int dpdkvhostuser0 mtu_request=2200 Aug 24 04:09:10 cisco-c220m3-01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl set int dpdkvhostuser0 mtu_request=2300 Aug 24 04:09:15 cisco-c220m3-01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl set int dpdkvhostuser0 mtu_request=9000 Aug 24 04:09:15 cisco-c220m3-01 ovs-vswitchd[16701]: RING: Cannot reserve memory Aug 24 04:09:15 cisco-c220m3-01 ovs-vswitchd[16701]: RING: Cannot reserve memory Aug 24 04:09:15 cisco-c220m3-01 ovs-vswitchd[16701]: RING: Cannot reserve memory Aug 24 04:09:15 cisco-c220m3-01 ovs-vswitchd[16701]: RING: Cannot reserve memory Aug 24 04:09:15 cisco-c220m3-01 ovs-vswitchd[16701]: RING: Cannot reserve memory Aug 24 04:09:15 cisco-c220m3-01 ovs-vswitchd[16701]: ovs|00098|dpdk|ERR|Insufficient memory to create memory pool for netdev dpdkvhostuser0, with MTU 9000 on socket 0 Aug 24 04:09:23 cisco-c220m3-01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl set int dpdkvhostuser0 mtu_request=2000 Aug 24 04:09:29 cisco-c220m3-01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl set int dpdkvhostuser0 mtu_request=1500 Aug 24 04:09:29 cisco-c220m3-01 kernel: pmd26[16958]: segfault at 44 ip 000055b6b514bdff sp 00007f6fcbff44d0 error 4 in ovs-vswitchd[55b6b5070000+3b1000] Aug 24 04:09:30 cisco-c220m3-01 systemd: ovs-vswitchd.service: main process exited, code=killed, status=11/SEGV Aug 24 04:09:30 cisco-c220m3-01 systemd: Stopping Open vSwitch... Aug 24 04:09:30 cisco-c220m3-01 systemd: Stopped Open vSwitch. Aug 24 04:09:30 cisco-c220m3-01 ovs-ctl: 2017-08-24T08:09:30Z|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.16701.ctl Aug 24 04:09:30 cisco-c220m3-01 ovs-appctl: ovs|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.16701.ctl Aug 24 04:09:30 cisco-c220m3-01 ovs-ctl: ovs-appctl: cannot connect to "/var/run/openvswitch/ovs-vswitchd.16701.ctl" (Connection refused) Aug 24 04:09:30 cisco-c220m3-01 systemd: Stopped Open vSwitch Forwarding Unit. Aug 24 04:09:30 cisco-c220m3-01 systemd: Unit ovs-vswitchd.service entered failed state. Aug 24 04:09:30 cisco-c220m3-01 systemd: ovs-vswitchd.service failed. Aug 24 04:09:30 cisco-c220m3-01 systemd: Stopping Open vSwitch Database Unit... Aug 24 04:09:30 cisco-c220m3-01 ovs-ctl: Exiting ovsdb-server (16665) [ OK ] Aug 24 04:09:30 cisco-c220m3-01 systemd: Stopped Open vSwitch Database Unit. Aug 24 04:13:33 cisco-c220m3-01 ovs-vsctl: ovs|00002|fatal_signal|WARN|terminating with signal 2 (Interrupt) Aug 24 04:13:42 cisco-c220m3-01 systemd: Starting Open vSwitch Database Unit... Aug 24 04:13:42 cisco-c220m3-01 ovs-ctl: Starting ovsdb-server [ OK ] Aug 24 04:13:42 cisco-c220m3-01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait -- init -- set Open_vSwitch . db-version=7.14.0 Aug 24 04:13:42 cisco-c220m3-01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set Open_vSwitch . ovs-version=2.6.1 "external-ids:system-id=\"03e57cb6-7c2b-4f8d-84d3-381889410eaf\"" "external-ids:hostname=\"cisco-c220m3-01.rhts.eng.pek2.redhat.com\"" "system-type=\"rhel\"" "system-version=\"7.4\"" Aug 24 04:13:42 cisco-c220m3-01 ovs-ctl: Configuring Open vSwitch system IDs [ OK ] (6) And I found the ovs service was stop, and can start it successfully by manually. If run following script in cisco, it can also reproduce the issue. run following script on cisco machine: i=1500 while true do ovs-vsctl set int dpdkvhostuser0 mtu_request=$i sleep 1 ovs-vsctl list int dpdkvhostuser0 | grep mtu echo; echo; i=`expr $i + 10`; if [ $i -gt 9000 ]; then break; fi; done | tee -a /tmp/test-results
I have verified it pass by using openvswitch-2.6.1-13.git20161206.el7ost.x86_64.rpm.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2648