Bug 1468631

Summary: openvswitch segfaults when changing port VIF MTU and there's traffic flowing
Product: Red Hat OpenStack Reporter: Vincent S. Cojot <vcojot>
Component: openvswitchAssignee: Aaron Conole <aconole>
Status: CLOSED ERRATA QA Contact: Yariv <yrachman>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 10.0 (Newton)CC: aconole, akaris, aloughla, amuller, apevec, atelang, atragler, chrisw, ctrautma, fbaudin, fleitner, gvaradar, ihrachys, jjoyce, jraju, jsisul, kzhang, oblaut, ovs-qe, pmannidi, ravianan, rhos-maint, rkhan, smykhail, srevivo, sukulkar, tli, vcojot
Target Milestone: z4Keywords: Reopened, Triaged, ZStream
Target Release: 10.0 (Newton)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openvswitch-2.6.1-13.git20161206.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1489010 (view as bug list) Environment:
Last Closed: 2017-09-06 16:59:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vincent S. Cojot 2017-07-07 15:07:11 UTC
Description of problem:

It's easy to get openvswitch to segfault when changing the MTU of a vif port.
A customer of ours uses OVS with OSP10z3 and DPDK with Jumbo frames.
As VIF port do not currently inherit the MTU from the OVS-Bridge, the customer must run a cronjob to set the MTU on 'vhu*' ports when they come up.
This results in ovs-vswitchd segfaulting very often:

E.g: ovs-vsctl set interface vhu2fd7027c-33 mtu_request=9000

Results in:

Jul 05 12:52:46 tkll00p1 kernel: pmd459[6777]: segfault at 44 ip 00007fa334156dff sp 00007fa2517ef4d0 error 4 in ovs-vswitchd[7fa33407b000+3b1000]

Jul 05 12:52:46 tkll00p1 systemd[1]: ovs-vswitchd.service: main process exited, code=killed, status=11/SEGV

In ovsdb-server.log we see a line such a this one:
2017-07-05T17:52:47.975Z|00005|fatal_signal|WARN|terminating with signal 15 (Terminated)

Version-Release number of selected component (if applicable):

openvswitch-2.6.1-10.git20161206.el7fdp.x86_64

How reproducible:

From the field it seems there's about 50% chance of it segfaulting when setting MTU.



Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Aaron Conole 2017-07-07 20:17:16 UTC
most likely this needs 546e57d44c473aac2915037f6906c9dd04294105

Will check to see if that is all.

Comment 2 Aaron Conole 2017-07-07 20:20:29 UTC
Is it also possible to get a crash dump from the customer?  That would confirm this is the issue.

Comment 9 Aaron Conole 2017-08-02 18:03:50 UTC
I've posted an upstream fix for the crash reported:

http://dpdk.org/ml/archives/dev/2017-August/072387.html

Comment 10 Andreas Karis 2017-08-02 18:33:15 UTC
Hi,

How fast are we going to have that downstream? Either as a hotfix or in the repos?

Thanks,

Andreas

Comment 11 Aaron Conole 2017-08-02 19:05:01 UTC
Needs to be accepted upstream first - I don't know how long that will take, usually a few days to a week.

Comment 12 Aaron Conole 2017-08-03 01:08:29 UTC
*** Bug 1477785 has been marked as a duplicate of this bug. ***

Comment 21 Ihar Hrachyshka 2017-08-04 15:26:50 UTC
The fix was applied in upstream repository. Please build a new package with it included.

Comment 37 Jim Sisul 2017-08-09 13:53:17 UTC
TAM and SA team met with Cisco 8/8.  Cisco is open to providing us with the necessary hardware, or access to a lab with the hardware, or helping us determine if hardware we have is functionally equivalent.

Ravi Anan (ravianan) is the Cisco engineer we need to contact about test environment.

Here are the notes from our meeting:

Current RH Software on Sprint Environment:
--RH OSP 10.z2
--OVS 2.6.1-3 beta (Compiled with DPDK)

Current OVS Deployment: Open vSwitch version number does not necessarily imply what version of DPDK library the upstream used to compile it.  Cisco does track what version of DPDK library works with VIC-1340.  We don't know if OVS 2.6.1-3 uses a compatible DPDK library.

Action Item: Cisco to confirm what DPDK libraries work (tested?) with VIC-1340, map DPDK library to OVS version (or check upstream opevswitch.org?).  Cisco PMD drivers also upstreamed to openvswitch.org.  Need to confirm which versions of OVS contain the correct DPDK library and correct PMD drivers.  Need to confirm that we're using a stable branch of DPDK libraries that will accumulate support patches going forward.

Current RH QE Test Lab:  RH may lack the correct hardware in their lab to test OVS hotfixes.

Action Item:  RH engineering will contact Ravi Anan to determine if our lab hardware is either the same as or compatible with UCS-B200 and VIC-1340 (Jim Sisul will pass contact info to RH QE) and if not, how to go about getting it for testing purposes.

Comment 42 Flavio Leitner 2017-08-21 15:48:55 UTC
Moving back to ASSIGNED.

Due to lack of HW to verify and enable support, the ENIC PMD driver will be disabled for 10z4.

fbl

Comment 49 liting 2017-08-24 09:39:30 UTC
I have reproduced this issue on cisco machine, following is the detail info.

ENV info:
two machine: cisco machine and dell-02
cisco machine's nic enp11s0 connect dell-02's nic p3p2.
cisco nic's model is VIC 1225.

Steps:
(1) did following configurations on cisco machine:
setenforce permissive
systemctl restart openvswitch
/usr/bin/ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
 /usr/bin/ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-lcore-mask=0x2
/usr/bin/ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem=1024,0
/usr/bin/ovs-vsctl --timeout 10 set Open_vSwitch . other_config:pmd-cpu-mask=1
ovs-vsctl --timeout 10 set Open_vSwitch . other_config:pmd-cpu-mask=30
/usr/bin/ovs-vsctl --timeout 10 add-br br0 -- set bridge br0 datapath_type=netdev
/usr/bin/ovs-vsctl --timeout 10 add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk 
/usr/bin/ovs-vsctl --timeout 10 add-port br0 dpdkvhostuser0 -- set Interface dpdkvhostuser0 type=dpdkvhostuser 
sudo /usr/bin/ovs-ofctl -O OpenFlow13 --timeout 10 del-flows br0 
 /usr/bin/ovs-ofctl -O OpenFlow13 --timeout 10 add-flow br0 idle_timeout=0,in_port=1,action=output:2
/usr/bin/ovs-ofctl -O OpenFlow13 --timeout 10 add-flow br0 idle_timeout=0,in_port=2,action=output:1
chmod 777 /var/run/openvswitch/dpdkvhostuser0

inside guest:
ip addr add 192.168.10.1/24 dev eth1

(2) checked the ovs topo:
[root@cisco-c220m3-01 openvswitch]# ovs-vsctl show
9b108ee7-ef08-4c14-92fb-8814206bb881
    Bridge "br0"
        Port "dpdk0"
            Interface "dpdk0"
                type: dpdk
        Port "dpdkvhostuser0"
            Interface "dpdkvhostuser0"
                type: dpdkvhostuser
        Port "br0"
            Interface "br0"
                type: internal
    ovs_version: "2.6.1"


(3) did following configuration on dell-02:
ip addr add 192.168.10.2/24 dev p3p2

and ping the guest on cisco machine:
ping -n -i 0.001 192.168.10.1

(4) I configured following command by steps on cisco machine.
ovs-vsctl set int dpdkvhostuser0 mtu_request=1900
ovs-vsctl set int dpdkvhostuser0 mtu_request=2000
ovs-vsctl set int dpdkvhostuser0 mtu_request=2200
ovs-vsctl set int dpdkvhostuser0 mtu_request=2300
ovs-vsctl set int dpdkvhostuser0 mtu_request=9000
ovs-vsctl set int dpdkvhostuser0 mtu_request=2000
ovs-vsctl set int dpdkvhostuser0 mtu_request=1500

(5) dell-02 can ping the guest in cisco machine at first,  but after configure mtu to 1500, it cannot ping the guest in cisco machine. And check /var/log/messages, it has following error info:
Aug 24 04:08:53 cisco-c220m3-01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl set int dpdkvhostuser0 mtu_request=1900
Aug 24 04:08:53 cisco-c220m3-01 ovs-vswitchd[16701]: RING: Cannot reserve memory
Aug 24 04:08:53 cisco-c220m3-01 ovs-vswitchd[16701]: RING: Cannot reserve memory
Aug 24 04:08:57 cisco-c220m3-01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl set int dpdkvhostuser0 mtu_request=2000
Aug 24 04:09:01 cisco-c220m3-01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl set int dpdkvhostuser0 mtu_request=2100
Aug 24 04:09:06 cisco-c220m3-01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl set int dpdkvhostuser0 mtu_request=2200
Aug 24 04:09:10 cisco-c220m3-01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl set int dpdkvhostuser0 mtu_request=2300
Aug 24 04:09:15 cisco-c220m3-01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl set int dpdkvhostuser0 mtu_request=9000
Aug 24 04:09:15 cisco-c220m3-01 ovs-vswitchd[16701]: RING: Cannot reserve memory
Aug 24 04:09:15 cisco-c220m3-01 ovs-vswitchd[16701]: RING: Cannot reserve memory
Aug 24 04:09:15 cisco-c220m3-01 ovs-vswitchd[16701]: RING: Cannot reserve memory
Aug 24 04:09:15 cisco-c220m3-01 ovs-vswitchd[16701]: RING: Cannot reserve memory
Aug 24 04:09:15 cisco-c220m3-01 ovs-vswitchd[16701]: RING: Cannot reserve memory
Aug 24 04:09:15 cisco-c220m3-01 ovs-vswitchd[16701]: ovs|00098|dpdk|ERR|Insufficient memory to create memory pool for netdev dpdkvhostuser0, with MTU 9000 on socket 0
Aug 24 04:09:23 cisco-c220m3-01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl set int dpdkvhostuser0 mtu_request=2000
Aug 24 04:09:29 cisco-c220m3-01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl set int dpdkvhostuser0 mtu_request=1500
Aug 24 04:09:29 cisco-c220m3-01 kernel: pmd26[16958]: segfault at 44 ip 000055b6b514bdff sp 00007f6fcbff44d0 error 4 in ovs-vswitchd[55b6b5070000+3b1000]
Aug 24 04:09:30 cisco-c220m3-01 systemd: ovs-vswitchd.service: main process exited, code=killed, status=11/SEGV
Aug 24 04:09:30 cisco-c220m3-01 systemd: Stopping Open vSwitch...
Aug 24 04:09:30 cisco-c220m3-01 systemd: Stopped Open vSwitch.
Aug 24 04:09:30 cisco-c220m3-01 ovs-ctl: 2017-08-24T08:09:30Z|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.16701.ctl
Aug 24 04:09:30 cisco-c220m3-01 ovs-appctl: ovs|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.16701.ctl
Aug 24 04:09:30 cisco-c220m3-01 ovs-ctl: ovs-appctl: cannot connect to "/var/run/openvswitch/ovs-vswitchd.16701.ctl" (Connection refused)
Aug 24 04:09:30 cisco-c220m3-01 systemd: Stopped Open vSwitch Forwarding Unit.
Aug 24 04:09:30 cisco-c220m3-01 systemd: Unit ovs-vswitchd.service entered failed state.
Aug 24 04:09:30 cisco-c220m3-01 systemd: ovs-vswitchd.service failed.
Aug 24 04:09:30 cisco-c220m3-01 systemd: Stopping Open vSwitch Database Unit...
Aug 24 04:09:30 cisco-c220m3-01 ovs-ctl: Exiting ovsdb-server (16665) [  OK  ]
Aug 24 04:09:30 cisco-c220m3-01 systemd: Stopped Open vSwitch Database Unit.
Aug 24 04:13:33 cisco-c220m3-01 ovs-vsctl: ovs|00002|fatal_signal|WARN|terminating with signal 2 (Interrupt)
Aug 24 04:13:42 cisco-c220m3-01 systemd: Starting Open vSwitch Database Unit...
Aug 24 04:13:42 cisco-c220m3-01 ovs-ctl: Starting ovsdb-server [  OK  ]
Aug 24 04:13:42 cisco-c220m3-01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait -- init -- set Open_vSwitch . db-version=7.14.0
Aug 24 04:13:42 cisco-c220m3-01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set Open_vSwitch . ovs-version=2.6.1 "external-ids:system-id=\"03e57cb6-7c2b-4f8d-84d3-381889410eaf\"" "external-ids:hostname=\"cisco-c220m3-01.rhts.eng.pek2.redhat.com\"" "system-type=\"rhel\"" "system-version=\"7.4\""
Aug 24 04:13:42 cisco-c220m3-01 ovs-ctl: Configuring Open vSwitch system IDs [  OK  ]


(6) And I found the ovs service was stop, and can start it successfully by manually.

If run following script in cisco, it can also reproduce the issue.
run following script on cisco machine:
i=1500
while true
do
   ovs-vsctl set int dpdkvhostuser0 mtu_request=$i
   sleep 1
   ovs-vsctl list int dpdkvhostuser0 | grep mtu
   echo; echo;
   i=`expr $i + 10`;
   if [ $i -gt 9000 ]; then break; fi;
done | tee -a /tmp/test-results

Comment 50 liting 2017-08-25 09:10:07 UTC
I have verified it pass by using openvswitch-2.6.1-13.git20161206.el7ost.x86_64.rpm.

Comment 56 errata-xmlrpc 2017-09-06 16:59:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2648