Bug 1644982

Summary: Issue with LACP--It can't detect the link down and the path is not failover as expected
Product: Red Hat OpenStack Reporter: liuwei <wliu>
Component: openvswitchAssignee: Aaron Conole <aconole>
Status: CLOSED DUPLICATE QA Contact: Roee Agiman <ragiman>
Severity: high Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: aconole, apevec, bhaley, chrisw, rhos-maint, rkhan, wliu
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-14 13:49:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description liuwei 2018-11-01 06:48:04 UTC
Description of problem:
Issue with LACP--It can't detect the link down and the path  is not failover  as expected.

The lacp show as below:

 ovs-appctl lacp/show
---- bond_dpdk ----
        status: active negotiated
        sys_id: a0:8c:f8:89:ca:90
        sys_priority: 65534
        aggregation key: 1
        lacp_time: slow

slave: dpdk0: current attached
        port_id: 2
        port_priority: 65535
        may_enable: true

        actor sys_id: a0:8c:f8:89:ca:90
        actor sys_priority: 65534
        actor port_id: 2
        actor port_priority: 65535
        actor key: 1
        actor state: activity aggregation synchronized collecting distributing

        partner sys_id: 80:b5:75:86:30:11
        partner sys_priority: 32768
        partner port_id: 2
        partner port_priority: 32768
        partner key: 5953
        partner state: activity timeout aggregation synchronized collecting distributing

slave: dpdk1: current attached
        port_id: 1
        port_priority: 65535
        may_enable: true

        actor sys_id: a0:8c:f8:89:ca:90
        actor sys_priority: 65534
        actor port_id: 1
        actor port_priority: 65535
        actor key: 1
        actor state: activity aggregation synchronized collecting distributing

        partner sys_id: 80:b5:75:86:30:11
        partner sys_priority: 32768
        partner port_id: 1
        partner port_priority: 32768
        partner key: 5953
        partner state: activity timeout aggregation synchronized collecting distributing

Version-Release number of selected component (if applicable):

 cat installed-rpms | egrep 'kernel|openvswitch|dpdk'
dpdk-17.11-7.el7.x86_64                                     Thu Jun 21 16:29:26 2018
erlang-kernel-18.3.4.8-1.el7ost.x86_64                      Thu Jun 21 16:21:12 2018
kernel-3.10.0-862.3.3.el7.x86_64                            Thu Jun 21 15:42:09 2018
kernel-tools-3.10.0-862.3.3.el7.x86_64                      Thu Jun 21 15:42:42 2018
kernel-tools-libs-3.10.0-862.3.3.el7.x86_64                 Thu Jun 21 15:38:30 2018
openstack-neutron-openvswitch-12.0.2-0.20180421011362.0ec54fd.el7ost.noarch Thu Jun 21 16:24:25 2018
openvswitch-2.9.0-19.el7fdp.1.x86_64                        Thu Jun 21 16:17:13 2018
openvswitch-ovn-central-2.9.0-19.el7fdp.1.x86_64            Thu Jun 21 16:29:22 2018
openvswitch-ovn-common-2.9.0-19.el7fdp.1.x86_64             Thu Jun 21 16:18:25 2018
openvswitch-ovn-host-2.9.0-19.el7fdp.1.x86_64               Thu Jun 21 16:29:22 2018
python-openvswitch-2.9.0-19.el7fdp.1.noarch                 Thu Jun 21 16:17:15 2018


How reproducible:

100%reproduced

Steps to Reproduce:


The networking is as follows:
(1) The two NICs on com6 and com7 is the bond of dpdk.    Detailed configuration:bond_mode=balance-tcp lacp=active other_config:lacp-time=slow other-config:lacp-fallback-ab=true other_config:bond-rebalance-interval=1000 other_config:bond-detect-mode=miimon other_config:bond-miimon-interval=100
(2)Configure eth-trunk on SW and support LACP. 
(3)There are two links between the server(com6 and com7) and the SW. 
                        --------------------
                       |               SW              |
                        ---|--|---------|--|--
                        ---|--|--       ---|--|---
                       |   com6  |   |   com7  |
                        ---------     ---------

(1) Create a virtual machine on each of com6 and com7. 
(2)The eth0 of two vms is configured as follows:
ifconfig eth0 7.7.7.1/24 up
ifconfig eth0 7.7.7.2/24 up
(3) From a VM Long Ping to another VM.
ping -c 10000 -s 60000 7.7.7.2
(4)Observe the Ping traffic passing through a link  between the server(com6 and com7) and the SW. You can observe which link passes through the ovs-aapctl bond/show command. 
(5) Cut off the link with traffic. 
(6)Observe Ping, no matter how long it has been, it always blocked.
(7) Unless you stop the current Ping and do nothing in the middle, 10S will be able to recover.

Actual results:

It can't detect the link down and the path  is not failover  as expected

Expected results:

detect the link down and  link failover is as expected

Additional info:

The ovs-bonding configuration is below:


cat etc/sysconfig/network-scripts/ifcfg-bond_dpdk 
# This file is autogenerated by os-net-config
DEVICE=bond_dpdk
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
DEVICETYPE=ovs
TYPE=OVSDPDKBond
OVS_BRIDGE=br-link0
BOND_IFACES="dpdk0 dpdk1"
RX_QUEUE=1
OVS_OPTIONS="bond_mode=balance-tcp lacp=active other_config:lacp-time=slow other-config:lacp-fallback-ab=true other_config:bond-rebalance-interval=1000 other_config:bond-detect-mode=miimon other_config:bond-miimon-interval=100"
MTU=9000
OVS_EXTRA="set Interface dpdk0 options:dpdk-devargs=0000:3e:00.0 -- set Interface dpdk1 options:dpdk-devargs=0000:3e:00.1 -- set Interface dpdk0 mtu_request=$MTU -- set Interface dpdk1 mtu_request=$MTU -- set Interface dpdk0 options:n_rxq=$RX_QUEUE -- set Interface dpdk1 options:n_rxq=$RX_QUEUE"

The log collection is below:

 [browse] the files here: http://collab-shell.usersys.redhat.com/02242686/

Comment 4 Aaron Conole 2018-11-05 19:27:34 UTC
How did you "cut off the traffic"?  Just want to know the exact method.

Comment 5 liuwei 2018-11-06 03:30:07 UTC
Dear Aaron Conole:

Thanks for your attenion on this ticket. The below is from the Customer side:

During failover testing, he shutdown the port which  has network traffic on switch side. 

I hope  the answer is clear for you . 


Wei

Comment 6 Aaron Conole 2018-11-06 14:32:47 UTC
Is this the same bug as https://bugzilla.redhat.com/show_bug.cgi?id=1644383 ?

Comment 9 Aaron Conole 2018-11-14 13:49:45 UTC
Thanks.  I'll close this as duplicate so that we don't get confused efforts.

*** This bug has been marked as a duplicate of bug 1644383 ***