Bug 1480376 - lacp packets are send on one member link only
Summary: lacp packets are send on one member link only
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Emilien Macchi
QA Contact: Gurenko Alex
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-08-10 21:10 UTC by bigswitch
Modified: 2017-09-19 19:17 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-09-19 19:17:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
sos report from controller node with lacp bonding on p1p1 and p1p2 (11.82 MB, application/x-xz)
2017-08-10 21:36 UTC, bigswitch
no flags Details

Description bigswitch 2017-08-10 21:10:01 UTC
Description of problem:
Seen with latest RHOSP 10 on RHEL 7.4. lacp bonding is configured on heat templates and applied. However, notice that lacp is only transmitted on one lacp member but not the other. This causes one link to always fail.
This is not seen in RHOSP 10 with RHEL 7.3

Version-Release number of selected component (if applicable):
RHOSP-10
RHEL-7.4

How reproducible:
Always

[root@overcloud-controller-0 heat-admin]# cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 50
Up Delay (ms): 1000
Down Delay (ms): 0

802.3ad info
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: 90:e2:ba:6e:ff:c0
Active Aggregator Info:
        Aggregator ID: 1
        Number of ports: 1
        Actor Key: 13
        Partner Key: 4099
        Partner Mac Address: 5c:16:c7:02:37:02

Slave Interface: p1p1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 90:e2:ba:6e:ff:c0
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 90:e2:ba:6e:ff:c0
    port key: 13
    port priority: 255
    port number: 1
    port state: 63
details partner lacp pdu:
    system priority: 32768
    system mac address: 5c:16:c7:02:37:02
    oper key: 4099
    port priority: 32768
    port number: 1
    port state: 63

Slave Interface: p1p2
MII Status: down
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 90:e2:ba:6e:ff:c1
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: churned
Partner Churn State: churned
Actor Churned Count: 1
Partner Churned Count: 1
details actor lacp pdu:
    system priority: 65535
    system mac address: 90:e2:ba:6e:ff:c0
    port key: 0
    port priority: 255
    port number: 2
    port state: 71
details partner lacp pdu:
    system priority: 65535
    system mac address: 00:00:00:00:00:00
    oper key: 1
    port priority: 255
    port number: 1
    port state: 1
[root@overcloud-controller-0 heat-admin]#

[root@overcloud-controller-0 heat-admin]# cat /etc/sysconfig/network-scripts/ifcfg-bond1
# This file is autogenerated by os-net-config
DEVICE=bond1
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
DEVICETYPE=ovs
TYPE=OVSPort
OVS_BRIDGE=br-ex
BONDING_OPTS="mode=4 lacp_rate=1 updelay=1000 miimon=50"
[root@overcloud-controller-0 heat-admin]# cat /etc/sysconfig/network-scripts/ifcfg-p1p1
# This file is autogenerated by os-net-config
DEVICE=p1p1
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
MASTER=bond1
SLAVE=yes
BOOTPROTO=none
[root@overcloud-controller-0 heat-admin]# cat /etc/sysconfig/network-scripts/ifcfg-p1p2
# This file is autogenerated by os-net-config
DEVICE=p1p2
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
MASTER=bond1
SLAVE=yes
BOOTPROTO=none
[root@overcloud-controller-0 heat-admin]#

Comment 1 bigswitch 2017-08-10 21:36:40 UTC
Created attachment 1311910 [details]
sos report from controller node with lacp bonding on p1p1 and p1p2

This sosreport is collected while the link p1p2 is initially down, than up after overcloud deployment is done. The lacp packets not transmitted issue is still seen after link up

Comment 2 Dan Sneddon 2017-08-24 20:31:55 UTC
There is a mismatch between the bonding options and the bonding mode. The following options are only applicable for Linux bonds: "mode=4 lacp_rate=1 updelay=1000 miimon=50", but the bond is configured as an OVS bond.

Either the NIC configs should be using "type: linux_bond" or the bonding options for OVS bonds should be used. LACP is configured on an OVS bond by using the appropriate options, for instance:
"bond_mode=balance-tcp lacp=active other-config:lacp-time=fast other_config:lacp-fallback-ab=true"

The options above will use LACP for load balancing (balance-tcp), will use active LACP with fast timing, and will fall back to active-backup if LACP cannot be established with the switch. Alternately, simply using "type: linux_bond" for the bond should work with the above options.

Comment 3 Bob Fournier 2017-09-07 17:05:37 UTC
Any update on this?  We don't think this is a bug but we like to know if recommendation has helped.

Comment 4 bigswitch 2017-09-07 18:05:34 UTC
we found one issue with miimon not able to get the physical status , so bonding driver think the status is down though the interface is up . This looks like something with kernel bug..

We get around this by avolding miimon and get the carrier status from net_dev

we added bonding options : "use_carrier=1" , so for worked good.

Comment 5 Bob Fournier 2017-09-07 18:51:32 UTC
OK, thanks, we want to make sure you were using the appropriate bond in the configuration. It would be useful to see the network-script that worked.

Btw, not sure if the carrier issue was related to this or not: https://review.openstack.org/#/c/419527/

Comment 6 Bob Fournier 2017-09-19 19:17:03 UTC
I'm closing this for now as the issue appears to be with the port's physical state and the discovered workaround has taken care of it.  There does not seem to be an issue with the handling or configuration of bonding in THT or os-net-config.

Please reopen if it appears that this is an issue in the Director's bonding management.


Note You need to log in before you can comment on or make changes to this bug.