Description of problem: Seen with latest RHOSP 10 on RHEL 7.4. lacp bonding is configured on heat templates and applied. However, notice that lacp is only transmitted on one lacp member but not the other. This causes one link to always fail. This is not seen in RHOSP 10 with RHEL 7.3 Version-Release number of selected component (if applicable): RHOSP-10 RHEL-7.4 How reproducible: Always [root@overcloud-controller-0 heat-admin]# cat /proc/net/bonding/bond1 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer2 (0) MII Status: up MII Polling Interval (ms): 50 Up Delay (ms): 1000 Down Delay (ms): 0 802.3ad info LACP rate: fast Min links: 0 Aggregator selection policy (ad_select): stable System priority: 65535 System MAC address: 90:e2:ba:6e:ff:c0 Active Aggregator Info: Aggregator ID: 1 Number of ports: 1 Actor Key: 13 Partner Key: 4099 Partner Mac Address: 5c:16:c7:02:37:02 Slave Interface: p1p1 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 90:e2:ba:6e:ff:c0 Slave queue ID: 0 Aggregator ID: 1 Actor Churn State: none Partner Churn State: none Actor Churned Count: 0 Partner Churned Count: 0 details actor lacp pdu: system priority: 65535 system mac address: 90:e2:ba:6e:ff:c0 port key: 13 port priority: 255 port number: 1 port state: 63 details partner lacp pdu: system priority: 32768 system mac address: 5c:16:c7:02:37:02 oper key: 4099 port priority: 32768 port number: 1 port state: 63 Slave Interface: p1p2 MII Status: down Speed: 10000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 90:e2:ba:6e:ff:c1 Slave queue ID: 0 Aggregator ID: 2 Actor Churn State: churned Partner Churn State: churned Actor Churned Count: 1 Partner Churned Count: 1 details actor lacp pdu: system priority: 65535 system mac address: 90:e2:ba:6e:ff:c0 port key: 0 port priority: 255 port number: 2 port state: 71 details partner lacp pdu: system priority: 65535 system mac address: 00:00:00:00:00:00 oper key: 1 port priority: 255 port number: 1 port state: 1 [root@overcloud-controller-0 heat-admin]# [root@overcloud-controller-0 heat-admin]# cat /etc/sysconfig/network-scripts/ifcfg-bond1 # This file is autogenerated by os-net-config DEVICE=bond1 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no PEERDNS=no DEVICETYPE=ovs TYPE=OVSPort OVS_BRIDGE=br-ex BONDING_OPTS="mode=4 lacp_rate=1 updelay=1000 miimon=50" [root@overcloud-controller-0 heat-admin]# cat /etc/sysconfig/network-scripts/ifcfg-p1p1 # This file is autogenerated by os-net-config DEVICE=p1p1 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no PEERDNS=no MASTER=bond1 SLAVE=yes BOOTPROTO=none [root@overcloud-controller-0 heat-admin]# cat /etc/sysconfig/network-scripts/ifcfg-p1p2 # This file is autogenerated by os-net-config DEVICE=p1p2 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no PEERDNS=no MASTER=bond1 SLAVE=yes BOOTPROTO=none [root@overcloud-controller-0 heat-admin]#
Created attachment 1311910 [details] sos report from controller node with lacp bonding on p1p1 and p1p2 This sosreport is collected while the link p1p2 is initially down, than up after overcloud deployment is done. The lacp packets not transmitted issue is still seen after link up
There is a mismatch between the bonding options and the bonding mode. The following options are only applicable for Linux bonds: "mode=4 lacp_rate=1 updelay=1000 miimon=50", but the bond is configured as an OVS bond. Either the NIC configs should be using "type: linux_bond" or the bonding options for OVS bonds should be used. LACP is configured on an OVS bond by using the appropriate options, for instance: "bond_mode=balance-tcp lacp=active other-config:lacp-time=fast other_config:lacp-fallback-ab=true" The options above will use LACP for load balancing (balance-tcp), will use active LACP with fast timing, and will fall back to active-backup if LACP cannot be established with the switch. Alternately, simply using "type: linux_bond" for the bond should work with the above options.
Any update on this? We don't think this is a bug but we like to know if recommendation has helped.
we found one issue with miimon not able to get the physical status , so bonding driver think the status is down though the interface is up . This looks like something with kernel bug.. We get around this by avolding miimon and get the carrier status from net_dev we added bonding options : "use_carrier=1" , so for worked good.
OK, thanks, we want to make sure you were using the appropriate bond in the configuration. It would be useful to see the network-script that worked. Btw, not sure if the carrier issue was related to this or not: https://review.openstack.org/#/c/419527/
I'm closing this for now as the issue appears to be with the port's physical state and the discovered workaround has taken care of it. There does not seem to be an issue with the handling or configuration of bonding in THT or os-net-config. Please reopen if it appears that this is an issue in the Director's bonding management.