Bug 804876 - Cannot bond Infiniband network interfaces (as VLAN id 0 is treated incorrectly)
Cannot bond Infiniband network interfaces (as VLAN id 0 is treated incorrectly)
Status: CLOSED RAWHIDE
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
rawhide
x86_64 Linux
unspecified Severity medium
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-03-19 23:44 EDT by Alexander Murashkin
Modified: 2012-11-13 09:54 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-11-13 09:54:44 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Alexander Murashkin 2012-03-19 23:44:13 EDT
Description of problem:

Enslaving of Infiniband interfaces does not work. An error similar to one below is printed

bonding: bond2: Error: cannot enslave VLAN challenged slave ib0 on VLAN enabled bond bond2

Based on the kernel source code bonding module assumes that bond2 interface has VLAN 0 enabled. VLAN ID 0 is 802.1Q reserved value indicating that a frame does not belong to any VLAN. So it has to be treated as a special case in 8021q module and other NETIF_F_HW_VLAN_FILTER enabled modules.

Specifically the following happens

When bond2 is being brought up

- vlan_device_event() in vlan.c is called with event NETDEV_UP
- because bonding has NETIF_F_HW_VLAN_FILTER feature vlan_device_event() calls bonding ndo_vlan_rx_add_vid(vlan_id 0)
- bond_vlan_rx_add_vid() in bond_main.c calls bond_add_vlan() that adds vlan_id 0 to bond->vlan_list

When Infiniband interface is being enslaved

- bond_enslave() in bond_main.c sees that bond->vlan_list is not empty (via bond_vlan_used()) and returns the error.

See more details below.

Version-Release number of selected component (if applicable):

kernel-3.2.7-1.fc16.x86_64

How reproducible:

Steps to Reproduce:
1. Configure bond2 with Infiniband slave ib0 (or some other device that has NETIF_F_VLAN_CHALLENGED flag)
2. ifup bond2
3. Observe echo '+ib0' > /sys/class/net/bond2/bonding/slaves failure 
  
Actual results:

echo: write error: Operation not permitted
bonding: bond2: Error: cannot enslave VLAN challenged slave ib0 on VLAN enabled bond bond2

Expected results:

The enslaving works. bonding module does not print any errors.

Additional info:

static int vlan_device_event(...)
{ 
        ...
	if ((event == NETDEV_UP) &&
	    (dev->features & NETIF_F_HW_VLAN_FILTER) &&
	    dev->netdev_ops->ndo_vlan_rx_add_vid) {
		pr_info("adding VLAN 0 to HW filter on device %s\n",
			dev->name);
		dev->netdev_ops->ndo_vlan_rx_add_vid(dev, 0);
	}
        ...
}

static void bond_vlan_rx_add_vid(struct net_device *bond_dev, uint16_t vid)
{
	...
	res = bond_add_vlan(bond, vid);
        ...
}

static int bond_add_vlan(struct bonding *bond, unsigned short vlan_id)
{
        ...
	INIT_LIST_HEAD(&vlan->vlan_list);
	vlan->vlan_id = vlan_id;
	list_add_tail(&vlan->vlan_list, &bond->vlan_list);
        ...
	pr_debug("added VLAN ID %d on bond %s\n", vlan_id, bond->dev->name);
...
}

int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
{
        ...
	if (slave_dev->features & NETIF_F_VLAN_CHALLENGED) {
		pr_debug("%s: NETIF_F_VLAN_CHALLENGED\n", slave_dev->name);
		if (bond_vlan_used(bond)) {
			pr_err("%s: Error: cannot enslave VLAN challenged slave %s on VLAN enabled bond %s\n",
			       bond_dev->name, slave_dev->name, bond_dev->name);
			return -EPERM;
        ...
}

Mar 19 21:22:56 raptor kernel: [528511.145669] 8021q: adding VLAN 0 to HW filter on device bond2
Mar 19 21:22:56 raptor kernel: [528511.145672] bonding: bond: bond2, vlan id 0
Mar 19 21:22:56 raptor kernel: [528511.145675] bonding: added VLAN ID 0 on bond bond2
Mar 19 21:22:56 raptor kernel: [528511.145677] bonding: event_dev: bond2, event: 1
Mar 19 21:22:56 raptor kernel: [528511.145679] bonding: IFF_MASTER
Mar 19 21:22:56 raptor kernel: [528511.201175] ib0: enabling connected mode will cause multicast packet drops
Mar 19 21:22:56 raptor kernel: [528511.203841] bonding: bond2: Adding slave ib0.
Mar 19 21:22:56 raptor kernel: [528511.203846] bonding: ib0: NETIF_F_VLAN_CHALLENGED
Mar 19 21:22:56 raptor kernel: [528511.203848] bonding: bond2: Error: cannot enslave VLAN challenged slave ib0 on VLAN enabled bond bond2
Mar 19 21:22:57 raptor kernel: [528511.249553] ib1: enabling connected mode will cause multicast packet drops
Mar 19 21:22:57 raptor kernel: [528511.252253] bonding: bond2: Adding slave ib1.
Mar 19 21:22:57 raptor kernel: [528511.252257] bonding: ib1: NETIF_F_VLAN_CHALLENGED
Mar 19 21:22:57 raptor kernel: [528511.252260] bonding: bond2: Error: cannot enslave VLAN challenged slave ib1 on VLAN enabled bond bond2
Comment 1 Dave Jones 2012-03-22 12:51:32 EDT
[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.
Comment 2 Dave Jones 2012-03-22 12:55:42 EDT
[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.
Comment 3 Dave Jones 2012-03-22 13:06:30 EDT
[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.
Comment 4 Alexander Murashkin 2012-03-22 22:30:47 EDT
The same problem. The bonding works right after a boot but stops to work if the bonding interface is recycled (down/up). It works initially because bonding module is loaded before 8021q module. After 8021q is loaded enslaving stops to work.

Here are relevanr lines from syslog

---- booting -------------

Mar 22 21:14:24 raptor kernel: [   83.070301] bonding: Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Mar 22 21:14:24 raptor kernel: [   83.071132] bonding: bond2 is being created...
Mar 22 21:14:24 raptor kernel: [   83.132692] bonding: bond2: Setting MII monitoring interval to 100.
Mar 22 21:14:24 raptor kernel: [   83.132786] bonding: bond2: setting mode to active-backup (1).
Mar 22 21:14:24 raptor kernel: [   83.134247] ADDRCONF(NETDEV_UP): bond2: link is not ready
Mar 22 21:14:24 raptor kernel: [   83.173922] ib0: enabling connected mode will cause multicast packet drops
Mar 22 21:14:24 raptor kernel: [   83.176488] ib0: mtu > 2044 will cause multicast packet drops.
Mar 22 21:14:24 raptor kernel: [   83.178063] bonding: bond2: Adding slave ib0.
Mar 22 21:14:24 raptor kernel: [   83.178065] bonding: bond2: Warning: enslaved VLAN challenged slave ib0. Adding VLANs will be blocked as long as ib0 is part of bond bond2
Mar 22 21:14:24 raptor kernel: [   83.178151] bonding: bond2: Warning: The first slave device specified does not support setting the MAC address. Setting fail_over_mac to active.
Mar 22 21:14:24 raptor kernel: [   83.179728] bonding: bond2: enslaving ib0 as a backup interface with a down link.
Mar 22 21:14:24 raptor kernel: [   83.227691] ib1: enabling connected mode will cause multicast packet drops
Mar 22 21:14:24 raptor kernel: [   83.229265] ib1: mtu > 2044 will cause multicast packet drops.
Mar 22 21:14:24 raptor kernel: [   83.231158] bonding: bond2: Adding slave ib1.
Mar 22 21:14:24 raptor kernel: [   83.231160] bonding: bond2: Warning: enslaved VLAN challenged slave ib1. Adding VLANs will be blocked as long as ib1 is part of bond bond2
Mar 22 21:14:24 raptor kernel: [   83.232751] bonding: bond2: enslaving ib1 as a backup interface with a down link.
Mar 22 21:14:24 raptor kernel: [   83.233020] bonding: bond2: link status definitely up for interface ib0, 4294967295 Mbps full duplex.
Mar 22 21:14:24 raptor kernel: [   83.233023] bonding: bond2: making interface ib0 the new active one.
Mar 22 21:14:24 raptor kernel: [   83.233044] bonding: bond2: first active interface up!
Mar 22 21:14:24 raptor kernel: [   83.233682] ADDRCONF(NETDEV_CHANGE): bond2: link becomes ready

....

Mar 22 21:14:33 raptor kernel: [   92.201617] 8021q: 802.1Q VLAN Support v1.8
Mar 22 21:14:33 raptor kernel: [   92.201629] 8021q: adding VLAN 0 to HW filter on device bond2

---- ifdown bond2 ------------------

Mar 22 21:17:12 raptor kernel: [  251.561320] bonding: bond2: Removing slave ib0.
Mar 22 21:17:12 raptor kernel: [  251.561338] bonding: bond2: releasing active interface ib0
Mar 22 21:17:12 raptor kernel: [  251.712468] bonding: bond2: Removing slave ib1.
Mar 22 21:17:12 raptor kernel: [  251.712486] bonding: bond2: releasing backup interface ib1
Mar 22 21:17:12 raptor kernel: [  251.712491] bonding: bond2: Warning: clearing HW address of bond2 while it still has VLANs.
Mar 22 21:17:12 raptor kernel: [  251.712494] bonding: bond2: When re-adding slaves, make sure the bond's HW address matches its VLANs'.

---- ifup bond2 --------------------

Mar 22 21:17:17 raptor kernel: [  256.443262] bonding: bond2: Setting MII monitoring interval to 100.
Mar 22 21:17:17 raptor kernel: [  256.443400] bonding: bond2: setting mode to active-backup (1).
Mar 22 21:17:17 raptor kernel: [  256.445385] ADDRCONF(NETDEV_UP): bond2: link is not ready
Mar 22 21:17:17 raptor kernel: [  256.445390] 8021q: adding VLAN 0 to HW filter on device bond2
Mar 22 21:17:17 raptor kernel: [  256.491592] ib0: enabling connected mode will cause multicast packet drops
Mar 22 21:17:17 raptor kernel: [  256.494235] bonding: bond2: Adding slave ib0.
Mar 22 21:17:17 raptor kernel: [  256.494238] bonding: bond2: Error: cannot enslave VLAN challenged slave ib0 on VLAN enabled bond bond2
Mar 22 21:17:17 raptor kernel: [  256.538680] ib1: enabling connected mode will cause multicast packet drops
Mar 22 21:17:17 raptor kernel: [  256.541346] bonding: bond2: Adding slave ib1.
Mar 22 21:17:17 raptor kernel: [  256.541350] bonding: bond2: Error: cannot enslave VLAN challenged slave ib1 on VLAN enabled bond bond2
Comment 5 Jon Stanley 2012-10-10 23:43:41 EDT
This is caused by upstream commit cc0e40700656b09d93b062ef6c818aa45429d09a and is still present in 3.6.1 upstream.

I haven't had a chance to look at the affected code to see if there's something obvious, but that's the commit that it bisects to.
Comment 6 Jon Stanley 2012-10-11 12:43:33 EDT
moving to rawhide since this is in 3.6 upstream.
Comment 7 Jon Stanley 2012-10-18 09:22:42 EDT
FYI, fix pending upstream.

http://patchwork.ozlabs.org/patch/191363/
http://patchwork.ozlabs.org/patch/192020/
Comment 8 Josh Boyer 2012-11-13 09:54:44 EST
This is fixed in the rawhide 3.7-rcX kernels.

Note You need to log in before you can comment on or make changes to this bug.