Bug 841983

Summary:

VLAN configured on top of a bonded interface (active-backup) does not failover

Product:

Red Hat Enterprise Linux 6

Reporter:

Neal Kim <nkim>

Component:

kernel

Assignee:

Neil Horman <nhorman>

Status:

CLOSED ERRATA

QA Contact:

Liang Zheng <lzheng>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

6.3

CC:

ajb, bilias, cww, david, dhoward, fhrbata, gdurandv, gouyang, jcpunk, john.ronciak, kzhang, leiwang, lzheng, mgiles, mishu, ngalvin, nhorman, redhat-bugzilla, rik.theys, sforsber, sputhenp, toracat, vcojot, zhchen

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

kernel-2.6.32-294.el6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-02-21 06:42:12 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

842429

Attachments:

Description	Flags
Network setup script	none
[PATCH] vlan: filter device events on bonds	none

Description Neal Kim 2012-07-20 18:53:27 UTC

Description of problem:

Bonding interface fails over (in active-backup) then the VLANs on top of it do not fail over as well.

Version-Release number of selected component (if applicable):
kernel-2.6.32-279.2.1.el6

How reproducible:

Always.

Steps to Reproduce:

* Configure a bonded interface, in active-backup Bonding mode, with 2 ethernets.
* Configure a VLAN on top of the bonded interface. Check that we can communicate with other devices on that VLAN.
* Disable the interface on the switch that *either* the active *or* the standby ethernets are connected to.
* Verify that traffic on the bonded interface still works - i.e. if we disabled the active ethernet then it has failed over.

* Observe that we can no longer communicate on the VLAN.
* Observe that "cat /sys/class/net/bond1.3091/operstate" returns "lowerlayerdown".

Actual results:

VLAN does not fail-over as expected.

Expected results:

VLAN fail-over successful.

Additional info:

Comment 1 Neal Kim 2012-07-20 18:56:51 UTC

Created attachment 599440 [details]
Network setup script

Comment 4 Neil Horman 2012-07-20 20:03:34 UTC

Created attachment 599456 [details]
[PATCH] vlan: filter device events on bonds


Since bond masters and slaves only have separate vlan groups now, the
vlan_device_event handler has to be taught to ignore network events from slave
devices when they're truly attached to the bond master.  We do this by looking
up the network device of a given vide on both the slave and its master.  if they
match, then we're processing an event for a physical device that we don't really
care about (since the masters events are realy what we're interested in.

This patch adds that comparison, and allows us to filter those slave events that
the vlan code should ignore.
---
 net/8021q/vlan.c |   64 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 64 insertions(+), 0 deletions(-)

Comment 5 Neil Horman 2012-07-20 20:04:26 UTC

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=4629537

Brew build link for you.  Please test and report as to weather or not this corrects the reported problem

Comment 6 Neal Kim 2012-07-21 06:06:06 UTC

Good news!

Initial test results are looking good. Failing one interface results in the VLAN *not* going down.


Cheers,

Comment 7 Neal Kim 2012-07-21 06:35:45 UTC

I can confirm the same on my virtual setup as well. After disconnecting one of the virtual interfaces results in the operstate as "up":

[root@rhel63test ~]# ifconfig
bond0     Link encap:Ethernet  HWaddr 00:0C:29:8B:33:56  
          inet addr:192.168.2.200  Bcast:192.168.2.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:fe8b:3356/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:2424 errors:0 dropped:0 overruns:0 frame:0
          TX packets:782 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:181326 (177.0 KiB)  TX bytes:262852 (256.6 KiB)

bond0.10  Link encap:Ethernet  HWaddr 00:0C:29:8B:33:56  
          inet addr:192.168.2.175  Bcast:192.168.2.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:fe8b:3356/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:4 errors:0 dropped:0 overruns:0 frame:0
          TX packets:11 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:168 (168.0 b)  TX bytes:746 (746.0 b)

eth0      Link encap:Ethernet  HWaddr 00:0C:29:8B:33:56  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:2140 errors:0 dropped:0 overruns:0 frame:0
          TX packets:782 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:164302 (160.4 KiB)  TX bytes:262852 (256.6 KiB)

eth1      Link encap:Ethernet  HWaddr 00:0C:29:8B:33:56  
          UP BROADCAST SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:284 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:17024 (16.6 KiB)  TX bytes:0 (0.0 b)

[root@rhel63test ~]# uname -r
2.6.32-287.el6.test.x86_64
[root@rhel63test ~]# cat /sys/class/net/bond0.10/operstate 
up

Just in case, I also disconnected *both* virtual interfaces that are part of bond0, and confirmed bond0.10 operstate to be "lowerlayerdown". I then brought one virtual interface back up, thereby reactivating bond0, and could see bond0.10 operstate to be "up" as well.


Cheers,

Comment 8 Neil Horman 2012-07-21 11:06:02 UTC

ok, that is good news.  When bytemobile confirms the same, I'll post the patch.  I recommend that you, neal, flag this as a z-stream candidate as well.

Comment 9 Neil Horman 2012-07-21 19:12:50 UTC

Neal, quick note, please make sure to test the non-bonded case.  i.e. in addition to adding a vlan to a bonded interface, also test the case in which you add a vlan to a single physical interface.  Please make sure that, when the physical interface is taken down the operstate of the vlan transitions to lowerlayerdown.  I want to be sure this doesn't create any new regressions.

Comment 10 Neal Kim 2012-07-21 19:20:54 UTC

No problem Neil, that should be easy enough to test.

Comment 11 Neal Kim 2012-07-21 19:53:44 UTC

So far so good.

I configured a VLAN interface (eth1.20), verified the link status and VLAN operstate (eth1 in up/down state).

eth1      Link encap:Ethernet  HWaddr 00:0C:29:8B:33:60  
          inet addr:192.168.2.223  Bcast:192.168.2.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:fe8b:3360/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2449 errors:0 dropped:0 overruns:0 frame:0
          TX packets:23 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:149504 (146.0 KiB)  TX bytes:1742 (1.7 KiB)

eth1.20   Link encap:Ethernet  HWaddr 00:0C:29:8B:33:60  
          inet addr:192.168.2.180  Bcast:192.168.2.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:fe8b:3360/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:720 (720.0 b)

[root@rhel63test ~]# ethtool eth1 | grep -i detected
	Link detected: yes

[root@rhel63test ~]# ethtool eth1.20 | grep -i detected
	Link detected: yes

[root@rhel63test ~]# cat /sys/class/net/eth1.20/operstate 
up

+---------------------+
| Simulate Cable Pull |
+---------------------+

[root@rhel63test ~]# ethtool eth1 | grep -i detected
	Link detected: no

[root@rhel63test ~]# ethtool eth1.20 | grep -i detected
	Link detected: no

[root@rhel63test ~]# cat /sys/class/net/eth1.20/operstate 
lowerlayerdown

I then reconnected the interfaces and the eth1.20 operstate reported as "up" (as expected). Nothing out of the ordinary recorded in dmesg either.

Comment 12 Neil Horman 2012-07-21 23:21:26 UTC

excellent, thank you.  Unless you object, I'll post this for review tomorrow (yes, sunday), so we can get acks monday.  I suggest you nominate this for z-stream, so we can get them a z-stream kernel asap.

Comment 13 RHEL Program Management 2012-07-22 13:40:03 UTC

This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release.  Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.

Comment 15 Liang Zheng 2012-07-24 02:46:43 UTC

Hi Neil,
I have a question for the failover event.What's the different between Cable Pull and shut down interface on switch to simulate failover event ?
Can I just shut down the interface on switch to sumulate the failover events?

Thank you.
Liang Zheng.

Comment 16 Neil Horman 2012-07-24 12:48:46 UTC

The real answer to that question often lies in the driver details.  For the purposes of this test I think the differences are largely irrelevant, but generally speaking, running ifdown will clear the IFF_UP flag from the interface before sending a carrier-off linkwatch event.  Just pulling the cable will only send the linkwatch event, without clearing the IFF_UP flag.  listeners for the event may behave different based on those differences.

Comment 17 Marcelo Giles 2012-07-24 15:17:05 UTC

(In reply to comment #12)

I will test the patched kernel in this environment that has 2 RHEL 6.3 kvm hosts using nic+bond+vlan+bridge and let you know it it fixes the issues that we have observed.

As a side note, we also have a RHEV 3 environment with the same network setup and RHEV-M fails to create the bonds using the vlan interfaces on RHEV-H 6.3 hypervisors. It works fine with RHEV-H 6.2 hypervisors.

Comment 18 Kapetanakis Giannis 2012-07-24 16:07:24 UTC

I've just tested patch https://bugzilla.redhat.com/attachment.cgi?id=599456&action=diff on top of 2.6.32-279.2.1 and works fine.

My setup is nics->bond->vlans->bridges and I had the same problem
after applying kernel 2.6.32-279.2.1

I've tested both ifup/ifdown as well as port disable on switch.

regards,

Giannis

Comment 20 Marcelo Giles 2012-07-26 12:50:30 UTC

(In reply to comment #18)
In the case I'm testing the problem affects NICs bonded using mode 4 (link aggregation). Should I open a separate BZ? Or maybe is already open?

Comment 22 Zhenjie Chen 2012-07-27 07:43:59 UTC

Hi,
I reproduce the bug in kernel 2.6.32-289 
and I also test in kernel 2.6.270, 2.6.279.5.1, 2.6.293, no this bug exist.

Comment 23 Kapetanakis Giannis 2012-08-01 09:07:53 UTC

Hi,

What's the status on this one? 
Is it fixed on any kernel publicly available?

thanx

Giannis

Comment 24 Suzanne Forsberg 2012-08-01 14:34:59 UTC

(In reply to comment #23)
> Hi,
> 
> What's the status on this one? 
> Is it fixed on any kernel publicly available?
> 
> thanx
> 
> Giannis

Hi,

Red Hat is working on a fix for this in an upcoming erratum for 6.3. We are targeting that release for mid-August (it is currently in test).

Regards,
- Sue

Comment 25 Jarod Wilson 2012-08-07 21:47:16 UTC

Patch(es) available on kernel-2.6.32-294.el6

Comment 28 Kapetanakis Giannis 2012-09-12 11:11:12 UTC

Problem seems to be solved in 2.6.32-279.5.2

I've seen that patch https://bugzilla.redhat.com/attachment.cgi?id=599456&action=diff in applied in source.

Comment 29 John Ronciak 2012-09-14 00:25:18 UTC

From the testing done by our validation people the above kernel does indeed fix the issue.  Sorry for the delay in getting this tested.

Comment 36 errata-xmlrpc 2013-02-21 06:42:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0496.html