Bug 849766

Summary:	Bonded and vlan tagged network does not work in KVM guest
Product:	Red Hat Enterprise Linux 6	Reporter:	matthew patton <pattonme>
Component:	kernel	Assignee:	Red Hat Kernel Manager <kernel-mgr>
Status:	CLOSED DUPLICATE	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	6.3	CC:	agospoda, akong, burghardt, cww, daniel, dhoward, dknierim, dougsland, dyasny, eblake, ekuric, gcosta, ggarland, gianluca.cecchi, haliu, hayato.suzuki, hfuchi, hjia, htaira, ilmis, jburke, jolsa, jpirko, jwest, kawasaki, kzhang, lwang, lzheng, md, michael.hagmann, nachandr, nhorman, pattonme, redhat-bugzilla, sputhenp, tburke, teruaki.ishizaki, tgraf, tis, tkubota, tomichi, yamato, ykawada
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	623199	Environment:
Last Closed:	2012-08-29 07:21:36 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description matthew patton 2012-08-20 19:29:18 UTC

+++ This bug was initially created as a clone of Bug #623199 +++

Description of problem:

The two ethernet ports are bonded. I then create a tagged vlan interface on top of the bond and associate the bond with a bridge. The physical switch ports are set to TRUNK mode.

bond0=eth0+eth1, mode=active/passive
bond0: Bridge=shared
bond0.39 defined as VLAN=yes with IPADDR and NETMASK defined

Note that the bridge is supposed to be VLAN-agnostic; ie. I can utilize as many bond0.XX tagged interfaces at will just as long as only 'bond0' is directly attached to the brige. From a host networking perspective this works just fine and is expected behavior.

[ Aside. There is actually another problem in the vlan+bond+bridge code paths. If I were to 'brctl addif shared bond0.39' instead of 'bond0' then the kernel spews "bond0.39: received packet with own address as source address"]

Then defined a QEMU network and CentOS 6.3 guest' interface stanza.

<network>
    <name>bridge-shared</name>
    <uuid>0824e08b-bb0e-452a-e2c9-dcca52af2341</uuid>
    <forward mode='bridge'/>
    <bridge name='shared' />
</network>

<interface type='bridge'>
   <mac address='52:54:00:c9:8d:d1'/>
   <source bridge='shared'/>
   <model type='virtio'/>
   <address type='pci' domain='0x0000' bus='0x00' slot='0x03'
function='0x0'/>
</interface>

[root@kvm1b ~]# virsh domiflist test
Interface  Type       Source     Model       MAC
-------------------------------------------------------
vnet0      bridge     shared     virtio      52:54:00:c9:8d:d1


[root@kvm1b networks]# virsh iface-dumpxml shared
<interface type='bridge' name='shared'>
    <protocol family='ipv6'>
      <ip address='fe80::225:90ff:fe4c:ba92' prefix='64'/>
    </protocol>
    <bridge>
      <interface type='bond' name='bond0'>
        <bond>
          <interface type='ethernet' name='eth0'>
            <mac address='00:25:90:4c:ba:92'/>
          </interface>
          <interface type='ethernet' name='eth1'>
            <mac address='00:25:90:4c:ba:92'/>
          </interface>
        </bond>
      </interface>
      <interface type='ethernet' name='vnet0'>
        <mac address='fe:54:00:c9:8d:d1'/>
      </interface>
    </bridge>
</interface>

Inside the guest, I obviously must use VLAN tags or else the traffic will not get handled by the physical switches. When I run tcpdump on bond0 and bond0.39 I see all ARP request and reply frames with the proper VLAN header and correct tag value.

However, (and IMO this is the root cause of the bug) the source (dest if reply) MAC address of these packets is the "private" MAC address of the guest (52:54:00:c9:8d:d1) instead of the host's vnet0 (fe:54:00:c9:8d:d1). And after I generate some traffic on the guest and do: 

# brctl showmacs shared

I see both of the MAC addresses being registered to bridge port #2. How is this correct? In any event All ARP packets never get put back on 'vnet0' and definitely do not show up on the guest's side on eth0.

Isn't this supposed to work just like IPTABLES' S-NAT or D-NAT such that any guest packet picked up by the host on vnet0 must have the client's MAC address rewritten to the 'fe:*' value and every packet that shows up at the host from external sources must be rewritten when re-queued onto vnet0 for the guest to process?

Unlike the person who filed Bug 623199, dropping a bond member does nothing. Indeed the only way I've found to work around this problem is to not to any VLAN tagging anywhere except at the hardware switch (rather misses the point) or to not do any bonding at all. And while in this latter case I believe I observed that the packets returned to the guest, that is probably a bug in of itself.

In short, it it a violation to leak the client MAC to the world. Or if the intent was to proxy-arp for the guest, the logic to re-queue the packet on vnet0 is broken. IMO it makes rather more sense to have vnetN have the real (IANA registered range) MAC addresses and the guest's ethN side to have some "bogus" value instead of being the other way around. Not only do you need to play games with ARP, the hypervisor's networking stack blissfully unaware of any hanky-panky going on. Only the QEMU network driver needs to know to rewrite the MAC address depending on which direction the packet is going.


Version-Release number of selected component (if applicable):

kernel 2.6.279.2 and 2.6.32-279.5.1 and probably all previous versions as well
libvirt v0.9.10-21.el6 


How reproducible:
This failure is 100% reproducible.

Steps to Reproduce:
1. Create an active-standby bond 
2. Create a tagged vlan interface on said bond
3. Create a bridge associated with said bond
4. Create a guest associated with said bridge and internally define a VLAN tagged sub-interface
5. Witness that ARP replies never get sent to the guest on either vnet0 or guest's eth0 but are very evident on bond0 and bond0.39
  
Actual results:
ARP replies are not passed to guest

Expected results:
ARP replies should be passed to guest

Additional info:

Comment 1 matthew patton 2012-08-20 19:33:50 UTC

sorry, typo.
"Not only do you need to play games with ARP," should be
"Not only do you NOT need to play games with ARP,"

Comment 3 matthew patton 2012-08-20 23:52:50 UTC

I have confirmed that the removal of all hypervisor bonding is NOT sufficient for it to work, either. Worse, the outbound ARP request doesn't show up when I tcpdump on eth0 whereas when I did the same on bond0, I saw both the request and the response. Though obviously the requests are being transmitted since I do see the responses.

On the guest:

[root@localhost network-scripts]# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 52:54:00:c9:8d:d1 brd ff:ff:ff:ff:ff:ff
4: eth0.39@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
    link/ether 52:54:00:c9:8d:d1 brd ff:ff:ff:ff:ff:ff

If I sniff 'eth0' I get nothing at all. Huh, why?!?
If I sniff 'eth0.39' I get

19:32:53.808708 ARP, Request who-has 10.2.3.1 tell 10.2.3.49, length 28
        0x0000:  ffff ffff ffff 5254 00c9 8dd1 0806 0001
        0x0010:  0800 0604 0001 5254 00c9 8dd1 0a02 0331
        0x0020:  0000 0000 0000 0a02 0301

Note there is no VLAN header on the packet as expected.

On the hypervisor:
[root@kvm1b ~]# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:25:90:4c:ba:92 brd ff:ff:ff:ff:ff:ff
...
6: shared: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether 00:25:90:4c:ba:92 brd ff:ff:ff:ff:ff:ff
7: eth0.39@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
    link/ether 00:25:90:4c:ba:92 brd ff:ff:ff:ff:ff:ff
10: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 500
    link/ether fe:54:00:c9:8d:d1 brd ff:ff:ff:ff:ff:ff

[root@kvm1b ~]# brctl show
bridge name     bridge id               STP enabled     interfaces
shared          8000.0025904cba92       no              eth0
                                                        vnet0

If I sniff eth0 (or eth0.39) I never see they guest's request. The response is

19:38:29.000446 ARP, Reply 10.2.3.1 is-at 00:00:5e:00:01:01 (oui Unknown), length 46
        0x0000:  5254 00c9 8dd1 0050 56be 0e3c 8100 0027
        0x0010:  0806 0001 0800 0604 0002 0000 5e00 0101
        0x0020:  0a02 0301 5254 00c9 8dd1 0a02 0331 0000
        0x0030:  0000 0000 0000 0000 0000 0000 0000 0000

Notice the frame is properly tagged "8100" with vlan ID "0027". This packet is never queued onto vnet0 so that the guest receives it.

Otherwise when sniffing eth0 (on hypervisor) I properly see both ARP requests and responses as long as it's the hypervisor itself doing the asking or responding.

Comment 5 Jiri Pirko 2012-08-29 07:21:36 UTC

Hm. I think that your problem is you have eth0.39 defined on baremetal and in the same time, eth0 has vlan rx accel turned on. May I ask what type of NIC eth0 is?

In case of accelerated vlan rx path, driver calls __vlan_hwaccel_rx() directly where skb->dev is set to vlan device (eth0.39 in your case). And since this device is not part of the bridge, this skb never gets injected into bridge code.

You can observe this behaviour by setting systemtap probes.

Anyway, the solution to your problem is to remove eth0.39 on your baremetal (And in case you need to access that vlan from there, put vlan dev upon bridge device (shared.39))

Closing this as notabug. Feel free to reopen in case I'm wrong with the above.

Comment 6 matthew patton 2012-09-13 19:22:59 UTC

I don't know where the vlan-handling code resides in the source tree or I'd point out the line numbers where the logic is wrong. It should be thus:

1) packet is ingested off wire.
2) if vlan header is present, find vlan virtual interface that matches tag. BUT DO NOT STRIP THE HEADER!!!!
3) If and ONLY if the destination MAC of the packet == vlan virtual interface' MAC THEN you can strip the header.
4) if vlan virtual interface is member of a bridge, THEN you may also strip the header and hand off to the bridge for forwarding to the proper port.
5) if the main interface (eg. non-tagged 'bond0' or 'eth0') is a member of a bridge DO NOT STRIP THE HEADER and hand off to the bridge to forward to the proper port.


Whoever wrote the code didn't think this sequence thru and instead immediately stripped the VLAN header at step 2. NO!!!

Comment 7 matthew patton 2012-09-15 10:57:06 UTC

might as well, cross-post...

**My contributions are pseudo-code and may well contain errors.**

The line numbers may be off a little since the source viewer I'm using is referencing 2.6.36 but seems to be reasonably similar to 2.6.32-279.5.2.el6.

--------
So why is the rx_stats structure being updated in "net/8021q/vlan_dev.c:vlan_skb_recv()"? Doesn't netif_rx() which is called from seemingly every driver out there, lead directly to "net/core/dev.c:__netif_receive_skb()", which calls "net/8021q/vlan_core.c:vlan_hwaccel_do_receive()" which does the same stats updates?
-------- 

OK so let's say skb->dev is 'eth0' when vlan_hwaccel_do_receive() is called. It will short-circuit, return to __netif_receive_skb() which should call "deliver_skb()" but I can't figure out the voodoo of 'list_for_each_entry_rcu'. 

So does deliver_skb() call vlan_skb_recv() via function pointer? 

Are there plans to do something like
http://lxr.free-electrons.com/source/net/core/dev.c?v=2.6.36#L2903
instead of calling handle_{bridge,macvlan} specifically?


"net/8021q/vlan_dev.c:vlan_skb_recv()"
170 if (!vlan_dev) {
171    if (vlan_id) {
172       pr_debug ("%s: ERROR: No net_device for VID .... 

So here's the problem. Instead of throwing the packet away, we need to drop thru so we can get picked up by __netif_receive_skb:handle_bridge()? We don't care at this stage if ultimately there isn't a bridge member that will consume the vlan-tagged packet.  That's up to handle_bridge() to find a matching destination MAC in it's port list and queue it up for delivery, or toss it.

The placement of handle_bridge() and the loop to retry 'deliver_skb()' for vlan-stacked bonds concerns me. Nonetheless, what if vlan_skb_recv() did something like this?

if (!vlan_dev) {
  if (dev->br_port) {
    rcu_read_unlock();
    return NET_RX_SUCCESS;
   } else if (vlan_id) { // no next step so we can safely throw it away
     pr_debug (....
     goto err_unlock;
  }
  rx_stats = NULL;
} else {
  skb->dev = vlan_dev;
  ...


I thought about doing  
if (!vlan_dev) {
  if (dev->br_port) {
    skb->dev = (dev->br_port)->br;
   } else if (vlan_id) ....

but that would involve getting re-queued and the effects of vlan_set_encap_proto() further down in vlan_skb_recv() don't strike me as benign. I guess another 'goto' tag after "netif_rx(skb);" would be cleaner.


There is still more problems with the code path.
SCENARIO:
eth0 member of bridge0
eth0.39 is defined as a host vlan interface
virtual machine has vnet0 attached to bridge0. said virtual machine is doing VLAN tagging with VID=39 amongst others.

First time around, __netif_receive_skb() will invoke vlan_hwaccel_do_receive() with skb->dev is eth0, which short-circuits back since it's not a vlan device and by virtue of 'eth0.39' being defined, goes into the 'else' clause of vlan_skb_recv() where it (IMO wrongly increments various counters) then gets re-queued with skb->dev is eth0.39. This time vlan_hwaccel_do_receive() gets past the !is_vlan_dev(dev) check and a couple lines down just obliterates the VLAN header via:

     skb->vlan_tci = 0;

Except the packet probably wasn't for the host but the virtual machine! We need to check if the destination MAC == interface MAC. The line needs to be something like:

if (dev->br_port || (skb->pkt_type == PACKET_HOST) || \
  /* seems whomever wrote compare_ether_addr thought he was writing a shell script? comparison is backwards. */
  !compare_ether_addr(eth_hdr(skb)->h_dest, dev->dev_addr)) {
    skb->vlan_tci = 0;
}


Something else caught my eye. in __netif_receive_skb() there is:

  if (vlan_tx_tag_present(skb) && vlan_hwaccel_do_receive(skb))
    return NET_RX_SUCCESS;

From what I can see vlan_hwaccel_do_receive() always returns '0' so this statement will never evaluate to true. Why is this written this way? It is rather misleading.


Now will this problem be taken seriously and looked at?!?!

Comment 8 matthew patton 2012-09-15 10:59:15 UTC


*** This bug has been marked as a duplicate of bug 855437 ***

Comment 9 hayato.suzuki 2014-09-11 08:51:13 UTC

http://h20564.www2.hp.com/portal/site/hpsc/public/kb/docDisplay/?docId=emr_na-c04267968