Bug 849766
| Summary: | Bonded and vlan tagged network does not work in KVM guest | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | matthew patton <pattonme> |
| Component: | kernel | Assignee: | Red Hat Kernel Manager <kernel-mgr> |
| Status: | CLOSED DUPLICATE | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 6.3 | CC: | agospoda, akong, burghardt, cww, daniel, dhoward, dknierim, dougsland, dyasny, eblake, ekuric, gcosta, ggarland, gianluca.cecchi, haliu, hayato.suzuki, hfuchi, hjia, htaira, ilmis, jburke, jolsa, jpirko, jwest, kawasaki, kzhang, lwang, lzheng, md, michael.hagmann, nachandr, nhorman, pattonme, redhat-bugzilla, sputhenp, tburke, teruaki.ishizaki, tgraf, tis, tkubota, tomichi, yamato, ykawada |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 623199 | Environment: | |
| Last Closed: | 2012-08-29 07:21:36 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
matthew patton
2012-08-20 19:29:18 UTC
sorry, typo. "Not only do you need to play games with ARP," should be "Not only do you NOT need to play games with ARP," I have confirmed that the removal of all hypervisor bonding is NOT sufficient for it to work, either. Worse, the outbound ARP request doesn't show up when I tcpdump on eth0 whereas when I did the same on bond0, I saw both the request and the response. Though obviously the requests are being transmitted since I do see the responses.
On the guest:
[root@localhost network-scripts]# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 52:54:00:c9:8d:d1 brd ff:ff:ff:ff:ff:ff
4: eth0.39@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
link/ether 52:54:00:c9:8d:d1 brd ff:ff:ff:ff:ff:ff
If I sniff 'eth0' I get nothing at all. Huh, why?!?
If I sniff 'eth0.39' I get
19:32:53.808708 ARP, Request who-has 10.2.3.1 tell 10.2.3.49, length 28
0x0000: ffff ffff ffff 5254 00c9 8dd1 0806 0001
0x0010: 0800 0604 0001 5254 00c9 8dd1 0a02 0331
0x0020: 0000 0000 0000 0a02 0301
Note there is no VLAN header on the packet as expected.
On the hypervisor:
[root@kvm1b ~]# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:25:90:4c:ba:92 brd ff:ff:ff:ff:ff:ff
...
6: shared: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
link/ether 00:25:90:4c:ba:92 brd ff:ff:ff:ff:ff:ff
7: eth0.39@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
link/ether 00:25:90:4c:ba:92 brd ff:ff:ff:ff:ff:ff
10: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 500
link/ether fe:54:00:c9:8d:d1 brd ff:ff:ff:ff:ff:ff
[root@kvm1b ~]# brctl show
bridge name bridge id STP enabled interfaces
shared 8000.0025904cba92 no eth0
vnet0
If I sniff eth0 (or eth0.39) I never see they guest's request. The response is
19:38:29.000446 ARP, Reply 10.2.3.1 is-at 00:00:5e:00:01:01 (oui Unknown), length 46
0x0000: 5254 00c9 8dd1 0050 56be 0e3c 8100 0027
0x0010: 0806 0001 0800 0604 0002 0000 5e00 0101
0x0020: 0a02 0301 5254 00c9 8dd1 0a02 0331 0000
0x0030: 0000 0000 0000 0000 0000 0000 0000 0000
Notice the frame is properly tagged "8100" with vlan ID "0027". This packet is never queued onto vnet0 so that the guest receives it.
Otherwise when sniffing eth0 (on hypervisor) I properly see both ARP requests and responses as long as it's the hypervisor itself doing the asking or responding.
Hm. I think that your problem is you have eth0.39 defined on baremetal and in the same time, eth0 has vlan rx accel turned on. May I ask what type of NIC eth0 is? In case of accelerated vlan rx path, driver calls __vlan_hwaccel_rx() directly where skb->dev is set to vlan device (eth0.39 in your case). And since this device is not part of the bridge, this skb never gets injected into bridge code. You can observe this behaviour by setting systemtap probes. Anyway, the solution to your problem is to remove eth0.39 on your baremetal (And in case you need to access that vlan from there, put vlan dev upon bridge device (shared.39)) Closing this as notabug. Feel free to reopen in case I'm wrong with the above. I don't know where the vlan-handling code resides in the source tree or I'd point out the line numbers where the logic is wrong. It should be thus: 1) packet is ingested off wire. 2) if vlan header is present, find vlan virtual interface that matches tag. BUT DO NOT STRIP THE HEADER!!!! 3) If and ONLY if the destination MAC of the packet == vlan virtual interface' MAC THEN you can strip the header. 4) if vlan virtual interface is member of a bridge, THEN you may also strip the header and hand off to the bridge for forwarding to the proper port. 5) if the main interface (eg. non-tagged 'bond0' or 'eth0') is a member of a bridge DO NOT STRIP THE HEADER and hand off to the bridge to forward to the proper port. Whoever wrote the code didn't think this sequence thru and instead immediately stripped the VLAN header at step 2. NO!!! might as well, cross-post... **My contributions are pseudo-code and may well contain errors.** The line numbers may be off a little since the source viewer I'm using is referencing 2.6.36 but seems to be reasonably similar to 2.6.32-279.5.2.el6. -------- So why is the rx_stats structure being updated in "net/8021q/vlan_dev.c:vlan_skb_recv()"? Doesn't netif_rx() which is called from seemingly every driver out there, lead directly to "net/core/dev.c:__netif_receive_skb()", which calls "net/8021q/vlan_core.c:vlan_hwaccel_do_receive()" which does the same stats updates? -------- OK so let's say skb->dev is 'eth0' when vlan_hwaccel_do_receive() is called. It will short-circuit, return to __netif_receive_skb() which should call "deliver_skb()" but I can't figure out the voodoo of 'list_for_each_entry_rcu'. So does deliver_skb() call vlan_skb_recv() via function pointer? Are there plans to do something like http://lxr.free-electrons.com/source/net/core/dev.c?v=2.6.36#L2903 instead of calling handle_{bridge,macvlan} specifically? "net/8021q/vlan_dev.c:vlan_skb_recv()" 170 if (!vlan_dev) { 171 if (vlan_id) { 172 pr_debug ("%s: ERROR: No net_device for VID .... So here's the problem. Instead of throwing the packet away, we need to drop thru so we can get picked up by __netif_receive_skb:handle_bridge()? We don't care at this stage if ultimately there isn't a bridge member that will consume the vlan-tagged packet. That's up to handle_bridge() to find a matching destination MAC in it's port list and queue it up for delivery, or toss it. The placement of handle_bridge() and the loop to retry 'deliver_skb()' for vlan-stacked bonds concerns me. Nonetheless, what if vlan_skb_recv() did something like this? if (!vlan_dev) { if (dev->br_port) { rcu_read_unlock(); return NET_RX_SUCCESS; } else if (vlan_id) { // no next step so we can safely throw it away pr_debug (.... goto err_unlock; } rx_stats = NULL; } else { skb->dev = vlan_dev; ... I thought about doing if (!vlan_dev) { if (dev->br_port) { skb->dev = (dev->br_port)->br; } else if (vlan_id) .... but that would involve getting re-queued and the effects of vlan_set_encap_proto() further down in vlan_skb_recv() don't strike me as benign. I guess another 'goto' tag after "netif_rx(skb);" would be cleaner. There is still more problems with the code path. SCENARIO: eth0 member of bridge0 eth0.39 is defined as a host vlan interface virtual machine has vnet0 attached to bridge0. said virtual machine is doing VLAN tagging with VID=39 amongst others. First time around, __netif_receive_skb() will invoke vlan_hwaccel_do_receive() with skb->dev is eth0, which short-circuits back since it's not a vlan device and by virtue of 'eth0.39' being defined, goes into the 'else' clause of vlan_skb_recv() where it (IMO wrongly increments various counters) then gets re-queued with skb->dev is eth0.39. This time vlan_hwaccel_do_receive() gets past the !is_vlan_dev(dev) check and a couple lines down just obliterates the VLAN header via: skb->vlan_tci = 0; Except the packet probably wasn't for the host but the virtual machine! We need to check if the destination MAC == interface MAC. The line needs to be something like: if (dev->br_port || (skb->pkt_type == PACKET_HOST) || \ /* seems whomever wrote compare_ether_addr thought he was writing a shell script? comparison is backwards. */ !compare_ether_addr(eth_hdr(skb)->h_dest, dev->dev_addr)) { skb->vlan_tci = 0; } Something else caught my eye. in __netif_receive_skb() there is: if (vlan_tx_tag_present(skb) && vlan_hwaccel_do_receive(skb)) return NET_RX_SUCCESS; From what I can see vlan_hwaccel_do_receive() always returns '0' so this statement will never evaluate to true. Why is this written this way? It is rather misleading. Now will this problem be taken seriously and looked at?!?! *** This bug has been marked as a duplicate of bug 855437 *** |