Bug 733224
Summary: | vlan not accessible through a bridge configuration | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | ejbg |
Component: | kernel | Assignee: | Neil Horman <nhorman> |
Status: | CLOSED NEXTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | 15 | CC: | dalefarm, gansalmon, iarlyy, itamar, jonathan, kernel-maint, madhu.chinakonda, nhorman, notting, Per.t.Sjoholm, phresus, plautrba, ppisar, rrakus |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2011-11-16 15:44:57 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
ejbg
2011-08-25 07:48:52 UTC
vconfig is obsolete for long time and has been superseded by `ip' command from `iproute' package. AFAIK the /etc/sysconfig/network-scripts/* files are parsed by scripts provided by package `initscripts' and simple grep in initscripts-9.30-2.fc15.x86_64 did not find any calls to vconfig. So this is not issue for vconfig package. This is very probably kernel issue as vconfig or ip just configure interfaces and rules that are applied to frames/packets by kernel. Hi Bill, Have you already found out if it comes from an initscripts problem rather than a kernel problem ? Right now creating a KVM virtual machine on a particular VLAN is not possible as the bridge configuration, may it be on the direct vlan interface or on a bonding VLAN one does not work at all on Fedora 15 and one needs to go reuse Fedora 14 to test such config. Thanks, Eric. Was there any resolution to this? I'm having exactly the same problem on a Fedora 15 machine, and no permutation of the configuration files has resolved it. Hi Ryan, I don't think this bug has been resolved yet, I never saw anything about it. I believe it is not a common usage or there is maybe a way of doing this that I don't know of yet. I dropped Fedora 15 and I came back to Fedora 14 where it works fine. I will try Fedora 16 to see if anything happened at that time. Regards, Eric. I've been having the same problem with a Fedora15 install recently. I'm pretty sure it has nothing to do with initscripts. My configuration was working fine in F14. It also works fine in an F15 install using kernel <= 2.6.38.8-35. If I upgrade to a >= 2.6.40 kernel (aka 3.x), it no longer works. I've tried various different configurations - bridge-on-vlan, vlan-on-bridge - no joy (see for example http://unix.stackexchange.com/questions/18576/why-does-adding-a-non-vlaned-interface-to-a-bridge-break-the-vlaned-interfaces) Using another machine, I've sniffed the traffic from my F15 box. I see that the egress VLAN traffic from that F15 box is not tagged with a VLAN ID, with the result that I cannot reach machines on my VLAN from the F15 box. However if I sniff traffic on the F15 box itself, I do see VLAN ID in header. So I'm suspecting a driver / kernel problem here. My motherboard is using an NVIDIA MCP55 chipset, with forcedeth driver. 'lspci -nn' shows : 00:08.0 Bridge [0680]: nVidia Corporation MCP55 Ethernet [10de:0373] (rev a2) I also note that there has been some VLAN-related rework in the forcedeth driver (kernel too?) of late, that appears to have been causing various issues for some eg : https://lkml.org/lkml/2011/8/5/115 Eric, Ryan, what network adaptor / driver are you using? Which kernel? dalefarm, when you sniff egress traffic, you're likely not going to see prepended vlan tags, as they are kept out of band in the vlan_tci field of the skb, to be prepended to the frame by the hardware on egress. You're best bet may be to run a stap script to probe egress skbs to see if $skb->vlan_tci is set appropriately on entry to the driver ndo_start_xmit method. If it is, then it is likely the driver is not properly telling the hardware to attach a vlan header to the frame >>when you sniff egress traffic, you're likely not going to see
prepended vlan tags
I'm sniffing the traffic from my F15 box, using a separate physical machine entirely.
In the 'working' case I see the VLAN tags in the egress traffic.
In the 'non-working' case I don't see the VLAN tags in the egress traffic.
The fact that I could see them (in either case) when I sniffed directly on the F15 box doesn't concern me - it just leads more credence to my suspicion that this is a kernel/driver issue.
I tend to agree, especially if you can run the above stap script that I outlined. If vlan_tci is set properly in the drivers ndo_start_xmit routine, but no vlan tags appear on the wire for that port, we can be fairly certain the driver isn't telling the hardware to add a vlan tag properly. Thanks Neil. Unfortunately that MCP55-chipset based box is in a semi-lockdown state for the next few weeks, so I'm unlikely to be able to do any debugging there for a little while. However, I was able to try this same networking configuration on a box with Realtek hardware (RTL8111/8168B - PCI ID [10ec:8168] ), and this works fine. (kernel here is 2.6.40.4-5.fc15.x86_64). Ok, while you wait to get back on the system in question I'll see if I can find a box here to recreate this with dalefarm, I tried it on 2 different machines and both use the latest Fedora 15 x86_64 kernel (currently 2.6.40.6-0.fc15.x86_64 ). The first machine uses network adapter : 09:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5752 Gigabit Ethernet PCI Express (rev 02) The second machine uses network adapter : 02:00.0 Ethernet controller: Broadcom Corporation NetLink BCM57785 Gigabit Ethernet PCIe (rev 10) Regards, Eric. ejbg, why did you do that? Did you get the same failure that he describes on the forcedeth hardware? Did you run the stap script I suggested? Neil, I've been able to get a couple of hours with that problem system. I hacked together a quick stap script 'nv.stp' to inspect calls to inspect calls to the forcedeth module 'nv_start_xmit' and 'nv_start_xmit_optimized' functions : #!/usr/bin/stap probe module("forcedeth").function("nv_start_xmit") { printf("nv_start_xmit\n"); } probe module("forcedeth").function("nv_start_xmit_optimized") { printf("nv_start_xmit_optimized dev %p (features: %x hw %x wanted %x vlan %x) skb %p vlan_tci %x\n", $dev, $dev->features, $dev->hw_features, $dev->wanted_features, $dev->vlan_features, $skb, $skb->vlan_tci); } Running this on kernel 2.6.40.6-0.fc15.x86_64 yield following output : [root@localhost stap]# ./nv.stp nv_start_xmit_optimized dev 0xffff880172e04000 (features: 600149a3 hw 60014803 wanted 60014803 vlan 4020) skb 0xffff880174204700 vlan_tci 0 nv_start_xmit_optimized dev 0xffff880172e04000 (features: 600149a3 hw 60014803 wanted 60014803 vlan 4020) skb 0xffff88004841df00 vlan_tci 0 nv_start_xmit_optimized dev 0xffff880172e04000 (features: 600149a3 hw 60014803 wanted 60014803 vlan 4020) skb 0xffff88004841d400 vlan_tci 100b nv_start_xmit_optimized dev 0xffff880172e04000 (features: 600149a3 hw 60014803 wanted 60014803 vlan 4020) skb 0xffff880172b5ff00 vlan_tci 0 nv_start_xmit_optimized dev 0xffff880172e04000 (features: 600149a3 hw 60014803 wanted 60014803 vlan 4020) skb 0xffff88015f228b00 vlan_tci 100b nv_start_xmit_optimized dev 0xffff880172e04000 (features: 600149a3 hw 60014803 wanted 60014803 vlan 4020) skb 0xffff880173149c00 vlan_tci 0 nv_start_xmit_optimized dev 0xffff880172e04000 (features: 600149a3 hw 60014803 wanted 60014803 vlan 4020) skb 0xffff88017202bb00 vlan_tci 0 nv_start_xmit_optimized dev 0xffff880172e04000 (features: 600149a3 hw 60014803 wanted 60014803 vlan 4020) skb 0xffff880173669b00 vlan_tci 100b nv_start_xmit_optimized dev 0xffff880172e04000 (features: 600149a3 hw 60014803 wanted 60014803 vlan 4020) skb 0xffff8801752cb700 vlan_tci 0 There is a mix of non-vlan and vlan (id = 11) traffic on this system, so the vlan_tci values of '0' and '100b' are expected. However, as previously noted, an external system sees all packets as untagged. I attempted to disable hw acceleration, but cannot disable hw vlan via ethtool : [root@localhost stap]# ethtool -K eth0 tx off [root@localhost stap]# ethtool -K eth0 rx off [root@localhost stap]# ethtool -K eth0 gro off [root@localhost stap]# ethtool -K eth0 tso off [root@localhost stap]# ethtool -K eth0 txvlan off Cannot set device flag settings: Operation not supported [root@localhost stap]# ethtool -K eth0 rxvlan off Cannot set device flag settings: Operation not supported [root@localhost stap]# ethtool -k eth0 Offload parameters for eth0: rx-checksumming: on tx-checksumming: off scatter-gather: off tcp-segmentation-offload: off udp-fragmentation-offload: off generic-segmentation-offload: off generic-receive-offload: off large-receive-offload: off rx-vlan-offload: on tx-vlan-offload: on ntuple-filters: off receive-hashing: off With these ethtool settings, same 'nv' stap script yielded : [root@localhost stap]# ./nv.stp nv_start_xmit_optimized dev 0xffff880172e04000 (features: 600001a0 hw 60014803 wanted 40000801 vlan 4020) skb 0xffff88017297aa00 vlan_tci 100b nv_start_xmit_optimized dev 0xffff880172e04000 (features: 600001a0 hw 60014803 wanted 40000801 vlan 4020) skb 0xffff88002a561000 vlan_tci 0 nv_start_xmit_optimized dev 0xffff880172e04000 (features: 600001a0 hw 60014803 wanted 40000801 vlan 4020) skb 0xffff880177193d00 vlan_tci 100b nv_start_xmit_optimized dev 0xffff880172e04000 (features: 600001a0 hw 60014803 wanted 40000801 vlan 4020) skb 0xffff88002a561000 vlan_tci 100b nv_start_xmit_optimized dev 0xffff880172e04000 (features: 600001a0 hw 60014803 wanted 40000801 vlan 4020) skb 0xffff88002a561b00 vlan_tci 100b nv_start_xmit_optimized dev 0xffff880172e04000 (features: 600001a0 hw 60014803 wanted 40000801 vlan 4020) skb 0xffff88002a561300 vlan_tci 0 I then reverted to a working configuration (kernel 2.6.38.8-35.fc15.x86_64) on the same system. (Had to tweak my stap script a little, since many $dev->features members are not present, or not traceable in this setup; Only dev->vlan_features was available) Here's the output of the stap run : nv_start_xmit_optimized dev 0xffff8801708ee000 ( vlan 4020) skb 0xffff880093b88600 vlan_tci 0 nv_start_xmit_optimized dev 0xffff8801708ee000 ( vlan 4020) skb 0xffff8800938f1700 vlan_tci 0 nv_start_xmit_optimized dev 0xffff8801708ee000 ( vlan 4020) skb 0xffff880171da8400 vlan_tci 0 nv_start_xmit_optimized dev 0xffff8801708ee000 ( vlan 4020) skb 0xffff880171da8f00 vlan_tci 0 nv_start_xmit_optimized dev 0xffff8801708ee000 ( vlan 4020) skb 0xffff88016f9d0e00 vlan_tci 0 nv_start_xmit_optimized dev 0xffff8801708ee000 ( vlan 4020) skb 0xffff8800a18e8000 vlan_tci 0 nv_start_xmit_optimized dev 0xffff8801708ee000 ( vlan 4020) skb 0xffff880096539d00 vlan_tci 0 nv_start_xmit_optimized dev 0xffff8801708ee000 ( vlan 4020) skb 0xffff880096539700 vlan_tci 0 Surprisingly, even with a (verified) mix of and non-tagged egress traffic, the vlan_tci field here is always '0', implying no tagging performed/requested of the forcedeth driver. So at a guess : 1. kernel 2.6.38.8-35 itself is adding vlan tag, prior to sending skb to driver for transmit 2. kernel 2.6.40.6-0 is attempting to make use of driver to add vlan tags 3. Driver has a bug that results in no vlan tag added Further, I'd suggest that driver may have had this bug for quite a while, but change in kernel behavior has now exposed it. Plausible? -dalefarm I'd say your spot on. Looks like commit 0891b0e08937aaec2c4734acb94c5ff8042313bb borked you. It looks like that change cleared the VLAN_TX flag for the enabled features list, and set it on the user modifiable features list, but the driver uses the features flag to default enable VLAN acceleration on TX and RX, so you're left in this odd state where acceleration isn't enabled in the hardware by default, but the driver is telling the stack that it is. And it appears you can't change the feature (as you've noted) because forcedeth uses the new set_features interface that ethtool in userspace doesn't yet support. I think the thing to do here is revert part of this commit, specifically the part that disabled vlan acceleration, and see if that gets you working again. I'll have a patch shortly. actually, scratch that, it appears that the features flag is set properly a little farther down, so I still need to figure out why it is that prior to the 2.6.40 kernel we didn't do vlan accel. There's discussion of a very similar problem on LKML from February this year : https://lkml.org/lkml/2011/2/21/386 Hm, So the more I look at this the more confusing it seems. By all rights before and after that patch, we should have dev->features set in such a way as to indicate that the device supports TX vlan acceleration. In fact in 2.6.30 all the way through 3.1, this should be the case, yet, in 2.6.38 we seem to have no vlan_tci information reaching the driver, which seems quite wrong. Can you do me a favor. Can you please attach /var/log/messages on your system here, from a boot of the 2.6.38 kernel and the 2.6.40 kernel? I'd like to look at the banner information for forcedeth to confirm that I'm not missing something in the older driver that clears that flag. Thank you. daelfarm, yes, I've seen that conversation, and it definately looks suspicious, but from what I see forcedeth already uses the new vlan model. I might be missing something though. Snippets of /var/log/messages : (I've redacted the MAC addresses; if you need full log then I'll be happy to send via PM) 2.6.38.8-35.fc15.x86_64: forcedeth: Reverse Engineered nForce ethernet driver. Version 0.64. forcedeth 0000:00:08.0: PCI INT A -> Link[APCH] -> GSI 22 (level, low) -> IRQ 22 forcedeth 0000:00:08.0: setting latency timer to 64 forcedeth 0000:00:08.0: ifname eth0, PHY OUI 0x5043 @ 1, addr 00:18:f3:<xx:yy:zz> forcedeth 0000:00:08.0: highdma csum vlan pwrctl mgmt gbit lnktim msi desc-v3 2.6.40.6-0.fc15.x86_64 : forcedeth: Reverse Engineered nForce ethernet driver. Version 0.64. forcedeth 0000:00:08.0: PCI INT A -> Link[APCH] -> GSI 22 (level, low) -> IRQ 22 forcedeth 0000:00:08.0: setting latency timer to 64 forcedeth 0000:00:08.0: ifname eth0, PHY OUI 0x5043 @ 1, addr 00:18:f3:<xx:yy:zz> forcedeth 0000:00:08.0: highdma csum vlan pwrctl mgmt gbit lnktim msi desc-v3 Being selfish for a moment, I'm less worried about why kernel 2.6.38 does not attempt hw accel vlan tagging, than I am about why the forcedeth driver doesn't seem to work properly when kernel 2.6.40 does attempt hw accel tagging. Digging around in forcedeth.c, I'm surprised to see that the DEV_HAS_VLAN flag (in the pci_tbl struct) is advertized only for the MCP55-chipset variants. As a workaround, I disabled (ie removed) 'DEV_HAS_VLAN' flag from the MCP55 entry; Lo and Behold now my vlans are working with 2.6.40. /var/log/messages shows : forcedeth 0000:00:08.0: highdma csum pwrctl mgmt gbit lnktim msi desc-v3 ethtool -k eth0 : Offload parameters for eth0: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp-segmentation-offload: on udp-fragmentation-offload: off generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off rx-vlan-offload: off tx-vlan-offload: off ntuple-filters: off receive-hashing: off and finally my 'nv.stp' script is showing no attempt at hw tagging : nv_start_xmit_optimized dev 0xffff88017385c000 (features: 60014823 hw 60014803 wanted 60014803 vlan 4020) skb 0xffff88008bea5100 vlan_tci 0 nv_start_xmit_optimized dev 0xffff88017385c000 (features: 60014823 hw 60014803 wanted 60014803 vlan 4020) skb 0xffff88008bea5d00 vlan_tci 0 nv_start_xmit_optimized dev 0xffff88017385c000 (features: 60014823 hw 60014803 wanted 60014803 vlan 4020) skb 0xffff880141f86100 vlan_tci 0 nv_start_xmit_optimized dev 0xffff88017385c000 (features: 60014823 hw 60014803 wanted 60014803 vlan 4020) skb 0xffff880141f86d00 vlan_tci 0 nv_start_xmit_optimized dev 0xffff88017385c000 (features: 60014823 hw 60014803 wanted 60014803 vlan 4020) skb 0xffff880141f86500 vlan_tci 0 Regarding commit 0891b0e08937aaec2c4734acb94c5ff8042313bb - 'forcedeth: fix vlans', as far as I can tell this doesn't come into play until kernel 3.1. The only tangible forcedeth commit I see from 2.6.38 -> 2.6.40 (ie 3.0) that affects the various 'features' flags is 569e146396cb3b378d2957b94671bf30cd777c67 - 'forcedeth: convert to hw_features'. Interestingly though, the 'fix vlans' commit does add notable changes to the nv_probe() function relating to vlan feature flags : if (id->driver_data & DEV_HAS_VLAN) { np->vlanctl_bits = NVREG_VLANCONTROL_ENABLE; - dev->features |= NETIF_F_HW_VLAN_RX | NETIF_F_HW_VLAN_TX; + dev->hw_features |= NETIF_F_HW_VLAN_RX | NETIF_F_HW_VLAN_TX; } + dev->features |= dev->hw_features; So this change specifically sets the NETIF_F_HW_VLAN_RX and NETIF_F_HW_VLAN_TX flags to dev->hw_features, in addition to dev->features. If time allows, I may have a go at doing : a. Regress forcedeth to the same rev as seen in kernel 2.6.38, to see if hw vlan stops being used b. Bring in some/all of the forcedeth commits, post kernel 2.6.40, to see if hw vlan functionality springs to life. I agree, the message logs confirm that, indicating that VLAN_TX was set in both versions of the file. pulling some/all of the nv commits into f15 may be useful here, but I'm not seeing much in the way of vlan fixes since the above commit. I'm starting to wonder if the bridge isn't doing something odd to the frames. I'm going to set up a reproducer and take a look shortly. Based on a comment in the LMKL thread I linked previously [ https://lkml.org/lkml/2011/2/25/411], I added a VLAN (with an unused id) on eth0 : > vconfig add eth0 6 Now my other vlan at br0.11 (id = 11) is working, using unmodified 2.6.40 kernel and forcedeth driver. stap script shows that vlan_tci field is 100b, indicating hw vlan tx active. To verify, I removed eth0.6 : > vconfig rem eth0.6 ... and br0.11 stopped working. A follow-up on that same thread reads 'The right solution is convert the driver [Intel e1000e] over to the new vlan model'. I brought in the forcedeth.c from kernel 3.1, which adds the following commits above-and-beyond what's in 2.6.40/3.0 : 3326c784c9f492e988617d93f647ae0cfd4c8d09 - forcedeth: do vlan cleanup 0891b0e08937aaec2c4734acb94c5ff8042313bb - forcedeth: fix vlans This looks to now all work nicely with kernel 2.6.40; no feature-flag hacks, no fake-vlan config. dmesg : forcedeth 0000:00:08.0: highdma csum vlan pwrctl mgmt gbit lnktim msi desc-v3 ethtool -k eth0 : Offload parameters for eth0: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp-segmentation-offload: on udp-fragmentation-offload: off generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off rx-vlan-offload: on tx-vlan-offload: on ntuple-filters: off receive-hashing: off my 'nv.stp' stap-script : nv_start_xmit_optimized dev 0xffff880173b78000 (features: 600149a3 hw 60014983 wanted 60014983 vlan 4020) skb 0xffff880149959200 vlan_tci 0 nv_start_xmit_optimized dev 0xffff880173b78000 (features: 600149a3 hw 60014983 wanted 60014983 vlan 4020) skb 0xffff880167ac5000 vlan_tci 100b nv_start_xmit_optimized dev 0xffff880173b78000 (features: 600149a3 hw 60014983 wanted 60014983 vlan 4020) skb 0xffff880167ac5d00 vlan_tci 100b nv_start_xmit_optimized dev 0xffff880173b78000 (features: 600149a3 hw 60014983 wanted 60014983 vlan 4020) skb 0xffff880149aca400 vlan_tci 0 nv_start_xmit_optimized dev 0xffff880173b78000 (features: 600149a3 hw 60014983 wanted 60014983 vlan 4020) skb 0xffff880149acae00 vlan_tci 100b nv_start_xmit_optimized dev 0xffff880173b78000 (features: 600149a3 hw 60014983 wanted 60014983 vlan 4020) skb 0xffff880172f17600 vlan_tci 100b nv_start_xmit_optimized dev 0xffff880173b78000 (features: 600149a3 hw 60014983 wanted 60014983 vlan 4020) skb 0xffff880172987f00 vlan_tci 0 nv_start_xmit_optimized dev 0xffff880173b78000 (features: 600149a3 hw 60014983 wanted 60014983 vlan 4020) skb 0xffff880149aca400 vlan_tci 0 Probably also worth bringing in 9331db4f00cfee8a79d2147ac83723ef436b9759 - forcedeth: call vlan_mode only if hw supports vlans (Obviously this doesn't affect me directly, but claims to address problems seen by many others when using non vlan-capable hardware) Ok, copy that, thanks, I'll pull those changes back. @ ejbg : If indeed you are using Broadcom hardware (which my reading of your Comment #11 suggests that you are), then these issues with the NVIDIA/forcedeth driver are unlikely to be the cause of your problem. Looking back at your network config, it appears that you have both a vlan (eth0.10) and a bridge (zbrz) on the same physical interface (eth0). From the various ifcfg files you posted, schematically your setup looks like : /-- zbrz [ @ 192.168.1.100] / eth0 --{ \ \-- eth0.10 -- zbr10 [ @ 192.168.10.100] This type of configuration has been seen to be problematic, in that all inbound traffic first gets sent to the bridge, and is never seen by the vlan. The symptoms you're seeing are consistent with this situation. (see for example http://thread.gmane.org/gmane.linux.network/149864) Can you try re-arranging your setup, so that the vlan hangs off the bridge ? ie - eth0 -- zbrz [ @ 192.168.1.100] \ \-- zbrz.10 -- zbr10 [ @ 192.168.10.100] With regards to your posted ifcfg files - remove 'ifcfg-eth0.10', and create a new 'ifcfg-zbrz.10' to look like : VLAN=yes DEVICE=zbrz.10 ONBOOT=yes Type=Ethernet BRIDGE=zbr10 IPV6INIT=no IPV6_AUTOCONF=no NOZEROCONF=yes See if that helps. -dalefarm since you've done the legwork on forcedeth, I'm going to use this bug to pull those changes in, I'll open a separate bug to see if we need to pull back simmilar changes for the broadcom driver. Actually, I don't even have anything to apply. Looks like the head of the git tree for f15 has moved to 2.6.41, which already includes these changes. You can get a recent build here: https://koji.fedoraproject.org/koji/buildinfo?buildID=273961 Or just wait for it to be pushed, which should be soon @ dalefarm / C27 : Following your advice, I changed the configuration as you stated : eth0 -- zbrz [ @ 192.168.1.100] \ \-- zbrz.10 -- zbr10 [ @ 192.168.10.100] Once the network has restarted, both segments 192.168.1.* and 192.168.10.* are now totally accessible :-). It works now perfectly fine. Many Thanks, Eric. Neil, I'm happy with my patched forcedeth driver for now, and look forward to release of 2.6.41. Thanks for all your help. -dalefarm |