Bug 1279161
Summary: | Bridge is not forwarding frames towards a connected tap device | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Ido Barkan <ibarkan> | ||||||||
Component: | kernel | Assignee: | Jakub Sitnicki <jsitnick> | ||||||||
kernel sub component: | Bonding | QA Contact: | Amit Supugade <asupugad> | ||||||||
Status: | CLOSED DUPLICATE | Docs Contact: | |||||||||
Severity: | high | ||||||||||
Priority: | high | CC: | aloughla, atragler, danken, ibarkan, jarod, jpirko, jsitnick, kzhang, mleitner, network-qe, rkhan, sukulkar, tgraf, tspeetje, vyasevic, yliberma | ||||||||
Version: | 6.7 | ||||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 1288828 1408958 (view as bug list) | Environment: | |||||||||
Last Closed: | 2016-11-17 16:04:20 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1288828, 1370193 | ||||||||||
Attachments: |
|
Description
Ido Barkan
2015-11-08 09:47:29 UTC
Which bond mode are you using? Please ensure it's either load balance or LACP. ARP replies are destined to original requester MAC but some bond modes will overwrite src mac for load balancing, which would cause the bridge to not forward the packets back to the guest. (https://bugzilla.redhat.com/show_bug.cgi?id=1264029) (In reply to Marcelo Ricardo Leitner from comment #2) > Which bond mode are you using? Please ensure it's either load balance or > LACP. > ARP replies are destined to original requester MAC but some bond modes will > overwrite src mac for load balancing, which would cause the bridge to not > forward the packets back to the guest. > (https://bugzilla.redhat.com/show_bug.cgi?id=1264029) [root@ucs1-b200-2 ~]# cat /sys/class/net/bond0/bonding/mode 802.3ad 4 This is LACP right? The guest kernel is 2.6.32-504el6.x86_64 (In reply to Ido Barkan from comment #3) > [root@ucs1-b200-2 ~]# cat /sys/class/net/bond0/bonding/mode > 802.3ad 4 > > This is LACP right? Yes, that's LACP, good. Please attach the binary captures for rhevm and vnet0 for the two tests, pinging rhevm address directly and also an external one. Try to get both captures at the same time. also supplying info for vyasevic (also added to cc) requested by email:
> Actually since you mentioned that you are using UCS, there are
> vlans involved. UCS will by default tag any 'untagged' traffic
> with vlan id 0. This used to cause us all sorts of problems in
> rhel6 and vlan code had to be updated to handle it properly...
>
> As a pure work-around, you might be able simply 'modprobe 8021q'
> to make this work.
Host:
[root@ucs1-b200-2 ~]# lsmod | grep 8021q
8021q 20362 0
Guest:
on the guest this did not solve the problem.
(In reply to Marcelo Ricardo Leitner from comment #5) > (In reply to Ido Barkan from comment #3) > > [root@ucs1-b200-2 ~]# cat /sys/class/net/bond0/bonding/mode > > 802.3ad 4 > > > > This is LACP right? > > Yes, that's LACP, good. > > Please attach the binary captures for rhevm and vnet0 for the two tests, > pinging rhevm address directly and also an external one. Try to get both > captures at the same time. adding the results of: [root@ucs1-b200-2 ~]# tcpdump -n -i rhevm -w - "(host 10.35.16.244) and (icmp or arp)" | tee /tmp/trace.txt - while running from the guest: 'ping 10.35.19.149 & ping 8.8.8.8' Created attachment 1094352 [details]
binary tcpdump output
Host( recording traffic):
[root@ucs1-b200-2 ~]# tcpdump -n -i rhevm -w - "(host 10.35.16.244) and (icmp or arp)" | tee /tmp/trace.txt
Guest (generating traffic):
'ping 10.35.19.149 & ping 8.8.8.8'
(In reply to Ido Barkan from comment #4) > The guest kernel is 2.6.32-504el6.x86_64 That's is the base rhel6.6 kernel that has vlan issues. If you look at the arp packets you've provided (attachment 1094352 [details]), you'll see that they are in fact tagged with vlan id 0. Loading the vlan module on the guest will work around this issue, but I think there might still be other issues. You might consider updating the guest kernel. -vlad I'm thinking a bit differently. For each ARP request, we have 3 packets in there: a.The request itself from the guest, 42 bytes long b.The request again, 60 bytes long c.The reply, 64 bytes long. Considering it was captured only in rhvm bridge, this packet on B shouldn't exist and it's size and timing makes me believe that this is either the NIC mirroring the packet back at us, or there is a network loop somewhere in there. https://access.redhat.com/solutions/750553 https://access.redhat.com/solutions/774743 As the bridge is being fooled, it's thinking that the guest is on the branch that the reply came from, so it does nothing/no forwarding. You may confirm this situation by doing a capture on the NIC itself this time. (The captures so far were either rhevm or vnet0, which doesn't cover this situation.) Note that this doesn't exclude this possible issue with processing vlan 0. Marcelo I think Marcello is right. If I had to guess, I'd say there is something wrong with the bond. I am thinking that initial arp is being looped back to the other bond port and thus fools the bridge. You could try removing one of the devices from the bond and see if connectivity is restored. You may still vlan 0 issues on the guest is you are truly running 2.6.32-504.el6.x86_64. -vlad Created attachment 1095876 [details] trace for eth0 (first bond member) trying to confirm Marcello's theory from Comment 10: on the host: [root@ucs1-b200-2 ~]# tcpdump -n -i eth0 -w - "(host 10.35.16.244) and (icmp or arp)" > /tmp/trace-eth0.txt on the guest: # ping 10.35.19.149 & ping 8.8.8.8 Created attachment 1095877 [details] trace for eth1 (second bond member) trying to confirm Marcello's theory from Comment 10: on the host: [root@ucs1-b200-2 ~]# tcpdump -n -i eth1 -w - "(host 10.35.16.244) and (icmp or arp)" > /tmp/trace-eth1.txt on the guest: # ping 10.35.19.149 & ping 8.8.8.8 (In reply to Vlad Yasevich from comment #9) > (In reply to Ido Barkan from comment #4) > > The guest kernel is 2.6.32-504el6.x86_64 > > That's is the base rhel6.6 kernel that has vlan issues. If you look at > the arp packets you've provided (attachment 1094352 [details]), you'll see > that they > are in fact tagged with vlan id 0. Loading the vlan module on the guest > will work around this issue, but I think there might still be other issues. > > You might consider updating the guest kernel. > > -vlad Vlad, I already tried to modprobe 8021q which did not remedy the issue (in Comment 6), but that is maybe because we have 2 issues here- duplicated packets + vlan tag 0. But assuming you guys confirm that the bond/nic is mirroring the arp packets back at the bridge (using the additional physical NIC traces), what is the workaround for the packet mirroring? - change the bond mode? remove the bond? updated kernel/drivers? Cool, yes, the arp request is being reflected by the switch or the NIC. Did this server ever worked? Sounds like a bad switch config, its ports are not grouped. (Or that SR-IOV thing I shared earlier) If you switch to active/backup, it should work, as bonding will drop all incoming packets from inactive slave. Before that, please provide cat /proc/net/bonding/bondX it will contain information that allow us to diagnose the LACP link. (In reply to Ido Barkan from comment #14) > (In reply to Vlad Yasevich from comment #9) > > (In reply to Ido Barkan from comment #4) > > > The guest kernel is 2.6.32-504el6.x86_64 > > > > That's is the base rhel6.6 kernel that has vlan issues. If you look at > > the arp packets you've provided (attachment 1094352 [details]), you'll see > > that they > > are in fact tagged with vlan id 0. Loading the vlan module on the guest > > will work around this issue, but I think there might still be other issues. > > > > You might consider updating the guest kernel. > > > > -vlad > > Vlad, I already tried to modprobe 8021q which did not remedy the issue (in > Comment 6), but that is maybe because we have 2 issues here- duplicated > packets + vlan tag 0. In comment 6 you mentioned that you did this on the host. What I am saying is that after we figure out bonding issue, you may have to do this in the guest or upgrade the guest kernel. > > But assuming you guys confirm that the bond/nic is mirroring the arp packets > back at the bridge (using the additional physical NIC traces), what is the > workaround for the packet mirroring? > - change the bond mode? remove the bond? updated kernel/drivers? You have a few options: 1) Remove one of the bond members. That would resolve the issue, but you'd loose fault tolerance. 2) Change bond mode. Mode 1 (active-backup) should work correctly. 3) Try to understand why the issue persists, by providing all the information from /proc/net/bonding/bond0 (like Marcelo asked). We don't really know if this is something that's wrong in the bond driver or not. -vlad (In reply to Marcelo Ricardo Leitner from comment #15) > Cool, yes, the arp request is being reflected by the switch or the NIC. > Did this server ever worked? Sounds like a bad switch config, its ports are > not grouped. (Or that SR-IOV thing I shared earlier) > If you switch to active/backup, it should work, as bonding will drop all > incoming packets from inactive slave. > Before that, please provide cat /proc/net/bonding/bondX > it will contain information that allow us to diagnose the LACP link. [root@ucs1-b200-2 ~]# cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer2 (0) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: slow Min links: 0 Aggregator selection policy (ad_select): stable Active Aggregator Info: Aggregator ID: 1 Number of ports: 1 Actor Key: 1 Partner Key: 1 Partner Mac Address: 00:00:00:00:00:00 Slave Interface: eth0 MII Status: up Speed: 10240 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 00:25:b5:0a:00:09 Aggregator ID: 1 Slave queue ID: 0 Slave Interface: eth1 MII Status: up Speed: 10240 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 00:25:b5:0b:00:09 Aggregator ID: 2 Slave queue ID: 0 Those "Aggregator ID"s should be matching. It's saying that the switch thinks each port belongs to a different aggregated port, which is not what bonding is expecting. This is very likely a switch config issue. Please engage which whoever maintains the switch and ask to check that. Vlad, on bonding side, we could detect such issue and print a warning on kernel log and reject packets from the slaves that aren't using the right AggID (by right read: the first one to come up). What do you think? Hi guys, thanks for you findings: So we have done a few tests: 1- changed bond to mode 1 (to stop a possible loop) -> el6 guest: no connectivity 2- modprobe 8021q in el6 guest -> el6 guest: connectivity restored! -> el7 guest (just migrated to the host): has connectivity! * el7 guest kernel is 3.10.0-123.el7.x86_64 3- el6 guest: rmmod 8021q -> el6 guest: connectivity broken * until now: there are no surprises it corresponds directly to the 2 problems (0 vlan tag + a network loop) 4- restored bond to mode 4 + modprobe 8021q in el6 guest -> el6 guest: connectivity _still_ working -> el7 guest _still_ has connectivity * this is a surprise! 5- leaving the setup some time (a day or so) -> el6 guest: no connectivity -> el7 guest: _still_ has connectivity * this is what I expected, but I don't understand still how this is time related. Maybe i has something with learning/aging in bridge tables or the network loop takes some time to appear. * and how come the el7 guest still operates after there is a loop? Please s/network loop/bad switch config/ . The loop ends up happening causes because the switch thinks that each of your NICs belong to a different bonding, so that both ends up receiving broadcasts, even those generated by this system. (In reply to Ido Barkan from comment #19) > > 5- leaving the setup some time (a day or so) > -> el6 guest: no connectivity > -> el7 guest: _still_ has connectivity > > * this is what I expected, but I don't understand still how this is time > related. Maybe i has something with learning/aging in bridge tables or > the network loop takes some time to appear. > * and how come the el7 guest still operates after there is a loop? Can you provide the info from proc/net/bonding/bond0 on rhel7? -vlad Hi Vlad, Bonding is configured only in the host, not in the guests. Yaniv, I think Vlad wants to check how bonding negotiated the aggregation during rhel7 test. Note that even if the aggregation is broken, there is a time dependency on it. Just like the broadcast packets from the switch confuses the bridge, any packet from the guest will fix it by "confusing" it again. That is, if the unwanted broadcast gets in late, it's not a problem. Also, if you stop issuing broadcasts for any reason, it won't trigger the issue.. Anyway, have you got in contact with the switch administrator to fix that config? And RHEL6 kernel, did you update it, so that you wouldn't need to load 8021q module? Seems a dup of Bug 1258446 - [RHEL6.7][kernel][bonding][bridging] KVM virt-guests can no longer pxe boot. kernel-2.6.32-573.8.1.el6 should fixed the issue. (In reply to Zhenjie Chen from comment #25) > Seems a dup of Bug 1258446 - [RHEL6.7][kernel][bonding][bridging] KVM > virt-guests can no longer pxe boot. > > kernel-2.6.32-573.8.1.el6 should fixed the issue. I see. Thank you for the input. I'll look into this, and let you know if it worked. Hi, please reply to comment 23 and 26 when possible. Thanks (In reply to Marcelo Ricardo Leitner from comment #23) > Yaniv, I think Vlad wants to check how bonding negotiated the aggregation > during rhel7 test. Note that even if the aggregation is broken, there is a > time dependency on it. Just like the broadcast packets from the switch > confuses the bridge, any packet from the guest will fix it by "confusing" it > again. That is, if the unwanted broadcast gets in late, it's not a problem. > Also, if you stop issuing broadcasts for any reason, it won't trigger the > issue.. > > 1. Anyway, have you got in contact with the switch administrator to fix that > config? > 2. And RHEL6 kernel, did you update it, so that you wouldn't need to load 8021q > module? 1. Yes, we reconfigured the relevant UCS servers to 1 NIC and enabled fabric failover, which means that whenever 1 NIC dies, the other one will take its place immediately. According to the UCS technician, this doesn't affect performance all too much. The UCS servers were reinstalled to RHEL 6.7, network drivers have been updated, bonding was removed. 2. I updated the kernel version in RHEL 6.7 to what Zhenjie Chen (comment 25) suggested, version 2.6.32-573.8.1.el6 (including the kernel-firmware package). (In reply to Marcelo Ricardo Leitner from comment #27) > Hi, please reply to comment 23 and 26 when possible. Thanks Done. After the reconfiguration and upgrades we've made, we're still testing to see if this solved the problems we were experiencing. All we have to do now is to reactivate them in RHEV-TLV. Reference: https://engineering.redhat.com/rt/Ticket/Display.html?id=388168 I'll update you guys whenever there's anything significant to update. Hey guys, In continuation to comments 26 and 28, I upgraded the kernel and kernel-firmware versions to version 2.6.32-573.12.1.el6.x86_64 in all 3 UCS servers (I couldn't find kernel-2.6.32-573.8.1.el6.x86_64 anywhere, and besides 2.6.32-573.12.1.el6.x86_64 is a newer version so I don't think there'd be a problem). The network in the UCS servers was reconfigured, bonding was removed, network drivers were updated (more details in comment 28). The problem still reoccurs. RHEL 6.7 and below VMs lose connectivity when migrated to the UCS servers, but RHEL 7.0 and above VMs still maintain their connectivity. The OS in the UCS servers is RHEL 6.7. I then migrated the 2 VMs (RHEL 6.6 and RHEL 7.0) to a different host, a Dell PowerEdge C6220, where the OS is RHEL 6.7, too, and the kernel version is 2.6.32-573.8.1.el6.x86_64, and both VMs maintain their connectivity there. I have no idea what's going on at this point... I was told by the UCS technician that was here to reconfigure the UCS servers and remove the bonding config that the firmware versions in our UCS setup (servers, fabric interconnects, etc.) are very outdated, so maybe that's the root cause for these problems... I can't think of anything else, really, even though it was, AFAIR, working properly before. A RHEL 6.x<->UCS incompatibility, perhaps? Please let me know your thoughts on this. (In reply to Yaniv Liberman from comment #29) > Hey guys, > > In continuation to comments 26 and 28, I upgraded the kernel and > kernel-firmware versions to version 2.6.32-573.12.1.el6.x86_64 in all 3 UCS > servers (I couldn't find kernel-2.6.32-573.8.1.el6.x86_64 anywhere, and > besides 2.6.32-573.12.1.el6.x86_64 is a newer version so I don't think > there'd be a problem). > You state that the kernel was upgraded on UCS servers. What about the VMs? Are they still running old rhel6 kernels or were they upgraded to newer -573.12.1 kernels as well. The issue with UCS is that it is notorious for adding vlan id 0 to packets that are sent to the host. If such packets are for the host itself, to consume, the vlan id 0 has no impact. If, however, these packets are to be forwarded to the VM, this vlan id 0 is forwarded. This is where old rhel6 kernels have issues. There are 2 ways to solve them: 1) load the 802.1q module on the guest so that vlan header can be processed correctly. without this module loaded, the vlan header remain in the packet and cause packets to be dropped. 2) upgrade the kernel in VM. There was a large effort to improve VLAN vlan handling in rhel6. As a result, with newer kernels, you no longer need to load the module to correctly process VLAN 0 packets. Hope this explains why when you migrate your VMs outside of UCS environment, everything works as it should. (In reply to Vlad Yasevich from comment #31) > (In reply to Yaniv Liberman from comment #29) > > Hey guys, > > > > In continuation to comments 26 and 28, I upgraded the kernel and > > kernel-firmware versions to version 2.6.32-573.12.1.el6.x86_64 in all 3 UCS > > servers (I couldn't find kernel-2.6.32-573.8.1.el6.x86_64 anywhere, and > > besides 2.6.32-573.12.1.el6.x86_64 is a newer version so I don't think > > there'd be a problem). > > > > You state that the kernel was upgraded on UCS servers. What about the VMs? > Are they still running old rhel6 kernels or were they upgraded to newer > -573.12.1 kernels as well. Yaniv, do you happen to have the info on the kernel the VMs are/were running that Vlad was asking for? (In reply to Jakub Sitnicki from comment #33) > (In reply to Vlad Yasevich from comment #31) > > (In reply to Yaniv Liberman from comment #29) > > > Hey guys, > > > > > > In continuation to comments 26 and 28, I upgraded the kernel and > > > kernel-firmware versions to version 2.6.32-573.12.1.el6.x86_64 in all 3 UCS > > > servers (I couldn't find kernel-2.6.32-573.8.1.el6.x86_64 anywhere, and > > > besides 2.6.32-573.12.1.el6.x86_64 is a newer version so I don't think > > > there'd be a problem). > > > > > > > You state that the kernel was upgraded on UCS servers. What about the VMs? > > Are they still running old rhel6 kernels or were they upgraded to newer > > -573.12.1 kernels as well. > > Yaniv, do you happen to have the info on the kernel the VMs are/were running > that Vlad was asking for? I don't. Sorry. Tentative devel_ack+. We're committed to finding the root cause of this issue. Hi Yaniv, Let me recap what I gather from the previous investigation carried out by you, Ido, Marcelo, and Vlad: 1) the issue is not related to bonding, you have reconfigured the affected machines not to use bonding and the problem persists 2) the issue does not reproduce on RHEL7, kernel 3.10.0-123.el7.x86_64 was tested 3) the issue reproduces on RHEL6.7 guest where hypervisor is running on a UCS server that is connected to a UCS switch 4) some UCS switch firmware versions are known for tagging traffic sent to the host with VLAN id 0 5) it has not been confirmed yet that loading the 802.1q module in a guest running 2.6.32-573.12.1.el6.x86_64 kernel solves the problem 6) guest running RHEL6.7 with kernel newer than 573.12.1.el6 have not been tested 7) guest running RHEL6.8 or RHEL6.9 kernels has not been tested either To do anything here I will need to confirm if the problem still happens with the latest RHEL6.9 kernel. Currently it is kernel-2.6.32-661.el6 and can be grabbed from: http://download-node-02.eng.bos.redhat.com/brewroot/packages/kernel/2.6.32/661.el6/ Yaniv, is the environment where the problem was happening still available? If so, could you deploy a RHEL6.8 guest there, update the kernel to 2.6.32-661.el6 and check if the problem persists? I will try to reproduce it in a virtual setup but I don't know yet if mimicking the quirky UCS switch behavior is doable. I'm not sure if I should be asking you or Ido about further testing. Please let me know if you're no longer involved with this bug. Ido, Yaniv is not responding so maybe you can help with with checking if the latest RHEL6.9 kernel (-661.el6) still has the issue when the guest is running on a UCS server? I'm afraid we're very short on time to find the root cause and backport a fix here because the Kernel Patch Submission Deadline for RHEL6.9 is set for Tue, 2016-10-18. So far I haven't been able to reproduce the issue myself. Hi Jakub, 1. We were on Public Holiday in Monday and Tuesday. 2. Ido does not work in Red Hat any more. Our RHV environment was upgraded to version 4.0. Please let me know if this matters ASAP, as we are on Public Holiday again next week until October 24. I need to know if it's possible to check this in RHV 4.0 because if not, we don't have enough time to set up a RHV 3.# / RHEL 6.# environment before we enter our Public Holiday. Thanks, Yaniv (In reply to Yaniv Liberman from comment #38) > I need to know if it's possible to check this in RHV 4.0 Yes, it is. We (RHV) care about this issue only if it reproduces on RHV-4 on top of el7. Nothing on the bridge/tap/qemu level has changed between rhev-3.6 and 4. (In reply to Dan Kenigsberg from comment #39) > (In reply to Yaniv Liberman from comment #38) > > I need to know if it's possible to check this in RHV 4.0 > > Yes, it is. We (RHV) care about this issue only if it reproduces on RHV-4 on > top of el7. Nothing on the bridge/tap/qemu level has changed between > rhev-3.6 and 4. OK. I installed RHEL 7.2 (3.10.0-327.el7.x86_64) on a UCS server. We'll add it to our RHV 4.0 environment as soon as possible, and then install a RHEL 6.9 VM there to check if the problem is reproduced. (In reply to Yaniv Liberman from comment #38) > 1. We were on Public Holiday in Monday and Tuesday. > 2. Ido does not work in Red Hat any more. Yaniv, my bad, I didn't know. Looking forward to your test result with RHV 4.0 and RHEL 6.9 VM. I'm still working on a reproducer but maybe you will be able to confirm if RHEL 6.9 still has the bug first. Either way, in the end we would need to test in UCS environment. Dan, thank you for chipping in and answering Yaniv questions. (In reply to Jakub Sitnicki from comment #41) > (In reply to Yaniv Liberman from comment #38) > > 1. We were on Public Holiday in Monday and Tuesday. > > 2. Ido does not work in Red Hat any more. > > Yaniv, my bad, I didn't know. Looking forward to your test result with RHV > 4.0 and RHEL 6.9 VM. I'm still working on a reproducer but maybe you will be > able to confirm if RHEL 6.9 still has the bug first. Either way, in the end > we would need to test in UCS environment. No worries. I've just been told we don't have RHEL 6.9... Do we have any alternatives or...? Please advise. (In reply to Yaniv Liberman from comment #42) > (In reply to Jakub Sitnicki from comment #41) > > (In reply to Yaniv Liberman from comment #38) > > > 1. We were on Public Holiday in Monday and Tuesday. > > > 2. Ido does not work in Red Hat any more. > > > > Yaniv, my bad, I didn't know. Looking forward to your test result with RHV > > 4.0 and RHEL 6.9 VM. I'm still working on a reproducer but maybe you will be > > able to confirm if RHEL 6.9 still has the bug first. Either way, in the end > > we would need to test in UCS environment. > > No worries. > > I've just been told we don't have RHEL 6.9... > > Do we have any alternatives or...? > > Please advise. We should also be able to test with RHEL 6.8 and upgrade the kernel to latest version from 6.9, if that is an option. I see. Alright. As we agreed on IRC, I'll try that when we come back from the Holidays. I'll keep you posted. Thanks! Yaniv, could you please also gather info on what NICs are being used on the UCS server to provide connectivity to the guests? # ethtool -i <dev> # lspci -s <bus-info from ethtool output> -vmm Thanks, Jakub Yaniv, I've managed to simulate the quirky Cisco UCS firmware behavior and can confirm that Vlad's suggested solution from comment #31 is what you need when no bonding is involved: > There are 2 ways to solve them: > 1) load the 802.1q module on the guest so that vlan header can be > processed > correctly. without this module loaded, the vlan header remain in the > packet and cause packets to be dropped. > 2) upgrade the kernel in VM. There was a large effort to improve VLAN > vlan handling in rhel6. As a result, with newer kernels, you no longer > need to load the module to correctly process VLAN 0 packets. As per my tests, when RHEL 6.6 VM gets an ARP reply or an ICMP Echo reply with with VLAN 0 tag then outcome: * kernel-2.6.32-504.el6 (base RHEL 6.6 kernel) up to 2.6.32-504.21.1.el6, 8021q module not loaded - tagged packets don't get untagged, ping doesn't work, * kernel-2.6.32-504.el6 (base RHEL 6.6 kernel) up to 2.6.32-504.21.1.el6, 8021q module loaded - tagged packets get untagged, ping works, * 2.6.32-504.22.1.el6 and above, 8021q module not loaded - tagged packets untagged, ping works. Hence at the moment I'm inclined to close this as a duplicate of BZ 1135347 - the backport of the new vlan model that Vlad has been referring to. These changes have been backported into 6.6 z-stream in version. (In reply to Jakub Sitnicki from comment #46) > Hence at the moment I'm inclined to close this as a duplicate of BZ 1135347 > - the backport of the new vlan model that Vlad has been referring to. These > changes have been backported into 6.6 z-stream in version. .. in version 2.6.32-504.22.1.el6. (In reply to Jakub Sitnicki from comment #45) > Yaniv, could you please also gather info on what NICs are being used on the > UCS server to provide connectivity to the guests? > > # ethtool -i <dev> > # lspci -s <bus-info from ethtool output> -vmm > > Thanks, > Jakub Just to clear this needinfo flag, even though it's probably irrelevant at this point: -- [root@ucs1-b200-1 ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enp6s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000 link/ether 00:25:b5:0c:00:01 brd ff:ff:ff:ff:ff:ff inet 10.35.19.148/22 brd 10.35.19.255 scope global dynamic enp6s0 valid_lft 42760sec preferred_lft 42760sec inet6 2620:52:0:2310:225:b5ff:fe0c:1/64 scope global noprefixroute dynamic valid_lft 2591713sec preferred_lft 604513sec inet6 fe80::225:b5ff:fe0c:1/64 scope link valid_lft forever preferred_lft forever [root@ucs1-b200-1 ~]# ethtool -i enp6s0 driver: enic version: 2.1.1.83 firmware-version: 2.1(2a) bus-info: 0000:06:00.0 supports-statistics: yes supports-test: no supports-eeprom-access: no supports-register-dump: no supports-priv-flags: no [root@ucs1-b200-1 ~]# lspci -s 0000:06:00.0 -vmm Slot: 06:00.0 Class: Ethernet controller Vendor: Cisco Systems Inc Device: VIC Ethernet NIC SVendor: Cisco Systems Inc SDevice: VIC 1240 MLOM Ethernet NIC Rev: a2 -- (In reply to Jakub Sitnicki from comment #46) > As per my tests, when RHEL 6.6 VM gets an ARP reply or an ICMP Echo reply > with with VLAN 0 tag then outcome: > > * kernel-2.6.32-504.el6 (base RHEL 6.6 kernel) up to 2.6.32-504.21.1.el6, > 8021q module not loaded - tagged packets don't get untagged, ping doesn't > work, > * kernel-2.6.32-504.el6 (base RHEL 6.6 kernel) up to 2.6.32-504.21.1.el6, > 8021q module loaded - tagged packets get untagged, ping works, > * 2.6.32-504.22.1.el6 and above, 8021q module not loaded - tagged packets > untagged, ping works. > > Hence at the moment I'm inclined to close this as a duplicate of BZ 1135347 > - the backport of the new vlan model that Vlad has been referring to. These > changes have been backported into 6.6 z-stream in version. I see. So kernel version 2.6.32-504.el6 fixes this problem in RHEL 6.6? If so, then I guess it's safe to assume that in RHEL 6.8-9 it's already working properly, yes? I haven't tested this yet to confirm, though, but I think it makes sense that it'd work. (In reply to Yaniv Liberman from comment #48) > (In reply to Jakub Sitnicki from comment #45) > (In reply to Jakub Sitnicki from comment #46) > > As per my tests, when RHEL 6.6 VM gets an ARP reply or an ICMP Echo reply > > with with VLAN 0 tag then outcome: > > > > * kernel-2.6.32-504.el6 (base RHEL 6.6 kernel) up to 2.6.32-504.21.1.el6, > > 8021q module not loaded - tagged packets don't get untagged, ping doesn't > > work, > > * kernel-2.6.32-504.el6 (base RHEL 6.6 kernel) up to 2.6.32-504.21.1.el6, > > 8021q module loaded - tagged packets get untagged, ping works, > > * 2.6.32-504.22.1.el6 and above, 8021q module not loaded - tagged packets > > untagged, ping works. > > > > Hence at the moment I'm inclined to close this as a duplicate of BZ 1135347 > > - the backport of the new vlan model that Vlad has been referring to. These > > changes have been backported into 6.6 z-stream in version. > > I see. So kernel version 2.6.32-504.el6 fixes this problem in RHEL 6.6? 2.6.32-504.22.1.el6 or later from RHEL6.6 z-stream. 2.6.32-504.el6 is buggy. > If so, then I guess it's safe to assume that in RHEL 6.8-9 it's already > working properly, yes? > > I haven't tested this yet to confirm, though, but I think it makes sense > that it'd work. Yes, it safe to assume. I've tested 2.6.32-663.el6.x86_64 (recentish RHEL6.9 development version) and VLAN 0 tagged frames are handled by the stack. So, are you okay with me closing this one? Or would you like me keep it open until you can confirm that lastest guests with 6.6 z-stream work in UCS environment? Thanks! This is good news. Please keep this open for the time being. This issue was fixed in bz1135347. Marking as dupe. *** This bug has been marked as a duplicate of bug 1135347 *** |